Commit graph

4320 commits

Author SHA1 Message Date
Herman van Hovell 0d63487502 [SPARK-17616][SQL] Support a single distinct aggregate combined with a non-partial aggregate
## What changes were proposed in this pull request?
We currently cannot execute an aggregate that contains a single distinct aggregate function and an one or more non-partially plannable aggregate functions, for example:
```sql
select   grp,
         collect_list(col1),
         count(distinct col2)
from     tbl_a
group by 1
```
This is a regression from Spark 1.6. This is caused by the fact that the single distinct aggregation code path assumes that all aggregates can be planned in two phases (is partially aggregatable). This PR works around this issue by triggering the `RewriteDistinctAggregates` in such cases (this is similar to the approach taken in 1.6).

## How was this patch tested?
Created `RewriteDistinctAggregatesSuite` which checks if the aggregates with distinct aggregate functions get rewritten into two `Aggregates` and an `Expand`. Added a regression test to `DataFrameAggregateSuite`.

Author: Herman van Hovell <hvanhovell@databricks.com>

Closes #15187 from hvanhovell/SPARK-17616.
2016-09-22 14:29:27 -07:00
Burak Yavuz 85d609cf25 [SPARK-17613] S3A base paths with no '/' at the end return empty DataFrames
## What changes were proposed in this pull request?

Consider you have a bucket as `s3a://some-bucket`
and under it you have files:
```
s3a://some-bucket/file1.parquet
s3a://some-bucket/file2.parquet
```
Getting the parent path of `s3a://some-bucket/file1.parquet` yields
`s3a://some-bucket/` and the ListingFileCatalog uses this as the key in the hash map.

When catalog.allFiles is called, we use `s3a://some-bucket` (no slash at the end) to get the list of files, and we're left with an empty list!

This PR fixes this by adding a `/` at the end of the `URI` iff the given `Path` doesn't have a parent, i.e. is the root. This is a no-op if the path already had a `/` at the end, and is handled through the Hadoop Path, path merging semantics.

## How was this patch tested?

Unit test in `FileCatalogSuite`.

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #15169 from brkyvz/SPARK-17613.
2016-09-22 13:05:41 -07:00
Wenchen Fan 8a02410a92 [SQL][MINOR] correct the comment of SortBasedAggregationIterator.safeProj
## What changes were proposed in this pull request?

This comment went stale long time ago, this PR fixes it according to my understanding.

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes #15095 from cloud-fan/update-comment.
2016-09-22 23:25:32 +08:00
Zhenhua Wang de7df7defc [SPARK-17625][SQL] set expectedOutputAttributes when converting SimpleCatalogRelation to LogicalRelation
## What changes were proposed in this pull request?

We should set expectedOutputAttributes when converting SimpleCatalogRelation to LogicalRelation, otherwise the outputs of LogicalRelation are different from outputs of SimpleCatalogRelation - they have different exprId's.

## How was this patch tested?

add a test case

Author: Zhenhua Wang <wzh_zju@163.com>

Closes #15182 from wzhfy/expectedAttributes.
2016-09-22 14:48:49 +08:00
gatorsmile 3a80f92f8f [SPARK-17492][SQL] Fix Reading Cataloged Data Sources without Extending SchemaRelationProvider
### What changes were proposed in this pull request?
For data sources without extending `SchemaRelationProvider`, we expect users to not specify schemas when they creating tables. If the schema is input from users, an exception is issued.

Since Spark 2.1, for any data source, to avoid infer the schema every time, we store the schema in the metastore catalog. Thus, when reading a cataloged data source table, the schema could be read from metastore catalog. In this case, we also got an exception. For example,

```Scala
sql(
  s"""
     |CREATE TABLE relationProvierWithSchema
     |USING org.apache.spark.sql.sources.SimpleScanSource
     |OPTIONS (
     |  From '1',
     |  To '10'
     |)
   """.stripMargin)
spark.table(tableName).show()
```
```
org.apache.spark.sql.sources.SimpleScanSource does not allow user-specified schemas.;
```

This PR is to fix the above issue. When building a data source, we introduce a flag `isSchemaFromUsers` to indicate whether the schema is really input from users. If true, we issue an exception. Otherwise, we will call the `createRelation` of `RelationProvider` to generate the `BaseRelation`, in which it contains the actual schema.

### How was this patch tested?
Added a few cases.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #15046 from gatorsmile/tempViewCases.
2016-09-22 13:19:06 +08:00
Yadong Qi cb324f6115 [SPARK-17425][SQL] Override sameResult in HiveTableScanExec to make ReuseExchange work in text format table
## What changes were proposed in this pull request?
The PR will override the `sameResult` in `HiveTableScanExec` to make `ReuseExchange` work in text format table.

## How was this patch tested?
# SQL
```sql
SELECT * FROM src t1
JOIN src t2 ON t1.key = t2.key
JOIN src t3 ON t1.key = t3.key;
```

# Before
```
== Physical Plan ==
*BroadcastHashJoin [key#30], [key#34], Inner, BuildRight
:- *BroadcastHashJoin [key#30], [key#32], Inner, BuildRight
:  :- *Filter isnotnull(key#30)
:  :  +- HiveTableScan [key#30, value#31], MetastoreRelation default, src
:  +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
:     +- *Filter isnotnull(key#32)
:        +- HiveTableScan [key#32, value#33], MetastoreRelation default, src
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
   +- *Filter isnotnull(key#34)
      +- HiveTableScan [key#34, value#35], MetastoreRelation default, src
```

# After
```
== Physical Plan ==
*BroadcastHashJoin [key#2], [key#6], Inner, BuildRight
:- *BroadcastHashJoin [key#2], [key#4], Inner, BuildRight
:  :- *Filter isnotnull(key#2)
:  :  +- HiveTableScan [key#2, value#3], MetastoreRelation default, src
:  +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
:     +- *Filter isnotnull(key#4)
:        +- HiveTableScan [key#4, value#5], MetastoreRelation default, src
+- ReusedExchange [key#6, value#7], BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
```

cc: davies cloud-fan

Author: Yadong Qi <qiyadong2010@gmail.com>

Closes #14988 from watermen/SPARK-17425.
2016-09-22 13:04:42 +08:00
Wenchen Fan b50b34f561 [SPARK-17609][SQL] SessionCatalog.tableExists should not check temp view
## What changes were proposed in this pull request?

After #15054 , there is no place in Spark SQL that need `SessionCatalog.tableExists` to check temp views, so this PR makes `SessionCatalog.tableExists` only check permanent table/view and removes some hacks.

This PR also improves the `getTempViewOrPermanentTableMetadata` that is introduced in  #15054 , to make the code simpler.

## How was this patch tested?

existing tests

Author: Wenchen Fan <wenchen@databricks.com>

Closes #15160 from cloud-fan/exists.
2016-09-22 12:52:09 +08:00
Davies Liu 8bde03bf9a [SPARK-17494][SQL] changePrecision() on compact decimal should respect rounding mode
## What changes were proposed in this pull request?

Floor()/Ceil() of decimal is implemented using changePrecision() by passing a rounding mode, but the rounding mode is not respected when the decimal is in compact mode (could fit within a Long).

This Update the changePrecision() to respect rounding mode, which could be ROUND_FLOOR, ROUND_CEIL, ROUND_HALF_UP, ROUND_HALF_EVEN.

## How was this patch tested?

Added regression tests.

Author: Davies Liu <davies@databricks.com>

Closes #15154 from davies/decimal_round.
2016-09-21 21:02:30 -07:00
Michael Armbrust 3497ebe511 [SPARK-17627] Mark Streaming Providers Experimental
All of structured streaming is experimental in its first release.  We missed the annotation on two of the APIs.

Author: Michael Armbrust <michael@databricks.com>

Closes #15188 from marmbrus/experimentalApi.
2016-09-21 20:59:46 -07:00
Burak Yavuz 7cbe216449 [SPARK-17569] Make StructuredStreaming FileStreamSource batch generation faster
## What changes were proposed in this pull request?

While getting the batch for a `FileStreamSource` in StructuredStreaming, we know which files we must take specifically. We already have verified that they exist, and have committed them to a metadata log. When creating the FileSourceRelation however for an incremental execution, the code checks the existence of every single file once again!

When you have 100,000s of files in a folder, creating the first batch takes 2 hours+ when working with S3! This PR disables that check

## How was this patch tested?

Added a unit test to `FileStreamSource`.

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #15122 from brkyvz/SPARK-17569.
2016-09-21 17:12:52 -07:00
Liang-Chi Hsieh 248922fd4f [SPARK-17590][SQL] Analyze CTE definitions at once and allow CTE subquery to define CTE
## What changes were proposed in this pull request?

We substitute logical plan with CTE definitions in the analyzer rule CTESubstitution. A CTE definition can be used in the logical plan for multiple times, and its analyzed logical plan should be the same. We should not analyze CTE definitions multiple times when they are reused in the query.

By analyzing CTE definitions before substitution, we can support defining CTE in subquery.

## How was this patch tested?

Jenkins tests.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #15146 from viirya/cte-analysis-once.
2016-09-21 06:53:42 -07:00
hyukjinkwon 25a020be99
[SPARK-17583][SQL] Remove uesless rowSeparator variable and set auto-expanding buffer as default for maxCharsPerColumn option in CSV
## What changes were proposed in this pull request?

This PR includes the changes below:

1. Upgrade Univocity library from 2.1.1 to 2.2.1

  This includes some performance improvement and also enabling auto-extending buffer in `maxCharsPerColumn` option in CSV. Please refer the [release notes](https://github.com/uniVocity/univocity-parsers/releases).

2. Remove useless `rowSeparator` variable existing in `CSVOptions`

  We have this unused variable in [CSVOptions.scala#L127](29952ed096/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala (L127)) but it seems possibly causing confusion that it actually does not care of `\r\n`. For example, we have an issue open about this, [SPARK-17227](https://issues.apache.org/jira/browse/SPARK-17227), describing this variable.

  This variable is virtually not being used because we rely on `LineRecordReader` in Hadoop which deals with only both `\n` and `\r\n`.

3. Set the default value of `maxCharsPerColumn` to auto-expending.

  We are setting 1000000 for the length of each column. It'd be more sensible we allow auto-expending rather than fixed length by default.

  To make sure, using `-1` is being described in the release note, [2.2.0](https://github.com/uniVocity/univocity-parsers/releases/tag/v2.2.0).

## How was this patch tested?

N/A

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #15138 from HyukjinKwon/SPARK-17583.
2016-09-21 10:35:29 +01:00
VinceShieh 57dc326bd0
[SPARK-17219][ML] Add NaN value handling in Bucketizer
## What changes were proposed in this pull request?
This PR fixes an issue when Bucketizer is called to handle a dataset containing NaN value.
Sometimes, null value might also be useful to users, so in these cases, Bucketizer should
reserve one extra bucket for NaN values, instead of throwing an illegal exception.
Before:
```
Bucketizer.transform on NaN value threw an illegal exception.
```
After:
```
NaN values will be grouped in an extra bucket.
```
## How was this patch tested?
New test cases added in `BucketizerSuite`.
Signed-off-by: VinceShieh <vincent.xieintel.com>

Author: VinceShieh <vincent.xie@intel.com>

Closes #14858 from VinceShieh/spark-17219.
2016-09-21 10:20:57 +01:00
Burak Yavuz 28fafa3ee8 [SPARK-17599] Prevent ListingFileCatalog from failing if path doesn't exist
## What changes were proposed in this pull request?

The `ListingFileCatalog` lists files given a set of resolved paths. If a folder is deleted at any time between the paths were resolved and the file catalog can check for the folder, the Spark job fails. This may abruptly stop long running StructuredStreaming jobs for example.

Folders may be deleted by users or automatically by retention policies. These cases should not prevent jobs from successfully completing.

## How was this patch tested?

Unit test in `FileCatalogSuite`

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #15153 from brkyvz/SPARK-17599.
2016-09-21 17:07:16 +08:00
Sean Zhong 3977223a32 [SPARK-17617][SQL] Remainder(%) expression.eval returns incorrect result on double value
## What changes were proposed in this pull request?

Remainder(%) expression's `eval()` returns incorrect result when the dividend is a big double. The reason is that Remainder converts the double dividend to decimal to do "%", and that lose precision.

This bug only affects the `eval()` that is used by constant folding, the codegen path is not impacted.

### Before change
```
scala> -5083676433652386516D % 10
res2: Double = -6.0

scala> spark.sql("select -5083676433652386516D % 10 as a").show
+---+
|  a|
+---+
|0.0|
+---+
```

### After change
```
scala> spark.sql("select -5083676433652386516D % 10 as a").show
+----+
|   a|
+----+
|-6.0|
+----+
```

## How was this patch tested?

Unit test.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #15171 from clockfly/SPARK-17617.
2016-09-21 16:53:34 +08:00
wm624@hotmail.com 61876a4279
[CORE][DOC] Fix errors in comments
## What changes were proposed in this pull request?
While reading source code of CORE and SQL core, I found some minor errors in comments such as extra space, missing blank line and grammar error.

I fixed these minor errors and might find more during my source code study.

## How was this patch tested?
Manually build

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #15151 from wangmiao1981/mem.
2016-09-21 09:33:29 +01:00
jerryshao e48ebc4e40 [SPARK-15698][SQL][STREAMING][FOLLW-UP] Fix FileStream source and sink log get configuration issue
## What changes were proposed in this pull request?

This issue was introduced in the previous commit of SPARK-15698. Mistakenly change the way to get configuration back to original one, so here with the follow up PR to revert them up.

## How was this patch tested?

N/A

Ping zsxwing , please review again, sorry to bring the inconvenience. Thanks a lot.

Author: jerryshao <sshao@hortonworks.com>

Closes #15173 from jerryshao/SPARK-15698-follow.
2016-09-20 22:36:24 -07:00
petermaxlee 976f3b1227 [SPARK-17513][SQL] Make StreamExecution garbage-collect its metadata
## What changes were proposed in this pull request?
This PR modifies StreamExecution such that it discards metadata for batches that have already been fully processed. I used the purge method that was added as part of SPARK-17235.

This is a resubmission of 15126, which was based on work by frreiss in #15067, but fixed the test case along with some typos.

## How was this patch tested?
A new test case in StreamingQuerySuite. The test case would fail without the changes in this pull request.

Author: petermaxlee <petermaxlee@gmail.com>

Closes #15166 from petermaxlee/SPARK-17513-2.
2016-09-20 19:08:07 -07:00
Yin Huai 9ac68dbc57 [SPARK-17549][SQL] Revert "[] Only collect table size stat in driver for cached relation."
This reverts commit 39e2bad6a8 because of the problem mentioned at https://issues.apache.org/jira/browse/SPARK-17549?focusedCommentId=15505060&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15505060

Author: Yin Huai <yhuai@databricks.com>

Closes #15157 from yhuai/revert-SPARK-17549.
2016-09-20 11:53:57 -07:00
jerryshao a6aade0042 [SPARK-15698][SQL][STREAMING] Add the ability to remove the old MetadataLog in FileStreamSource
## What changes were proposed in this pull request?

Current `metadataLog` in `FileStreamSource` will add a checkpoint file in each batch but do not have the ability to remove/compact, which will lead to large number of small files when running for a long time. So here propose to compact the old logs into one file. This method is quite similar to `FileStreamSinkLog` but simpler.

## How was this patch tested?

Unit test added.

Author: jerryshao <sshao@hortonworks.com>

Closes #13513 from jerryshao/SPARK-15698.
2016-09-20 10:24:12 -07:00
Wenchen Fan eb004c6620 [SPARK-17051][SQL] we should use hadoopConf in InsertIntoHiveTable
## What changes were proposed in this pull request?

Hive confs in hive-site.xml will be loaded in `hadoopConf`, so we should use `hadoopConf` in `InsertIntoHiveTable` instead of `SessionState.conf`

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes #14634 from cloud-fan/bug.
2016-09-20 09:53:28 -07:00
gatorsmile d5ec5dbb0d [SPARK-17502][SQL] Fix Multiple Bugs in DDL Statements on Temporary Views
### What changes were proposed in this pull request?
- When the permanent tables/views do not exist but the temporary view exists, the expected error should be `NoSuchTableException` for partition-related ALTER TABLE commands. However, it always reports a confusing error message. For example,
```
Partition spec is invalid. The spec (a, b) must match the partition spec () defined in table '`testview`';
```
- When the permanent tables/views do not exist but the temporary view exists, the expected error should be `NoSuchTableException` for `ALTER TABLE ... UNSET TBLPROPERTIES`. However, it reports a missing table property. For example,
```
Attempted to unset non-existent property 'p' in table '`testView`';
```
- When `ANALYZE TABLE` is called on a view or a temporary view, we should issue an error message. However, it reports a strange error:
```
ANALYZE TABLE is not supported for Project
```

- When inserting into a temporary view that is generated from `Range`, we will get the following error message:
```
assertion failed: No plan for 'InsertIntoTable Range (0, 10, step=1, splits=Some(1)), false, false
+- Project [1 AS 1#20]
   +- OneRowRelation$
```

This PR is to fix the above four issues.

### How was this patch tested?
Added multiple test cases

Author: gatorsmile <gatorsmile@gmail.com>

Closes #15054 from gatorsmile/tempViewDDL.
2016-09-20 20:11:48 +08:00
Wenchen Fan f039d964d1 Revert "[SPARK-17513][SQL] Make StreamExecution garbage-collect its metadata"
This reverts commit be9d57fc9d.
2016-09-20 16:12:35 +08:00
petermaxlee be9d57fc9d [SPARK-17513][SQL] Make StreamExecution garbage-collect its metadata
## What changes were proposed in this pull request?
This PR modifies StreamExecution such that it discards metadata for batches that have already been fully processed. I used the purge method that was added as part of SPARK-17235.

This is based on work by frreiss in #15067, but fixed the test case along with some typos.

## How was this patch tested?
A new test case in StreamingQuerySuite. The test case would fail without the changes in this pull request.

Author: petermaxlee <petermaxlee@gmail.com>
Author: frreiss <frreiss@us.ibm.com>

Closes #15126 from petermaxlee/SPARK-17513.
2016-09-19 22:19:51 -07:00
Josh Rosen e719b1c045 [SPARK-17160] Properly escape field names in code-generated error messages
This patch addresses a corner-case escaping bug where field names which contain special characters were unsafely interpolated into error message string literals in generated Java code, leading to compilation errors.

This patch addresses these issues by using `addReferenceObj` to store the error messages as string fields rather than inline string constants.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #15156 from JoshRosen/SPARK-17160.
2016-09-19 20:20:36 -07:00
Davies Liu d8104158a9 [SPARK-17100] [SQL] fix Python udf in filter on top of outer join
## What changes were proposed in this pull request?

In optimizer, we try to evaluate the condition to see whether it's nullable or not, but some expressions are not evaluable, we should check that before evaluate it.

## How was this patch tested?

Added regression tests.

Author: Davies Liu <davies@databricks.com>

Closes #15103 from davies/udf_join.
2016-09-19 13:24:16 -07:00
Davies Liu e063206263 [SPARK-16439] [SQL] bring back the separator in SQL UI
## What changes were proposed in this pull request?

Currently, the SQL metrics looks like `number of rows: 111111111111`, it's very hard to read how large the number is. So a separator was added by #12425, but removed by #14142, because the separator is weird in some locales (for example, pl_PL), this PR will add that back, but always use "," as the separator, since the SQL UI are all in English.

## How was this patch tested?

Existing tests.
![metrics](https://cloud.githubusercontent.com/assets/40902/14573908/21ad2f00-030d-11e6-9e2c-c544f30039ea.png)

Author: Davies Liu <davies@databricks.com>

Closes #15106 from davies/metric_sep.
2016-09-19 11:49:03 -07:00
Sean Owen d720a40194
[SPARK-17297][DOCS] Clarify window/slide duration as absolute time, not relative to a calendar
## What changes were proposed in this pull request?

Clarify that slide and window duration are absolute, and not relative to a calendar.

## How was this patch tested?

Doc build (no functional change)

Author: Sean Owen <sowen@cloudera.com>

Closes #15142 from srowen/SPARK-17297.
2016-09-19 09:38:25 +01:00
petermaxlee 8f0c35a4d0 [SPARK-17571][SQL] AssertOnQuery.condition should always return Boolean value
## What changes were proposed in this pull request?
AssertOnQuery has two apply constructor: one that accepts a closure that returns boolean, and another that accepts a closure that returns Unit. This is actually very confusing because developers could mistakenly think that AssertOnQuery always require a boolean return type and verifies the return result, when indeed the value of the last statement is ignored in one of the constructors.

This pull request makes the two constructor consistent and always require boolean value. It will overall make the test suites more robust against developer errors.

As an evidence for the confusing behavior, this change also identified a bug with an existing test case due to file system time granularity. This pull request fixes that test case as well.

## How was this patch tested?
This is a test only change.

Author: petermaxlee <petermaxlee@gmail.com>

Closes #15127 from petermaxlee/SPARK-17571.
2016-09-18 15:22:01 -07:00
Liwei Lin 1dbb725dbe
[SPARK-16462][SPARK-16460][SPARK-15144][SQL] Make CSV cast null values properly
## Problem

CSV in Spark 2.0.0:
-  does not read null values back correctly for certain data types such as `Boolean`, `TimestampType`, `DateType` -- this is a regression comparing to 1.6;
- does not read empty values (specified by `options.nullValue`) as `null`s for `StringType` -- this is compatible with 1.6 but leads to problems like SPARK-16903.

## What changes were proposed in this pull request?

This patch makes changes to read all empty values back as `null`s.

## How was this patch tested?

New test cases.

Author: Liwei Lin <lwlin7@gmail.com>

Closes #14118 from lw-lin/csv-cast-null.
2016-09-18 19:25:58 +01:00
jiangxingbo 5d3f4615f8
[SPARK-17506][SQL] Improve the check double values equality rule.
## What changes were proposed in this pull request?

In `ExpressionEvalHelper`, we check the equality between two double values by comparing whether the expected value is within the range [target - tolerance, target + tolerance], but this can cause a negative false when the compared numerics are very large.
Before:
```
val1 = 1.6358558070241E306
val2 = 1.6358558070240974E306
ExpressionEvalHelper.compareResults(val1, val2)
false
```
In fact, `val1` and `val2` are but with different precisions, we should tolerant this case by comparing with percentage range, eg.,expected is within range [target - target * tolerance_percentage, target + target * tolerance_percentage].
After:
```
val1 = 1.6358558070241E306
val2 = 1.6358558070240974E306
ExpressionEvalHelper.compareResults(val1, val2)
true
```

## How was this patch tested?

Exsiting testcases.

Author: jiangxingbo <jiangxb1987@gmail.com>

Closes #15059 from jiangxb1987/deq.
2016-09-18 16:04:37 +01:00
Wenchen Fan 3fe630d314 [SPARK-17541][SQL] fix some DDL bugs about table management when same-name temp view exists
## What changes were proposed in this pull request?

In `SessionCatalog`, we have several operations(`tableExists`, `dropTable`, `loopupRelation`, etc) that handle both temp views and metastore tables/views. This brings some bugs to DDL commands that want to handle temp view only or metastore table/view only. These bugs are:

1. `CREATE TABLE USING` will fail if a same-name temp view exists
2. `Catalog.dropTempView`will un-cache and drop metastore table if a same-name table exists
3. `saveAsTable` will fail or have unexpected behaviour if a same-name temp view exists.

These bug fixes are pulled out from https://github.com/apache/spark/pull/14962 and targets both master and 2.0 branch

## How was this patch tested?

new regression tests

Author: Wenchen Fan <wenchen@databricks.com>

Closes #15099 from cloud-fan/fix-view.
2016-09-18 21:15:35 +08:00
gatorsmile 3a3c9ffbd2 [SPARK-17518][SQL] Block Users to Specify the Internal Data Source Provider Hive
### What changes were proposed in this pull request?
In Spark 2.1, we introduced a new internal provider `hive` for telling Hive serde tables from data source tables. This PR is to block users to specify this in `DataFrameWriter` and SQL APIs.

### How was this patch tested?
Added a test case

Author: gatorsmile <gatorsmile@gmail.com>

Closes #15073 from gatorsmile/formatHive.
2016-09-18 15:37:15 +08:00
hyukjinkwon 86c2d393a5
[SPARK-17480][SQL][FOLLOWUP] Fix more instances which calls List.length/size which is O(n)
## What changes were proposed in this pull request?

This PR fixes all the instances which was fixed in the previous PR.

To make sure, I manually debugged and also checked the Scala source. `length` in [LinearSeqOptimized.scala#L49-L57](https://github.com/scala/scala/blob/2.11.x/src/library/scala/collection/LinearSeqOptimized.scala#L49-L57) is O(n). Also, `size` calls `length` via [SeqLike.scala#L106](https://github.com/scala/scala/blob/2.11.x/src/library/scala/collection/SeqLike.scala#L106).

For debugging, I have created these as below:

```scala
ArrayBuffer(1, 2, 3)
Array(1, 2, 3)
List(1, 2, 3)
Seq(1, 2, 3)
```

and then called `size` and `length` for each to debug.

## How was this patch tested?

I ran the bash as below on Mac

```bash
find . -name *.scala -type f -exec grep -il "while (.*\\.length)" {} \; | grep "src/main"
find . -name *.scala -type f -exec grep -il "while (.*\\.size)" {} \; | grep "src/main"
```

and then checked each.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #15093 from HyukjinKwon/SPARK-17480-followup.
2016-09-17 16:52:30 +01:00
David Navas 9dbd4b864e
[SPARK-17529][CORE] Implement BitSet.clearUntil and use it during merge joins
## What changes were proposed in this pull request?

Add a clearUntil() method on BitSet (adapted from the pre-existing setUntil() method).
Use this method to clear the subset of the BitSet which needs to be used during merge joins.

## How was this patch tested?

dev/run-tests, as well as performance tests on skewed data as described in jira.

I expect there to be a small local performance hit using BitSet.clearUntil rather than BitSet.clear for normally shaped (unskewed) joins (additional read on the last long).  This is expected to be de-minimis and was not specifically tested.

Author: David Navas <davidn@clearstorydata.com>

Closes #15084 from davidnavas/bitSet.
2016-09-17 16:22:23 +01:00
Daniel Darabos 69cb049697
Correct fetchsize property name in docs
## What changes were proposed in this pull request?

Replace `fetchSize` with `fetchsize` in the docs.

## How was this patch tested?

I manually tested `fetchSize` and `fetchsize`. The latter has an effect. See also [`JdbcUtils.scala#L38`](https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L38) for the definition of the property.

Author: Daniel Darabos <darabos.daniel@gmail.com>

Closes #14975 from darabos/patch-3.
2016-09-17 12:28:42 +01:00
Marcelo Vanzin 39e2bad6a8 [SPARK-17549][SQL] Only collect table size stat in driver for cached relation.
The existing code caches all stats for all columns for each partition
in the driver; for a large relation, this causes extreme memory usage,
which leads to gc hell and application failures.

It seems that only the size in bytes of the data is actually used in the
driver, so instead just colllect that. In executors, the full stats are
still kept, but that's not a big problem; we expect the data to be distributed
and thus not really incur in too much memory pressure in each individual
executor.

There are also potential improvements on the executor side, since the data
being stored currently is very wasteful (e.g. storing boxed types vs.
primitive types for stats). But that's a separate issue.

On a mildly related change, I'm also adding code to catch exceptions in the
code generator since Janino was breaking with the test data I tried this
patch on.

Tested with unit tests and by doing a count a very wide table (20k columns)
with many partitions.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #15112 from vanzin/SPARK-17549.
2016-09-16 14:02:56 -07:00
Sean Owen b9323fc938 [SPARK-17561][DOCS] DataFrameWriter documentation formatting problems
## What changes were proposed in this pull request?

Fix `<ul> / <li>` problems in SQL scaladoc.

## How was this patch tested?

Scaladoc build and manual verification of generated HTML.

Author: Sean Owen <sowen@cloudera.com>

Closes #15117 from srowen/SPARK-17561.
2016-09-16 13:43:05 -07:00
Sean Zhong a425a37a5d [SPARK-17426][SQL] Refactor TreeNode.toJSON to avoid OOM when converting unknown fields to JSON
## What changes were proposed in this pull request?

This PR is a follow up of SPARK-17356. Current implementation of `TreeNode.toJSON` recursively converts all fields of TreeNode to JSON, even if the field is of type `Seq` or type Map. This may trigger out of memory exception in cases like:

1. the Seq or Map can be very big. Converting them to JSON may take huge memory, which may trigger out of memory error.
2. Some user space input may also be propagated to the Plan. The user space input can be of arbitrary type, and may also be self-referencing. Trying to print user space input to JSON may trigger out of memory error or stack overflow error.

For a code example, please check the Jira description of SPARK-17426.

In this PR, we refactor the `TreeNode.toJSON` so that we only convert a field to JSON string if the field is a safe type.

## How was this patch tested?

Unit test.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #14990 from clockfly/json_oom2.
2016-09-16 19:37:30 +08:00
Andrew Ray b72486f82d [SPARK-17458][SQL] Alias specified for aggregates in a pivot are not honored
## What changes were proposed in this pull request?

This change preserves aliases that are given for pivot aggregations

## How was this patch tested?

New unit test

Author: Andrew Ray <ray.andrew@gmail.com>

Closes #15111 from aray/SPARK-17458.
2016-09-15 21:45:29 +02:00
Sean Zhong a6b8182006 [SPARK-17364][SQL] Antlr lexer wrongly treats full qualified identifier as a decimal number token when parsing SQL string
## What changes were proposed in this pull request?

The Antlr lexer we use to tokenize a SQL string may wrongly tokenize a fully qualified identifier as a decimal number token. For example, table identifier `default.123_table` is wrongly tokenized as
```
default // Matches lexer rule IDENTIFIER
.123 // Matches lexer rule DECIMAL_VALUE
_TABLE // Matches lexer rule IDENTIFIER
```

The correct tokenization for `default.123_table` should be:
```
default // Matches lexer rule IDENTIFIER,
. // Matches a single dot
123_TABLE // Matches lexer rule IDENTIFIER
```

This PR fix the Antlr grammar so that it can tokenize fully qualified identifier correctly:
1. Fully qualified table name can be parsed correctly. For example, `select * from database.123_suffix`.
2. Fully qualified column name can be parsed correctly, for example `select a.123_suffix from a`.

### Before change

#### Case 1: Failed to parse fully qualified column name

```
scala> spark.sql("select a.123_column from a").show
org.apache.spark.sql.catalyst.parser.ParseException:
extraneous input '.123' expecting {<EOF>,
...
, IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 8)
== SQL ==
select a.123_column from a
--------^^^
```

#### Case 2: Failed to parse fully qualified table name
```
scala> spark.sql("select * from default.123_table")
org.apache.spark.sql.catalyst.parser.ParseException:
extraneous input '.123' expecting {<EOF>,
...
IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 21)

== SQL ==
select * from default.123_table
---------------------^^^
```

### After Change

#### Case 1: fully qualified column name, no ParseException thrown
```
scala> spark.sql("select a.123_column from a").show
```

#### Case 2: fully qualified table name, no ParseException thrown
```
scala> spark.sql("select * from default.123_table")
```

## How was this patch tested?

Unit test.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #15006 from clockfly/SPARK-17364.
2016-09-15 20:53:48 +02:00
岑玉海 fe767395ff [SPARK-17429][SQL] use ImplicitCastInputTypes with function Length
## What changes were proposed in this pull request?
select length(11);
select length(2.0);
these sql will return errors, but hive is ok.
this PR will support casting input types implicitly for function length
the correct result is:
select length(11) return 2
select length(2.0) return 3

Author: 岑玉海 <261810726@qq.com>
Author: cenyuhai <cenyuhai@didichuxing.com>

Closes #15014 from cenyuhai/SPARK-17429.
2016-09-15 20:45:00 +02:00
Herman van Hovell d403562eb4 [SPARK-17114][SQL] Fix aggregates grouped by literals with empty input
## What changes were proposed in this pull request?
This PR fixes an issue with aggregates that have an empty input, and use a literals as their grouping keys. These aggregates are currently interpreted as aggregates **without** grouping keys, this triggers the ungrouped code path (which aways returns a single row).

This PR fixes the `RemoveLiteralFromGroupExpressions` optimizer rule, which changes the semantics of the Aggregate by eliminating all literal grouping keys.

## How was this patch tested?
Added tests to `SQLQueryTestSuite`.

Author: Herman van Hovell <hvanhovell@databricks.com>

Closes #15101 from hvanhovell/SPARK-17114-3.
2016-09-15 20:24:15 +02:00
John Muller 71a65825c5 [SPARK-17536][SQL] Minor performance improvement to JDBC batch inserts
## What changes were proposed in this pull request?

Optimize a while loop during batch inserts

## How was this patch tested?

Unit tests were done, specifically "mvn  test" for sql

Author: John Muller <jmuller@us.imshealth.com>

Closes #15098 from blue666man/SPARK-17536.
2016-09-15 10:00:28 +01:00
Adam Roberts f893e26250 [SPARK-17524][TESTS] Use specified spark.buffer.pageSize
## What changes were proposed in this pull request?

This PR has the appendRowUntilExceedingPageSize test in RowBasedKeyValueBatchSuite use whatever spark.buffer.pageSize value a user has specified to prevent a test failure for anyone testing Apache Spark on a box with a reduced page size. The test is currently hardcoded to use the default page size which is 64 MB so this minor PR is a test improvement

## How was this patch tested?
Existing unit tests with 1 MB page size and with 64 MB (the default) page size

Author: Adam Roberts <aroberts@uk.ibm.com>

Closes #15079 from a-roberts/patch-5.
2016-09-15 09:37:12 +01:00
gatorsmile 6a6adb1673 [SPARK-17440][SPARK-17441] Fixed Multiple Bugs in ALTER TABLE
### What changes were proposed in this pull request?
For the following `ALTER TABLE` DDL, we should issue an exception when the target table is a `VIEW`:
```SQL
 ALTER TABLE viewName SET LOCATION '/path/to/your/lovely/heart'

 ALTER TABLE viewName SET SERDE 'whatever'

 ALTER TABLE viewName SET SERDEPROPERTIES ('x' = 'y')

 ALTER TABLE viewName PARTITION (a=1, b=2) SET SERDEPROPERTIES ('x' = 'y')

 ALTER TABLE viewName ADD IF NOT EXISTS PARTITION (a='4', b='8')

 ALTER TABLE viewName DROP IF EXISTS PARTITION (a='2')

 ALTER TABLE viewName RECOVER PARTITIONS

 ALTER TABLE viewName PARTITION (a='1', b='q') RENAME TO PARTITION (a='100', b='p')
```

In addition, `ALTER TABLE RENAME PARTITION` is unable to handle data source tables, just like the other `ALTER PARTITION` commands. We should issue an exception instead.

### How was this patch tested?
Added a few test cases.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #15004 from gatorsmile/altertable.
2016-09-15 14:43:10 +08:00
Shixiong Zhu e33bfaed3b [SPARK-17463][CORE] Make CollectionAccumulator and SetAccumulator's value can be read thread-safely
## What changes were proposed in this pull request?

Make CollectionAccumulator and SetAccumulator's value can be read thread-safely to fix the ConcurrentModificationException reported in [JIRA](https://issues.apache.org/jira/browse/SPARK-17463).

## How was this patch tested?

Existing tests.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #15063 from zsxwing/SPARK-17463.
2016-09-14 13:33:51 -07:00
Xin Wu 040e46979d [SPARK-10747][SQL] Support NULLS FIRST|LAST clause in ORDER BY
## What changes were proposed in this pull request?
Currently, ORDER BY clause returns nulls value according to sorting order (ASC|DESC), considering null value is always smaller than non-null values.
However, SQL2003 standard support NULLS FIRST or NULLS LAST to allow users to specify whether null values should be returned first or last, regardless of sorting order (ASC|DESC).

This PR is to support this new feature.

## How was this patch tested?
New test cases are added to test NULLS FIRST|LAST for regular select queries and windowing queries.

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: Xin Wu <xinwu@us.ibm.com>

Closes #14842 from xwu0226/SPARK-10747.
2016-09-14 21:14:29 +02:00
hyukjinkwon a79838bdee [MINOR][SQL] Add missing functions for some options in SQLConf and use them where applicable
## What changes were proposed in this pull request?

I first thought they are missing because they are kind of hidden options but it seems they are just missing.

For example, `spark.sql.parquet.mergeSchema` is documented in [sql-programming-guide.md](https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md) but this function is missing whereas many options such as `spark.sql.join.preferSortMergeJoin` are not documented but have its own function individually.

So, this PR suggests making them consistent by adding the missing functions for some options in `SQLConf` and use them where applicable, in order to make them more readable.

## How was this patch tested?

Existing tests should cover this.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14678 from HyukjinKwon/sqlconf-cleanup.
2016-09-15 01:33:56 +08:00
Josh Rosen 6d06ff6f7e [SPARK-17514] df.take(1) and df.limit(1).collect() should perform the same in Python
## What changes were proposed in this pull request?

In PySpark, `df.take(1)` runs a single-stage job which computes only one partition of the DataFrame, while `df.limit(1).collect()` computes all partitions and runs a two-stage job. This difference in performance is confusing.

The reason why `limit(1).collect()` is so much slower is that `collect()` internally maps to `df.rdd.<some-pyspark-conversions>.toLocalIterator`, which causes Spark SQL to build a query where a global limit appears in the middle of the plan; this, in turn, ends up being executed inefficiently because limits in the middle of plans are now implemented by repartitioning to a single task rather than by running a `take()` job on the driver (this was done in #7334, a patch which was a prerequisite to allowing partition-local limits to be pushed beneath unions, etc.).

In order to fix this performance problem I think that we should generalize the fix from SPARK-10731 / #8876 so that `DataFrame.collect()` also delegates to the Scala implementation and shares the same performance properties. This patch modifies `DataFrame.collect()` to first collect all results to the driver and then pass them to Python, allowing this query to be planned using Spark's `CollectLimit` optimizations.

## How was this patch tested?

Added a regression test in `sql/tests.py` which asserts that the expected number of jobs, stages, and tasks are run for both queries.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #15068 from JoshRosen/pyspark-collect-limit.
2016-09-14 10:10:01 -07:00