Commit graph

3695 commits

Author SHA1 Message Date
Sean Zhong 06514d689c [SPARK-12988][SQL] Can't drop top level columns that contain dots
## What changes were proposed in this pull request?

Fixes "Can't drop top level columns that contain dots".

This work is based on dilipbiswal's https://github.com/apache/spark/pull/10943.
This PR fixes problems like:

```
scala> Seq((1, 2)).toDF("a.b", "a.c").drop("a.b")
org.apache.spark.sql.AnalysisException: cannot resolve '`a.c`' given input columns: [a.b, a.c];
```

`drop(columnName)` can only be used to drop top level column, so, we should parse the column name literally WITHOUT interpreting dot "."

We should also NOT interpret back tick "`", otherwise it is hard to understand what

```
​```aaa```bbb``
```

actually means.

## How was this patch tested?

Unit tests.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #13306 from clockfly/fix_drop_column.
2016-05-31 17:34:10 -07:00
Josh Rosen 8ca01a6feb [SPARK-15680][SQL] Disable comments in generated code in order to avoid perf. issues
## What changes were proposed in this pull request?

In benchmarks involving tables with very wide and complex schemas (thousands of columns, deep nesting), I noticed that significant amounts of time (order of tens of seconds per task) were being spent generating comments during the code generation phase.

The root cause of the performance problem stems from the fact that calling toString() on a complex expression can involve thousands of string concatenations, resulting in huge amounts (tens of gigabytes) of character array allocation and copying.

In the long term, we can avoid this problem by passing StringBuilders down the tree and using them to accumulate output. As a short-term workaround, this patch guards comment generation behind a flag and disables comments by default (for wide tables / complex queries, these comments were being truncated prior to display and thus were not very useful).

## How was this patch tested?

This was tested manually by running a Spark SQL query over an empty table with a very wide schema obtained from a real workload. Disabling comments brought the per-task time down from about 16 seconds to 600 milliseconds.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #13421 from JoshRosen/disable-line-comments-in-codegen.
2016-05-31 17:30:03 -07:00
Reynold Xin 223f1d58c4 [SPARK-15662][SQL] Add since annotation for classes in sql.catalog
## What changes were proposed in this pull request?
This patch does a few things:

1. Adds since version annotation to methods and classes in sql.catalog.
2. Fixed a typo in FilterFunction and a whitespace issue in spark/api/java/function/package.scala
3. Added "database" field to Function class.

## How was this patch tested?
Updated unit test case for "database" field in Function class.

Author: Reynold Xin <rxin@databricks.com>

Closes #13406 from rxin/SPARK-15662.
2016-05-31 17:29:10 -07:00
Tathagata Das 90b11439b3 [SPARK-15517][SQL][STREAMING] Add support for complete output mode in Structure Streaming
## What changes were proposed in this pull request?
Currently structured streaming only supports append output mode.  This PR adds the following.

- Added support for Complete output mode in the internal state store, analyzer and planner.
- Added public API in Scala and Python for users to specify output mode
- Added checks for unsupported combinations of output mode and DF operations
  - Plans with no aggregation should support only Append mode
  - Plans with aggregation should support only Update and Complete modes
  - Default output mode is Append mode (**Question: should we change this to automatically set to Complete mode when there is aggregation?**)
- Added support for Complete output mode in Memory Sink. So Memory Sink internally supports append and complete, update. But from public API only Complete and Append output modes are supported.

## How was this patch tested?
Unit tests in various test suites
- StreamingAggregationSuite: tests for complete mode
- MemorySinkSuite: tests for checking behavior in Append and Complete modes.
- UnsupportedOperationSuite: tests for checking unsupported combinations of DF ops and output modes
- DataFrameReaderWriterSuite: tests for checking that output mode cannot be called on static DFs
- Python doc test and existing unit tests modified to call write.outputMode.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13286 from tdas/complete-mode.
2016-05-31 15:57:01 -07:00
Dilip Biswal dfe2cbeb43 [SPARK-15557] [SQL] cast the string into DoubleType when it's used together with decimal
In this case, the result type of the expression becomes DECIMAL(38, 36) as we promote the individual string literals to DECIMAL(38, 18) when we handle string promotions for `BinaryArthmaticExpression`.

I think we need to cast the string literals to Double type instead. I looked at the history and found that  this was changed to use decimal instead of double to avoid potential loss of precision when we cast decimal to double.

To double check i ran the query against hive, mysql. This query returns non NULL result for both the databases and both promote the expression to use double.
Here is the output.

- Hive
```SQL
hive> create table l2 as select (cast(99 as decimal(19,6)) + '2') from l1;
OK
hive> describe l2;
OK
_c0                 	double
```
- MySQL
```SQL
mysql> create table foo2 as select (cast(99 as decimal(19,6)) + '2') from test;
Query OK, 1 row affected (0.01 sec)
Records: 1  Duplicates: 0  Warnings: 0

mysql> describe foo2;
+-----------------------------------+--------+------+-----+---------+-------+
| Field                             | Type   | Null | Key | Default | Extra |
+-----------------------------------+--------+------+-----+---------+-------+
| (cast(99 as decimal(19,6)) + '2') | double | NO   |     | 0       |       |
+-----------------------------------+--------+------+-----+---------+-------+
```

## How was this patch tested?
Added a new test in SQLQuerySuite

Author: Dilip Biswal <dbiswal@us.ibm.com>

Closes #13368 from dilipbiswal/spark-15557.
2016-05-31 15:49:45 -07:00
Davies Liu 2df6ca848e [SPARK-15327] [SQL] fix split expression in whole stage codegen
## What changes were proposed in this pull request?

Right now, we will split the code for expressions into multiple functions when it exceed 64k, which requires that the the expressions are using Row object, but this is not true for whole-state codegen, it will fail to compile after splitted.

This PR will not split the code in whole-stage codegen.

## How was this patch tested?

Added regression tests.

Author: Davies Liu <davies@databricks.com>

Closes #13235 from davies/fix_nested_codegen.
2016-05-31 15:36:02 -07:00
Shixiong Zhu 9a74de18a1 Revert "[SPARK-11753][SQL][TEST-HADOOP2.2] Make allowNonNumericNumbers option work
## What changes were proposed in this pull request?

This reverts commit c24b6b679c. Sent a PR to run Jenkins tests due to the revert conflicts of `dev/deps/spark-deps-hadoop*`.

## How was this patch tested?

Jenkins unit tests, integration tests, manual tests)

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #13417 from zsxwing/revert-SPARK-11753.
2016-05-31 14:50:07 -07:00
Yin Huai c6de5832bf [SPARK-15622][SQL] Wrap the parent classloader of Janino's classloader in the ParentClassLoader.
## What changes were proposed in this pull request?
At https://github.com/aunkrig/janino/blob/janino_2.7.8/janino/src/org/codehaus/janino/ClassLoaderIClassLoader.java#L80-L85, Janino's classloader throws the exception when its parent throws a ClassNotFoundException with a cause set. However, it does not throw the exception when there is no cause set. Seems we need to use a special ClassLoader to wrap the actual parent classloader set to Janino handle this behavior.

## How was this patch tested?
I have reverted the workaround made by https://issues.apache.org/jira/browse/SPARK-11636 ( https://github.com/apache/spark/compare/master...yhuai:SPARK-15622?expand=1#diff-bb538fda94224dd0af01d0fd7e1b4ea0R81) and `test-only *ReplSuite -- -z "SPARK-2576 importing implicits"` still passes the test (without the change in `CodeGenerator`, this test does not pass with the change in `ExecutorClassLoader `).

Author: Yin Huai <yhuai@databricks.com>

Closes #13366 from yhuai/SPARK-15622.
2016-05-31 12:30:34 -07:00
Wenchen Fan 2bfed1a0c5 [SPARK-15658][SQL] UDT serializer should declare its data type as udt instead of udt.sqlType
## What changes were proposed in this pull request?

When we build serializer for UDT object, we should declare its data type as udt instead of udt.sqlType, or if we deserialize it again, we lose the information that it's a udt object and throw analysis exception.

## How was this patch tested?

new test in `UserDefiendTypeSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13402 from cloud-fan/udt.
2016-05-31 11:00:38 -07:00
gatorsmile d67c82e4b6 [SPARK-15647][SQL] Fix Boundary Cases in OptimizeCodegen Rule
#### What changes were proposed in this pull request?

The following condition in the Optimizer rule `OptimizeCodegen` is not right.
```Scala
branches.size < conf.maxCaseBranchesForCodegen
```

- The number of branches in case when clause should be `branches.size + elseBranch.size`.
- `maxCaseBranchesForCodegen` is the maximum boundary for enabling codegen. Thus, we should use `<=` instead of `<`.

This PR is to fix this boundary case and also add missing test cases for verifying the conf `MAX_CASES_BRANCHES`.

#### How was this patch tested?
Added test cases in `SQLConfSuite`

Author: gatorsmile <gatorsmile@gmail.com>

Closes #13392 from gatorsmile/maxCaseWhen.
2016-05-31 10:08:00 -07:00
Lianhui Wang 2bfc4f1521 [SPARK-15649][SQL] Avoid to serialize MetastoreRelation in HiveTableScanExec
## What changes were proposed in this pull request?
in HiveTableScanExec, schema is lazy and is related with relation.attributeMap. So it needs to serialize MetastoreRelation when serializing task binary bytes.It can avoid to serialize MetastoreRelation.

## How was this patch tested?

Author: Lianhui Wang <lianhuiwang09@gmail.com>

Closes #13397 from lianhuiwang/avoid-serialize.
2016-05-31 09:21:51 -07:00
Takeshi YAMAMURO 95db8a44f3 [SPARK-15528][SQL] Fix race condition in NumberConverter
## What changes were proposed in this pull request?
A local variable in NumberConverter is wrongly shared between threads.
This pr fixes the race condition.

## How was this patch tested?
Manually checked.

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #13391 from maropu/SPARK-15528.
2016-05-31 07:25:16 -05:00
Reynold Xin 675921040e [SPARK-15638][SQL] Audit Dataset, SparkSession, and SQLContext
## What changes were proposed in this pull request?
This patch contains a list of changes as a result of my auditing Dataset, SparkSession, and SQLContext. The patch audits the categorization of experimental APIs, function groups, and deprecations. For the detailed list of changes, please see the diff.

## How was this patch tested?
N/A

Author: Reynold Xin <rxin@databricks.com>

Closes #13370 from rxin/SPARK-15638.
2016-05-30 22:47:58 -07:00
Cheng Lian 1360a6d636 [SPARK-15112][SQL] Disables EmbedSerializerInFilter for plan fragments that change schema
## What changes were proposed in this pull request?

`EmbedSerializerInFilter` implicitly assumes that the plan fragment being optimized doesn't change plan schema, which is reasonable because `Dataset.filter` should never change the schema.

However, due to another issue involving `DeserializeToObject` and `SerializeFromObject`, typed filter *does* change plan schema (see [SPARK-15632][1]). This breaks `EmbedSerializerInFilter` and causes corrupted data.

This PR disables `EmbedSerializerInFilter` when there's a schema change to avoid data corruption. The schema change issue should be addressed in follow-up PRs.

## How was this patch tested?

New test case added in `DatasetSuite`.

[1]: https://issues.apache.org/jira/browse/SPARK-15632

Author: Cheng Lian <lian@databricks.com>

Closes #13362 from liancheng/spark-15112-corrupted-filter.
2016-05-29 23:19:12 -07:00
Sean Owen ce1572d16f [MINOR] Resolve a number of miscellaneous build warnings
## What changes were proposed in this pull request?

This change resolves a number of build warnings that have accumulated, before 2.x. It does not address a large number of deprecation warnings, especially related to the Accumulator API. That will happen separately.

## How was this patch tested?

Jenkins

Author: Sean Owen <sowen@cloudera.com>

Closes #13377 from srowen/BuildWarnings.
2016-05-29 16:48:14 -05:00
Reynold Xin 472f16181d [SPARK-15636][SQL] Make aggregate expressions more concise in explain
## What changes were proposed in this pull request?
This patch reduces the verbosity of aggregate expressions in explain (but does not actually remove any information). As an example, for the following command:
```
spark.range(10).selectExpr("sum(id) + 1", "count(distinct id)").explain(true)
```

Output before this patch:
```
== Physical Plan ==
*TungstenAggregate(key=[], functions=[(sum(id#0L),mode=Final,isDistinct=false),(count(id#0L),mode=Final,isDistinct=true)], output=[(sum(id) + 1)#3L,count(DISTINCT id)#16L])
+- Exchange SinglePartition, None
   +- *TungstenAggregate(key=[], functions=[(sum(id#0L),mode=PartialMerge,isDistinct=false),(count(id#0L),mode=Partial,isDistinct=true)], output=[sum#18L,count#21L])
      +- *TungstenAggregate(key=[id#0L], functions=[(sum(id#0L),mode=PartialMerge,isDistinct=false)], output=[id#0L,sum#18L])
         +- Exchange hashpartitioning(id#0L, 5), None
            +- *TungstenAggregate(key=[id#0L], functions=[(sum(id#0L),mode=Partial,isDistinct=false)], output=[id#0L,sum#18L])
               +- *Range (0, 10, splits=2)
```

Output after this patch:
```
== Physical Plan ==
*TungstenAggregate(key=[], functions=[sum(id#0L),count(distinct id#0L)], output=[(sum(id) + 1)#3L,count(DISTINCT id)#16L])
+- Exchange SinglePartition, None
   +- *TungstenAggregate(key=[], functions=[merge_sum(id#0L),partial_count(distinct id#0L)], output=[sum#18L,count#21L])
      +- *TungstenAggregate(key=[id#0L], functions=[merge_sum(id#0L)], output=[id#0L,sum#18L])
         +- Exchange hashpartitioning(id#0L, 5), None
            +- *TungstenAggregate(key=[id#0L], functions=[partial_sum(id#0L)], output=[id#0L,sum#18L])
               +- *Range (0, 10, splits=2)
```

Note the change from `(sum(id#0L),mode=PartialMerge,isDistinct=false)` to `merge_sum(id#0L)`.

In general aggregate explain is still very verbose, but further work will be done as follow-up pull requests.

## How was this patch tested?
Tested manually.

Author: Reynold Xin <rxin@databricks.com>

Closes #13367 from rxin/SPARK-15636.
2016-05-28 14:14:36 -07:00
Yadong Qi b4c32c4952 [SPARK-15549][SQL] Disable bucketing when the output doesn't contain all bucketing columns
## What changes were proposed in this pull request?
I create a bucketed table bucketed_table with bucket column i,
```scala
case class Data(i: Int, j: Int, k: Int)
sc.makeRDD(Array((1, 2, 3))).map(x => Data(x._1, x._2, x._3)).toDF.write.bucketBy(2, "i").saveAsTable("bucketed_table")
```

and I run the following SQLs:
```sql
SELECT j FROM bucketed_table;
Error in query: bucket column i not found in existing columns (j);

SELECT j, MAX(k) FROM bucketed_table GROUP BY j;
Error in query: bucket column i not found in existing columns (j, k);
```

I think we should add a check that, we only enable bucketing when it satisfies all conditions below:
1. the conf is enabled
2. the relation is bucketed
3. the output contains all bucketing columns

## How was this patch tested?
Updated test cases to reflect the changes.

Author: Yadong Qi <qiyadong2010@gmail.com>

Closes #13321 from watermen/SPARK-15549.
2016-05-28 10:19:29 -07:00
Liang-Chi Hsieh f1b220eeee [SPARK-15553][SQL] Dataset.createTempView should use CreateViewCommand
## What changes were proposed in this pull request?

Let `Dataset.createTempView` and `Dataset.createOrReplaceTempView` use `CreateViewCommand`, rather than calling `SparkSession.createTempView`. Besides, this patch also removes `SparkSession.createTempView`.

## How was this patch tested?
Existing tests.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #13327 from viirya/dataset-createtempview.
2016-05-27 21:24:08 -07:00
Reynold Xin 73178c7556 [SPARK-15633][MINOR] Make package name for Java tests consistent
## What changes were proposed in this pull request?
This is a simple patch that makes package names for Java 8 test suites consistent. I moved everything to test.org.apache.spark to we can test package private APIs properly. Also added "java8" as the package name so we can easily run all the tests related to Java 8.

## How was this patch tested?
This is a test only change.

Author: Reynold Xin <rxin@databricks.com>

Closes #13364 from rxin/SPARK-15633.
2016-05-27 21:20:02 -07:00
Andrew Or 4a2fb8b87c [SPARK-15594][SQL] ALTER TABLE SERDEPROPERTIES does not respect partition spec
## What changes were proposed in this pull request?

These commands ignore the partition spec and change the storage properties of the table itself:
```
ALTER TABLE table_name PARTITION (a=1, b=2) SET SERDE 'my_serde'
ALTER TABLE table_name PARTITION (a=1, b=2) SET SERDEPROPERTIES ('key1'='val1')
```
Now they change the storage properties of the specified partition.

## How was this patch tested?

DDLSuite

Author: Andrew Or <andrew@databricks.com>

Closes #13343 from andrewor14/alter-table-serdeproperties.
2016-05-27 17:27:24 -07:00
Ryan Blue 776d183c82 [SPARK-9876][SQL] Update Parquet to 1.8.1.
## What changes were proposed in this pull request?

This includes minimal changes to get Spark using the current release of Parquet, 1.8.1.

## How was this patch tested?

This uses the existing Parquet tests.

Author: Ryan Blue <blue@apache.org>

Closes #13280 from rdblue/SPARK-9876-update-parquet.
2016-05-27 16:59:38 -07:00
Xin Wu 019afd9c78 [SPARK-15431][SQL][BRANCH-2.0-TEST] rework the clisuite test cases
## What changes were proposed in this pull request?
This PR reworks on the CliSuite test cases for `LIST FILES/JARS` commands.

CC yhuai Thanks!

Author: Xin Wu <xinwu@us.ibm.com>

Closes #13361 from xwu0226/SPARK-15431-clisuite-new.
2016-05-27 14:07:12 -07:00
Tejas Patil a96e4151a9 [SPARK-14400][SQL] ScriptTransformation does not fail the job for bad user command
## What changes were proposed in this pull request?

- Refer to the Jira for the problem: jira : https://issues.apache.org/jira/browse/SPARK-14400
- The fix is to check if the process has exited with a non-zero exit code in `hasNext()`. I have moved this and checking of writer thread exception to a separate method.

## How was this patch tested?

- Ran a job which had incorrect transform script command and saw that the job fails
- Existing unit tests for `ScriptTransformationSuite`. Added a new unit test

Author: Tejas Patil <tejasp@fb.com>

Closes #12194 from tejasapatil/script_transform.
2016-05-27 12:05:11 -07:00
Xinh Huynh 5bdbedf220 [MINOR][DOCS] Typo fixes in Dataset scaladoc
## What changes were proposed in this pull request?

Minor typo fixes in Dataset scaladoc
* Corrected context type as SparkSession, not SQLContext.
liancheng rxin andrewor14

## How was this patch tested?

Compiled locally

Author: Xinh Huynh <xinh_huynh@yahoo.com>

Closes #13330 from xinhhuynh/fix-dataset-typos.
2016-05-27 11:13:53 -07:00
Reynold Xin a52e681339 [SPARK-15597][SQL] Add SparkSession.emptyDataset
## What changes were proposed in this pull request?
This patch adds a new function emptyDataset to SparkSession, for creating an empty dataset.

## How was this patch tested?
Added a test case.

Author: Reynold Xin <rxin@databricks.com>

Closes #13344 from rxin/SPARK-15597.
2016-05-27 11:13:09 -07:00
Sameer Agarwal 635fb30f83 [SPARK-15599][SQL][DOCS] API docs for createDataset functions in SparkSession
## What changes were proposed in this pull request?

Adds API docs and usage examples for the 3 `createDataset` calls in `SparkSession`

## How was this patch tested?

N/A

Author: Sameer Agarwal <sameer@databricks.com>

Closes #13345 from sameeragarwal/dataset-doc.
2016-05-27 11:11:31 -07:00
Dongjoon Hyun 4538443e27 [SPARK-15584][SQL] Abstract duplicate code: spark.sql.sources. properties
## What changes were proposed in this pull request?

This PR replaces `spark.sql.sources.` strings with `CreateDataSourceTableUtils.*` constant variables.

## How was this patch tested?

Pass the existing Jenkins tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13349 from dongjoon-hyun/SPARK-15584.
2016-05-27 11:10:31 -07:00
gatorsmile c17272902c [SPARK-15565][SQL] Add the File Scheme to the Default Value of WAREHOUSE_PATH
#### What changes were proposed in this pull request?
The default value of `spark.sql.warehouse.dir` is `System.getProperty("user.dir")/spark-warehouse`. Since `System.getProperty("user.dir")` is a local dir, we should explicitly set the scheme to local filesystem.

cc yhuai

#### How was this patch tested?
Added two test cases

Author: gatorsmile <gatorsmile@gmail.com>

Closes #13348 from gatorsmile/addSchemeToDefaultWarehousePath.
2016-05-27 09:54:31 -07:00
Xin Wu 6f95c6c030 [SPARK-15431][SQL][HOTFIX] ignore 'list' command testcase from CliSuite for now
## What changes were proposed in this pull request?
The test cases for  `list` command added in `CliSuite` by PR #13212 can not run in some jenkins jobs after being merged.
However, some jenkins jobs can pass:
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.4/
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.2/
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.3/

Others failed on this test case. But the failures on those jobs are at slightly different checkpoints among different jobs too. So it seems that CliSuite's output capture is flaky for list commands to check for expected output. There are test cases already in `HiveQuerySuite` and `SparkContextSuite` to cover the cases. So I am ignoring 2 test cases added by PR #13212 .

Author: Xin Wu <xinwu@us.ibm.com>

Closes #13276 from xwu0226/SPARK-15431-clisuite.
2016-05-27 08:54:14 -07:00
gatorsmile d5911d1173 [SPARK-15529][SQL] Replace SQLContext and HiveContext with SparkSession in Test
#### What changes were proposed in this pull request?
This PR is to use the new entrance `Sparksession` to replace the existing `SQLContext` and `HiveContext` in SQL test suites.

No change is made in the following suites:
- `ListTablesSuite` is to test the APIs of `SQLContext`.
- `SQLContextSuite` is to test `SQLContext`
- `HiveContextCompatibilitySuite` is to test `HiveContext`

**Update**: Move tests in `ListTableSuite` to `SQLContextSuite`

#### How was this patch tested?
N/A

Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>

Closes #13337 from gatorsmile/sparkSessionTest.
2016-05-26 22:40:57 -07:00
Zheng RuiFeng 6b1a6180e7 [MINOR] Fix Typos 'a -> an'
## What changes were proposed in this pull request?

`a` -> `an`

I use regex to generate potential error lines:
`grep -in ' a [aeiou]' mllib/src/main/scala/org/apache/spark/ml/*/*scala`
and review them line by line.

## How was this patch tested?

local build
`lint-java` checking

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #13317 from zhengruifeng/a_an.
2016-05-26 22:39:14 -07:00
Andrew Or 3fca635b4e [SPARK-15583][SQL] Disallow altering datasource properties
## What changes were proposed in this pull request?

Certain table properties (and SerDe properties) are in the protected namespace `spark.sql.sources.`, which we use internally for datasource tables. The user should not be allowed to

(1) Create a Hive table setting these properties
(2) Alter these properties in an existing table

Previously, we threw an exception if the user tried to alter the properties of an existing datasource table. However, this is overly restrictive for datasource tables and does not do anything for Hive tables.

## How was this patch tested?

DDLSuite

Author: Andrew Or <andrew@databricks.com>

Closes #13341 from andrewor14/alter-table-props.
2016-05-26 20:11:09 -07:00
Andrew Or 008a5377d5 [SPARK-15538][SPARK-15539][SQL] Truncate table fixes round 2
## What changes were proposed in this pull request?

Two more changes:
(1) Fix truncate table for data source tables (only for cases without `PARTITION`)
(2) Disallow truncating external tables or views

## How was this patch tested?

`DDLSuite`

Author: Andrew Or <andrew@databricks.com>

Closes #13315 from andrewor14/truncate-table.
2016-05-26 19:01:41 -07:00
Yin Huai 3ac2363d75 [SPARK-15532][SQL] SQLContext/HiveContext's public constructors should use SparkSession.build.getOrCreate
## What changes were proposed in this pull request?
This PR changes SQLContext/HiveContext's public constructor to use SparkSession.build.getOrCreate and removes isRootContext from SQLContext.

## How was this patch tested?
Existing tests.

Author: Yin Huai <yhuai@databricks.com>

Closes #13310 from yhuai/SPARK-15532.
2016-05-26 16:53:31 -07:00
Cheng Lian e7082caeb4 [SPARK-15550][SQL] Dataset.show() should show contents nested products as rows
## What changes were proposed in this pull request?

This PR addresses two related issues:

1. `Dataset.showString()` should show case classes/Java beans at all levels as rows, while master code only handles top level ones.

2. `Dataset.showString()` should show full contents produced the underlying query plan

   Dataset is only a view of the underlying query plan. Columns not referred by the encoder are still reachable using methods like `Dataset.col`. So it probably makes more sense to show full contents of the query plan.

## How was this patch tested?

Two new test cases are added in `DatasetSuite` to check `.showString()` output.

Author: Cheng Lian <lian@databricks.com>

Closes #13331 from liancheng/spark-15550-ds-show.
2016-05-26 16:23:48 -07:00
Sean Zhong b5859e0bb8 [SPARK-13445][SQL] Improves error message and add test coverage for Window function
## What changes were proposed in this pull request?

Add more verbose error message when order by clause is missed when using Window function.

## How was this patch tested?

Unit test.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #13333 from clockfly/spark-13445.
2016-05-26 14:50:00 -07:00
Reynold Xin 0f61d6efb4 [SPARK-15552][SQL] Remove unnecessary private[sql] methods in SparkSession
## What changes were proposed in this pull request?
SparkSession has a list of unnecessary private[sql] methods. These methods cause some trouble because private[sql] doesn't apply in Java. In the cases that they are easy to remove, we can simply remove them. This patch does that.

As part of this pull request, I also replaced a bunch of protected[sql] with private[sql], to tighten up visibility.

## How was this patch tested?
Updated test cases to reflect the changes.

Author: Reynold Xin <rxin@databricks.com>

Closes #13319 from rxin/SPARK-15552.
2016-05-26 13:03:07 -07:00
Andrew Or 2b1ac6cea8 [SPARK-15539][SQL] DROP TABLE throw exception if table doesn't exist
## What changes were proposed in this pull request?

Same as #13302, but for DROP TABLE.

## How was this patch tested?

`DDLSuite`

Author: Andrew Or <andrew@databricks.com>

Closes #13307 from andrewor14/drop-table.
2016-05-26 12:04:18 -07:00
Bo Meng 53d4abe9e9 [SPARK-15537][SQL] fix dir delete issue
## What changes were proposed in this pull request?

For some of the test cases, e.g. `OrcSourceSuite`, it will create temp folders and temp files inside them. But after tests finish, the folders are not removed. This will cause lots of temp files created and space occupied, if we keep running the test cases.

The reason is dir.delete() won't work if dir is not empty. We need to recursively delete the content before deleting the folder.

## How was this patch tested?

Manually checked the temp folder to make sure the temp files were deleted.

Author: Bo Meng <mengbo@hotmail.com>

Closes #13304 from bomeng/SPARK-15537.
2016-05-26 00:22:47 -07:00
Reynold Xin 361ebc282b [SPARK-15543][SQL] Rename DefaultSources to make them more self-describing
## What changes were proposed in this pull request?
This patch renames various DefaultSources to make their names more self-describing. The choice of "DefaultSource" was from the days when we did not have a good way to specify short names.

They are now named:
- LibSVMFileFormat
- CSVFileFormat
- JdbcRelationProvider
- JsonFileFormat
- ParquetFileFormat
- TextFileFormat

Backward compatibility is maintained through aliasing.

## How was this patch tested?
Updated relevant test cases too.

Author: Reynold Xin <rxin@databricks.com>

Closes #13311 from rxin/SPARK-15543.
2016-05-25 23:54:24 -07:00
Sameer Agarwal 06ed1fa3e4 [SPARK-15533][SQL] Deprecate Dataset.explode
## What changes were proposed in this pull request?

This patch deprecates `Dataset.explode` and documents appropriate workarounds to use `flatMap()` or `functions.explode()` instead.

## How was this patch tested?

N/A

Author: Sameer Agarwal <sameer@databricks.com>

Closes #13312 from sameeragarwal/deprecate.
2016-05-25 19:10:57 -07:00
Andrew Or ee682fe293 [SPARK-15534][SPARK-15535][SQL] Truncate table fixes
## What changes were proposed in this pull request?

Two changes:
- When things fail, `TRUNCATE TABLE` just returns nothing. Instead, we should throw exceptions.
- Remove `TRUNCATE TABLE ... COLUMN`, which was never supported by either Spark or Hive.

## How was this patch tested?
Jenkins.

Author: Andrew Or <andrew@databricks.com>

Closes #13302 from andrewor14/truncate-table.
2016-05-25 15:08:39 -07:00
Jurriaan Pruis c875d81a3d [SPARK-15493][SQL] default QuoteEscapingEnabled flag to true when writing CSV
## What changes were proposed in this pull request?

Default QuoteEscapingEnabled flag to true when writing CSV and add an escapeQuotes option to be able to change this.

See f3eb2af263/src/main/java/com/univocity/parsers/csv/CsvWriterSettings.java (L231-L247)

This change is needed to be able to write RFC 4180 compatible CSV files (https://tools.ietf.org/html/rfc4180#section-2)

https://issues.apache.org/jira/browse/SPARK-15493

## How was this patch tested?

Added a test that verifies the output is quoted correctly.

Author: Jurriaan Pruis <email@jurriaanpruis.nl>

Closes #13267 from jurriaan/quote-escaping.
2016-05-25 12:40:16 -07:00
Takuya UESHIN 4b88067416 [SPARK-15483][SQL] IncrementalExecution should use extra strategies.
## What changes were proposed in this pull request?

Extra strategies does not work for streams because `IncrementalExecution` uses modified planner with stateful operations but it does not include extra strategies.

This pr fixes `IncrementalExecution` to include extra strategies to use them.

## How was this patch tested?

I added a test to check if extra strategies work for streams.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #13261 from ueshin/issues/SPARK-15483.
2016-05-25 12:02:07 -07:00
lfzCarlosC 02c8072eea [MINOR][MLLIB][STREAMING][SQL] Fix typos
fixed typos for source code for components [mllib] [streaming] and [SQL]

None and obvious.

Author: lfzCarlosC <lfz.carlos@gmail.com>

Closes #13298 from lfzCarlosC/master.
2016-05-25 10:53:57 -07:00
Jeff Zhang 01e7b9c85b [SPARK-15345][SQL][PYSPARK] SparkSession's conf doesn't take effect when this already an existing SparkContext
## What changes were proposed in this pull request?

Override the existing SparkContext is the provided SparkConf is different. PySpark part hasn't been fixed yet, will do that after the first round of review to ensure this is the correct approach.

## How was this patch tested?

Manually verify it in spark-shell.

rxin  Please help review it, I think this is a very critical issue for spark 2.0

Author: Jeff Zhang <zjffdu@apache.org>

Closes #13160 from zjffdu/SPARK-15345.
2016-05-25 10:46:51 -07:00
Reynold Xin 4f27b8dd58 [SPARK-15436][SQL] Remove DescribeFunction and ShowFunctions
## What changes were proposed in this pull request?
This patch removes the last two commands defined in the catalyst module: DescribeFunction and ShowFunctions. They were unnecessary since the parser could just generate DescribeFunctionCommand and ShowFunctionsCommand directly.

## How was this patch tested?
Created a new SparkSqlParserSuite.

Author: Reynold Xin <rxin@databricks.com>

Closes #13292 from rxin/SPARK-15436.
2016-05-25 19:17:53 +02:00
Wenchen Fan 50b660d725 [SPARK-15498][TESTS] fix slow tests
## What changes were proposed in this pull request?

This PR fixes 3 slow tests:

1. `ParquetQuerySuite.read/write wide table`: This is not a good unit test as it runs more than 5 minutes. This PR removes it and add a new regression test in `CodeGenerationSuite`, which is more "unit".
2. `ParquetQuerySuite.returning batch for wide table`: reduce the threshold and use smaller data size.
3. `DatasetSuite.SPARK-14554: Dataset.map may generate wrong java code for wide table`: Improve `CodeFormatter.format`(introduced at https://github.com/apache/spark/pull/12979) can dramatically speed this it up.

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13273 from cloud-fan/test.
2016-05-24 21:23:39 -07:00
Parth Brahmbhatt 4acababcab [SPARK-15365][SQL] When table size statistics are not available from metastore, we should fallback to HDFS
## What changes were proposed in this pull request?
Currently if a table is used in join operation we rely on Metastore returned size to calculate if we can convert the operation to Broadcast join. This optimization only kicks in for table's that have the statistics available in metastore. Hive generally rolls over to HDFS if the statistics are not available directly from metastore and this seems like a reasonable choice to adopt given the optimization benefit of using broadcast joins.

## How was this patch tested?
I have executed queries locally to test.

Author: Parth Brahmbhatt <pbrahmbhatt@netflix.com>

Closes #13150 from Parth-Brahmbhatt/SPARK-15365.
2016-05-24 20:58:20 -07:00
Dongjoon Hyun f08bf587b1 [SPARK-15512][CORE] repartition(0) should raise IllegalArgumentException
## What changes were proposed in this pull request?

Previously, SPARK-8893 added the constraints on positive number of partitions for repartition/coalesce operations in general. This PR adds one missing part for that and adds explicit two testcases.

**Before**
```scala
scala> sc.parallelize(1 to 5).coalesce(0)
java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive.
...
scala> sc.parallelize(1 to 5).repartition(0).collect()
res1: Array[Int] = Array()   // empty
scala> spark.sql("select 1").coalesce(0)
res2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [1: int]
scala> spark.sql("select 1").coalesce(0).collect()
java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive.
scala> spark.sql("select 1").repartition(0)
res3: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [1: int]
scala> spark.sql("select 1").repartition(0).collect()
res4: Array[org.apache.spark.sql.Row] = Array()  // empty
```

**After**
```scala
scala> sc.parallelize(1 to 5).coalesce(0)
java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive.
...
scala> sc.parallelize(1 to 5).repartition(0)
java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive.
...
scala> spark.sql("select 1").coalesce(0)
java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive.
...
scala> spark.sql("select 1").repartition(0)
java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive.
...
```

## How was this patch tested?

Pass the Jenkins tests with new testcases.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13282 from dongjoon-hyun/SPARK-15512.
2016-05-24 18:55:23 -07:00