### What changes were proposed in this pull request?
PG and Oracle both support use CUBE/ROLLUP/GROUPING SETS in GROUPING SETS's grouping set as a sugar syntax.
![image](https://user-images.githubusercontent.com/46485123/114975588-139a1180-9eb7-11eb-8f53-498c1db934e0.png)
In this PR, we support it in Spark SQL too
### Why are the changes needed?
Keep consistent with PG and oracle
### Does this PR introduce _any_ user-facing change?
User can write grouping analytics like
```
SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS(ROLLUP(a, b));
SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS((a, b), (a), ());
SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS(GROUPING SETS((a, b), (a), ()));
```
### How was this patch tested?
Added Test
Closes#32201 from AngersZhuuuu/SPARK-35026.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Support GROUP BY use Separate columns and CUBE/ROLLUP
In postgres sql, it support
```
select a, b, c, count(1) from t group by a, b, cube (a, b, c);
select a, b, c, count(1) from t group by a, b, rollup(a, b, c);
select a, b, c, count(1) from t group by cube(a, b), rollup (a, b, c);
select a, b, c, count(1) from t group by a, b, grouping sets((a, b), (a), ());
```
In this pr, we have done two things as below:
1. Support partial grouping analytics such as `group by a, cube(a, b)`
2. Support mixed grouping analytics such as `group by cube(a, b), rollup(b,c)`
*Partial Groupings*
Partial Groupings means there are both `group_expression` and `CUBE|ROLLUP|GROUPING SETS`
in GROUP BY clause. For example:
`GROUP BY warehouse, CUBE(product, location)` is equivalent to
`GROUP BY GROUPING SETS((warehouse, product, location), (warehouse, product), (warehouse, location), (warehouse))`.
`GROUP BY warehouse, ROLLUP(product, location)` is equivalent to
`GROUP BY GROUPING SETS((warehouse, product, location), (warehouse, product), (warehouse))`.
`GROUP BY warehouse, GROUPING SETS((product, location), (producet), ())` is equivalent to
`GROUP BY GROUPING SETS((warehouse, product, location), (warehouse, location), (warehouse))`.
*Concatenated Groupings*
Concatenated groupings offer a concise way to generate useful combinations of groupings. Groupings specified
with concatenated groupings yield the cross-product of groupings from each grouping set. The cross-product
operation enables even a small number of concatenated groupings to generate a large number of final groups.
The concatenated groupings are specified simply by listing multiple `GROUPING SETS`, `CUBES`, and `ROLLUP`,
and separating them with commas. For example:
`GROUP BY GROUPING SETS((warehouse), (producet)), GROUPING SETS((location), (size))` is equivalent to
`GROUP BY GROUPING SETS((warehouse, location), (warehouse, size), (product, location), (product, size))`.
`GROUP BY CUBE((warehouse), (producet)), ROLLUP((location), (size))` is equivalent to
`GROUP BY GROUPING SETS((warehouse, product), (warehouse), (producet), ()), GROUPING SETS((location, size), (location), ())`
`GROUP BY GROUPING SETS(
(warehouse, product, location, size), (warehouse, product, location), (warehouse, product),
(warehouse, location, size), (warehouse, location), (warehouse),
(product, location, size), (product, location), (product),
(location, size), (location), ())`.
`GROUP BY order, CUBE((warehouse), (producet)), ROLLUP((location), (size))` is equivalent to
`GROUP BY order, GROUPING SETS((warehouse, product), (warehouse), (producet), ()), GROUPING SETS((location, size), (location), ())`
`GROUP BY GROUPING SETS(
(order, warehouse, product, location, size), (order, warehouse, product, location), (order, warehouse, product),
(order, warehouse, location, size), (order, warehouse, location), (order, warehouse),
(order, product, location, size), (order, product, location), (order, product),
(order, location, size), (order, location), (order))`.
### Why are the changes needed?
Support more flexible grouping analytics
### Does this PR introduce _any_ user-facing change?
User can use sql like
```
select a, b, c, agg_expr() from table group by a, cube(b, c)
```
### How was this patch tested?
Added UT
Closes#30144 from AngersZhuuuu/SPARK-33229.
Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: angerszhu <angers.zhu@gmail.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
GROUP BY ... GROUPING SETS (...) is a weird SQL syntax we copied from Hive. It's not in the SQL standard or any other mainstream databases. This syntax requires users to repeat the expressions inside `GROUPING SETS (...)` after `GROUP BY`, and has a weird null semantic if `GROUP BY` contains extra expressions than `GROUPING SETS (...)`.
This PR deprecates this syntax:
1. Do not promote it in the document and only mention it as a Hive compatible sytax.
2. Simplify the code to only keep it for Hive compatibility.
### Why are the changes needed?
Deprecate a weird grammar.
### Does this PR introduce _any_ user-facing change?
No breaking change, but it removes a check to simplify the code: `GROUP BY a GROUPING SETS(a, b)` fails before and forces users to also put `b` after `GROUP BY`. Now this works just as `GROUP BY GROUPING SETS(a, b)`.
### How was this patch tested?
existing tests
Closes#32022 from cloud-fan/followup.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
As discussed in
https://github.com/apache/spark/pull/30145#discussion_r514728642https://github.com/apache/spark/pull/30145#discussion_r514734648
We need to rewrite current Grouping Analytics grammar to support as flexible as Postgres SQL to support subsequent development.
In postgres sql, it support
```
select a, b, c, count(1) from t group by cube (a, b, c);
select a, b, c, count(1) from t group by cube(a, b, c);
select a, b, c, count(1) from t group by cube (a, b, c, (a, b), (a, b, c));
select a, b, c, count(1) from t group by rollup(a, b, c);
select a, b, c, count(1) from t group by rollup (a, b, c);
select a, b, c, count(1) from t group by rollup (a, b, c, (a, b), (a, b, c));
```
In this pr, we have done three things as below, and we will split it to different pr:
- Refactor CUBE/ROLLUP (regarding them as ANTLR tokens in a parser)
- Refactor GROUPING SETS (the logical node -> a new expr)
- Support new syntax for CUBE/ROLLUP (e.g., GROUP BY CUBE ((a, b), (a, c)))
### Why are the changes needed?
Rewrite current Grouping Analytics grammar to support as flexible as Postgres SQL to support subsequent development.
### Does this PR introduce _any_ user-facing change?
User can write Grouping Analytics grammar as flexible as Postgres SQL to support subsequent development.
### How was this patch tested?
Added UT
Closes#30212 from AngersZhuuuu/refact-grouping-analytics.
Lead-authored-by: angerszhu <angers.zhu@gmail.com>
Co-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com>
Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR intends to fix typos in the sub-modules:
* `bin`
* `core`
* `docs`
* `external`
* `mllib`
* `repl`
* `pom.xml`
Split per srowen https://github.com/apache/spark/pull/30323#issuecomment-728981618
NOTE: The misspellings have been reported at 706a726f87 (commitcomment-44064356)
### Why are the changes needed?
Misspelled words make it harder to read / understand content.
### Does this PR introduce _any_ user-facing change?
There are various fixes to documentation, etc...
### How was this patch tested?
No testing was performed
Closes#30530 from jsoref/spelling-bin-core-docs-external-mllib-repl.
Authored-by: Josh Soref <jsoref@users.noreply.github.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
Fix typo for docs, log messages and comments
### Why are the changes needed?
typo fix to increase readability
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
manual test has been performed to test the updated
Closes#29443 from brandonJY/spell-fix-doc.
Authored-by: Brandon Jiang <Brandon.jiang.a@outlook.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
Remove the unneeded embedded inline HTML markup by using the basic markdown syntax.
Please see #28414
### Why are the changes needed?
Make the doc cleaner and easily editable by MD editors.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Manually build and check
Closes#28451 from huaxingao/html_cleanup.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This PR intends to clean up the SQL documents in `doc/sql-ref*`.
Main changes are as follows;
- Fixes wrong syntaxes and capitalize sub-titles
- Adds some DDL queries in `Examples` so that users can run examples there
- Makes query output in `Examples` follows the `Dataset.showString` (right-aligned) format
- Adds/Removes spaces, Indents, or blank lines to follow the format below;
```
---
license...
---
### Description
Writes what's the syntax is.
### Syntax
{% highlight sql %}
SELECT...
WHERE... // 4 indents after the second line
...
{% endhighlight %}
### Parameters
<dl>
<dt><code><em>Param Name</em></code></dt>
<dd>
Param Description
</dd>
...
</dl>
### Examples
{% highlight sql %}
-- It is better that users are able to execute example queries here.
-- So, we prepare test data in the first section if possible.
CREATE TABLE t (key STRING, value DOUBLE);
INSERT INTO t VALUES
('a', 1.0), ('a', 2.0), ('b', 3.0), ('c', 4.0);
-- query output has 2 indents and it follows the `Dataset.showString`
-- format (right-aligned).
SELECT * FROM t;
+---+-----+
|key|value|
+---+-----+
| a| 1.0|
| a| 2.0|
| b| 3.0|
| c| 4.0|
+---+-----+
-- Query statements after the second line have 4 indents.
SELECT key, SUM(value)
FROM t
GROUP BY key;
+---+----------+
|key|sum(value)|
+---+----------+
| c| 4.0|
| b| 3.0|
| a| 3.0|
+---+----------+
...
{% endhighlight %}
### Related Statements
* [XXX](xxx.html)
* ...
```
### Why are the changes needed?
The most changes of this PR are pretty minor, but I think the consistent formats/rules to write documents are important for long-term maintenance in our community
### Does this PR introduce any user-facing change?
Yes.
### How was this patch tested?
Manually checked.
Closes#28151 from maropu/MakeRightAligned.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This PR intends to improve the SQL document of `GROUP BY`; it added the description about FILTER clauses of aggregate functions.
### Why are the changes needed?
To improve the SQL documents
### Does this PR introduce any user-facing change?
Yes.
<img src="https://user-images.githubusercontent.com/692303/78558612-e2234a80-784d-11ea-9353-b3feac4d57a7.png" width="500">
### How was this patch tested?
Manually checked.
Closes#28134 from maropu/SPARK-31358.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
A few improvements to the sql ref SELECT doc:
1. correct the syntax of SELECT query
2. correct the default of null sort order
3. correct the GROUP BY syntax
4. several minor fixes
### Why are the changes needed?
refine document
### Does this PR introduce any user-facing change?
N/A
### How was this patch tested?
N/A
Closes#27866 from cloud-fan/doc.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
- Sets up links between related sections.
- Add "Related sections" for each section.
- Change to the left hand side menu to reflect the current status of the doc.
- Other minor cleanups.
### Why are the changes needed?
Currently Spark lacks documentation on the supported SQL constructs causing
confusion among users who sometimes have to look at the code to understand the
usage. This is aimed at addressing this issue.
### Does this PR introduce any user-facing change?
Yes.
### How was this patch tested?
Tested using jykyll build --serve
Closes#27371 from dilipbiswal/select_finalization.
Authored-by: Dilip Biswal <dkbiswal@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Fix a few super nit problems
### Why are the changes needed?
To make doc look better
### Does this PR introduce any user-facing change?
Yes
### How was this patch tested?
Tested using jykyll build --serve
Closes#27332 from huaxingao/spark-30575-followup.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
## What changes were proposed in this pull request?
This is a initial PR that creates the table of content for SQL reference guide. The left side bar will displays additional menu items corresponding to supported SQL constructs. One this PR is merged, we will fill in the content incrementally. Additionally this PR contains a minor change to make the left sidebar scrollable. Currently it is not possible to scroll in the left hand side window.
## How was this patch tested?
Used jekyll build and serve to verify.
Closes#25459 from dilipbiswal/ref-doc.
Authored-by: Dilip Biswal <dbiswal@us.ibm.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>