Conversion of outer joins, if the predicates in filter conditions can restrict the result sets so that all null-supplying rows are eliminated.
- `full outer` -> `inner` if both sides have such predicates
- `left outer` -> `inner` if the right side has such predicates
- `right outer` -> `inner` if the left side has such predicates
- `full outer` -> `left outer` if only the left side has such predicates
- `full outer` -> `right outer` if only the right side has such predicates
If applicable, this can greatly improve the performance, since outer join is much slower than inner join, full outer join is much slower than left/right outer join.
The original PR is https://github.com/apache/spark/pull/10542
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#10567 from gatorsmile/outerJoinEliminationByFilterCond.
This patch expose `maxCharactersPerColumn` and `maxColumns` to user in CSV data source.
Author: Hossein <hossein@databricks.com>
Closes#11147 from falaki/SPARK-13261.
Fixes error `org.postgresql.util.PSQLException: Unable to find server array type for provided name decimal(38,18)`.
* Passes scale metadata to JDBC dialect for usage in type conversions.
* Removes unused length/scale/precision parameters from `createArrayOf` parameter `typeName` (for writing).
* Adds configurable precision and scale to Postgres `DecimalType` (for reading).
* Adds a new kind of test that verifies the schema written by `DataFrame.write.jdbc`.
Author: Brandon Bradley <bradleytastic@gmail.com>
Closes#10928 from blbradley/spark-12966.
`rand` and `randn` functions with a `seed` argument are commonly used. Based on the common sense, the results of `rand` and `randn` should be deterministic if the `seed` parameter value is provided. For example, in MS SQL Server, it also has a function `rand`. Regarding the parameter `seed`, the description is like: ```Seed is an integer expression (tinyint, smallint, or int) that gives the seed value. If seed is not specified, the SQL Server Database Engine assigns a seed value at random. For a specified seed value, the result returned is always the same.```
Update: the current implementation is unable to generate deterministic results when the partitions are not fixed. This PR documents this issue in the function descriptions.
jkbradley hit an issue and provided an example in the following JIRA: https://issues.apache.org/jira/browse/SPARK-13333
Author: gatorsmile <gatorsmile@gmail.com>
Closes#11232 from gatorsmile/randSeed.
This PR support codegen for broadcast outer join.
In order to reduce the duplicated codes, this PR merge HashJoin and HashOuterJoin together (also BroadcastHashJoin and BroadcastHashOuterJoin).
Author: Davies Liu <davies@databricks.com>
Closes#11130 from davies/gen_out.
Add local ivy repo to the SBT build file to fix this.
Scaladoc compile error is fixed.
Author: jerryshao <sshao@hortonworks.com>
Closes#11001 from jerryshao/SPARK-13109.
`TakeOrderedAndProjectNode` should use generated projection and ordering like other `LocalNode`s.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#11230 from ueshin/issues/SPARK-13357.
Add `LazilyGenerateOrdering` to support generated ordering for `RangePartitioner` of `Exchange` instead of `InterpretedOrdering`.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#10894 from ueshin/issues/SPARK-12976.
Using GroupingSets will generate a wrong result when Aggregate Functions containing GroupBy columns.
This PR is to fix it. Since the code changes are very small. Maybe we also can merge it to 1.6
For example, the following query returns a wrong result:
```scala
sql("select course, sum(earnings) as sum from courseSales group by course, earnings" +
" grouping sets((), (course), (course, earnings))" +
" order by course, sum").show()
```
Before the fix, the results are like
```
[null,null]
[Java,null]
[Java,20000.0]
[Java,30000.0]
[dotNET,null]
[dotNET,5000.0]
[dotNET,10000.0]
[dotNET,48000.0]
```
After the fix, the results become correct:
```
[null,113000.0]
[Java,20000.0]
[Java,30000.0]
[Java,50000.0]
[dotNET,5000.0]
[dotNET,10000.0]
[dotNET,48000.0]
[dotNET,63000.0]
```
UPDATE: This PR also deprecated the external column: GROUPING__ID.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#11100 from gatorsmile/groupingSets.
This patch adds a new optimizer rule for performing limit pushdown. Limits will now be pushed down in two cases:
- If a limit is on top of a `UNION ALL` operator, then a partition-local limit operator will be pushed to each of the union operator's children.
- If a limit is on top of an `OUTER JOIN` then a partition-local limit will be pushed to one side of the join. For `LEFT OUTER` and `RIGHT OUTER` joins, the limit will be pushed to the left and right side, respectively. For `FULL OUTER` join, we will only push limits when at most one of the inputs is already limited: if one input is limited we will push a smaller limit on top of it and if neither input is limited then we will limit the input which is estimated to be larger.
These optimizations were proposed previously by gatorsmile in #10451 and #10454, but those earlier PRs were closed and deferred for later because at that time Spark's physical `Limit` operator would trigger a full shuffle to perform global limits so there was a chance that pushdowns could actually harm performance by causing additional shuffles/stages. In #7334, we split the `Limit` operator into separate `LocalLimit` and `GlobalLimit` operators, so we can now push down only local limits (which don't require extra shuffles). This patch is based on both of gatorsmile's patches, with changes and simplifications due to partition-local-limiting.
When we push down the limit, we still keep the original limit in place, so we need a mechanism to ensure that the optimizer rule doesn't keep pattern-matching once the limit has been pushed down. In order to handle this, this patch adds a `maxRows` method to `SparkPlan` which returns the maximum number of rows that the plan can compute, then defines the pushdown rules to only push limits to children if the children's maxRows are greater than the limit's maxRows. This idea is carried over from #10451; see that patch for additional discussion.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#11121 from JoshRosen/limit-pushdown-2.
This pull request has the following changes:
1. Moved UserDefinedFunction into expressions package. This is more consistent with how we structure the packages for window functions and UDAFs.
2. Moved UserDefinedPythonFunction into execution.python package, so we don't have a random private class in the top level sql package.
3. Move everything in execution/python.scala into the newly created execution.python package.
Most of the diffs are just straight copy-paste.
Author: Reynold Xin <rxin@databricks.com>
Closes#11181 from rxin/SPARK-13296.
Expand suffer from create the UnsafeRow from same input multiple times, with codegen, it only need to copy some of the columns.
After this, we can see 3X improvements (from 43 seconds to 13 seconds) on a TPCDS query (Q67) that have eight columns in Rollup.
Ideally, we could mask some of the columns based on bitmask, I'd leave that in the future, because currently Aggregation (50 ns) is much slower than that just copy the variables (1-2 ns).
Author: Davies Liu <davies@databricks.com>
Closes#11177 from davies/gen_expand.
https://issues.apache.org/jira/browse/SPARK-13260
This is a quicky fix for `count(*)`.
When the `requiredColumns` is empty, currently it returns `sqlContext.sparkContext.emptyRDD[Row]` which does not have the count.
Just like JSON datasource, this PR lets the CSV datasource count the rows but do not parse each set of tokens.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#11169 from HyukjinKwon/SPARK-13260.
Add the table name validation at the temp table creation
Author: jayadevanmurali <jayadevan.m@tcs.com>
Closes#11051 from jayadevanmurali/branch-0.2-SPARK-12982.
For lots of SQL operators, we have metrics for both of input and output, the number of input rows should be exactly the number of output rows of child, we could only have metrics for output rows.
After we improved the performance using whole stage codegen, the overhead of SQL metrics are not trivial anymore, we should avoid that if it's not necessary.
This PR remove all the SQL metrics for number of input rows, add SQL metric of number of output rows for all LeafNode. All remove the SQL metrics from those operators that have the same number of rows from input and output (for example, Projection, we may don't need that).
The new SQL UI will looks like:
![metrics](https://cloud.githubusercontent.com/assets/40902/12965227/63614e5e-d009-11e5-88b3-84fea04f9c20.png)
Author: Davies Liu <davies@databricks.com>
Closes#11163 from davies/remove_metrics.
Grouping() returns a column is aggregated or not, grouping_id() returns the aggregation levels.
grouping()/grouping_id() could be used with window function, but does not work in having/sort clause, will be fixed by another PR.
The GROUPING__ID/grouping_id() in Hive is wrong (according to docs), we also did it wrongly, this PR change that to match the behavior in most databases (also the docs of Hive).
Author: Davies Liu <davies@databricks.com>
Closes#10677 from davies/grouping.
This PR addresses two issues:
- Self join does not work in SQL Generation
- When creating new instances for `LogicalRelation`, `metastoreTableIdentifier` is lost.
liancheng Could you please review the code changes? Thank you!
Author: gatorsmile <gatorsmile@gmail.com>
Closes#11084 from gatorsmile/selfJoinInSQLGen.
Some analysis rules generate aliases or auxiliary attribute references with the same name but different expression IDs. For example, `ResolveAggregateFunctions` introduces `havingCondition` and `aggOrder`, and `DistinctAggregationRewriter` introduces `gid`.
This is OK for normal query execution since these attribute references get expression IDs. However, it's troublesome when converting resolved query plans back to SQL query strings since expression IDs are erased.
Here's an example Spark 1.6.0 snippet for illustration:
```scala
sqlContext.range(10).select('id as 'a, 'id as 'b).registerTempTable("t")
sqlContext.sql("SELECT SUM(a) FROM t GROUP BY a, b ORDER BY COUNT(a), COUNT(b)").explain(true)
```
The above code produces the following resolved plan:
```
== Analyzed Logical Plan ==
_c0: bigint
Project [_c0#101L]
+- Sort [aggOrder#102L ASC,aggOrder#103L ASC], true
+- Aggregate [a#47L,b#48L], [(sum(a#47L),mode=Complete,isDistinct=false) AS _c0#101L,(count(a#47L),mode=Complete,isDistinct=false) AS aggOrder#102L,(count(b#48L),mode=Complete,isDistinct=false) AS aggOrder#103L]
+- Subquery t
+- Project [id#46L AS a#47L,id#46L AS b#48L]
+- LogicalRDD [id#46L], MapPartitionsRDD[44] at range at <console>:26
```
Here we can see that both aggregate expressions in `ORDER BY` are extracted into an `Aggregate` operator, and both of them are named `aggOrder` with different expression IDs.
The solution is to automatically add the expression IDs into the attribute name for the Alias and AttributeReferences that are generated by Analyzer in SQL Generation.
In this PR, it also resolves another issue. Users could use the same name as the internally generated names. The duplicate names should not cause name ambiguity. When resolving the column, Catalyst should not pick the column that is internally generated.
Could you review the solution? marmbrus liancheng
I did not set the newly added flag for all the alias and attribute reference generated by Analyzers. Please let me know if I should do it? Thank you!
Author: gatorsmile <gatorsmile@gmail.com>
Closes#11050 from gatorsmile/namingConflicts.
Update Aggregator links to point to #org.apache.spark.sql.expressions.Aggregator
Author: raela <raela@databricks.com>
Closes#11158 from raelawang/master.
### Management API for Continuous Queries
**API for getting status of each query**
- Whether active or not
- Unique name of each query
- Status of the sources and sinks
- Exceptions
**API for managing each query**
- Immediately stop an active query
- Waiting for a query to be terminated, correctly or with error
**API for managing multiple queries**
- Listing all active queries
- Getting an active query by name
- Waiting for any one of the active queries to be terminated
**API for listening to query life cycle events**
- ContinuousQueryListener API for query start, progress and termination events.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#11030 from tdas/streaming-df-management-api.
This pr adds benchmark codes for in-memory cache compression to make future developments and discussions more smooth.
Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
Closes#10965 from maropu/ImproveColumnarCache.
The patch for SPARK-8964 ("use Exchange to perform shuffle in Limit" / #7334) inadvertently broke the planning of the TakeOrderedAndProject operator: because ReturnAnswer was the new root of the query plan, the TakeOrderedAndProject rule was unable to match before BasicOperators.
This patch fixes this by moving the `TakeOrderedAndCollect` and `CollectLimit` rules into the same strategy.
In addition, I made changes to the TakeOrderedAndProject operator in order to make its `doExecute()` method lazy and added a new TakeOrderedAndProjectSuite which tests the new code path.
/cc davies and marmbrus for review.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#11145 from JoshRosen/take-ordered-and-project-fix.
`FileStreamSource` is an implementation of `org.apache.spark.sql.execution.streaming.Source`. It takes advantage of the existing `HadoopFsRelationProvider` to support various file formats. It remembers files in each batch and stores it into the metadata files so as to recover them when restarting. The metadata files are stored in the file system. There will be a further PR to clean up the metadata files periodically.
This is based on the initial work from marmbrus.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#11034 from zsxwing/stream-df-file-source.
This PR improve the lookup of BytesToBytesMap by:
1. Generate code for calculate the hash code of grouping keys.
2. Do not use MemoryLocation, fetch the baseObject and offset for key and value directly (remove the indirection).
Author: Davies Liu <davies@databricks.com>
Closes#11010 from davies/gen_map.
WIP: running tests. Code needs a bit of clean up.
This patch completes the vectorized decoding with the goal of passing the existing
tests. There is still more patches to support the rest of the format spec, even
just for flat schemas.
This patch adds a new flag to enable the vectorized decoding. Tests were updated
to try with both modes where applicable.
Once this is working well, we can remove the previous code path.
Author: Nong Li <nong@databricks.com>
Closes#11055 from nongli/spark-12992-2.
This PR improve the performance for Broadcast join with dimension tables, which is common in data warehouse.
If the join key can fit in a long, we will use a special api `get(Long)` to get the rows from HashedRelation.
If the HashedRelation only have unique keys, we will use a special api `getValue(Long)` or `getValue(InternalRow)`.
If the keys can fit within a long, also the keys are dense, we will use a array of UnsafeRow, instead a hash map.
TODO: will do cleanup
Author: Davies Liu <davies@databricks.com>
Closes#11065 from davies/gen_dim.
nullability should only be considered as an optimization rather than part of the type system, so instead of failing analysis for mismatch nullability, we should pass analysis and add runtime null check.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11035 from cloud-fan/ignore-nullability.
This patch changes the implementation of the physical `Limit` operator so that it relies on the `Exchange` operator to perform data movement rather than directly using `ShuffledRDD`. In addition to improving efficiency, this lays the necessary groundwork for further optimization of limit, such as limit pushdown or whole-stage codegen.
At a high-level, this replaces the old physical `Limit` operator with two new operators, `LocalLimit` and `GlobalLimit`. `LocalLimit` performs per-partition limits, while `GlobalLimit` applies the final limit to a single partition; `GlobalLimit`'s declares that its `requiredInputDistribution` is `SinglePartition`, which will cause the planner to use an `Exchange` to perform the appropriate shuffles. Thus, a logical `Limit` appearing in the middle of a query plan will be expanded into `LocalLimit -> Exchange to one partition -> GlobalLimit`.
In the old code, calling `someDataFrame.limit(100).collect()` or `someDataFrame.take(100)` would actually skip the shuffle and use a fast-path which used `executeTake()` in order to avoid computing all partitions in case only a small number of rows were requested. This patch preserves this optimization by treating logical `Limit` operators specially when they appear as the terminal operator in a query plan: if a `Limit` is the final operator, then we will plan a special `CollectLimit` physical operator which implements the old `take()`-based logic.
In order to be able to match on operators only at the root of the query plan, this patch introduces a special `ReturnAnswer` logical operator which functions similar to `BroadcastHint`: this dummy operator is inserted at the root of the optimized logical plan before invoking the physical planner, allowing the planner to pattern-match on it.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#7334 from JoshRosen/remove-copy-in-limit.
rxin srowen
I work out note message for rdd.take function, please help to review.
If it's fine, I can apply to all other function later.
Author: Tommy YU <tummyyu@163.com>
Closes#10874 from Wenpei/spark-5865-add-warning-for-localdatastructure.
Trivial search-and-replace to eliminate deprecation warnings in Scala 2.11.
Also works with 2.10
Author: Jakob Odersky <jakob@odersky.com>
Closes#11085 from jodersky/SPARK-13171.
Since we remove the configuration for codegen, we are heavily reply on codegen (also TungstenAggregate require the generated MutableProjection to update UnsafeRow), should remove the fallback, which could make user confusing, see the discussion in SPARK-13116.
Author: Davies Liu <davies@databricks.com>
Closes#11097 from davies/remove_fallback.
https://issues.apache.org/jira/browse/SPARK-12939
Now we will catch `ObjectOperator` in `Analyzer` and resolve the `fromRowExpression/deserializer` inside it. Also update the `MapGroups` and `CoGroup` to pass in `dataAttributes`, so that we can correctly resolve value deserializer(the `child.output` contains both groupking key and values, which may mess things up if they have same-name attribtues). End-to-end tests are added.
follow-ups:
* remove encoders from typed aggregate expression.
* completely remove resolve/bind in `ExpressionEncoder`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#10852 from cloud-fan/bug.
This patch adds option function for boolean, long, and double types. This makes it slightly easier for Spark users to specify options without turning them into strings. Using the JSON data source as an example.
Before this patch:
```scala
sqlContext.read.option("primitivesAsString", "true").json("/path/to/json")
```
After this patch:
Before this patch:
```scala
sqlContext.read.option("primitivesAsString", true).json("/path/to/json")
```
Author: Reynold Xin <rxin@databricks.com>
Closes#11072 from rxin/SPARK-13187.
JIRA: https://issues.apache.org/jira/browse/SPARK-12850
This PR is to support bucket pruning when the predicates are `EqualTo`, `EqualNullSafe`, `IsNull`, `In`, and `InSet`.
Like HIVE, in this PR, the bucket pruning works when the bucketing key has one and only one column.
So far, I do not find a way to verify how many buckets are actually scanned. However, I did verify it when doing the debug. Could you provide a suggestion how to do it properly? Thank you! cloud-fan yhuai rxin marmbrus
BTW, we can add more cases to support complex predicate including `Or` and `And`. Please let me know if I should do it in this PR.
Maybe we also need to add test cases to verify if bucket pruning works well for each data type.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#10942 from gatorsmile/pruningBuckets.
Spark SQL should collapse adjacent `Repartition` operators and only keep the last one.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#11064 from JoshRosen/collapse-repartition.
Make an internal non-deprecated version of incBytesRead and incRecordsRead so we don't have unecessary deprecation warnings in our build.
Right now incBytesRead and incRecordsRead are marked as deprecated and for internal use only. We should make private[spark] versions which are not deprecated and switch to those internally so as to not clutter up the warning messages when building.
cc andrewor14 who did the initial deprecation
Author: Holden Karau <holden@us.ibm.com>
Closes#11056 from holdenk/SPARK-13152-fix-task-metrics-deprecation-warnings.
Best time is stabler than average time, also added a column for nano seconds per row (which could be used to estimate contributions of each components in a query).
Having best time and average time together for more information (we can see kind of variance).
rate, time per row and relative are all calculated using best time.
The result looks like this:
```
Intel(R) Core(TM) i7-4558U CPU 2.80GHz
rang/filter/sum: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------
rang/filter/sum codegen=false 14332 / 16646 36.0 27.8 1.0X
rang/filter/sum codegen=true 845 / 940 620.0 1.6 17.0X
```
Author: Davies Liu <davies@databricks.com>
Closes#11018 from davies/gen_bench.
They seem redundant and we can simply use DataFrameReader/Writer. The new usage looks like:
```scala
val df = sqlContext.read.stream("...")
val handle = df.write.stream("...")
handle.stop()
```
Author: Reynold Xin <rxin@databricks.com>
Closes#11062 from rxin/SPARK-13166.
A row from stream side could match multiple rows on build side, the loop for these matched rows should not be interrupted when emitting a row, so we buffer the output rows in a linked list, check the termination condition on producer loop (for example, Range or Aggregate).
Author: Davies Liu <davies@databricks.com>
Closes#10989 from davies/gen_join.
1. try to avoid the suffix (unique id)
2. remove the comment if there is no code generated.
3. re-arrange the order of functions
4. trop the new line for inlined blocks.
Author: Davies Liu <davies@databricks.com>
Closes#11032 from davies/better_suffix.
This PR add spilling support for generated TungstenAggregate.
If spilling happened, it's not that bad to do the iterator based sort-merge-aggregate (not generated).
The changes will be covered by TungstenAggregationQueryWithControlledFallbackSuite
Author: Davies Liu <davies@databricks.com>
Closes#10998 from davies/gen_spilling.