As a follow-up to https://github.com/apache/spark/pull/5944
Author: Reynold Xin <rxin@databricks.com>
Closes#6064 from rxin/jointype-better-error and squashes the following commits:
7629bf7 [Reynold Xin] [SQL] Show better error messages for incorrect join types in DataFrames.
(cherry picked from commit 4f4dbb030c)
Signed-off-by: Reynold Xin <rxin@databricks.com>
Updated Java, Scala, Python, and R.
Author: Reynold Xin <rxin@databricks.com>
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#5996 from rxin/groupby-retain and squashes the following commits:
aac7119 [Reynold Xin] Merge branch 'groupby-retain' of github.com:rxin/spark into groupby-retain
f6858f6 [Reynold Xin] Merge branch 'master' into groupby-retain
5f923c0 [Reynold Xin] Merge pull request #15 from shivaram/sparkr-groupby-retrain
c1de670 [Shivaram Venkataraman] Revert workaround in SparkR to retain grouped cols Based on reverting code added in commit 9a6be746ef
b8b87e1 [Reynold Xin] Fixed DataFrameJoinSuite.
d910141 [Reynold Xin] Updated rest of the files
1e6e666 [Reynold Xin] [SPARK-7462] By default retain group by columns in aggregate
(cherry picked from commit 0a4844f90a)
Signed-off-by: Reynold Xin <rxin@databricks.com>
Bugs description:
1. There are extra commas on the top of session list.
2. The format of time in "Start at:" part is not the same as others.
3. The total number of online sessions is wrong.
Author: tianyi <tianyi.asiainfo@gmail.com>
Closes#6048 from tianyi/SPARK-7519 and squashes the following commits:
ed366b7 [tianyi] fix bug
(cherry picked from commit 2242ab31e9)
Signed-off-by: Cheng Lian <lian@databricks.com>
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/6038)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes#6038 from liancheng/fix-typo and squashes the following commits:
572c2a4 [Cheng Lian] Fixes variable name typo
(cherry picked from commit 6bf9352fa5)
Signed-off-by: Cheng Lian <lian@databricks.com>
Issue appears when one tries to create DataFrame using sqlContext.load("jdbc"...) statement when "dbtable" contains query with renamed columns.
If original column is used in SQL query once the resulting DataFrame will contain non-renamed column.
If original column is used in SQL query several times with different aliases, sqlContext.load will fail.
Original implementation of JDBCRDD.resolveTable uses getColumnName to detect column names in RDD schema.
Suggested implementation uses getColumnLabel to handle column renames in SQL statement which is aware of SQL "AS" statement.
Readings:
http://stackoverflow.com/questions/4271152/getcolumnlabel-vs-getcolumnnamehttp://stackoverflow.com/questions/12259829/jdbc-getcolumnname-getcolumnlabel-db2
Official documentation unfortunately a bit misleading in definition of "suggested title" purpose however clearly defines behavior of AS keyword in SQL statement.
http://docs.oracle.com/javase/7/docs/api/java/sql/ResultSetMetaData.html
getColumnLabel - Gets the designated column's suggested title for use in printouts and displays. The suggested title is usually specified by the SQL AS clause. If a SQL AS is not specified, the value returned from getColumnLabel will be the same as the value returned by the getColumnName method.
Author: Oleg Sidorkin <oleg.sidorkin@gmail.com>
Closes#6032 from osidorkin/master and squashes the following commits:
10fc44b [Oleg Sidorkin] [SPARK-7345][SQL] Regression test for JDBCSuite (resolved scala style test error)
2aaf6f7 [Oleg Sidorkin] [SPARK-7345][SQL] Regression test for JDBCSuite (renamed fields in JDBC query)
b7d5b22 [Oleg Sidorkin] [SPARK-7345][SQL] Regression test for JDBCSuite
09559a0 [Oleg Sidorkin] [SPARK-7345][SQL] Spark cannot detect renamed columns using JDBC connector
(cherry picked from commit d7a37bcaf1)
Signed-off-by: Reynold Xin <rxin@databricks.com>
Author: tedyu <yuzhihong@gmail.com>
Closes#6031 from tedyu/master and squashes the following commits:
5c2580c [tedyu] Reference fasterxml.jackson.version in sql/core/pom.xml
ff2a44f [tedyu] Merge branch 'master' of github.com:apache/spark
28c8394 [tedyu] Upgrade version of jackson-databind in sql/core/pom.xml
(cherry picked from commit bd74301ff8)
Signed-off-by: Michael Armbrust <michael@databricks.com>
Currently version of jackson-databind in sql/core/pom.xml is 2.3.0
This is older than the version specified in root pom.xml
This PR upgrades the version in sql/core/pom.xml so that they're consistent.
Author: tedyu <yuzhihong@gmail.com>
Closes#6028 from tedyu/master and squashes the following commits:
28c8394 [tedyu] Upgrade version of jackson-databind in sql/core/pom.xml
(cherry picked from commit 3071aac387)
Signed-off-by: Michael Armbrust <michael@databricks.com>
This patch refactors the SQL `Exchange` operator's logic for determining whether map outputs need to be copied before being shuffled. As part of this change, we'll now avoid unnecessary copies in cases where sort-based shuffle operates on serialized map outputs (as in #4450 /
SPARK-4550).
This patch also includes a change to copy the input to RangePartitioner partition bounds calculation, which is necessary because this calculation buffers mutable Java objects.
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5948)
<!-- Reviewable:end -->
Author: Josh Rosen <joshrosen@databricks.com>
Closes#5948 from JoshRosen/SPARK-7375 and squashes the following commits:
f305ff3 [Josh Rosen] Reduce scope of some variables in Exchange
899e1d7 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-7375
6a6bfce [Josh Rosen] Fix issue related to RangePartitioning:
ad006a4 [Josh Rosen] [SPARK-7375] Avoid defensive copying in exchange operator when sort.serializeMapOutputs takes effect.
(cherry picked from commit cde5483884)
Signed-off-by: Yin Huai <yhuai@databricks.com>
Changes include
1. Rename sortDF to arrange
2. Add new aliases `group_by` and `sample_frac`, `summarize`
3. Add more user friendly column addition (mutate), rename
4. Support mean as an alias for avg in Scala and also support n_distinct, n as in dplyr
Using these changes we can pretty much run the examples as described in http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html with the same syntax
The only thing missing in SparkR is auto resolving column names when used in an expression i.e. making something like `select(flights, delay)` works in dply but we right now need `select(flights, flights$delay)` or `select(flights, "delay")`. But this is a complicated change and I'll file a new issue for it
cc sun-rui rxin
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#6005 from shivaram/sparkr-df-api and squashes the following commits:
5e0716a [Shivaram Venkataraman] Fix some roxygen bugs
1254953 [Shivaram Venkataraman] Merge branch 'master' of https://github.com/apache/spark into sparkr-df-api
0521149 [Shivaram Venkataraman] Changes to make SparkR DataFrame dplyr friendly. Changes include 1. Rename sortDF to arrange 2. Add new aliases `group_by` and `sample_frac`, `summarize` 3. Add more user friendly column addition (mutate), rename 4. Support mean as an alias for avg in Scala and also support n_distinct, n as in dplyr
(cherry picked from commit 0a901dd3a1)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
The DAG visualization currently displays only low-level Spark primitives (e.g. `map`, `reduceByKey`, `filter` etc.). For SQL, these aren't particularly useful. Instead, we should display higher level physical operators (e.g. `Filter`, `Exchange`, `ShuffleHashJoin`). cc marmbrus
-----------------
**Before**
<img src="https://issues.apache.org/jira/secure/attachment/12731586/before.png" width="600px"/>
-----------------
**After** (Pay attention to the words)
<img src="https://issues.apache.org/jira/secure/attachment/12731587/after.png" width="600px"/>
-----------------
Author: Andrew Or <andrew@databricks.com>
Closes#5999 from andrewor14/dag-viz-sql and squashes the following commits:
0db23a4 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-sql
1e211db [Andrew Or] Update comment
0d49fd6 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-sql
ffd237a [Andrew Or] Fix style
202dac1 [Andrew Or] Make ignoreParent false by default
e61b1ab [Andrew Or] Visualize SQL operators, not low-level Spark primitives
569034a [Andrew Or] Add a flag to ignore parent settings and scopes
(cherry picked from commit bd61f07039)
Signed-off-by: Andrew Or <andrew@databricks.com>
JIRA: https://issues.apache.org/jira/browse/SPARK-7390
Also fix a minor typo.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#5931 from viirya/fix_covariancecounter and squashes the following commits:
352eda6 [Liang-Chi Hsieh] Only merge other CovarianceCounter when its count is greater than zero.
(cherry picked from commit 90527f5604)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
It's the first step: generalize UnresolvedGetField to support all map, struct, and array
TODO: add `apply` in Scala and `__getitem__` in Python, and unify the `getItem` and `getField` methods to one single API(or should we keep them for compatibility?).
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#5744 from cloud-fan/generalize and squashes the following commits:
715c589 [Wenchen Fan] address comments
7ea5b31 [Wenchen Fan] fix python test
4f0833a [Wenchen Fan] add python test
f515d69 [Wenchen Fan] add apply method and test cases
8df6199 [Wenchen Fan] fix python test
239730c [Wenchen Fan] fix test compile
2a70526 [Wenchen Fan] use _bin_op in dataframe.py
6bf72bc [Wenchen Fan] address comments
3f880c3 [Wenchen Fan] add java doc
ab35ab5 [Wenchen Fan] fix python test
b5961a9 [Wenchen Fan] fix style
c9d85f5 [Wenchen Fan] generalize UnresolvedGetField to support all map, struct, and array
(cherry picked from commit 2d05f325dc)
Signed-off-by: Michael Armbrust <michael@databricks.com>
Added a new batch named `Substitution` before `Resolution` batch. The motivation for this is there are kind of cases we want to do some substitution on the parsed logical plan before resolve it.
Consider this two cases:
1 CTE, for cte we first build a row logical plan
```
'With Map(q1 -> 'Subquery q1
'Project ['key]
'UnresolvedRelation [src], None)
'Project [*]
'Filter ('key = 5)
'UnresolvedRelation [q1], None
```
In `With` logicalplan here is a map stored the (`q1-> subquery`), we want first take off the with command and substitute the `q1` of `UnresolvedRelation` by the `subquery`
2 Another example is Window function, in window function user may define some windows, we also need substitute the window name of child by the concrete window. this should also done in the Substitution batch.
Author: wangfei <wangfei1@huawei.com>
Closes#5776 from scwf/addbatch and squashes the following commits:
d4b962f [wangfei] added WindowsSubstitution
70f6932 [wangfei] Merge branch 'master' of https://github.com/apache/spark into addbatch
ecaeafb [wangfei] address yhuai's comments
553005a [wangfei] fix test case
0c54798 [wangfei] address comments
29aaaaf [wangfei] fix compile
1c9a092 [wangfei] added Substitution bastch
(cherry picked from commit f496bf3c53)
Signed-off-by: Yin Huai <yhuai@databricks.com>
With 0a2b15ce43, the serialization stream and deserialization stream has enough information to determine it is handling a key-value pari, a key, or a value. It is safe to use `SparkSqlSerializer2` in more cases.
Author: Yin Huai <yhuai@databricks.com>
Closes#5849 from yhuai/serializer2MoreCases and squashes the following commits:
53a5eaa [Yin Huai] Josh's comments.
487f540 [Yin Huai] Use BufferedOutputStream.
8385f95 [Yin Huai] Always create a new row at the deserialization side to work with sort merge join.
c7e2129 [Yin Huai] Update tests.
4513d13 [Yin Huai] Use Serializer2 in more places.
(cherry picked from commit 3af423c92f)
Signed-off-by: Yin Huai <yhuai@databricks.com>
This PR switches Spark SQL's Hive support to use the isolated hive client interface introduced by #5851, instead of directly interacting with the client. By using this isolated client we can now allow users to dynamically configure the version of Hive that they are connecting to by setting `spark.sql.hive.metastore.version` without the need recompile. This also greatly reduces the surface area for our interaction with the hive libraries, hopefully making it easier to support other versions in the future.
Jars for the desired hive version can be configured using `spark.sql.hive.metastore.jars`, which accepts the following options:
- a colon-separated list of jar files or directories for hive and hadoop.
- `builtin` - attempt to discover the jars that were used to load Spark SQL and use those. This
option is only valid when using the execution version of Hive.
- `maven` - download the correct version of hive on demand from maven.
By default, `builtin` is used for Hive 13.
This PR also removes the test step for building against Hive 12, as this will no longer be required to talk to Hive 12 metastores. However, the full removal of the Shim is deferred until a later PR.
Remaining TODOs:
- Remove the Hive Shims and inline code for Hive 13.
- Several HiveCompatibility tests are not yet passing.
- `nullformatCTAS` - As detailed below, we now are handling CTAS parsing ourselves instead of hacking into the Hive semantic analyzer. However, we currently only handle the common cases and not things like CTAS where the null format is specified.
- `combine1` now leaks state about compression somehow, breaking all subsequent tests. As such we currently add it to the blacklist
- `part_inherit_tbl_props` and `part_inherit_tbl_props_with_star` do not work anymore. We are correctly propagating the information
- "load_dyn_part14.*" - These tests pass when run on their own, but fail when run with all other tests. It seems our `RESET` mechanism may not be as robust as it used to be?
Other required changes:
- `CreateTableAsSelect` no longer carries parts of the HiveQL AST with it through the query execution pipeline. Instead, we parse CTAS during the HiveQL conversion and construct a `HiveTable`. The full parsing here is not yet complete as detailed above in the remaining TODOs. Since the operator is Hive specific, it is moved to the hive package.
- `Command` is simplified to be a trait that simply acts as a marker for a LogicalPlan that should be eagerly evaluated.
Author: Michael Armbrust <michael@databricks.com>
Closes#5876 from marmbrus/useIsolatedClient and squashes the following commits:
258d000 [Michael Armbrust] really really correct path handling
e56fd4a [Michael Armbrust] getAbsolutePath
5a259f5 [Michael Armbrust] fix typos
81bb366 [Michael Armbrust] comments from vanzin
5f3945e [Michael Armbrust] Merge remote-tracking branch 'origin/master' into useIsolatedClient
4b5cd41 [Michael Armbrust] yin's comments
f5de7de [Michael Armbrust] cleanup
11e9c72 [Michael Armbrust] better coverage in versions suite
7e8f010 [Michael Armbrust] better error messages and jar handling
e7b3941 [Michael Armbrust] more permisive checking for function registration
da91ba7 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into useIsolatedClient
5fe5894 [Michael Armbrust] fix serialization suite
81711c4 [Michael Armbrust] Initial support for running without maven
1d8ae44 [Michael Armbrust] fix final tests?
1c50813 [Michael Armbrust] more comments
a3bee70 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into useIsolatedClient
a6f5df1 [Michael Armbrust] style
ab07f7e [Michael Armbrust] WIP
4d8bf02 [Michael Armbrust] Remove hive 12 compilation
8843a25 [Michael Armbrust] [SPARK-6908] [SQL] Use isolated Hive client
(cherry picked from commit cd1d4110cf)
Signed-off-by: Yin Huai <yhuai@databricks.com>
Avoid translating to CaseWhen and evaluate the key expression many times.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#5979 from cloud-fan/condition and squashes the following commits:
3ce54e1 [Wenchen Fan] add CaseKeyWhen
(cherry picked from commit 35f0173b8f)
Signed-off-by: Michael Armbrust <michael@databricks.com>
Go through the context classloader when reflecting on user types in ScalaReflection.
Replaced calls to `typeOf` with `typeTag[T].in(mirror)`. The convenience method assumes
all types can be found in the classloader that loaded scala-reflect (the primordial
classloader). This assumption is not valid in all contexts (sbt console, Eclipse launchers).
Fixed SPARK-5281
Author: Iulian Dragos <jaguarul@gmail.com>
Closes#5981 from dragos/issue/mirrors-missing-requirement-error and squashes the following commits:
d103e70 [Iulian Dragos] Go through the context classloader when reflecting on user types in ScalaReflection
(cherry picked from commit 937ba798c5)
Signed-off-by: Michael Armbrust <michael@databricks.com>
JIRA: https://issues.apache.org/jira/browse/SPARK-7277
As automatically determining the number of reducers is not supported (`mapred.reduce.tasks` is set to `-1`), we should throw exception to users.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#5811 from viirya/no_neg_reduce_tasks and squashes the following commits:
e518f96 [Liang-Chi Hsieh] Consider other wrong setting values.
fd9c817 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into no_neg_reduce_tasks
4ede705 [Liang-Chi Hsieh] Throw exception instead of warning message.
68a1c70 [Liang-Chi Hsieh] Show warning message if mapred.reduce.tasks is set to -1.
(cherry picked from commit ea3077f19c)
Signed-off-by: Michael Armbrust <michael@databricks.com>
`Star` and `MultiAlias` just used in `analyzer` and them will be substituted after analyze, So just like `Alias` they do not need extend `Attribute`
Author: scwf <wangfei1@huawei.com>
Closes#5928 from scwf/attribute and squashes the following commits:
73a0560 [scwf] star and multialias do not need extend attribute
(cherry picked from commit 97d1182af6)
Signed-off-by: Michael Armbrust <michael@databricks.com>
This is a follow up of #5827 to remove the additional `SparkSQLParser`
Author: Cheng Hao <hao.cheng@intel.com>
Closes#5965 from chenghao-intel/remove_sparksqlparser and squashes the following commits:
509a233 [Cheng Hao] Remove the HiveQlQueryExecution
a5f9e3b [Cheng Hao] Remove the duplicated SparkSQLParser
(cherry picked from commit 074d75d4c8)
Signed-off-by: Michael Armbrust <michael@databricks.com>
This patch simply removes a `cache()` on an intermediate RDD when evaluating Python UDFs.
Author: ksonj <kson@siberie.de>
Closes#5973 from ksonj/udf and squashes the following commits:
db5b564 [ksonj] removed TODO about cleaning up
fe70c54 [ksonj] Remove cache() causing memory leak
(cherry picked from commit dec8f53719)
Signed-off-by: Michael Armbrust <michael@databricks.com>
Address marmbrus and scwf's comments in #5604.
Author: Yin Huai <yhuai@databricks.com>
Closes#5945 from yhuai/windowFollowup and squashes the following commits:
0ef879d [Yin Huai] Add collectFirst to TreeNode.
2373968 [Yin Huai] wip
4a16df9 [Yin Huai] Address minor comments for [SPARK-1442].
(cherry picked from commit 5784c8d955)
Signed-off-by: Michael Armbrust <michael@databricks.com>
Thank nadavoosh point this out in #5590
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#5877 from adrian-wang/jdbcrdd and squashes the following commits:
cc11900 [Daoyuan Wang] avoid NPE in jdbcrdd
(cherry picked from commit ed9be06a47)
Signed-off-by: Yin Huai <yhuai@databricks.com>
Author: Shiti <ssaxena.ece@gmail.com>
Closes#5867 from Shiti/spark-7295 and squashes the following commits:
71a9913 [Shiti] implementation for bitwise and,or, not and xor on Column with tests and docs
(cherry picked from commit fa8fddffd5)
Signed-off-by: Reynold Xin <rxin@databricks.com>
This patch comprises of a few related pieces of work:
* Schema inference is performed directly on the JSON token stream
* `String => Row` conversion populate Spark SQL structures without intermediate types
* Projection pushdown is implemented via CatalystScan for DataFrame queries
* Support for the legacy parser by setting `spark.sql.json.useJacksonStreamingAPI` to `false`
Performance improvements depend on the schema and queries being executed, but it should be faster across the board. Below are benchmarks using the last.fm Million Song dataset:
```
Command | Baseline | Patched
---------------------------------------------------|----------|--------
import sqlContext.implicits._ | |
val df = sqlContext.jsonFile("/tmp/lastfm.json") | 70.0s | 14.6s
df.count() | 28.8s | 6.2s
df.rdd.count() | 35.3s | 21.5s
df.where($"artist" === "Robert Hood").collect() | 28.3s | 16.9s
```
To prepare this dataset for benchmarking, follow these steps:
```
# Fetch the datasets from http://labrosa.ee.columbia.edu/millionsong/lastfm
wget http://labrosa.ee.columbia.edu/millionsong/sites/default/files/lastfm/lastfm_test.zip \
http://labrosa.ee.columbia.edu/millionsong/sites/default/files/lastfm/lastfm_train.zip
# Decompress and combine, pipe through `jq -c` to ensure there is one record per line
unzip -p lastfm_test.zip lastfm_train.zip | jq -c . > lastfm.json
```
Author: Nathan Howell <nhowell@godaddy.com>
Closes#5801 from NathanHowell/json-performance and squashes the following commits:
26fea31 [Nathan Howell] Recreate the baseRDD each for each scan operation
a7ebeb2 [Nathan Howell] Increase coverage of inserts into a JSONRelation
e06a1dd [Nathan Howell] Add comments to the `useJacksonStreamingAPI` config flag
6822712 [Nathan Howell] Split up JsonRDD2 into multiple objects
fa8234f [Nathan Howell] Wrap long lines
b31917b [Nathan Howell] Rename `useJsonRDD2` to `useJacksonStreamingAPI`
15c5d1b [Nathan Howell] JSONRelation's baseRDD need not be lazy
f8add6e [Nathan Howell] Add comments on lack of support for precision and scale DecimalTypes
fa0be47 [Nathan Howell] Remove unused default case in the field parser
80dba17 [Nathan Howell] Add comments regarding null handling and empty strings
842846d [Nathan Howell] Point the empty schema inference test at JsonRDD2
ab6ee87 [Nathan Howell] Add projection pushdown support to JsonRDD/JsonRDD2
f636c14 [Nathan Howell] Enable JsonRDD2 by default, add a flag to switch back to JsonRDD
0bbc445 [Nathan Howell] Improve JSON parsing and type inference performance
7ca70c1 [Nathan Howell] Eliminate arrow pattern, replace with pattern matches
(cherry picked from commit 2d6612cc8b)
Signed-off-by: Yin Huai <yhuai@databricks.com>
Author: Yin Huai <yhuai@databricks.com>
Closes#5951 from yhuai/fixBuildMaven and squashes the following commits:
fdde183 [Yin Huai] Move HiveWindowFunctionQuerySuite.scala to hive compatibility dir.
(cherry picked from commit 7740996700)
Signed-off-by: Yin Huai <yhuai@databricks.com>
This patch extends the `Serializer` interface with a new `Private` API which allows serializers to indicate whether they support relocation of serialized objects in serializer stream output.
This relocatibilty property is described in more detail in `Serializer.scala`, but in a nutshell a serializer supports relocation if reordering the bytes of serialized objects in serialization stream output is equivalent to having re-ordered those elements prior to serializing them. The optimized shuffle path introduced in #4450 and #5868 both rely on serializers having this property; this patch just centralizes the logic for determining whether a serializer has this property. I also added tests and comments clarifying when this works for KryoSerializer.
This change allows the optimizations in #4450 to be applied for shuffles that use `SqlSerializer2`.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#5924 from JoshRosen/SPARK-7311 and squashes the following commits:
50a68ca [Josh Rosen] Address minor nits
0a7ebd7 [Josh Rosen] Clarify reason why SqlSerializer2 supports this serializer
123b992 [Josh Rosen] Cleanup for submitting as standalone patch.
4aa61b2 [Josh Rosen] Add missing newline
2c1233a [Josh Rosen] Small refactoring of SerializerPropertiesSuite to enable test re-use:
0ba75e6 [Josh Rosen] Add tests for serializer relocation property.
450fa21 [Josh Rosen] Back out accidental log4j.properties change
86d4dcd [Josh Rosen] Flag that SparkSqlSerializer2 supports relocation
b9624ee [Josh Rosen] Expand serializer API and use new function to help control when new UnsafeShuffle path is used.
(cherry picked from commit 002c12384d)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
Adding more information about the implementation...
This PR is adding the support of window functions to Spark SQL (specifically OVER and WINDOW clause). For every expression having a OVER clause, we use a WindowExpression as the container of a WindowFunction and the corresponding WindowSpecDefinition (the definition of a window frame, i.e. partition specification, order specification, and frame specification appearing in a OVER clause).
# Implementation #
The high level work flow of the implementation is described as follows.
* Query parsing: In the query parse process, all WindowExpressions are originally placed in the projectList of a Project operator or the aggregateExpressions of an Aggregate operator. It makes our changes to simple and keep all of parsing rules for window functions at a single place (nodesToWindowSpecification). For the WINDOWclause in a query, we use a WithWindowDefinition as the container as the mapping from the name of a window specification to a WindowSpecDefinition. This changes is similar with our common table expression support.
* Analysis: The query analysis process has three steps for window functions.
* Resolve all WindowSpecReferences by replacing them with WindowSpecReferences according to the mapping table stored in the node of WithWindowDefinition.
* Resolve WindowFunctions in the projectList of a Project operator or the aggregateExpressions of an Aggregate operator. For this PR, we use Hive's functions for window functions because we will have a major refactoring of our internal UDAFs and it is better to switch our UDAFs after that refactoring work.
* Once we have resolved all WindowFunctions, we will use ResolveWindowFunction to extract WindowExpressions from projectList and aggregateExpressions and then create a Window operator for every distinct WindowSpecDefinition. With this choice, at the execution time, we can rely on the Exchange operator to do all of work on reorganizing the table and we do not need to worry about it in the physical Window operator. An example analyzed plan is shown as follows
```
sql("""
SELECT
year, country, product, sales,
avg(sales) over(partition by product) avg_product,
sum(sales) over(partition by country) sum_country
FROM sales
ORDER BY year, country, product
""").explain(true)
== Analyzed Logical Plan ==
Sort [year#34 ASC,country#35 ASC,product#36 ASC], true
Project [year#34,country#35,product#36,sales#37,avg_product#27,sum_country#28]
Window [year#34,country#35,product#36,sales#37,avg_product#27], [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFSum(sales#37) WindowSpecDefinition [country#35], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS sum_country#28], WindowSpecDefinition [country#35], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
Window [year#34,country#35,product#36,sales#37], [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFAverage(sales#37) WindowSpecDefinition [product#36], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS avg_product#27], WindowSpecDefinition [product#36], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
Project [year#34,country#35,product#36,sales#37]
MetastoreRelation default, sales, None
```
* Query planning: In the process of query planning, we simple generate the physical Window operator based on the logical Window operator. Then, to prepare the executedPlan, the EnsureRequirements rule will add Exchange and Sort operators if necessary. The EnsureRequirements rule will analyze the data properties and try to not add unnecessary shuffle and sort. The physical plan for the above example query is shown below.
```
== Physical Plan ==
Sort [year#34 ASC,country#35 ASC,product#36 ASC], true
Exchange (RangePartitioning [year#34 ASC,country#35 ASC,product#36 ASC], 200), []
Window [year#34,country#35,product#36,sales#37,avg_product#27], [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFSum(sales#37) WindowSpecDefinition [country#35], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS sum_country#28], WindowSpecDefinition [country#35], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
Exchange (HashPartitioning [country#35], 200), [country#35 ASC]
Window [year#34,country#35,product#36,sales#37], [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFAverage(sales#37) WindowSpecDefinition [product#36], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS avg_product#27], WindowSpecDefinition [product#36], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
Exchange (HashPartitioning [product#36], 200), [product#36 ASC]
HiveTableScan [year#34,country#35,product#36,sales#37], (MetastoreRelation default, sales, None), None
```
* Execution time: At execution time, a physical Window operator buffers all rows in a partition specified in the partition spec of a OVER clause. If necessary, it also maintains a sliding window frame. The current implementation tries to buffer the input parameters of a window function according to the window frame to avoid evaluating a row multiple times.
# Future work #
Here are three improvements that are not hard to add:
* Taking advantage of the window frame specification to reduce the number of rows buffered in the physical Window operator. For some cases, we only need to buffer the rows appearing in the sliding window. But for other cases, we will not be able to reduce the number of rows buffered (e.g. ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING).
* When aRAGEN frame is used, for <value> PRECEDING and <value> FOLLOWING, it will be great if the <value> part is an expression (we can start with Literal). So, when the data type of ORDER BY expression is a FractionalType, we can support FractionalType as the type <value> (<value> still needs to be evaluated as a positive value).
* When aRAGEN frame is used, we need to support DateType and TimestampType as the data type of the expression appearing in the order specification. Then, the <value> part of <value> PRECEDING and <value> FOLLOWING can support interval types (once we support them).
This is a joint work with guowei2 and yhuai
Thanks hbutani hvanhovell for his comments
Thanks scwf for his comments and unit tests
Author: Yin Huai <yhuai@databricks.com>
Closes#5604 from guowei2/windowImplement and squashes the following commits:
76fe1c8 [Yin Huai] Implementation.
aa2b0ae [Yin Huai] Tests.
(cherry picked from commit f2c47082c3)
Signed-off-by: Michael Armbrust <michael@databricks.com>
huangjs
Acutally spark sql will first go through analysis period, in which we do widen types and promote strings, and then optimization, where constant IN will be converted into INSET.
So it turn out that we only need to fix this for IN.
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#4945 from adrian-wang/inset and squashes the following commits:
71e05cc [Daoyuan Wang] minor fix
581fa1c [Daoyuan Wang] mysql way
f3f7baf [Daoyuan Wang] address comments
5eed4bc [Daoyuan Wang] promote string and do widen types for IN
(cherry picked from commit c3eb441f54)
Signed-off-by: Yin Huai <yhuai@databricks.com>
Author: Reynold Xin <rxin@databricks.com>
Closes#5935 from rxin/df-doc1 and squashes the following commits:
aaeaadb [Reynold Xin] [SQL] JavaDoc update for various DataFrame functions.
(cherry picked from commit 322e7e7f68)
Signed-off-by: Reynold Xin <rxin@databricks.com>
After a discussion on the user mailing list, it was decided to put all UDF's under `o.a.s.sql.functions`
cc rxin
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#5923 from brkyvz/move-math-funcs and squashes the following commits:
a8dc3f7 [Burak Yavuz] address comments
cf7a7bb [Burak Yavuz] [SPARK-7358] Move DataFrame mathfunctions into functions
(cherry picked from commit ba2b56614d)
Signed-off-by: Reynold Xin <rxin@databricks.com>
See the comment in join function for more information.
Author: Reynold Xin <rxin@databricks.com>
Closes#5919 from rxin/self-join-resolve and squashes the following commits:
e2fb0da [Reynold Xin] Updated SQLConf comment.
7233a86 [Reynold Xin] Updated comment.
6be2b4d [Reynold Xin] Removed println
9f6b72f [Reynold Xin] [SPARK-6231][SQL/DF] Automatically resolve ambiguity in join condition for self-joins.
make StringComparison extends ExpectsInputTypes and added expectedChildTypes, so do not need override expectedChildTypes in each subclass
Author: wangfei <wangfei1@huawei.com>
Closes#5905 from scwf/ExpectsInputTypes and squashes the following commits:
b374ddf [wangfei] make stringcomparison extends ExpectsInputTypes
(cherry picked from commit 3059291e20)
Signed-off-by: Reynold Xin <rxin@databricks.com>
Reduced take size from 1e8 to 1e6.
cc rxin
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#5900 from brkyvz/df-cont-followup and squashes the following commits:
c11e762 [Burak Yavuz] fix grammar
b30ace2 [Burak Yavuz] address comments
a417ba5 [Burak Yavuz] [SPARK-7243][SQL] Reduce size for Contingency Tables in DataFrames
(cherry picked from commit 18340d7be5)
Signed-off-by: Reynold Xin <rxin@databricks.com>
Two minor doc errors in `BytesToBytesMap` and `UnsafeRow`.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#5906 from viirya/minor_doc and squashes the following commits:
27f9089 [Liang-Chi Hsieh] Minor update for doc.
(cherry picked from commit b83091ae45)
Signed-off-by: Sean Owen <sowen@cloudera.com>
This should gives us better analysis time error messages (rather than runtime) and automatic type casting.
Author: Reynold Xin <rxin@databricks.com>
Closes#5796 from rxin/expected-input-types and squashes the following commits:
c900760 [Reynold Xin] [SPARK-7266] Add ExpectsInputTypes to expressions when possible.
(cherry picked from commit 678c4da0fa)
Signed-off-by: Reynold Xin <rxin@databricks.com>
Author: 云峤 <chensong.cs@alibaba-inc.com>
Closes#5865 from kaka1992/df.show and squashes the following commits:
c79204b [云峤] Update
a1338f6 [云峤] Update python dataFrame show test and add empty df unit test.
734369c [云峤] Update python dataFrame show test and add empty df unit test.
84aec3e [云峤] Update python dataFrame show test and add empty df unit test.
159b3d5 [云峤] update
03ef434 [云峤] update
7394fd5 [云峤] update test show
ced487a [云峤] update pep8
b6e690b [云峤] Merge remote-tracking branch 'upstream/master' into df.show
30ac311 [云峤] [SPARK-7294] ADD BETWEEN
7d62368 [云峤] [SPARK-7294] ADD BETWEEN
baf839b [云峤] [SPARK-7294] ADD BETWEEN
d11d5b9 [云峤] [SPARK-7294] ADD BETWEEN
(cherry picked from commit f32e69ecc3)
Signed-off-by: Reynold Xin <rxin@databricks.com>
This PR is a rebased version of #3946 , and mainly focused on creating an independent tab for the thrift server in spark web UI.
Features:
1. Session related statistics ( username and IP are only supported in hive-0.13.1 )
2. List all the SQL executing or executed on this server
3. Provide links to the job generated by SQL
4. Provide link to show all SQL executing or executed in a specified session
Prototype snapshots:
This is the main page for thrift server
![image](https://cloud.githubusercontent.com/assets/1411869/7361379/df7dcc64-ed89-11e4-9964-4df0b32f475e.png)
Author: tianyi <tianyi.asiainfo@gmail.com>
Closes#5730 from tianyi/SPARK-5100 and squashes the following commits:
cfd14c7 [tianyi] style fix
0efe3d5 [tianyi] revert part of pom change
c0f2fa0 [tianyi] extends HiveThriftJdbcTest to start/stop thriftserver for UI test
aa20408 [tianyi] fix style problem
c9df6f9 [tianyi] add testsuite for thriftserver ui and fix some style issue
9830199 [tianyi] add webui for thriftserver
submitting this PR from a phone, excuse the brevity.
adds Pearson correlation to Dataframes, reusing the covariance calculation code
cc mengxr rxin
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#5858 from brkyvz/df-corr and squashes the following commits:
285b838 [Burak Yavuz] addressed comments v2.0
d10babb [Burak Yavuz] addressed comments v0.2
4b74b24 [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into df-corr
4fe693b [Burak Yavuz] addressed comments v0.1
a682d06 [Burak Yavuz] ready for PR
This PR adds initial support for loading multiple versions of Hive in a single JVM and provides a common interface for extracting metadata from the `HiveMetastoreClient` for a given version. This is accomplished by creating an isolated `ClassLoader` that operates according to the following rules:
- __Shared Classes__: Java, Scala, logging, and Spark classes are delegated to `baseClassLoader`
allowing the results of calls to the `ClientInterface` to be visible externally.
- __Hive Classes__: new instances are loaded from `execJars`. These classes are not
accessible externally due to their custom loading.
- __Barrier Classes__: Classes such as `ClientWrapper` are defined in Spark but must link to a specific version of Hive. As a result, the bytecode is acquired from the Spark `ClassLoader` but a new copy is created for each instance of `IsolatedClientLoader`.
This new instance is able to see a specific version of hive without using reflection where ever hive is consistent across versions. Since
this is a unique instance, it is not visible externally other than as a generic
`ClientInterface`, unless `isolationOn` is set to `false`.
In addition to the unit tests, I have also tested this locally against mysql instances of the Hive Metastore. I've also successfully ported Spark SQL to run with this client, but due to the size of the changes, that will come in a follow-up PR.
By default, Hive jars are currently downloaded from Maven automatically for a given version to ease packaging and testing. However, there is also support for specifying their location manually for deployments without internet.
Author: Michael Armbrust <michael@databricks.com>
Closes#5851 from marmbrus/isolatedClient and squashes the following commits:
c72f6ac [Michael Armbrust] rxins comments
1e271fa [Michael Armbrust] [SPARK-6907][SQL] Isolated client for HiveMetastore
based on #4015, we should not delete `sqlParser` from sqlcontext, that leads to mima failed. Users implement dialect to give a fallback for `sqlParser` and we should construct `sqlParser` in sqlcontext according to the dialect
`protected[sql] val sqlParser = new SparkSQLParser(getSQLDialect().parse(_))`
Author: Cheng Hao <hao.cheng@intel.com>
Author: scwf <wangfei1@huawei.com>
Closes#5827 from scwf/sqlparser1 and squashes the following commits:
81b9737 [scwf] comment fix
0878bd1 [scwf] remove comments
c19780b [scwf] fix mima tests
c2895cf [scwf] Merge branch 'master' of https://github.com/apache/spark into sqlparser1
493775c [Cheng Hao] update the code as feedback
81a731f [Cheng Hao] remove the unecessary comment
aab0b0b [Cheng Hao] polish the code a little bit
49b9d81 [Cheng Hao] shrink the comment for rebasing
At least in the version of Hive I tested on, the test was deleting
a temp directory generated by Hive instead of one containing partition
data. So fix the filter to only consider partition directories when
deciding what to delete.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#5854 from vanzin/hive-test-fix and squashes the following commits:
7594ae9 [Marcelo Vanzin] Fix typo.
729fa80 [Marcelo Vanzin] [minor] [hive] Fix QueryPartitionSuite.
The python api for DataFrame's plus addressed your comments from previous PR.
rxin
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#5859 from brkyvz/df-freq-py2 and squashes the following commits:
f9aa9ce [Burak Yavuz] addressed comments v0.1
4b25056 [Burak Yavuz] added python api for freqItems
Remove the method, since it causes infinite recursive calls. And seems it's a dummy method, since we have the API:
`def createDataFrame(rowRDD: JavaRDD[Row], schema: StructType): DataFrame`
Author: Cheng Hao <hao.cheng@intel.com>
Closes#5804 from chenghao-intel/spark_6999 and squashes the following commits:
63220a8 [Cheng Hao] remove the infinite recursive method (useless)
Refer to the JIRA for the design doc and some perf results.
I wanted to call out some of the more possibly controversial changes up front:
* Map outputs are only stored in serialized form when Kryo is in use. I'm still unsure whether Java-serialized objects can be relocated. At the very least, Java serialization writes out a stream header which causes problems with the current approach, so I decided to leave investigating this to future work.
* The shuffle now explicitly operates on key-value pairs instead of any object. Data is written to shuffle files in alternating keys and values instead of key-value tuples. `BlockObjectWriter.write` now accepts a key argument and a value argument instead of any object.
* The map output buffer can hold a max of Integer.MAX_VALUE bytes. Though this wouldn't be terribly difficult to change.
* When spilling occurs, the objects that still in memory at merge time end up serialized and deserialized an extra time.
Author: Sandy Ryza <sandy@cloudera.com>
Closes#4450 from sryza/sandy-spark-4550 and squashes the following commits:
8c70dd9 [Sandy Ryza] Fix serialization
9c16fe6 [Sandy Ryza] Fix a couple tests and move getAutoReset to KryoSerializerInstance
6c54e06 [Sandy Ryza] Fix scalastyle
d8462d8 [Sandy Ryza] SPARK-4550
Adds the functions `rand` (Uniform Dist) and `randn` (Normal Dist.) as expressions to DataFrames.
cc mengxr rxin
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#5819 from brkyvz/df-rng and squashes the following commits:
50d69d4 [Burak Yavuz] add seed for test that failed
4234c3a [Burak Yavuz] fix Rand expression
13cad5c [Burak Yavuz] couple fixes
7d53953 [Burak Yavuz] waiting for hive tests
b453716 [Burak Yavuz] move radn with seed down
03637f0 [Burak Yavuz] fix broken hive func
c5909eb [Burak Yavuz] deleted old implementation of Rand
6d43895 [Burak Yavuz] implemented random generators
Run following sql get error
`SELECT r.*
FROM testData l join testData2 r on (l.key = r.a)`
Author: scwf <wangfei1@huawei.com>
Closes#5690 from scwf/tablestar and squashes the following commits:
3b2e2b6 [scwf] support table.star
This PR aims to make the SQL Parser Pluggable, and user can register it's own parser via Spark SQL CLI.
```
# add the jar into the classpath
$hchengmydesktop:spark>bin/spark-sql --jars sql99.jar
-- switch to "hiveql" dialect
spark-sql>SET spark.sql.dialect=hiveql;
spark-sql>SELECT * FROM src LIMIT 1;
-- switch to "sql" dialect
spark-sql>SET spark.sql.dialect=sql;
spark-sql>SELECT * FROM src LIMIT 1;
-- switch to a custom dialect
spark-sql>SET spark.sql.dialect=com.xxx.xxx.SQL99Dialect;
spark-sql>SELECT * FROM src LIMIT 1;
-- register the non-exist SQL dialect
spark-sql> SET spark.sql.dialect=NotExistedClass;
spark-sql> SELECT * FROM src LIMIT 1;
-- Exception will be thrown and switch to default sql dialect ("sql" for SQLContext and "hiveql" for HiveContext)
```
Author: Cheng Hao <hao.cheng@intel.com>
Closes#4015 from chenghao-intel/sqlparser and squashes the following commits:
493775c [Cheng Hao] update the code as feedback
81a731f [Cheng Hao] remove the unecessary comment
aab0b0b [Cheng Hao] polish the code a little bit
49b9d81 [Cheng Hao] shrink the comment for rebasing
Fixed `java.sql.SQLException: No suitable driver found` when loading DataFrame into Spark SQL if the driver is supplied with `--jars` argument.
The problem is in `java.sql.DriverManager` class that can't access drivers loaded by Spark ClassLoader.
Wrappers that forward requests are created for these drivers.
Also, it's not necessary any more to include JDBC drivers in `--driver-class-path` in local mode, specifying in `--jars` argument is sufficient.
Author: Vyacheslav Baranov <slavik.baranov@gmail.com>
Closes#5782 from SlavikBaranov/SPARK-6913 and squashes the following commits:
510c43f [Vyacheslav Baranov] [SPARK-6913] Fixed review comments
b2a727c [Vyacheslav Baranov] [SPARK-6913] Fixed thread race on driver registration
c8294ae [Vyacheslav Baranov] [SPARK-6913] Fixed "No suitable driver found" when using using JDBC driver added with SparkContext.addJar
Now in spark sql optimizer we only push down right side filter for left semi join, actually we can push down left side filter because left semi join is doing filter on left table essentially.
Author: wangfei <wangfei1@huawei.com>
Author: scwf <wangfei1@huawei.com>
Closes#5677 from scwf/leftsemi and squashes the following commits:
483d205 [wangfei] update with master to fix compile issue
82df0e1 [wangfei] Merge branch 'master' of https://github.com/apache/spark into leftsemi
d68a053 [wangfei] added apply
8f48a3d [scwf] added test
ebadaa9 [wangfei] left filter push down for left semi join
Using newPredicate in NestedLoopJoin instead of InterpretedPredicate to make it can make use of code generation
Author: scwf <wangfei1@huawei.com>
Closes#5665 from scwf/NLP and squashes the following commits:
d19dd31 [scwf] improvement
a887c02 [scwf] improve for NLP boundCondition
Takes a column name/s and returns a new DataFrame that drops a column/s.
Author: rakeshchalasani <vnit.rakesh@gmail.com>
Closes#5818 from rakeshchalasani/SPARK-7280 and squashes the following commits:
ce2ec09 [rakeshchalasani] Minor edit
45c06f1 [rakeshchalasani] Change withColumnRename and format changes
f68945a [rakeshchalasani] Minor fix
0b9104d [rakeshchalasani] Drop one column at a time
289afd2 [rakeshchalasani] [SPARK-7280][SQL] Add "drop" column/s on a data frame
Finding frequent items with possibly false positives, using the algorithm described in `http://www.cs.umd.edu/~samir/498/karp.pdf`.
public API under:
```
df.stat.freqItems(cols: Array[String], support: Double = 0.001): DataFrame
```
The output is a local DataFrame having the input column names with `-freqItems` appended to it. This is a single pass algorithm that may return false positives, but no false negatives.
cc mengxr rxin
Let's get the implementations in, I can add python API in a follow up PR.
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#5799 from brkyvz/freq-items and squashes the following commits:
a6ec82c [Burak Yavuz] addressed comments v?
39b1bba [Burak Yavuz] removed toSeq
0915e23 [Burak Yavuz] addressed comments v2.1
3a5c177 [Burak Yavuz] addressed comments v2.0
482e741 [Burak Yavuz] removed old import
38e784d [Burak Yavuz] addressed comments v1.0
8279d4d [Burak Yavuz] added default value for support
3d82168 [Burak Yavuz] made base implementation
Now that 1.3 has been released, we should enable MiMa checks for the `sql` subproject.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#5727 from JoshRosen/enable-more-mima-checks and squashes the following commits:
3ad302b [Josh Rosen] Merge remote-tracking branch 'origin/master' into enable-more-mima-checks
0c48e4d [Josh Rosen] Merge remote-tracking branch 'origin/master' into enable-more-mima-checks
e276cee [Josh Rosen] Fix SQL MiMa checks via excludes and private[sql]
44d0d01 [Josh Rosen] Add back 'launcher' exclude
1aae027 [Josh Rosen] Enable MiMa checks for launcher and sql projects.
JIRA: https://issues.apache.org/jira/browse/SPARK-7196
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#5777 from viirya/jdbc_precision and squashes the following commits:
f40f5e6 [Liang-Chi Hsieh] Support precision and scale for NUMERIC type.
49acbf9 [Liang-Chi Hsieh] Add unit test.
a509e19 [Liang-Chi Hsieh] Support precision and scale of decimal type for JDBC.
small fixes regarding comments in PR #5761
cc rxin
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#5795 from brkyvz/split-followup and squashes the following commits:
369c522 [Burak Yavuz] changed wording a little
1ea456f [Burak Yavuz] Addressed follow up comments
Author: 云峤 <chensong.cs@alibaba-inc.com>
Closes#5778 from kaka1992/fix_codegenon_datetype_mismatch and squashes the following commits:
1ad4cff [云峤] SPARK-7234 fix dateType mismatch
Author: Cheng Hao <hao.cheng@intel.com>
Closes#5772 from chenghao-intel/specific_row and squashes the following commits:
2cd064d [Cheng Hao] scala style issue
60347a2 [Cheng Hao] SpecificMutableRow should take integer type as internal representation for DateType
This is built on top of kaka1992 's PR #5711 using Logical plans.
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#5761 from brkyvz/random-sample and squashes the following commits:
a1fb0aa [Burak Yavuz] remove unrelated file
69669c3 [Burak Yavuz] fix broken test
1ddb3da [Burak Yavuz] copy base
6000328 [Burak Yavuz] added python api and fixed test
3c11d1b [Burak Yavuz] fixed broken test
f400ade [Burak Yavuz] fix build errors
2384266 [Burak Yavuz] addressed comments v0.1
e98ebac [Burak Yavuz] [SPARK-7156][SQL] support RandomSplit in DataFrames
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#5712 from cloud-fan/minor and squashes the following commits:
be23064 [Wenchen Fan] fix java doc for DataFrame.agg
This patch adds managed-memory-based aggregation to Spark SQL / DataFrames. Instead of working with Java objects, this new aggregation path uses `sun.misc.Unsafe` to manipulate raw memory. This reduces the memory footprint for aggregations, resulting in fewer spills, OutOfMemoryErrors, and garbage collection pauses. As a result, this allows for higher memory utilization. It can also result in better cache locality since objects will be stored closer together in memory.
This feature can be eanbled by setting `spark.sql.unsafe.enabled=true`. For now, this feature is only supported when codegen is enabled and only supports aggregations for which the grouping columns are primitive numeric types or strings and aggregated values are numeric.
### Managing memory with sun.misc.Unsafe
This patch supports both on- and off-heap managed memory.
- In on-heap mode, memory addresses are identified by the combination of a base Object and an offset within that object.
- In off-heap mode, memory is addressed directly with 64-bit long addresses.
To support both modes, functions that manipulate memory accept both `baseObject` and `baseOffset` fields. In off-heap mode, we simply pass `null` as `baseObject`.
We allocate memory in large chunks, so memory fragmentation and allocation speed are not significant bottlenecks.
By default, we use on-heap mode. To enable off-heap mode, set `spark.unsafe.offHeap=true`.
To track allocated memory, this patch extends `SparkEnv` with an `ExecutorMemoryManager` and supplies each `TaskContext` with a `TaskMemoryManager`. These classes work together to track allocations and detect memory leaks.
### Compact tuple format
This patch introduces `UnsafeRow`, a compact row layout. In this format, each tuple has three parts: a null bit set, fixed length values, and variable-length values:
![image](https://cloud.githubusercontent.com/assets/50748/7328538/2fdb65ce-ea8b-11e4-9743-6c0f02bb7d1f.png)
- Rows are always 8-byte word aligned (so their sizes will always be a multiple of 8 bytes)
- The bit set is used for null tracking:
- Position _i_ is set if and only if field _i_ is null
- The bit set is aligned to an 8-byte word boundary.
- Every field appears as an 8-byte word in the fixed-length values part:
- If a field is null, we zero out the values.
- If a field is variable-length, the word stores a relative offset (w.r.t. the base of the tuple) that points to the beginning of the field's data in the variable-length part.
- Each variable-length data type can have its own encoding:
- For strings, the first word stores the length of the string and is followed by UTF-8 encoded bytes. If necessary, the end of the string is padded with empty bytes in order to ensure word-alignment.
For example, a tuple that consists 3 fields of type (int, string, string), with value (null, “data”, “bricks”) would look like this:
![image](https://cloud.githubusercontent.com/assets/50748/7328526/1e21959c-ea8b-11e4-9a28-a4350fe4a7b5.png)
This format allows us to compare tuples for equality by directly comparing their raw bytes. This also enables fast hashing of tuples.
### Hash map for performing aggregations
This patch introduces `UnsafeFixedWidthAggregationMap`, a hash map for performing aggregations where the aggregation result columns are fixed-with. This map's keys and values are `Row` objects. `UnsafeFixedWidthAggregationMap` is implemented on top of `BytesToBytesMap`, an append-only map which supports byte-array keys and values.
`BytesToBytesMap` stores pointers to key and value tuples. For each record with a new key, we copy the key and create the aggregation value buffer for that key and put them in a buffer. The hash table then simply stores pointers to the key and value. For each record with an existing key, we simply run the aggregation function to update the values in place.
This map is implemented using open hashing with triangular sequence probing. Each entry stores two words in a long array: the first word stores the address of the key and the second word stores the relative offset from the key tuple to the value tuple, as well as the key's 32-bit hashcode. By storing the full hashcode, we reduce the number of equality checks that need to be performed to handle position collisions ()since the chance of hashcode collision is much lower than position collision).
`UnsafeFixedWidthAggregationMap` allows regular Spark SQL `Row` objects to be used when probing the map. Internally, it encodes these rows into `UnsafeRow` format using `UnsafeRowConverter`. This conversion has a small overhead that can be eliminated in the future once we use UnsafeRows in other operators.
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5725)
<!-- Reviewable:end -->
Author: Josh Rosen <joshrosen@databricks.com>
Closes#5725 from JoshRosen/unsafe and squashes the following commits:
eeee512 [Josh Rosen] Add converters for Null, Boolean, Byte, and Short columns.
81f34f8 [Josh Rosen] Follow 'place children last' convention for GeneratedAggregate
1bc36cc [Josh Rosen] Refactor UnsafeRowConverter to avoid unnecessary boxing.
017b2dc [Josh Rosen] Remove BytesToBytesMap.finalize()
50e9671 [Josh Rosen] Throw memory leak warning even in case of error; add warning about code duplication
70a39e4 [Josh Rosen] Split MemoryManager into ExecutorMemoryManager and TaskMemoryManager:
6e4b192 [Josh Rosen] Remove an unused method from ByteArrayMethods.
de5e001 [Josh Rosen] Fix debug vs. trace in logging message.
a19e066 [Josh Rosen] Rename unsafe Java test suites to match Scala test naming convention.
78a5b84 [Josh Rosen] Add logging to MemoryManager
ce3c565 [Josh Rosen] More comments, formatting, and code cleanup.
529e571 [Josh Rosen] Measure timeSpentResizing in nanoseconds instead of milliseconds.
3ca84b2 [Josh Rosen] Only zero the used portion of groupingKeyConversionScratchSpace
162caf7 [Josh Rosen] Fix test compilation
b45f070 [Josh Rosen] Don't redundantly store the offset from key to value, since we can compute this from the key size.
a8e4a3f [Josh Rosen] Introduce MemoryManager interface; add to SparkEnv.
0925847 [Josh Rosen] Disable MiMa checks for new unsafe module
cde4132 [Josh Rosen] Add missing pom.xml
9c19fc0 [Josh Rosen] Add configuration options for heap vs. offheap
6ffdaa1 [Josh Rosen] Null handling improvements in UnsafeRow.
31eaabc [Josh Rosen] Lots of TODO and doc cleanup.
a95291e [Josh Rosen] Cleanups to string handling code
afe8dca [Josh Rosen] Some Javadoc cleanup
f3dcbfe [Josh Rosen] More mod replacement
854201a [Josh Rosen] Import and comment cleanup
06e929d [Josh Rosen] More warning cleanup
ef6b3d3 [Josh Rosen] Fix a bunch of FindBugs and IntelliJ inspections
29a7575 [Josh Rosen] Remove debug logging
49aed30 [Josh Rosen] More long -> int conversion.
b26f1d3 [Josh Rosen] Fix bug in murmur hash implementation.
765243d [Josh Rosen] Enable optional performance metrics for hash map.
23a440a [Josh Rosen] Bump up default hash map size
628f936 [Josh Rosen] Use ints intead of longs for indexing.
92d5a06 [Josh Rosen] Address a number of minor code review comments.
1f4b716 [Josh Rosen] Merge Unsafe code into the regular GeneratedAggregate, guarded by a configuration flag; integrate planner support and re-enable all tests.
d85eeff [Josh Rosen] Add basic sanity test for UnsafeFixedWidthAggregationMap
bade966 [Josh Rosen] Comment update (bumping to refresh GitHub cache...)
b3eaccd [Josh Rosen] Extract aggregation map into its own class.
d2bb986 [Josh Rosen] Update to implement new Row methods added upstream
58ac393 [Josh Rosen] Use UNSAFE allocator in GeneratedAggregate (TODO: make this configurable)
7df6008 [Josh Rosen] Optimizations related to zeroing out memory:
c1b3813 [Josh Rosen] Fix bug in UnsafeMemoryAllocator.free():
738fa33 [Josh Rosen] Add feature flag to guard UnsafeGeneratedAggregate
c55bf66 [Josh Rosen] Free buffer once iterator has been fully consumed.
62ab054 [Josh Rosen] Optimize for fact that get() is only called on String columns.
c7f0b56 [Josh Rosen] Reuse UnsafeRow pointer in UnsafeRowConverter
ae39694 [Josh Rosen] Add finalizer as "cleanup method of last resort"
c754ae1 [Josh Rosen] Now that the store*() contract has been stregthened, we can remove an extra lookup
f764d13 [Josh Rosen] Simplify address + length calculation in Location.
079f1bf [Josh Rosen] Some clarification of the BytesToBytesMap.lookup() / set() contract.
1a483c5 [Josh Rosen] First version that passes some aggregation tests:
fc4c3a8 [Josh Rosen] Sketch how the converters will be used in UnsafeGeneratedAggregate
53ba9b7 [Josh Rosen] Start prototyping Java Row -> UnsafeRow converters
1ff814d [Josh Rosen] Add reminder to free memory on iterator completion
8a8f9df [Josh Rosen] Add skeleton for GeneratedAggregate integration.
5d55cef [Josh Rosen] Add skeleton for Row implementation.
f03e9c1 [Josh Rosen] Play around with Unsafe implementations of more string methods.
ab68e08 [Josh Rosen] Begin merging the UTF8String implementations.
480a74a [Josh Rosen] Initial import of code from Databricks unsafe utils repo.
Adds support for the math functions for DataFrames in PySpark.
rxin I love Davies.
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#5750 from brkyvz/python-math-udfs and squashes the following commits:
7c4f563 [Burak Yavuz] removed is_math
3c4adde [Burak Yavuz] cleanup imports
d5dca3f [Burak Yavuz] moved math functions to mathfunctions
25e6534 [Burak Yavuz] addressed comments v2.0
d3f7e0f [Burak Yavuz] addressed comments and added tests
7b7d7c4 [Burak Yavuz] remove tests for removed methods
33c2c15 [Burak Yavuz] fixed python style
3ee0c05 [Burak Yavuz] added python functions
Coalesce and repartition now show up as part of the query plan, rather than resulting in a new `DataFrame`.
cc rxin
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#5762 from brkyvz/df-repartition and squashes the following commits:
b1e76dd [Burak Yavuz] added documentation on repartitions
5807e35 [Burak Yavuz] renamed coalescepartitions
fa4509f [Burak Yavuz] rename coalesce
2c349b5 [Burak Yavuz] address comments
f2e6af1 [Burak Yavuz] add ticks
686c90b [Burak Yavuz] made coalesce and repartition a part of the query plan
Update Maven build plugin versions and centralize plugin version management
Author: Sean Owen <sowen@cloudera.com>
Closes#5720 from srowen/SPARK-7168 and squashes the following commits:
98a8947 [Sean Owen] Make install, deploy plugin versions explicit
4ecf3b2 [Sean Owen] Update Maven build plugin versions and centralize plugin version management
Add new config "spark.sql.parquet.output.committer.class" to allow custom parquet output committer and an output committer class specific to use on s3.
Fix compilation error introduced by https://github.com/apache/spark/pull/5042.
Respect ParquetOutputFormat.ENABLE_JOB_SUMMARY flag.
Author: Pei-Lun Lee <pllee@appier.com>
Closes#5525 from ypcat/spark-6352 and squashes the following commits:
54c6b15 [Pei-Lun Lee] error handling
472870e [Pei-Lun Lee] add back custom parquet output committer
ddd0f69 [Pei-Lun Lee] Merge branch 'master' of https://github.com/apache/spark into spark-6352
9ece5c5 [Pei-Lun Lee] compatibility with hadoop 1.x
8413fcd [Pei-Lun Lee] Merge branch 'master' of https://github.com/apache/spark into spark-6352
fe65915 [Pei-Lun Lee] add support for parquet config parquet.enable.summary-metadata
e17bf47 [Pei-Lun Lee] Merge branch 'master' of https://github.com/apache/spark into spark-6352
9ae7545 [Pei-Lun Lee] [SPARL-6352] [SQL] Change to allow custom parquet output committer.
0d540b9 [Pei-Lun Lee] [SPARK-6352] [SQL] add license
c42468c [Pei-Lun Lee] [SPARK-6352] [SQL] add test case
0fc03ca [Pei-Lun Lee] [SPARK-6532] [SQL] hide class DirectParquetOutputCommitter
769bd67 [Pei-Lun Lee] DirectParquetOutputCommitter
f75e261 [Pei-Lun Lee] DirectParquetOutputCommitter
Implemented almost all math functions found in scala.math (max, min and abs were already present).
cc mengxr marmbrus
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#5616 from brkyvz/math-udfs and squashes the following commits:
fb27153 [Burak Yavuz] reverted exception message
836a098 [Burak Yavuz] fixed test and addressed small comment
e5f0d13 [Burak Yavuz] addressed code review v2.2
b26c5fb [Burak Yavuz] addressed review v2.1
2761f08 [Burak Yavuz] addressed review v2
6588a5b [Burak Yavuz] fixed merge conflicts
b084e10 [Burak Yavuz] Addressed code review
029e739 [Burak Yavuz] fixed atan2 test
534cc11 [Burak Yavuz] added more tests, addressed comments
fa68dbe [Burak Yavuz] added double specific test data
937d5a5 [Burak Yavuz] use doubles instead of ints
8e28fff [Burak Yavuz] Added apache header
7ec8f7f [Burak Yavuz] Added math functions for DataFrames
Remove use of commons-lang in favor of commons-lang3 classes; remove commons-io use in favor of Guava
Author: Sean Owen <sowen@cloudera.com>
Closes#5703 from srowen/SPARK-7145 and squashes the following commits:
21fbe03 [Sean Owen] Remove use of commons-lang in favor of commons-lang3 classes; remove commons-io use in favor of Guava
according liancheng‘s comment in https://issues.apache.org/jira/browse/SPARK-6505, this patch remove the reflection call in HiveFunctionWrapper, and implement the functions named "deserializeObjectByKryo" and "serializeObjectByKryo" according the functions with the save name in
org.apache.hadoop.hive.ql.exec.Utilities.java
Author: baishuo <vc_java@hotmail.com>
Closes#5660 from baishuo/SPARK-6505-20150423 and squashes the following commits:
ae61ec4 [baishuo] modify code style
78d9fa3 [baishuo] modify code style
0b522a7 [baishuo] modify code style
a5ff9c7 [baishuo] Remove the reflection call in HiveFunctionWrapper
rename DataTypeParser.apply to DataTypeParser.parse to make it more clear and readable.
/cc rxin
Author: wangfei <wangfei1@huawei.com>
Closes#5710 from scwf/apply and squashes the following commits:
c319977 [wangfei] rename apply to parse
Author: Reynold Xin <rxin@databricks.com>
Closes#5705 from rxin/df-pid and squashes the following commits:
401018f [Reynold Xin] [SPARK-7152][SQL] Add a Column expression for partition ID.
Author: Yin Huai <yhuai@databricks.com>
Closes#5702 from yhuai/howToGenerateGoldenFiles and squashes the following commits:
9c4a7f8 [Yin Huai] Update readme to include instructions on generating golden answer files based on Hive 0.13.1.
This is a reopening of #4867.
A short summary of the issues resolved from the previous PR:
1. HTTPClient version mismatch: Selenium (used for UI tests) requires version 4.3.x, and Tachyon included 4.2.5 through a transitive dependency of its shaded thrift jar. To address this, Tachyon 0.6.3 will promote the transitive dependencies of the shaded jar so they can be excluded in spark.
2. Jackson-Mapper-ASL version mismatch: In lower versions of hadoop-client (ie. 1.0.4), version 1.0.1 is included. The parquet library used in spark sql requires version 1.8+. Its unclear to me why upgrading tachyon-client would cause this dependency to break. The solution was to exclude jackson-mapper-asl from hadoop-client.
It seems that the dependency management in spark-parent will not work on transitive dependencies, one way to make sure jackson-mapper-asl is included with the correct version is to add it as a top level dependency. The best solution would be to exclude the dependency in the modules which require a higher version, but that did not fix the unit tests. Any suggestions on the best way to solve this would be appreciated!
Author: Calvin Jia <jia.calvin@gmail.com>
Closes#5354 from calvinjia/upgrade_tachyon_0.6.3 and squashes the following commits:
0eefe4d [Calvin Jia] Handle httpclient version in maven dependency management. Remove httpclient version setting from profiles.
7c00dfa [Calvin Jia] Set httpclient version to 4.3.2 for selenium. Specify version of httpclient for sql/hive (previously 4.2.5 transitive dependency of libthrift).
9263097 [Calvin Jia] Merge master to test latest changes
dbfc1bd [Calvin Jia] Use Tachyon 0.6.4 for cleaner dependencies.
e2ff80a [Calvin Jia] Exclude the jetty and curator promoted dependencies from tachyon-client.
a3a29da [Calvin Jia] Update tachyon-client exclusions.
0ae6c97 [Calvin Jia] Change tachyon version to 0.6.3
a204df9 [Calvin Jia] Update make distribution tachyon version.
a93c94f [Calvin Jia] Exclude jackson-mapper-asl from hadoop client since it has a lower version than spark's expected version.
a8a923c [Calvin Jia] Exclude httpcomponents from Tachyon
910fabd [Calvin Jia] Update to master
eed9230 [Calvin Jia] Update tachyon version to 0.6.1.
11907b3 [Calvin Jia] Use TachyonURI for tachyon paths instead of strings.
71bf441 [Calvin Jia] Upgrade Tachyon client version to 0.6.0.
Also took the chance to improve documentation for various types.
Author: Reynold Xin <rxin@databricks.com>
Closes#5675 from rxin/data-type-matching-expr and squashes the following commits:
0f31856 [Reynold Xin] One more function documentation.
27c1973 [Reynold Xin] Added more documentation.
336a36d [Reynold Xin] [SQL] Fixed expression data type matching.
It was over 1000 lines of code, making it harder to find all the types. Only moved code around, and didn't change any.
Author: Reynold Xin <rxin@databricks.com>
Closes#5670 from rxin/break-types and squashes the following commits:
8c59023 [Reynold Xin] Check in missing files.
dcd5193 [Reynold Xin] [SQL] Break dataTypes.scala into multiple files.
Author: Vinod K C <vinod.kc@huawei.com>
Closes#5633 from vinodkc/use_correct_classloader_driverload and squashes the following commits:
73c5380 [Vinod K C] Use correct ClassLoader for JDBC Driver
Author: Cheng Hao <hao.cheng@intel.com>
Closes#5625 from chenghao-intel/transform and squashes the following commits:
5ec1dd2 [Cheng Hao] fix the deadlock issue in ScriptTransform
Author: Prashant Sharma <prashant.s@imaginea.com>
Closes#5652 from ScrapCodes/hf/compilation-fix-scala-2.11 and squashes the following commits:
819ff06 [Prashant Sharma] [HOTFIX] Fix compilation for scala 2.11.
Also renamed JvmType to InternalType.
Author: Reynold Xin <rxin@databricks.com>
Closes#5651 from rxin/native-to-atomic-type and squashes the following commits:
cbd4028 [Reynold Xin] [SPARK-7069][SQL] Rename NativeType -> AtomicType.
Author: Reynold Xin <rxin@databricks.com>
Closes#5646 from rxin/remove-primitive-type and squashes the following commits:
01b673d [Reynold Xin] [SPARK-7068][SQL] Remove PrimitiveType
Added in #5475. Pointed as broken in #5639.
/cc marmbrus
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#5640 from viirya/fix_cached_test and squashes the following commits:
c0cf69a [Liang-Chi Hsieh] Fix broken cached test.
Author: Reynold Xin <rxin@databricks.com>
Closes#5642 from rxin/mllib-native-type and squashes the following commits:
e23af5b [Reynold Xin] Remove StringType
7cbb205 [Reynold Xin] [SPARK-7066][MLlib] VectorAssembler should use NumericType and StringType, not NativeType.
This pr convert java.sql.Date type into Int for JDBCRDD.
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#5590 from adrian-wang/datebug and squashes the following commits:
f897b81 [Daoyuan Wang] add a test case
3c9184c [Daoyuan Wang] fix date type convertion in jdbcrdd
Author: Reynold Xin <rxin@databricks.com>
Closes#5638 from rxin/joinUsing and squashes the following commits:
13e9cc9 [Reynold Xin] Code review + Python.
b1bd914 [Reynold Xin] [SPARK-7059][SQL] Create a DataFrame join API to facilitate equijoin and self join.
I was looking at the code gen code and got confused by a few of use cases of apply, in particular apply on objects. So I went ahead and changed a few of them. Hopefully slightly more clear with a proper verb.
Author: Reynold Xin <rxin@databricks.com>
Closes#5624 from rxin/apply-rename and squashes the following commits:
ee45034 [Reynold Xin] [SQL] Rename some apply functions.
This change adds some new utility code to handle shutdown hooks in
Spark. The main goal is to take advantage of Hadoop 2.x's API for
shutdown hooks, which allows Spark to register a hook that will
run before the one that cleans up HDFS clients, and thus avoids
some races that would cause exceptions to show up and other issues
such as failure to properly close event logs.
Unfortunately, Hadoop 1.x does not have such APIs, so in that case
correctness is still left to chance.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#5560 from vanzin/SPARK-6014 and squashes the following commits:
edfafb1 [Marcelo Vanzin] Better scaladoc.
fcaeedd [Marcelo Vanzin] Merge branch 'master' into SPARK-6014
e7039dc [Marcelo Vanzin] [SPARK-6014] [core] Revamp Spark shutdown hooks, fix shutdown races.
It's a bug while do query like:
```sql
select d from (select explode(array(1,1)) d from src limit 1) t
```
And it will throws exception like:
```
org.apache.spark.sql.AnalysisException: cannot resolve 'd' given input columns _c0; line 1 pos 7
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:48)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:45)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249)
at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:103)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:117)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:116)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
```
To solve the bug, it requires code refactoring for UDTF
The major changes are about:
* Simplifying the UDTF development, UDTF will manage the output attribute names any more, instead, the `logical.Generate` will handle that properly.
* UDTF will be asked for the output schema (data types) during the logical plan analyzing.
Author: Cheng Hao <hao.cheng@intel.com>
Closes#4602 from chenghao-intel/explode_bug and squashes the following commits:
c2a5132 [Cheng Hao] add back resolved for Alias
556e982 [Cheng Hao] revert the unncessary change
002c361 [Cheng Hao] change the rule of resolved for Generate
04ae500 [Cheng Hao] add qualifier only for generator output
5ee5d2c [Cheng Hao] prepend the new qualifier
d2e8b43 [Cheng Hao] Update the code as feedback
ca5e7f4 [Cheng Hao] shrink the commits
liancheng mengxr this is similar to #5146.
Author: Punya Biswal <pbiswal@palantir.com>
Closes#5578 from punya/feature/SPARK-6996 and squashes the following commits:
d56c3e0 [Punya Biswal] Fix imports
c7e308b [Punya Biswal] Support java iterable types in POJOs
5e00685 [Punya Biswal] Support map types in java beans
https://issues.apache.org/jira/browse/SPARK-6969
Author: Yin Huai <yhuai@databricks.com>
Closes#5583 from yhuai/refreshTableRefreshDataCache and squashes the following commits:
1e5142b [Yin Huai] Add todo.
92b2498 [Yin Huai] Minor updates.
367df92 [Yin Huai] Recache data in the command of REFRESH TABLE.
For `GetField` outside `UnresolvedAttribute`, we will throw exception in `Analyzer`.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#5588 from cloud-fan/tmp and squashes the following commits:
7ac74d2 [Wenchen Fan] small refactor
It looked weird that up to now there was no way in Spark's Scala API to access fields of `DataFrame/sql.Row` by name, only by their index.
This tries to solve this issue.
Author: vidmantas zemleris <vidmantas@vinted.com>
Closes#5573 from vidma/features/row-with-named-fields and squashes the following commits:
6145ae3 [vidmantas zemleris] [SPARK-6994][SQL] Allow to fetch field values by name on Row
9564ebb [vidmantas zemleris] [SPARK-6994][SQL] Add fieldIndex to schema (StructType)
JIRA https://issues.apache.org/jira/browse/SPARK-6635
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#5541 from viirya/replace_with_column and squashes the following commits:
b539c7b [Liang-Chi Hsieh] For comment.
72f35b1 [Liang-Chi Hsieh] DataFrame.withColumn can replace original column with identical column name.
Author: Michael Armbrust <michael@databricks.com>
Closes#5545 from marmbrus/addCoalesce and squashes the following commits:
9fdf3f6 [Michael Armbrust] [SPARK-6972][SQL] Add Coalesce to DataFrame
Otherwise we cannot add jars with drivers after the fact.
Author: Michael Armbrust <michael@databricks.com>
Closes#5543 from marmbrus/jdbcClassloader and squashes the following commits:
d9930f3 [Michael Armbrust] fix imports
73d0614 [Michael Armbrust] [SPARK-6966][SQL] Use correct ClassLoader for JDBC Driver
JIRA https://issues.apache.org/jira/browse/SPARK-6899
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#5517 from viirya/fix_codegen_average and squashes the following commits:
8ae5f65 [Liang-Chi Hsieh] Add the case of DecimalType.Unlimited to Average.
`foreachUp` should runs the given function recursively on [[children]] then on this node(just like transformUp). The current implementation does not follow this.
This will leads to checkanalysis do not check from bottom of logical tree.
Author: scwf <wangfei1@huawei.com>
Author: Fei Wang <wangfei1@huawei.com>
Closes#5518 from scwf/patch-1 and squashes the following commits:
18e28b2 [scwf] added a test case
1ccbfa8 [Fei Wang] fix foreachUp
Fix this error by adding BinaryType comparor in GenerateOrdering.
JIRA https://issues.apache.org/jira/browse/SPARK-6927
Author: 云峤 <chensong.cs@alibaba-inc.com>
Closes#5524 from kaka1992/fix-codegen-sort and squashes the following commits:
d7e2afe [云峤] fix codegen sorting error
SparkSQL CLI has an option --database as follows.
But, the option --database is ignored.
```
$ spark-sql --help
:
CLI options:
:
--database <databasename> Specify the database to use
```
Author: Jin Adachi <adachij2002@yahoo.co.jp>
Author: adachij <adachij@nttdata.co.jp>
Closes#5345 from adachij2002/SPARK-6694 and squashes the following commits:
8659084 [Jin Adachi] Merge branch 'master' of https://github.com/apache/spark into SPARK-6694
0301eb9 [Jin Adachi] Merge branch 'master' of https://github.com/apache/spark into SPARK-6694
df81086 [Jin Adachi] Modify code style.
846f83e [Jin Adachi] Merge branch 'master' of https://github.com/apache/spark into SPARK-6694
dbe8c63 [Jin Adachi] Change file permission to 644.
7b58f42 [Jin Adachi] Merge branch 'master' of https://github.com/apache/spark into SPARK-6694
c581d06 [Jin Adachi] Add an option --database test
db56122 [Jin Adachi] Merge branch 'SPARK-6694' of https://github.com/adachij2002/spark into SPARK-6694
ee09fa5 [adachij] Merge branch 'master' into SPARK-6694
c804c03 [adachij] SparkSQL CLI must be able to specify an option --database on the command line.
[SPARK-5277][SQL] - SparkSqlSerializer doesn't always register user specified KryoRegistrators
There were a few places where new SparkSqlSerializer instances were created with new, empty SparkConfs resulting in user specified registrators sometimes not getting initialized.
The fix is to try and pull a conf from the SparkEnv, and construct a new conf (that loads defaults) if one cannot be found.
The changes touched:
1) SparkSqlSerializer's resource pool (this appears to fix the issue in the comment)
2) execution.Exchange (for all of the partitioners)
3) execution.Limit (for the HashPartitioner)
A few tests were added to ColumnTypeSuite, ensuring that a custom registrator and serde is initialized and used when in-memory columns are written.
Author: Max Seiden <max@platfora.com>
This patch had conflicts when merged, resolved by
Committer: Michael Armbrust <michael@databricks.com>
Closes#5237 from mhseiden/sql_udt_kryo and squashes the following commits:
3175c2f [Max Seiden] [SPARK-5277][SQL] - address code review comments
e5011fb [Max Seiden] [SPARK-5277][SQL] - SparkSqlSerializer does not register user specified KryoRegistrators
Thanks for the initial work from Ishiihara in #3173
This PR introduce a new join method of sort merge join, which firstly ensure that keys of same value are in the same partition, and inside each partition the Rows are sorted by key. Then we can run down both sides together, find matched rows using [sort merge join](http://en.wikipedia.org/wiki/Sort-merge_join). In this way, we don't have to store the whole hash table of one side as hash join, thus we have less memory usage. Also, this PR would benefit from #3438 , making the sorting phrase much more efficient.
We introduced a new configuration of "spark.sql.planner.sortMergeJoin" to switch between this(`true`) and ShuffledHashJoin(`false`), probably we want the default value of it be `false` at first.
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Author: Michael Armbrust <michael@databricks.com>
This patch had conflicts when merged, resolved by
Committer: Michael Armbrust <michael@databricks.com>
Closes#5208 from adrian-wang/smj and squashes the following commits:
2493b9f [Daoyuan Wang] fix style
5049d88 [Daoyuan Wang] propagate rowOrdering for RangePartitioning
f91a2ae [Daoyuan Wang] yin's comment: use external sort if option is enabled, add comments
f515cd2 [Daoyuan Wang] yin's comment: outputOrdering, join suite refine
ec8061b [Daoyuan Wang] minor change
413fd24 [Daoyuan Wang] Merge pull request #3 from marmbrus/pr/5208
952168a [Michael Armbrust] add type
5492884 [Michael Armbrust] copy when ordering
7ddd656 [Michael Armbrust] Cleanup addition of ordering requirements
b198278 [Daoyuan Wang] inherit ordering in project
c8e82a3 [Daoyuan Wang] fix style
6e897dd [Daoyuan Wang] hide boundReference from manually construct RowOrdering for key compare in smj
8681d73 [Daoyuan Wang] refactor Exchange and fix copy for sorting
2875ef2 [Daoyuan Wang] fix changed configuration
61d7f49 [Daoyuan Wang] add omitted comment
00a4430 [Daoyuan Wang] fix bug
078d69b [Daoyuan Wang] address comments: add comments, do sort in shuffle, and others
3af6ba5 [Daoyuan Wang] use buffer for only one side
171001f [Daoyuan Wang] change default outputordering
47455c9 [Daoyuan Wang] add apache license ...
a28277f [Daoyuan Wang] fix style
645c70b [Daoyuan Wang] address comments using sort
068c35d [Daoyuan Wang] fix new style and add some tests
925203b [Daoyuan Wang] address comments
07ce92f [Daoyuan Wang] fix ArrayIndexOutOfBound
42fca0e [Daoyuan Wang] code clean
e3ec096 [Daoyuan Wang] fix comment style..
2edd235 [Daoyuan Wang] fix outputpartitioning
57baa40 [Daoyuan Wang] fix sort eval bug
303b6da [Daoyuan Wang] fix several errors
95db7ad [Daoyuan Wang] fix brackets for if-statement
4464f16 [Daoyuan Wang] fix error
880d8e9 [Daoyuan Wang] sort merge join for spark sql
Even if we wrap column names in backticks like `` `a#$b.c` ``, we still handle the "." inside column name specially. I think it's fragile to use a special char to split name parts, why not put name parts in `UnresolvedAttribute` directly?
Author: Wenchen Fan <cloud0fan@outlook.com>
This patch had conflicts when merged, resolved by
Committer: Michael Armbrust <michael@databricks.com>
Closes#5511 from cloud-fan/6898 and squashes the following commits:
48e3e57 [Wenchen Fan] more style fix
820dc45 [Wenchen Fan] do not ignore newName in UnresolvedAttribute
d81ad43 [Wenchen Fan] fix style
11699d6 [Wenchen Fan] completely support special chars in column names
JIRA: https://issues.apache.org/jira/browse/SPARK-6844
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#5475 from viirya/cache_memory_leak and squashes the following commits:
0b41235 [Liang-Chi Hsieh] fix style.
dc1d5d5 [Liang-Chi Hsieh] For comments.
78af229 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into cache_memory_leak
26c9bb6 [Liang-Chi Hsieh] Add configuration to enable in-memory table scan accumulators.
1c3b06e [Liang-Chi Hsieh] Clean up accumulators used in InMemoryRelation when it is uncached.
This PR change the internal representation for StringType from java.lang.String to UTF8String, which is implemented use ArrayByte.
This PR should not break any public API, Row.getString() will still return java.lang.String.
This is the first step of improve the performance of String in SQL.
cc rxin
Author: Davies Liu <davies@databricks.com>
Closes#5350 from davies/string and squashes the following commits:
3b7bfa8 [Davies Liu] fix schema of AddJar
2772f0d [Davies Liu] fix new test failure
6d776a9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into string
59025c8 [Davies Liu] address comments from @marmbrus
341ec2c [Davies Liu] turn off scala style check in UTF8StringSuite
744788f [Davies Liu] Merge branch 'master' of github.com:apache/spark into string
b04a19c [Davies Liu] add comment for getString/setString
08d897b [Davies Liu] Merge branch 'master' of github.com:apache/spark into string
5116b43 [Davies Liu] rollback unrelated changes
1314a37 [Davies Liu] address comments from Yin
867bf50 [Davies Liu] fix String filter push down
13d9d42 [Davies Liu] Merge branch 'master' of github.com:apache/spark into string
2089d24 [Davies Liu] add hashcode check back
ac18ae6 [Davies Liu] address comment
fd11364 [Davies Liu] optimize UTF8String
8d17f21 [Davies Liu] fix hive compatibility tests
e5fa5b8 [Davies Liu] remove clone in UTF8String
28f3d81 [Davies Liu] Merge branch 'master' of github.com:apache/spark into string
28d6f32 [Davies Liu] refactor
537631c [Davies Liu] some comment about Date
9f4c194 [Davies Liu] convert data type for data source
956b0a4 [Davies Liu] fix hive tests
73e4363 [Davies Liu] Merge branch 'master' of github.com:apache/spark into string
9dc32d1 [Davies Liu] fix some hive tests
23a766c [Davies Liu] refactor
8b45864 [Davies Liu] fix codegen with UTF8String
bb52e44 [Davies Liu] fix scala style
c7dd4d2 [Davies Liu] fix some catalyst tests
38c303e [Davies Liu] fix python sql tests
5f9e120 [Davies Liu] fix sql tests
6b499ac [Davies Liu] fix style
a85fb27 [Davies Liu] refactor
d32abd1 [Davies Liu] fix utf8 for python api
4699c3a [Davies Liu] use Array[Byte] in UTF8String
21f67c6 [Davies Liu] cleanup
685fd07 [Davies Liu] use UTF8String instead of String for StringType
JIRA: https://issues.apache.org/jira/browse/SPARK-6730
It is very possible that keyword will be used as identifier in `OPTIONS`, this pr makes it works.
However, another approach is that we can request that `OPTIONS` can't include keywords and has to use alternative identifier (e.g. table -> cassandraTable) if needed.
If so, please let me know to close this pr. Thanks.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#5520 from viirya/relax_options and squashes the following commits:
339fd68 [Liang-Chi Hsieh] Use regex parser.
92be11c [Liang-Chi Hsieh] Allow using keyword as identifier in OPTIONS.
SPARK-6440 #5424 import guava but did not promote guava dependency to compile level.
[INFO] compiler plugin: BasicArtifact(org.scalamacros,paradise_2.10.4,2.0.1,null)
[info] Compiling 8 Scala sources to /root/projects/spark/sql/hive-thriftserver/target/scala-2.10/classes...
[error] bad symbolic reference. A signature in Utils.class refers to term util
[error] in package com.google.common which is not available.
[error] It may be completely missing from the current classpath, or the version on
[error] the classpath might be incompatible with the version used when compiling Utils.class.
[error]
[error] while compiling: /root/projects/spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLEnv.scala
[error] during phase: erasure
[error] library version: version 2.10.4
[error] compiler version: version 2.10.4
[error] reconstructed args: -deprecation -classpath
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#5507 from adrian-wang/guava and squashes the following commits:
c337dad [Daoyuan Wang] fix compile error
JIRA https://issues.apache.org/jira/browse/SPARK-6871
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#5480 from viirya/no_cte_after_cte and squashes the following commits:
4da3712 [Liang-Chi Hsieh] Create new test.
40b38ed [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into no_cte_after_cte
0edf568 [Liang-Chi Hsieh] for comments.
6591b79 [Liang-Chi Hsieh] WITH clause in CTE can not following another WITH clause.
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#4586 from adrian-wang/addjar and squashes the following commits:
efdd602 [Daoyuan Wang] move jar to another place
6c707e8 [Daoyuan Wang] restrict hive version for test
32c4fb8 [Daoyuan Wang] fix style and add a test
9957d87 [Daoyuan Wang] use sessionstate classloader in makeRDDforTable
0810e71 [Daoyuan Wang] remove variable substitution
1898309 [Daoyuan Wang] fix classnotfound
95a40da [Daoyuan Wang] support env argus in add jar, and set add jar ret to 0
Currently `min` is not supported in code generation. This pr adds the support for it.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#5487 from viirya/add_min_codegen and squashes the following commits:
0ddec23 [Liang-Chi Hsieh] Add code generation support for Min.
Because `Average` is a `PartialAggregate`, we never get a `Average` node when reaching `HashAggregation` to prepare `GeneratedAggregate`.
That is why in SQLQuerySuite there is already a test for `avg` with codegen. And it works.
But we can find a case in `GeneratedAggregate` to deal with `Average`. Based on the above, we actually never execute this case.
So we can remove this case from `GeneratedAggregate`.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#4996 from viirya/add_average_codegened and squashes the following commits:
621c12f [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into add_average_codegened
368cfbc [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into add_average_codegened
74926d1 [Liang-Chi Hsieh] Add Average in canBeCodeGened lists.
In `leftsemijoin.q`, there is a data loading command for table `sales` already, but in `TestHive`, it also created the table `sales`, which causes duplicated records inserted into the `sales`.
Author: Cheng Hao <hao.cheng@intel.com>
Closes#4506 from chenghao-intel/df_table and squashes the following commits:
0be05f7 [Cheng Hao] Remove the table `sales` creating from TestHive
We need add copy before call externalsort.
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#5481 from adrian-wang/extsort and squashes the following commits:
9611586 [Daoyuan Wang] fix bug in external sort
Add a DirectParquetOutputCommitter class that skips _temporary directory when saving to s3. Add new config value "spark.sql.parquet.useDirectParquetOutputCommitter" (default false) to choose between the default output committer.
Author: Pei-Lun Lee <pllee@appier.com>
Closes#5042 from ypcat/spark-6352 and squashes the following commits:
e17bf47 [Pei-Lun Lee] Merge branch 'master' of https://github.com/apache/spark into spark-6352
9ae7545 [Pei-Lun Lee] [SPARL-6352] [SQL] Change to allow custom parquet output committer.
0d540b9 [Pei-Lun Lee] [SPARK-6352] [SQL] add license
c42468c [Pei-Lun Lee] [SPARK-6352] [SQL] add test case
0fc03ca [Pei-Lun Lee] [SPARK-6532] [SQL] hide class DirectParquetOutputCommitter
769bd67 [Pei-Lun Lee] DirectParquetOutputCommitter
f75e261 [Pei-Lun Lee] DirectParquetOutputCommitter
Author: nyaapa <nyaapa@gmail.com>
Closes#5424 from nyaapa/master and squashes the following commits:
6b717aa [nyaapa] [SPARK-6440][CORE] Remove Utils.localIpAddressHostname, Utils.localIpAddressURI and Utils.getAddressHostName; make Utils.localIpAddress private; rename Utils.localHostURI into Utils.localHostNameForURI; use Utils.localHostName in org.apache.spark.streaming.kinesis.KinesisReceiver and org.apache.spark.sql.hive.thriftserver.SparkSQLEnv
2098081 [nyaapa] [SPARK-6440][CORE] style fixes and use getHostAddress instead of getHostName
84763d7 [nyaapa] [SPARK-6440][CORE]Handle IPv6 addresses properly when constructing URI
Supports replacing values with other values in DataFrames.
Python support should be in a separate pull request.
Author: Reynold Xin <rxin@databricks.com>
Closes#5282 from rxin/df-na-replace and squashes the following commits:
4b72434 [Reynold Xin] Removed println.
c8d9946 [Reynold Xin] col -> cols
fbb3c21 [Reynold Xin] [SPARK-6562][SQL] DataFrame.replace
The method `resolveGetField` isn't belong to `LogicalPlan` logically and didn't access any members of it.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#5435 from cloud-fan/tmp and squashes the following commits:
9a66c83 [Wenchen Fan] code clean up
This PR adds internal UDTs for expressions that are hijacking existing data types.
The following UDTs are added:
* `HyperLogLogUDT` (`BinaryType` as the SQL type) for `ApproxCountDistinctPartition`
* `OpenHashSetUDT` (`ArrayType` as the SQL type) for `CollectHashSet`, `NewSet`, `AddItemToSet`, and `CombineSets`.
I am also adding more unit tests for aggregation with code gen enabled.
JIRA: https://issues.apache.org/jira/browse/SPARK-6367
Author: Yin Huai <yhuai@databricks.com>
Closes#5094 from yhuai/expressionType and squashes the following commits:
8bcd11a [Yin Huai] Return types.
61a1d66 [Yin Huai] Merge remote-tracking branch 'upstream/master' into expressionType
e8b4599 [Yin Huai] Merge remote-tracking branch 'upstream/master' into expressionType
2753156 [Yin Huai] Ignore aggregations having sum functions for now.
b5eb259 [Yin Huai] Case object for HyperLogLog type.
00ebdbd [Yin Huai] deserialize/serialize.
54b87ae [Yin Huai] Add UDTs for expressions that return HyperLogLog and OpenHashSet.
Author: Yin Huai <yhuai@databricks.com>
Closes#5381 from yhuai/parquetPath2 and squashes the following commits:
fe296b4 [Yin Huai] Create new Path to take care special characters in the authority of a Path's URI.
This is useful for using pre-defined UDFs in SQLContext;
val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", "value")
val sqlctx = df.sqlContext
sqlctx.udf.register("simpleUdf", (v: Int) => v * v)
df.select($"id", sqlctx.callUdf("simpleUdf", $"value"))
Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
Closes#5061 from maropu/SupportUDFConversionInSparkContext and squashes the following commits:
f858aff [Takeshi YAMAMURO] Move the function into functions.scala
afd0380 [Takeshi YAMAMURO] Add a return type of callUDF
599b76c [Takeshi YAMAMURO] Remove the implicit conversion and add SqlContext#callUdf
8b56f10 [Takeshi YAMAMURO] Support an implicit conversion from udf"name" to an UDF defined in SQLContext
[SHOW PRINCIPALS role_name]
Lists all roles and users who belong to this role.
Only the admin role has privilege for this.
[SHOW COMPACTIONS]
It returns a list of all tables and partitions currently being compacted or scheduled for compaction when Hive transactions are being used.
[SHOW TRANSACTIONS]
It is for use by administrators when Hive transactions are being used. It returns a list of all currently open and aborted transactions in the system.
Author: DoingDone9 <799203320@qq.com>
Author: Zhongshuai Pei <799203320@qq.com>
Author: Xu Tingjun <xutingjun@huawei.com>
Closes#4902 from DoingDone9/SHOW_PRINCIPALS and squashes the following commits:
4add42f [Zhongshuai Pei] for test
311f806 [Zhongshuai Pei] for test
0c7550a [DoingDone9] Update HiveQl.scala
c8aeb1c [Xu Tingjun] aa
802261c [DoingDone9] Merge pull request #7 from apache/master
d00303b [DoingDone9] Merge pull request #6 from apache/master
98b134f [DoingDone9] Merge pull request #5 from apache/master
161cae3 [DoingDone9] Merge pull request #4 from apache/master
c87e8b6 [DoingDone9] Merge pull request #3 from apache/master
cb1852d [DoingDone9] Merge pull request #2 from apache/master
c3f046f [DoingDone9] Merge pull request #1 from apache/master
This PR follow up PR #3907 & #3891 & #4356.
According to marmbrus liancheng 's comments, I try to use fs.globStatus to retrieve all FileStatus objects under path(s), and then do the filtering locally.
[1]. get pathPattern by path, and put it into pathPatternSet. (hdfs://cluster/user/demo/2016/08/12 -> hdfs://cluster/user/demo/*/*/*)
[2]. retrieve all FileStatus objects ,and cache them by undating existPathSet.
[3]. do the filtering locally
[4]. if we have new pathPattern,do 1,2 step again. (external table maybe have more than one partition pathPattern)
chenghao-intel jeanlyn
Author: lazymam500 <lazyman500@gmail.com>
Author: lazyman <lazyman500@gmail.com>
Closes#5059 from lazyman500/SPARK-5068 and squashes the following commits:
5bfcbfd [lazyman] move spark.sql.hive.verifyPartitionPath to SQLConf,fix scala style
e1d6386 [lazymam500] fix scala style
f23133f [lazymam500] bug fix
47e0023 [lazymam500] fix scala style,add config flag,break the chaining
04c443c [lazyman] SPARK-5068: fix bug when partition path doesn't exists #2
41f60ce [lazymam500] Merge pull request #1 from apache/master
Author: haiyang <huhaiyang@huawei.com>
Closes#4929 from haiyangsea/cte and squashes the following commits:
220b67d [haiyang] add golden files for cte test
d3c7681 [haiyang] Merge branch 'master' into cte-repair
0ba2070 [haiyang] modify code style
9ce6b58 [haiyang] fix conflict
ff74741 [haiyang] add comment for With plan
0d56af4 [haiyang] code indention
776a440 [haiyang] add comments for resolve relation strategy
2fccd7e [haiyang] add comments for resolve relation strategy
241bbe2 [haiyang] fix cte problem of view
e9e1237 [haiyang] fix test case problem
614182f [haiyang] add test cases for CTE feature
32e415b [haiyang] add comment
1cc8c15 [haiyang] support with
03f1097 [haiyang] support with
e960099 [haiyang] support with
9aaa874 [haiyang] support with
0566978 [haiyang] support with
a99ecd2 [haiyang] support with
c3fa4c2 [haiyang] support with
3b6077f [haiyang] support with
5f8abe3 [haiyang] support with
4572b05 [haiyang] support with
f801f54 [haiyang] support with
In this PR, "analyser" is changed to "analyzer" to keep a consistent naming. Some other typos are also fixed.
Author: Guancheng (G.C.) Chen <chenguancheng@gmail.com>
Closes#5474 from gchen/sql-typo and squashes the following commits:
70e6e76 [Guancheng (G.C.) Chen] Merge branch 'sql-typo' of github.com:gchen/spark into sql-typo
fb7a6e2 [Guancheng (G.C.) Chen] fix typo in sql
37e3da1 [Guancheng (G.C.) Chen] fix type in sql
https://issues.apache.org/jira/browse/SPARK-6611
Author: Santiago M. Mola <santiago.mola@sap.com>
Closes#5271 from smola/features/integer-parse and squashes the following commits:
f5c1c64 [Santiago M. Mola] [SPARK-6611] Add support for INTEGER as synonym of INT.
Since now kyro serializer is used for `GeneralHashedRelation` whether kyro is enabled or not, it is better to register Java `HashMap` in `SparkSqlSerializer`.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#5465 from viirya/register_hashmap and squashes the following commits:
9062601 [Liang-Chi Hsieh] Register Java HashMap for SparkSqlSerializer.