## What changes were proposed in this pull request?
Now we should be able to convert all logical plans to SQL string, if they are parsed from hive query. This PR changes the error handling to throw exceptions instead of just log.
We will send new PRs for spotted bugs, and merge this one after all bugs are fixed.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11782 from cloud-fan/test.
## What changes were proposed in this pull request?
Fix some nits discussed in https://github.com/apache/spark/pull/11776#issuecomment-198207419
use !rdd.isEmpty instead of rdd.count > 0
use static instead of AtomicInteger
remove unneeded "throws Exception"
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#11821 from zhengruifeng/je_fix.
## What changes were proposed in this pull request?
The fix is simple, use the existing `CombineUnions` rule to combine adjacent Unions before build SQL string.
## How was this patch tested?
The re-enabled test
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11818 from cloud-fan/bug-fix.
## What changes were proposed in this pull request?
When trainingSummary is None, it should throw ```RuntimeException```.
cc mengxr
## How was this patch tested?
Existing tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#11784 from yanboliang/fix-summary.
## What changes were proposed in this pull request?
This patch updates documentations for Datasets. I also updated some internal documentation for exchange/broadcast.
## How was this patch tested?
Just documentation/api stability update.
Author: Reynold Xin <rxin@databricks.com>
Closes#11814 from rxin/dataset-docs.
## What changes were proposed in this pull request?
JIRA: https://issues.apache.org/jira/browse/SPARK-13930
Recently the fast serialization has been introduced to collecting DataFrame/Dataset (#11664). The same technology can be used on collect limit operator too.
## How was this patch tested?
Add a benchmark for collect limit to `BenchmarkWholeStageCodegen`.
Without this patch:
model name : Westmere E56xx/L56xx/X56xx (Nehalem-C)
collect limit: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------
collect limit 1 million 3413 / 3768 0.3 3255.0 1.0X
collect limit 2 millions 9728 / 10440 0.1 9277.3 0.4X
With this patch:
model name : Westmere E56xx/L56xx/X56xx (Nehalem-C)
collect limit: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------
collect limit 1 million 833 / 1284 1.3 794.4 1.0X
collect limit 2 millions 3348 / 4005 0.3 3193.3 0.2X
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Closes#11759 from viirya/execute-take.
## What changes were proposed in this pull request?
This PR revises Dataset API ScalaDoc. All public methods are divided into the following groups
* `groupname basic`: Basic Dataset functions
* `groupname action`: Actions
* `groupname untypedrel`: Untyped Language Integrated Relational Queries
* `groupname typedrel`: Typed Language Integrated Relational Queries
* `groupname func`: Functional Transformations
* `groupname rdd`: RDD Operations
* `groupname output`: Output Operations
`since` tag and sample code are also updated. We may want to add more sample code for typed APIs.
## How was this patch tested?
Documentation change. Checked by building unidoc locally.
Author: Cheng Lian <lian@databricks.com>
Closes#11769 from liancheng/spark-13826-ds-api-doc.
PR #11696 introduced a complex pattern match that broke Scala 2.10 match unreachability check and caused build failure. This PR fixes this issue by expanding this pattern match into several simpler ones.
Note that tuning or turning off `-Dscalac.patmat.analysisBudget` doesn't work for this case.
Compilation against Scala 2.10
Author: tedyu <yuzhihong@gmail.com>
Closes#11798 from yy2016/master.
This patch modifies the BlockManager, MemoryStore, and several other storage components so that serialized cached blocks are stored as multiple small chunks rather than as a single contiguous ByteBuffer.
This change will help to improve the efficiency of memory allocation and the accuracy of memory accounting when serializing blocks. Our current serialization code uses a ByteBufferOutputStream, which doubles and re-allocates its backing byte array; this increases the peak memory requirements during serialization (since we need to hold extra memory while expanding the array). In addition, we currently don't account for the extra wasted space at the end of the ByteBuffer's backing array, so a 129 megabyte serialized block may actually consume 256 megabytes of memory. After switching to storing blocks in multiple chunks, we'll be able to efficiently trim the backing buffers so that no space is wasted.
This change is also a prerequisite to being able to cache blocks which are larger than 2GB (although full support for that depends on several other changes which have not bee implemented yet).
Author: Josh Rosen <joshrosen@databricks.com>
Closes#11748 from JoshRosen/chunked-block-serialization.
## What changes were proposed in this pull request?
We haven't figured out the corrected logical to add sub-queries yet, so we should not clear all sub-queries before generate SQL. This PR changed the logic to only remove sub-queries above table relation.
an example for this bug, original SQL: `SELECT a FROM (SELECT a FROM tbl) t WHERE a = 1`
before this PR, we will generate:
```
SELECT attr_1 AS a FROM
SELECT attr_1 FROM (
SELECT a AS attr_1 FROM tbl
) AS sub_q0
WHERE attr_1 = 1
```
We missed a sub-query and this SQL string is illegal.
After this PR, we will generate:
```
SELECT attr_1 AS a FROM (
SELECT attr_1 FROM (
SELECT a AS attr_1 FROM tbl
) AS sub_q0
WHERE attr_1 = 1
) AS t
```
TODO: for long term, we should find a way to add sub-queries correctly, so that arbitrary logical plans can be converted to SQL string.
## How was this patch tested?
`LogicalPlanToSQLSuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11786 from cloud-fan/bug-fix.
## What changes were proposed in this pull request?
We only need to make sub-query names unique every time we generate a SQL string, but not all the time. This PR moves the `newSubqueryName` method to `class SQLBuilder` and remove `object SQLBuilder`.
also addressed 2 minor comments in https://github.com/apache/spark/pull/11696
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11783 from cloud-fan/tmp.
Decision trees in spark.ml (RandomForest.scala) communicate twice as much data as needed for unordered categorical features. Here's an example.
Say there are 3 categories A, B, C. We consider 3 splits:
* A vs. B, C
* A, B vs. C
* A, C vs. B
Currently, we collect statistics for each of the 6 subsets of categories (3 * 2 = 6). However, we could instead collect statistics for the 3 subsets on the left-hand side of the 3 possible splits: A and A,B and A,C. If we also have stats for the entire node, then we can compute the stats for the 3 subsets on the right-hand side of the splits. In pseudomath: stats(B,C) = stats(A,B,C) - stats(A).
This patch adds a parent stats array to the `DTStatsAggregator` so that the right child stats do not need to be stored. The right child stats are computed by subtracting left child stats from the parent stats for unordered categorical features.
Author: sethah <seth.hendrickson16@gmail.com>
Closes#9474 from sethah/SPARK-10788.
## What changes were proposed in this pull request?
Cleanups from [https://github.com/apache/spark/pull/11620]: remove remaining uses of validateParams, and put functionality into transformSchema
## How was this patch tested?
Existing unit tests, modified to check using transformSchema instead of validateParams
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#11790 from jkbradley/SPARK-13761-cleanup.
## What changes were proposed in this pull request?
In PySpark wrapper.py JavaWrapper change _java_obj from an unused static variable to a member variable that is consistent with usage in derived classes.
## How was this patch tested?
Ran python tests for ML and MLlib.
Author: Bryan Cutler <cutlerb@gmail.com>
Closes#11767 from BryanCutler/JavaWrapper-static-_java_obj-SPARK-13937.
## What changes were proposed in this pull request?
Compilation against Scala 2.10 fails with:
```
[error] [warn] /home/jenkins/workspace/spark-master-compile-sbt-scala-2.10/sql/hive/src/main/scala/org/apache/spark/sql/hive/SQLBuilder.scala:483: Cannot check match for unreachability.
[error] (The analysis required more space than allowed. Please try with scalac -Dscalac.patmat.analysisBudget=512 or -Dscalac.patmat.analysisBudget=off.)
[error] [warn] private def addSubqueryIfNeeded(plan: LogicalPlan): LogicalPlan = plan match {
```
## How was this patch tested?
Compilation against Scala 2.10
Author: tedyu <yuzhihong@gmail.com>
Closes#11787 from yy2016/master.
JIRA: https://issues.apache.org/jira/browse/SPARK-13838
## What changes were proposed in this pull request?
We should also clear the variable code in `BoundReference.genCode` to prevent it to be evaluated twice, as we did in `evaluateVariables`.
## How was this patch tested?
Existing tests.
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Closes#11674 from viirya/avoid-reevaluate.
## What changes were proposed in this pull request?
Support queries that JOIN tables with USING clause.
SELECT * from table1 JOIN table2 USING <column_list>
USING clause can be used as a means to simplify the join condition
when :
1) Equijoin semantics is desired and
2) The column names in the equijoin have the same name.
We already have the support for Natural Join in Spark. This PR makes
use of the already existing infrastructure for natural join to
form the join condition and also the projection list.
## How was the this patch tested?
Have added unit tests in SQLQuerySuite, CatalystQlSuite, ResolveNaturalJoinSuite
Author: Dilip Biswal <dbiswal@us.ibm.com>
Closes#11297 from dilipbiswal/spark-13427.
## What changes were proposed in this pull request?
As each acceptor/selector in Jetty will use one thread, the number of threads should at least be the number of acceptors and selectors plus 1. Otherwise, the thread pool of Jetty server may be exhausted by acceptors/selectors and not be able to response any request.
To avoid wasting threads, the PR limits the max number of acceptors and selectors and also updates the max thread number if necessary.
## How was this patch tested?
Just make sure we don't break any existing tests
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#11615 from zsxwing/SPARK-13776.
## What changes were proposed in this pull request?
This PR adds SQL generation support for `Generate` operator. It always converts `Generate` operator into `LATERAL VIEW` format as there are many limitations to put UDTF in project list.
This PR is based on https://github.com/apache/spark/pull/11658, please see the last commit to review the real changes.
Thanks dilipbiswal for his initial work! Takes over https://github.com/apache/spark/pull/11596
## How was this patch tested?
new tests in `LogicalPlanToSQLSuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11696 from cloud-fan/generate.
## What changes were proposed in this pull request?
Logging was made private in Spark 2.0. If we move it, then users would be able to create a Logging trait themselves to avoid changing their own code.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11764 from cloud-fan/logger.
JIRA Issue:https://issues.apache.org/jira/browse/SPARK-13901
In getAllowedLocalityLevel method of TaskSetManager,we get wrong logDebug information when jump to the next locality level.So we should fix it.
Author: trueyao <501663994@qq.com>
Closes#11719 from trueyao/logDebug-localityWait.
## What changes were proposed in this pull request?
Add the java example of StreamingTest
## How was this patch tested?
manual tests in CLI: bin/run-example mllib.JavaStreamingTestExample dataDir 5 100
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#11776 from zhengruifeng/streaming_je.
MiMa excludes are currently generated using both the current Spark version's classes and Spark 1.2.0's classes, but this doesn't make sense: we should only be ignoring classes which were `private` in the previous Spark version, not classes which became private in the current version.
This patch updates `dev/mima` to only generate excludes with respect to the previous artifacts that MiMa checks against. It also updates `MimaBuild` so that `excludeClass` only applies directly to the class being excluded and not to its companion object (since a class and its companion object can have different accessibility).
Author: Josh Rosen <joshrosen@databricks.com>
Closes#11774 from JoshRosen/SPARK-13948.
This commit updates the HiveContext so that sc.hadoopConfiguration is used to instantiate its internal instances of HiveConf.
I tested this by overriding the S3 FileSystem implementation from spark-defaults.conf as "spark.hadoop.fs.s3.impl" (to avoid [HADOOP-12810](https://issues.apache.org/jira/browse/HADOOP-12810)).
Author: Ryan Blue <blue@apache.org>
Closes#11273 from rdblue/SPARK-13403-new-hive-conf-from-hadoop-conf.
Because ClassTags are available when constructing ShuffledRDD we can use them to automatically use Kryo for shuffle serialization when the RDD's types are known to be compatible with Kryo.
This patch introduces `SerializerManager`, a component which picks the "best" serializer for a shuffle given the elements' ClassTags. It will automatically pick a Kryo serializer for ShuffledRDDs whose key, value, and/or combiner types are primitives, arrays of primitives, or strings. In the future we can use this class as a narrow extension point to integrate specialized serializers for other types, such as ByteBuffers.
In a planned followup patch, I will extend the BlockManager APIs so that we're able to use similar automatic serializer selection when caching RDDs (this is a little trickier because the ClassTags need to be threaded through many more places).
Author: Josh Rosen <joshrosen@databricks.com>
Closes#11755 from JoshRosen/automatically-pick-best-serializer.
## What changes were proposed in this pull request?
Since developer API of plug-able parser has been removed in #10801 , docs should be updated accordingly.
## How was this patch tested?
This patch will not affect the real code path.
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#11758 from adrian-wang/spark12855.
## What changes were proposed in this pull request?
This PR removes three minor duplicated lines. First one is making the following unreachable code warning.
```
JoinSuite.scala:52: unreachable code
[warn] case j: BroadcastHashJoin => j
```
The other two are just consecutive repetitions in `Seq` of MiMa filters.
## How was this patch tested?
Pass the existing Jenkins test.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11773 from dongjoon-hyun/remove_duplicated_line.
## What changes were proposed in this pull request?
Fix expression generation for optional types.
Standard Java reflection causes issues when dealing with synthetic Scala objects (things that do not map to Java and thus contain a dollar sign in their name). This patch introduces Scala reflection in such cases.
This patch also adds a regression test for Dataset's handling of classes defined in package objects (which was the initial purpose of this PR).
## How was this patch tested?
A new test in ExpressionEncoderSuite that tests optional inner classes and a regression test for Dataset's handling of package objects.
Author: Jakob Odersky <jakob@odersky.com>
Closes#11708 from jodersky/SPARK-13118-package-objects.
## What changes were proposed in this pull request?
We need to copy the UnsafeRow since a Join could produce multiple rows from single input rows. We could avoid that if there is no join (or the join will not produce multiple rows) inside WholeStageCodegen.
Updated the benchmark for `collect`, we could see 20-30% speedup.
## How was this patch tested?
existing unit tests.
Author: Davies Liu <davies@databricks.com>
Closes#11740 from davies/avoid_copy2.
## What changes were proposed in this pull request?
This https://github.com/apache/spark/pull/2400 added the support to parse JSON rows wrapped with an array. However, this throws an exception when the given data contains array data and struct data in the same field as below:
```json
{"a": {"b": 1}}
{"a": []}
```
and the schema is given as below:
```scala
val schema =
StructType(
StructField("a", StructType(
StructField("b", StringType) :: Nil
)) :: Nil)
```
- **Before**
```scala
sqlContext.read.schema(schema).json(path).show()
```
```scala
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 0.0 failed 4 times, most recent failure: Lost task 7.3 in stage 0.0 (TID 10, 192.168.1.170): java.lang.ClassCastException: org.apache.spark.sql.types.GenericArrayData cannot be cast to org.apache.spark.sql.catalyst.InternalRow
at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getStruct(rows.scala:50)
at org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getStruct(rows.scala:247)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown Source)
...
```
- **After**
```scala
sqlContext.read.schema(schema).json(path).show()
```
```bash
+----+
| a|
+----+
| [1]|
|null|
+----+
```
For other data types, in this case it converts the given values are `null` but only this case emits an exception.
This PR makes the support for wrapped rows applied only at the top level.
## How was this patch tested?
Unit tests were used and `./dev/run_tests` for code style tests.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#11752 from HyukjinKwon/SPARK-3308-follow-up.
## What changes were proposed in this pull request?
As part of the effort to merge `SQLContext` and `HiveContext`, this patch implements an internal catalog called `SessionCatalog` that handles temporary functions and tables and delegates metastore operations to `ExternalCatalog`. Currently, this is still dead code, but in the future it will be part of `SessionState` and will replace `o.a.s.sql.catalyst.analysis.Catalog`.
A recent patch #11573 parses Hive commands ourselves in Spark, but still passes the entire query text to Hive. In a future patch, we will use `SessionCatalog` to implement the parsed commands.
## How was this patch tested?
800+ lines of tests in `SessionCatalogSuite`.
Author: Andrew Or <andrew@databricks.com>
Closes#11750 from andrewor14/temp-catalog.
## What changes were proposed in this pull request?
Deprecate validateParams() method here: 035d3acdf3/mllib/src/main/scala/org/apache/spark/ml/param/params.scala (L553)
Move all functionality in overridden methods to transformSchema().
Check docs to make sure they indicate complex Param interaction checks should be done in transformSchema.
## How was this patch tested?
unit tests
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#11620 from hhbyyh/depreValid.
## What changes were proposed in this pull request?
Narrow down the parameter type of `UserDefinedType#serialize()`. Currently, the parameter type is `Any`, however it would logically make more sense to narrow it down to the type of the actual user defined type.
## How was this patch tested?
Existing tests were successfully run on local machine.
Author: Jakob Odersky <jakob@odersky.com>
Closes#11379 from jodersky/SPARK-11011-udt-types.
## What changes were proposed in this pull request?
**[I'll link it to the JIRA once ASF JIRA is back online]**
This PR modifies the existing `CombineFilters` rule to remove redundant conditions while combining individual filter predicates. For instance, queries of the form `table.where('a === 1 && 'b === 1).where('a === 1 && 'c === 1)` will now be optimized to ` table.where('a === 1 && 'b === 1 && 'c === 1)` (instead of ` table.where('a === 1 && 'a === 1 && 'b === 1 && 'c === 1)`)
## How was this patch tested?
Unit test in `FilterPushdownSuite`
Author: Sameer Agarwal <sameer@databricks.com>
Closes#11670 from sameeragarwal/combine-filters.
## What changes were proposed in this pull request?
This PR generalizes the `NullFiltering` optimizer rule in catalyst to `InferFiltersFromConstraints` that can automatically infer all relevant filters based on an operator's constraints while making sure of 2 things:
(a) no redundant filters are generated, and
(b) filters that do not contribute to any further optimizations are not generated.
## How was this patch tested?
Extended all tests in `InferFiltersFromConstraintsSuite` (that were initially based on `NullFilteringSuite` to test filter inference in `Filter` and `Join` operators.
In particular the 2 tests ( `single inner join with pre-existing filters: filter out values on either side` and `multiple inner joins: filter out values on all sides on equi-join keys` attempts to highlight/test the real potential of this rule for join optimization.
Author: Sameer Agarwal <sameer@databricks.com>
Closes#11665 from sameeragarwal/infer-filters.
## What changes were proposed in this pull request?
`Shark` was merged into `Spark SQL` since [July 2014](https://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html). The followings seem to be the only legacy. For Spark 2.x, we had better clean up those docs.
**Migration Guide**
```
- ## Migration Guide for Shark Users
- ...
- ### Scheduling
- ...
- ### Reducer number
- ...
- ### Caching
```
## How was this patch tested?
Pass the Jenkins test.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11770 from dongjoon-hyun/SPARK-13942.
## What changes were proposed in this pull request?
Add export/import for all estimators and transformers(which have Scala implementation) under pyspark/ml/classification.py.
## How was this patch tested?
./python/run-tests
./dev/lint-python
Unit tests added to check persistence in Logistic Regression
Author: GayathriMurali <gayathri.m.softie@gmail.com>
Closes#11707 from GayathriMurali/SPARK-13034.
## What changes were proposed in this pull request?
Add row/column iterator to local matrices to simplify tasks like BlockMatrix => RowMatrix conversion. It handles dense and sparse matrices properly.
## How was this patch tested?
Unit tests on sparse and dense matrix.
cc: dbtsai
Author: Xiangrui Meng <meng@databricks.com>
Closes#11757 from mengxr/SPARK-13927.
### What changes were proposed in this pull request?
Made these MLReadable and MLWritable: DecisionTreeClassifier, DecisionTreeClassificationModel, DecisionTreeRegressor, DecisionTreeRegressionModel
* The shared implementation is in treeModels.scala
* I use case classes to create a DataFrame to save, and I use the Dataset API to parse loaded files.
Other changes:
* Made CategoricalSplit.numCategories public (to use in persistence)
* Fixed a bug in DefaultReadWriteTest.testEstimatorAndModelReadWrite, where it did not call the checkModelData function passed as an argument. This caused an error in LDASuite, which I fixed.
### How was this patch tested?
Persistence is tested via unit tests. For each algorithm, there are 2 non-trivial trees (depth 2). One is built with continuous features, and one with categorical; this ensures that both types of splits are tested.
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#11581 from jkbradley/dt-io.
## What changes were proposed in this pull request?
Provide ignored test cases to export the test dataset into CSV format in ```LinearRegressionSuite```, ```LogisticRegressionSuite```, ```AFTSurvivalRegressionSuite``` and ```GeneralizedLinearRegressionSuite```, so users can validate the training accuracy compared with R's glm, glmnet and survival package.
cc mengxr
## How was this patch tested?
The test suite is ignored, but I have enabled all these cases offline and it works as expected.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#11463 from yanboliang/spark-13613.
## What changes were proposed in this pull request?
JIRA issue: https://issues.apache.org/jira/browse/SPARK-13038
1. Add load/save to PySpark Pipeline and PipelineModel
2. Add `_transfer_stage_to_java()` and `_transfer_stage_from_java()` for `JavaWrapper`.
## How was this patch tested?
Test with doctest.
Author: Xusen Yin <yinxusen@gmail.com>
Closes#11683 from yinxusen/SPARK-13038-only.
#### What changes were proposed in this pull request?
This PR is to convert to SQL from analyzed logical plans containing operator `ScriptTransformation`.
For example, below is the SQL containing `Transform`
```
SELECT TRANSFORM (a, b, c, d) USING 'cat' FROM parquet_t2
```
Its logical plan is like
```
ScriptTransformation [a#210L,b#211L,c#212L,d#213L], cat, [key#208,value#209], HiveScriptIOSchema(List(),List(),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),List((field.delim, )),List((field.delim, )),Some(org.apache.hadoop.hive.ql.exec.TextRecordReader),Some(org.apache.hadoop.hive.ql.exec.TextRecordWriter),true)
+- SubqueryAlias parquet_t2
+- Relation[a#210L,b#211L,c#212L,d#213L] ParquetRelation
```
The generated SQL will be like
```
SELECT TRANSFORM (`parquet_t2`.`a`, `parquet_t2`.`b`, `parquet_t2`.`c`, `parquet_t2`.`d`) USING 'cat' AS (`key` string, `value` string) FROM `default`.`parquet_t2`
```
#### How was this patch tested?
Seven test cases are added to `LogicalPlanToSQLSuite`.
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#11503 from gatorsmile/transformToSQL.
## What changes were proposed in this pull request?
This PR tries to solve a fundamental issue in the `SQLBuilder`. When we want to turn a logical plan into SQL string and put it after FROM clause, we need to wrap it with a sub-query. However, a logical plan is allowed to have same-name outputs with different qualifiers(e.g. the `Join` operator), and this kind of plan can't be put under a subquery as we will erase and assign a new qualifier to all outputs and make it impossible to distinguish same-name outputs.
To solve this problem, this PR renames all attributes with globally unique names(using exprId), so that we don't need qualifiers to resolve ambiguity anymore.
For example, `SELECT x.key, MAX(y.key) OVER () FROM t x JOIN t y`, we will parse this SQL to a Window operator and a Project operator, and add a sub-query between them. The generated SQL looks like:
```
SELECT sq_1.key, sq_1.max
FROM (
SELECT sq_0.key, sq_0.key, MAX(sq_0.key) OVER () AS max
FROM (
SELECT x.key, y.key FROM t1 AS x JOIN t2 AS y
) AS sq_0
) AS sq_1
```
You can see, the `key` columns become ambiguous after `sq_0`.
After this PR, it will generate something like:
```
SELECT attr_30 AS key, attr_37 AS max
FROM (
SELECT attr_30, attr_37
FROM (
SELECT attr_30, attr_35, MAX(attr_35) AS attr_37
FROM (
SELECT attr_30, attr_35 FROM
(SELECT key AS attr_30 FROM t1) AS sq_0
INNER JOIN
(SELECT key AS attr_35 FROM t1) AS sq_1
) AS sq_2
) AS sq_3
) AS sq_4
```
The outermost SELECT is used to turn the generated named to real names back, and the innermost SELECT is used to alias real columns to our generated names. Between them, there is no name ambiguity anymore.
## How was this patch tested?
existing tests and new tests in LogicalPlanToSQLSuite.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11658 from cloud-fan/gensql.
JIRA: https://issues.apache.org/jira/browse/SPARK-13816
## What changes were proposed in this pull request?
Add parameter checks for algorithms in Graphx: Pregel,LabelPropagation,PageRank,SVDPlusPlus
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#11655 from zhengruifeng/graphx_param_check.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-13894
Change the return type of the `SQLContext.range` API from `DataFrame` to `Dataset`.
## How was this patch tested?
No additional unit test required.
Author: Cheng Hao <hao.cheng@intel.com>
Closes#11730 from chenghao-intel/range.