Data Spill with UnsafeRow causes assert failure.
```
java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:165)
at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeKey(UnsafeRowSerializer.scala:75)
at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:180)
at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:688)
at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:687)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:687)
at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:683)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:683)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
```
To reproduce that with code (thanks andrewor14):
```scala
bin/spark-shell --master local
--conf spark.shuffle.memoryFraction=0.005
--conf spark.shuffle.sort.bypassMergeThreshold=0
sc.parallelize(1 to 2 * 1000 * 1000, 10)
.map { i => (i, i) }.toDF("a", "b").groupBy("b").avg().count()
```
Author: Cheng Hao <hao.cheng@intel.com>
Closes#8635 from chenghao-intel/unsafe_spill.
This PR is based on #8383 , thanks to viirya
JIRA: https://issues.apache.org/jira/browse/SPARK-9730
This patch adds the Full Outer Join support for SortMergeJoin. A new class SortMergeFullJoinScanner is added to scan rows from left and right iterators. FullOuterIterator is simply a wrapper of type RowIterator to consume joined rows from SortMergeFullJoinScanner.
Closes#8383
Author: Liang-Chi Hsieh <viirya@appier.com>
Author: Davies Liu <davies@databricks.com>
Closes#8579 from davies/smj_fullouter.
The bulk of the changes are on `transient` annotation on class parameter. Often the compiler doesn't generate a field for this parameters, so the the transient annotation would be unnecessary.
But if the class parameter are used in methods, then fields are created. So it is safer to keep the annotations.
The remainder are some potential bugs, and deprecated syntax.
Author: Luc Bourlier <luc.bourlier@typesafe.com>
Closes#8433 from skyluc/issue/sbt-2.11.
To keep full compatibility of Parquet write path with Spark 1.4, we should rename the innermost field name of arrays that may contain null from "array_element" to "array".
Please refer to [SPARK-10434] [1] for more details.
[1]: https://issues.apache.org/jira/browse/SPARK-10434
Author: Cheng Lian <lian@databricks.com>
Closes#8586 from liancheng/spark-10434/fix-parquet-array-type.
This PR can be quite challenging to review. I'm trying to give a detailed description of the problem as well as its solution here.
When reading Parquet files, we need to specify a potentially nested Parquet schema (of type `MessageType`) as requested schema for column pruning. This Parquet schema is translated from a Catalyst schema (of type `StructType`), which is generated by the query planner and represents all requested columns. However, this translation can be fairly complicated because of several reasons:
1. Requested schema must conform to the real schema of the physical file to be read.
This means we have to tailor the actual file schema of every individual physical Parquet file to be read according to the given Catalyst schema. Fortunately we are already doing this in Spark 1.5 by pushing request schema conversion to executor side in PR #7231.
1. Support for schema merging.
A single Parquet dataset may consist of multiple physical Parquet files come with different but compatible schemas. This means we may request for a column path that doesn't exist in a physical Parquet file. All requested column paths can be nested. For example, for a Parquet file schema
```
message root {
required group f0 {
required group f00 {
required int32 f000;
required binary f001 (UTF8);
}
}
}
```
we may request for column paths defined in the following schema:
```
message root {
required group f0 {
required group f00 {
required binary f001 (UTF8);
required float f002;
}
}
optional double f1;
}
```
Notice that we pruned column path `f0.f00.f000`, but added `f0.f00.f002` and `f1`.
The good news is that Parquet handles non-existing column paths properly and always returns null for them.
1. The map from `StructType` to `MessageType` is a one-to-many map.
This is the most unfortunate part.
Due to historical reasons (dark histories!), schemas of Parquet files generated by different libraries have different "flavors". For example, to handle a schema with a single non-nullable column, whose type is an array of non-nullable integers, parquet-protobuf generates the following Parquet schema:
```
message m0 {
repeated int32 f;
}
```
while parquet-avro generates another version:
```
message m1 {
required group f (LIST) {
repeated int32 array;
}
}
```
and parquet-thrift spills this:
```
message m1 {
required group f (LIST) {
repeated int32 f_tuple;
}
}
```
All of them can be mapped to the following _unique_ Catalyst schema:
```
StructType(
StructField(
"f",
ArrayType(IntegerType, containsNull = false),
nullable = false))
```
This greatly complicates Parquet requested schema construction, since the path of a given column varies in different cases. To read the array elements from files with the above schemas, we must use `f` for `m0`, `f.array` for `m1`, and `f.f_tuple` for `m2`.
In earlier Spark versions, we didn't try to fix this issue properly. Spark 1.4 and prior versions simply translate the Catalyst schema in a way more or less compatible with parquet-hive and parquet-avro, but is broken in many other cases. Earlier revisions of Spark 1.5 only try to tailor the Parquet file schema at the first level, and ignore nested ones. This caused [SPARK-10301] [spark-10301] as well as [SPARK-10005] [spark-10005]. In PR #8228, I tried to avoid the hard part of the problem and made a minimum change in `CatalystRowConverter` to fix SPARK-10005. However, when taking SPARK-10301 into consideration, keeping hacking `CatalystRowConverter` doesn't seem to be a good idea. So this PR is an attempt to fix the problem in a proper way.
For a given physical Parquet file with schema `ps` and a compatible Catalyst requested schema `cs`, we use the following algorithm to tailor `ps` to get the result Parquet requested schema `ps'`:
For a leaf column path `c` in `cs`:
- if `c` exists in `cs` and a corresponding Parquet column path `c'` can be found in `ps`, `c'` should be included in `ps'`;
- otherwise, we convert `c` to a Parquet column path `c"` using `CatalystSchemaConverter`, and include `c"` in `ps'`;
- no other column paths should exist in `ps'`.
Then comes the most tedious part:
> Given `cs`, `ps`, and `c`, how to locate `c'` in `ps`?
Unfortunately, there's no quick answer, and we have to enumerate all possible structures defined in parquet-format spec. They are:
1. the standard structure of nested types, and
1. cases defined in all backwards-compatibility rules for `LIST` and `MAP`.
The core part of this PR is `CatalystReadSupport.clipParquetType()`, which tailors a given Parquet file schema according to a requested schema in its Catalyst form. Backwards-compatibility rules of `LIST` and `MAP` are covered in `clipParquetListType()` and `clipParquetMapType()` respectively. The column path selection algorithm is implemented in `clipParquetGroupFields()`.
With this PR, we no longer need to do schema tailoring in `CatalystReadSupport` and `CatalystRowConverter`. Another benefit is that, now we can also read Parquet datasets consist of files with different physical Parquet schema but share the same logical schema, for example, files generated by different Parquet libraries. This situation is illustrated by [this test case] [test-case].
[spark-10301]: https://issues.apache.org/jira/browse/SPARK-10301
[spark-10005]: https://issues.apache.org/jira/browse/SPARK-10005
[test-case]: 38644d8a45 (diff-a9b98e28ce3ae30641829dffd1173be2R26)
Author: Cheng Lian <lian@databricks.com>
Closes#8509 from liancheng/spark-10301/fix-parquet-requested-schema.
Data frame write to DB2 database is failing because by default JDBC data source implementation is generating a table schema with DB2 unsupported data types TEXT for String, and BIT1(1) for Boolean.
This patch registers DB2 JDBC Dialect that maps String, Boolean to valid DB2 data types.
Author: sureshthalamati <suresh.thalamati@gmail.com>
Closes#8393 from sureshthalamati/db2_dialect_spark-10170.
This PR includes the following changes:
- Add `LocalNodeTest` for local operator tests and add unit tests for FilterNode and ProjectNode.
- Add `LimitNode` and `UnionNode` and their unit tests to show how to use `LocalNodeTest`. (SPARK-9991, SPARK-9993)
Author: zsxwing <zsxwing@gmail.com>
Closes#8464 from zsxwing/local-execution.
This fixes the problem that scanning partitioned table causes driver have a high memory pressure and takes down the cluster. Also, with this fix, we will be able to correctly show the query plan of a query consuming partitioned tables.
https://issues.apache.org/jira/browse/SPARK-10339https://issues.apache.org/jira/browse/SPARK-10334
Finally, this PR squeeze in a "quick fix" for SPARK-10301. It is not a real fix, but it just throw a better error message to let user know what to do.
Author: Yin Huai <yhuai@databricks.com>
Closes#8515 from yhuai/partitionedTableScan.
SparkHadoopUtil contains methods that use reflection to work around TaskAttemptContext binary incompatibilities between Hadoop 1.x and 2.x. We should use these methods in more places.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#8499 from JoshRosen/use-hadoop-reflection-in-more-places.
Having sizeInBytes in HadoopFsRelation to enable broadcast join.
cc marmbrus
Author: Davies Liu <davies@databricks.com>
Closes#8490 from davies/sizeInByte.
https://issues.apache.org/jira/browse/SPARK-10287
After porting json to HadoopFsRelation, it seems hard to keep the behavior of picking up new files automatically for JSON. This PR removes this behavior, so JSON is consistent with others (ORC and Parquet).
Author: Yin Huai <yhuai@databricks.com>
Closes#8469 from yhuai/jsonRefresh.
This PR:
1. supports transferring arbitrary nested array from JVM to R side in SerDe;
2. based on 1, collect() implemenation is improved. Now it can support collecting data of complex types
from a DataFrame.
Author: Sun Rui <rui.sun@intel.com>
Closes#8276 from sun-rui/SPARK-10048.
Replace `JavaConversions` implicits with `JavaConverters`
Most occurrences I've seen so far are necessary conversions; a few have been avoidable. None are in critical code as far as I see, yet.
Author: Sean Owen <sowen@cloudera.com>
Closes#8033 from srowen/SPARK-9613.
Spark SQL's data sources API exposes Catalyst's internal types through its Filter interfaces. This is a problem because types like UTF8String are not stable developer APIs and should not be exposed to third-parties.
This issue caused incompatibilities when upgrading our `spark-redshift` library to work against Spark 1.5.0. To avoid these issues in the future we should only expose public types through these Filter objects. This patch accomplishes this by using CatalystTypeConverters to add the appropriate conversions.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#8403 from JoshRosen/datasources-internal-vs-external-types.
PR #8341 is a valid fix for SPARK-10136, but it didn't catch the real root cause. The real problem can be rather tricky to explain, and requires audiences to be pretty familiar with parquet-format spec, especially details of `LIST` backwards-compatibility rules. Let me have a try to give an explanation here.
The structure of the problematic Parquet schema generated by parquet-avro is something like this:
```
message m {
<repetition> group f (LIST) { // Level 1
repeated group array (LIST) { // Level 2
repeated <primitive-type> array; // Level 3
}
}
}
```
(The schema generated by parquet-thrift is structurally similar, just replace the `array` at level 2 with `f_tuple`, and the other one at level 3 with `f_tuple_tuple`.)
This structure consists of two nested legacy 2-level `LIST`-like structures:
1. The repeated group type at level 2 is the element type of the outer array defined at level 1
This group should map to an `CatalystArrayConverter.ElementConverter` when building converters.
2. The repeated primitive type at level 3 is the element type of the inner array defined at level 2
This group should also map to an `CatalystArrayConverter.ElementConverter`.
The root cause of SPARK-10136 is that, the group at level 2 isn't properly recognized as the element type of level 1. Thus, according to parquet-format spec, the repeated primitive at level 3 is left as a so called "unannotated repeated primitive type", and is recognized as a required list of required primitive type, thus a `RepeatedPrimitiveConverter` instead of a `CatalystArrayConverter.ElementConverter` is created for it.
According to parquet-format spec, unannotated repeated type shouldn't appear in a `LIST`- or `MAP`-annotated group. PR #8341 fixed this issue by allowing such unannotated repeated type appear in `LIST`-annotated groups, which is a non-standard, hacky, but valid fix. (I didn't realize this when authoring #8341 though.)
As for the reason why level 2 isn't recognized as a list element type, it's because of the following `LIST` backwards-compatibility rule defined in the parquet-format spec:
> If the repeated field is a group with one field and is named either `array` or uses the `LIST`-annotated group's name with `_tuple` appended then the repeated type is the element type and elements are required.
(The `array` part is for parquet-avro compatibility, while the `_tuple` part is for parquet-thrift.)
This rule is implemented in [`CatalystSchemaConverter.isElementType`] [1], but neglected in [`CatalystRowConverter.isElementType`] [2]. This PR delivers a more robust fix by adding this rule in the latter method.
Note that parquet-avro 1.7.0 also suffers from this issue. Details can be found at [PARQUET-364] [3].
[1]: 85f9a61357/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystSchemaConverter.scala (L259-L305)
[2]: 85f9a61357/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystRowConverter.scala (L456-L463)
[3]: https://issues.apache.org/jira/browse/PARQUET-364
Author: Cheng Lian <lian@databricks.com>
Closes#8361 from liancheng/spark-10136/proper-version.
* Makes `SQLImplicits.rddToDataFrameHolder` scaladoc consistent with `SQLContext.createDataFrame[A <: Product](rdd: RDD[A])` since the former is essentially a wrapper for the latter
* Clarifies `createDataFrame[A <: Product]` scaladoc to apply for any `RDD[Product]`, not just case classes
Author: Feynman Liang <fliang@databricks.com>
Closes#8406 from feynmanliang/sql-doc-fixes.
This PR contains examples on how to use some of the Stat Functions available for DataFrames under `df.stat`.
rxin
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#8378 from brkyvz/update-sql-docs.
https://issues.apache.org/jira/browse/SPARK-10143
With this PR, we will set min split size to parquet's block size (row group size) set in the conf if the min split size is smaller. So, we can avoid have too many tasks and even useless tasks for reading parquet data.
I tested it locally. The table I have has 343MB and it is in my local FS. Because I did not set any min/max split size, the default split size was 32MB and the map stage had 11 tasks. But there were only three tasks that actually read data. With my PR, there were only three tasks in the map stage. Here is the difference.
Without this PR:
![image](https://cloud.githubusercontent.com/assets/2072857/9399179/8587dba6-4765-11e5-9189-7ebba52a2b6d.png)
With this PR:
![image](https://cloud.githubusercontent.com/assets/2072857/9399185/a4735d74-4765-11e5-8848-1f1e361a6b4b.png)
Even if the block size setting does match the actual block size of parquet file, I think it is still generally good to use parquet's block size setting if min split size is smaller than this block size.
Tested it on a cluster using
```
val count = sqlContext.table("""store_sales""").groupBy().count().queryExecution.executedPlan(3).execute().count
```
Basically, it reads 0 column of table `store_sales`. My table has 1824 parquet files with size from 80MB to 280MB (1 to 3 row group sizes). Without this patch, in a 16 worker cluster, the job had 5023 tasks and spent 102s. With this patch, the job had 2893 tasks and spent 64s. It is still not as good as using one mapper per file (1824 tasks and 42s), but it is much better than our master.
Author: Yin Huai <yhuai@databricks.com>
Closes#8346 from yhuai/parquetMinSplit.
This class is identical to `org.apache.spark.sql.execution.datasources.jdbc. DefaultSource` and is not needed.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#8334 from cloud-fan/minor.
I caught SPARK-10136 while adding more test cases to `ParquetAvroCompatibilitySuite`. Actual bug fix code lies in `CatalystRowConverter.scala`.
Author: Cheng Lian <lian@databricks.com>
Closes#8341 from liancheng/spark-10136/parquet-avro-nested-primitive-array.
This improves performance by ~ 20 - 30% in one of my local test and should fix the performance regression from 1.4 to 1.5 on ss_max.
Author: Reynold Xin <rxin@databricks.com>
Closes#8332 from rxin/SPARK-10100.
https://issues.apache.org/jira/browse/SPARK-10092
This pr is a follow-up one for Multi-DB support. It has the following changes:
* `HiveContext.refreshTable` now accepts `dbName.tableName`.
* `HiveContext.analyze` now accepts `dbName.tableName`.
* `CreateTableUsing`, `CreateTableUsingAsSelect`, `CreateTempTableUsing`, `CreateTempTableUsingAsSelect`, `CreateMetastoreDataSource`, and `CreateMetastoreDataSourceAsSelect` all take `TableIdentifier` instead of the string representation of table name.
* When you call `saveAsTable` with a specified database, the data will be saved to the correct location.
* Explicitly do not allow users to create a temporary with a specified database name (users cannot do it before).
* When we save table to metastore, we also check if db name and table name can be accepted by hive (using `MetaStoreUtils.validateName`).
Author: Yin Huai <yhuai@databricks.com>
Closes#8324 from yhuai/saveAsTableDB.
A few minor changes:
1. Improved documentation
2. Rename apply(distinct....) to distinct.
3. Changed MutableAggregationBuffer from a trait to an abstract class.
4. Renamed returnDataType to dataType to be more consistent with other expressions.
And unrelated to UDAFs:
1. Renamed file names in expressions to use suffix "Expressions" to be more consistent.
2. Moved regexp related expressions out to its own file.
3. Renamed StringComparison => StringPredicate.
Author: Reynold Xin <rxin@databricks.com>
Closes#8321 from rxin/SPARK-9242.
As I talked with Lian,
1. I added EquelNullSafe to ParquetFilters
- It uses the same equality comparison filter with EqualTo since the Parquet filter performs actually null-safe equality comparison.
2. Updated the test code (ParquetFilterSuite)
- Convert catalyst.Expression to sources.Filter
- Removed Cast since only Literal is picked up as a proper Filter in DataSourceStrategy
- Added EquelNullSafe comparison
3. Removed deprecated createFilter for catalyst.Expression
Author: hyukjinkwon <gurwls223@gmail.com>
Author: 권혁진 <gurwls223@gmail.com>
Closes#8275 from HyukjinKwon/master.
Speculation hates direct output committer, as there are multiple corner cases that may cause data corruption and/or data loss.
Please see this [PR comment] [1] for more details.
[1]: https://github.com/apache/spark/pull/8191#issuecomment-131598385
Author: Cheng Lian <lian@databricks.com>
Closes#8317 from liancheng/spark-9899/speculation-hates-direct-output-committer.
`DictionaryEncoding` uses Scala runtime reflection to avoid boxing costs while building the directory array. However, this code path may hit [SI-6240] [1] and throw exception.
[1]: https://issues.scala-lang.org/browse/SI-6240
Author: Cheng Lian <lian@databricks.com>
Closes#8306 from liancheng/spark-9627/in-memory-cache-scala-reflection.
DataFrame.withColumn in Python should be consistent with the Scala one (replacing the existing column that has the same name).
cc marmbrus
Author: Davies Liu <davies@databricks.com>
Closes#8300 from davies/with_column.
This is kind of a weird case, but given a sufficiently complex query plan (in this case a TungstenProject with an Exchange underneath), we could have NPEs on the executors due to the time when we were calling transformAllExpressions
In general we should ensure that all transformations occur on the driver and not on the executors. Some reasons for avoid executor side transformations include:
* (this case) Some operator constructors require state such as access to the Spark/SQL conf so doing a makeCopy on the executor can fail.
* (unrelated reason for avoid executor transformations) ExprIds are calculated using an atomic integer, so you can violate their uniqueness constraint by constructing them anywhere other than the driver.
This subsumes #8285.
Author: Reynold Xin <rxin@databricks.com>
Author: Michael Armbrust <michael@databricks.com>
Closes#8295 from rxin/SPARK-10096.
Turns out that inner classes of inner objects are referenced directly, and thus moving it will break binary compatibility.
Author: Michael Armbrust <michael@databricks.com>
Closes#8281 from marmbrus/binaryCompat.
Parquet hard coded a JUL logger which always writes to stdout. This PR redirects it via SLF4j JUL bridge handler, so that we can control Parquet logs via `log4j.properties`.
This solution is inspired by https://github.com/Parquet/parquet-mr/issues/390#issuecomment-46064909.
Author: Cheng Lian <lian@databricks.com>
Closes#8196 from liancheng/spark-8118/redirect-parquet-jul.
This PR uses `JDBCRDD.getConnector` to load JDBC driver before creating connection in `DataFrameReader.jdbc` and `DataFrameWriter.jdbc`.
Author: zsxwing <zsxwing@gmail.com>
Closes#8232 from zsxwing/SPARK-10036 and squashes the following commits:
adf75de [zsxwing] Add extraOptions to the connection properties
57f59d4 [zsxwing] Load JDBC driver in DataFrameReader.jdbc and DataFrameWriter.jdbc
When inserting data into a `HadoopFsRelation`, if `commitTask()` of the writer container fails, `abortTask()` will be invoked. However, both `commitTask()` and `abortTask()` try to close the output writer(s). The problem is that, closing underlying writers may not be an idempotent operation. E.g., `ParquetRecordWriter.close()` throws NPE when called twice.
Author: Cheng Lian <lian@databricks.com>
Closes#8236 from liancheng/spark-7837/double-closing.
In case of schema merging, we only handled first level fields when converting Parquet groups to `InternalRow`s. Nested struct fields are not properly handled.
For example, the schema of a Parquet file to be read can be:
```
message individual {
required group f1 {
optional binary f11 (utf8);
}
}
```
while the global schema is:
```
message global {
required group f1 {
optional binary f11 (utf8);
optional int32 f12;
}
}
```
This PR fixes this issue by padding missing fields when creating actual converters.
Author: Cheng Lian <lian@databricks.com>
Closes#8228 from liancheng/spark-10005/nested-schema-merging.
The `initialSize` argument of `ColumnBuilder.initialize()` should be the
number of rows rather than bytes. However `InMemoryColumnarTableScan`
passes in a byte size, which makes Spark SQL allocate more memory than
necessary when building in-memory columnar buffers.
Author: Kun Xu <viper_kun@163.com>
Closes#8189 from viper-kun/errorSize.
This pull request creates a new operator interface that is more similar to traditional database query iterators (with open/close/next/get).
These local operators are not currently used anywhere, but will become the basis for SPARK-9983 (local physical operators for query execution).
cc zsxwing
Author: Reynold Xin <rxin@databricks.com>
Closes#8212 from rxin/SPARK-9984.
This PR enforce dynamic partition column data type requirements by adding analysis rules.
JIRA: https://issues.apache.org/jira/browse/SPARK-8887
Author: Yijie Shen <henry.yijieshen@gmail.com>
Closes#8201 from yjshen/dynamic_partition_columns.
Also alias the ExtractValue instead of wrapping it with UnresolvedAlias when resolve attribute in LogicalPlan, as this alias will be trimmed if it's unnecessary.
Based on #7957 without the changes to mllib, but instead maintaining earlier behavior when using `withColumn` on expressions that already have metadata.
Author: Wenchen Fan <cloud0fan@outlook.com>
Author: Michael Armbrust <michael@databricks.com>
Closes#8215 from marmbrus/pr/7957.
This bug is caused by a wrong column-exist-check in `__getitem__` of pyspark dataframe. `DataFrame.apply` accepts not only top level column names, but also nested column name like `a.b`, so we should remove that check from `__getitem__`.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#8202 from cloud-fan/nested.
in MLlib sometimes we need to set metadata for the new column, thus we will alias the new column with metadata before call `withColumn` and in `withColumn` we alias this clolumn again. Here I overloaded `withColumn` to allow user set metadata, just like what we did for `Column.as`.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#8159 from cloud-fan/withColumn.