Author: Cheng Hao <hao.cheng@intel.com>
Closes#3595 from chenghao-intel/udf0 and squashes the following commits:
a858973 [Cheng Hao] Add 0 arguments support for udf
This is a follow-up of SPARK-4593 (#3443).
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#3581 from ueshin/issues/SPARK-4720 and squashes the following commits:
c3959d4 [Takuya UESHIN] Make Remainder return null if the divider is 0.
Author: Cheng Hao <hao.cheng@intel.com>
Closes#3606 from chenghao-intel/codegen_short_circuit and squashes the following commits:
f466303 [Cheng Hao] short circuit for AND & OR
This PR provides a set Parquet testing API (see trait `ParquetTest`) that enables developers to write more concise test cases. A new set of Parquet test suites built upon this API are added and aim to replace the old `ParquetQuerySuite`. To avoid potential merge conflicts, old testing code are not removed yet. The following classes can be safely removed after most Parquet related PRs are handled:
- `ParquetQuerySuite`
- `ParquetTestData`
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3644)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes#3644 from liancheng/parquet-tests and squashes the following commits:
800e745 [Cheng Lian] Enforces ordering of test output
3bb8731 [Cheng Lian] Refactors HiveParquetSuite
aa2cb2e [Cheng Lian] Decouples ParquetTest and TestSQLContext
7b43a68 [Cheng Lian] Updates ParquetTest Scaladoc
7f07af0 [Cheng Lian] Adds a new set of Parquet test suites
In BroadcastHashJoin, currently it is using a hard coded value (5 minutes) to wait for the execution and broadcast of the small table.
In my opinion, it should be a configurable value since broadcast may exceed 5 minutes in some case, like in a busy/congested network environment.
Author: Jacky Li <jacky.likun@huawei.com>
Closes#3133 from jackylk/timeout-config and squashes the following commits:
733ac08 [Jacky Li] add spark.sql.broadcastTimeout in SQLConf.scala
557acd4 [Jacky Li] switch to sqlContext.getConf
81a5e20 [Jacky Li] make wait time configurable in BroadcastHashJoin
Since `AttributeReference` resolution and `*` expansion are currently in separate rules, each pair requires a full iteration instead of being able to resolve in a single pass. Since its pretty easy to construct queries that have many of these in a row, I combine them into a single rule in this PR.
Author: Michael Armbrust <michael@databricks.com>
Closes#3674 from marmbrus/projectStars and squashes the following commits:
d83d6a1 [Michael Armbrust] Fix resolution of deeply nested Project(attr, Project(Star,...)).
In `HashOuterJoin.scala`, spark read data from both side of join operation before zip them together. It is a waste for memory. We are trying to read data from only one side, put them into a hashmap, and then generate the `JoinedRow` with data from other side one by one.
Currently, we could only do this optimization for `left outer join` and `right outer join`. For `full outer join`, we will do something in another issue.
for
table test_csv contains 1 million records
table dim_csv contains 10 thousand records
SQL:
`select * from test_csv a left outer join dim_csv b on a.key = b.key`
the result is:
master:
```
CSV: 12671 ms
CSV: 9021 ms
CSV: 9200 ms
Current Mem Usage:787788984
```
after patch:
```
CSV: 10382 ms
CSV: 7543 ms
CSV: 7469 ms
Current Mem Usage:208145728
```
Author: tianyi <tianyi@asiainfo-linkage.com>
Author: tianyi <tianyi.asiainfo@gmail.com>
Closes#3375 from tianyi/SPARK-4483 and squashes the following commits:
72a8aec [tianyi] avoid having mutable state stored inside of the task
99c5c97 [tianyi] performance optimization
d2f94d7 [tianyi] fix bug: missing output when the join-key is null.
2be45d1 [tianyi] fix spell bug
1f2c6f1 [tianyi] remove commented codes
a676de6 [tianyi] optimize some codes
9e7d5b5 [tianyi] remove commented old codes
838707d [tianyi] Optimization about reduce memory costs during the HashOuterJoin
The problem is `codegenEnabled` is `val`, but it uses a `val` `sqlContext`, which can be override by subclasses. Here is a simple example to show this issue.
```Scala
scala> :paste
// Entering paste mode (ctrl-D to finish)
abstract class Foo {
protected val sqlContext = "Foo"
val codegenEnabled: Boolean = {
println(sqlContext) // it will call subclass's `sqlContext` which has not yet been initialized.
if (sqlContext != null) {
true
} else {
false
}
}
}
class Bar extends Foo {
override val sqlContext = "Bar"
}
println(new Bar().codegenEnabled)
// Exiting paste mode, now interpreting.
null
false
defined class Foo
defined class Bar
```
We should make `sqlContext` `final` to prevent subclasses from overriding it incorrectly.
Author: zsxwing <zsxwing@gmail.com>
Closes#3660 from zsxwing/SPARK-4812 and squashes the following commits:
1cbb623 [zsxwing] Make `sqlContext` final to prevent subclasses from overriding it incorrectly
Author: jerryshao <saisai.shao@intel.com>
Closes#3698 from jerryshao/SPARK-4847 and squashes the following commits:
4741130 [jerryshao] Make later added extraStrategies effect when calling strategies
Add HTTP protocol support and test cases to spark thrift server, so users can deploy thrift server in both TCP and http mode.
Author: Judy Nash <judynash@microsoft.com>
Author: judynash <judynash@microsoft.com>
Closes#3672 from judynash/master and squashes the following commits:
526315d [Judy Nash] correct spacing on startThriftServer method
31a6520 [Judy Nash] fix code style issues and update sql programming guide format issue
47bf87e [Judy Nash] modify withJdbcStatement method definition to meet less than 100 line length
2e9c11c [Judy Nash] add thrift server in http mode documentation on sql programming guide
1cbd305 [Judy Nash] Merge remote-tracking branch 'upstream/master'
2b1d312 [Judy Nash] updated http thrift server support based on feedback
377532c [judynash] add HTTP protocol spark thrift server
This enables assertions for the Maven and SBT build, but overrides the Hive module to not enable assertions.
Author: Sean Owen <sowen@cloudera.com>
Closes#3692 from srowen/SPARK-4814 and squashes the following commits:
caca704 [Sean Owen] Disable assertions just for Hive
f71e783 [Sean Owen] Enable assertions for SBT and Maven build
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#3676 from adrian-wang/countexpr and squashes the following commits:
dc5765b [Daoyuan Wang] add rule to fold count(expr) if expr is not null
When I use Parquet File as a output file using ParquetOutputFormat#getDefaultWorkFile, the file name is not zero padded while RDD#saveAsText does zero padding.
Author: Sasaki Toru <sasakitoa@nttdata.co.jp>
Closes#3602 from sasakitoa/parquet-zeroPadding and squashes the following commits:
6b0e58f [Sasaki Toru] Merge branch 'master' of git://github.com/apache/spark into parquet-zeroPadding
20dc79d [Sasaki Toru] Fixed the name of Parquet File generated by AppendingParquetOutputFormat
Fix bug when query like:
```
test("save join to table") {
val testData = sparkContext.parallelize(1 to 10).map(i => TestData(i, i.toString))
sql("CREATE TABLE test1 (key INT, value STRING)")
testData.insertInto("test1")
sql("CREATE TABLE test2 (key INT, value STRING)")
testData.insertInto("test2")
testData.insertInto("test2")
sql("SELECT COUNT(a.value) FROM test1 a JOIN test2 b ON a.key = b.key").saveAsTable("test")
checkAnswer(
table("test"),
sql("SELECT COUNT(a.value) FROM test1 a JOIN test2 b ON a.key = b.key").collect().toSeq)
}
```
Author: Cheng Hao <hao.cheng@intel.com>
Closes#3673 from chenghao-intel/spark_4825 and squashes the following commits:
e8cbd56 [Cheng Hao] alternate the pattern matching order for logical plan:CTAS
e004895 [Cheng Hao] fix bug
This is fixed by SPARK-4318 #3184
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#3445 from adrian-wang/emptyaggr and squashes the following commits:
982575e [Daoyuan Wang] enable empty aggr test case
So the optimizations are not valid. Also I think the optimization here is rarely encounter, so removing them will not have influence on performance.
Can we merge #3445 before I add a comparison test case from this?
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#3675 from adrian-wang/sumempty and squashes the following commits:
42df763 [Daoyuan Wang] sum and avg on empty table should always return null
a follow up of #3547
/cc marmbrus
Author: scwf <wangfei1@huawei.com>
Closes#3563 from scwf/rnc and squashes the following commits:
9395661 [scwf] remove unnecessary condition
Inserting data of type including `ArrayType.containsNull == false` or `MapType.valueContainsNull == false` or `StructType.fields.exists(_.nullable == false)` into Hive table will fail because `Cast` inserted by `HiveMetastoreCatalog.PreInsertionCasts` rule of `Analyzer` can't handle these types correctly.
Complex type cast rule proposal:
- Cast for non-complex types should be able to cast the same as before.
- Cast for `ArrayType` can evaluate if
- Element type can cast
- Nullability rule doesn't break
- Cast for `MapType` can evaluate if
- Key type can cast
- Nullability for casted key type is `false`
- Value type can cast
- Nullability rule for value type doesn't break
- Cast for `StructType` can evaluate if
- The field size is the same
- Each field can cast
- Nullability rule for each field doesn't break
- The nested structure should be the same.
Nullability rule:
- If the casted type is `nullable == true`, the target nullability should be `true`
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#3150 from ueshin/issues/SPARK-4293 and squashes the following commits:
e935939 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-4293
ba14003 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-4293
8999868 [Takuya UESHIN] Fix a test title.
f677c30 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-4293
287f410 [Takuya UESHIN] Add tests to insert data of types ArrayType / MapType / StructType with nullability is false into Hive table.
4f71bb8 [Takuya UESHIN] Make Cast be able to handle complex types.
fix a TODO in Analyzer:
// TODO: pass this in as a parameter
val fixedPoint = FixedPoint(100)
Author: Jacky Li <jacky.likun@huawei.com>
Closes#3499 from jackylk/config and squashes the following commits:
4c1252c [Jacky Li] fix scalastyle
820f460 [Jacky Li] pass maxIterations in as a parameter
Whitelist more hive unit test:
"create_like_tbl_props"
"udf5"
"udf_java_method"
"decimal_1"
"udf_pmod"
"udf_to_double"
"udf_to_float"
"udf7" (this will fail in Hive 0.12)
Author: Cheng Hao <hao.cheng@intel.com>
Closes#3522 from chenghao-intel/unittest and squashes the following commits:
f54e4c7 [Cheng Hao] work around to clean up the hive.table.parameters.default in reset
16fee22 [Cheng Hao] Whitelist more unittest
Unpersist a uncached RDD, will not raise exception, for example:
```
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
distData.unpersist(true)
```
But the `SchemaRDD` will raise exception if the `SchemaRDD` is not cached. Since `SchemaRDD` is the subclasses of the `RDD`, we should follow the same behavior.
Author: Cheng Hao <hao.cheng@intel.com>
Closes#3572 from chenghao-intel/try_uncache and squashes the following commits:
50a7a89 [Cheng Hao] SchemaRDD.unpersist() should not raise exception if it is not persisted
Modified ScalaReflection.schemaFor to take primary constructor of Product when there are multiple constructors. Added test to suite which failed before but works now.
Needed for [https://github.com/apache/spark/pull/3637]
CC: marmbrus
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#3646 from jkbradley/sql-reflection and squashes the following commits:
796b2e4 [Joseph K. Bradley] Modified ScalaReflection.schemaFor to take primary constructor of Product when there are multiple constructors. Added test to suite which failed before but works now.
Different from Hive 0.12.0, in Hive 0.13.1 UDF/UDAF/UDTF (aka Hive function) objects should only be initialized once on the driver side and then serialized to executors. However, not all function objects are serializable (e.g. GenericUDF doesn't implement Serializable). Hive 0.13.1 solves this issue with Kryo or XML serializer. Several utility ser/de methods are provided in class o.a.h.h.q.e.Utilities for this purpose. In this PR we chose Kryo for efficiency. The Kryo serializer used here is created in Hive. Spark Kryo serializer wasn't used because there's no available SparkConf instance.
Author: Cheng Hao <hao.cheng@intel.com>
Author: Cheng Lian <lian@databricks.com>
Closes#3640 from chenghao-intel/udf_serde and squashes the following commits:
8e13756 [Cheng Hao] Update the comment
74466a3 [Cheng Hao] refactor as feedbacks
396c0e1 [Cheng Hao] avoid Simple UDF to be serialized
e9c3212 [Cheng Hao] update the comment
19cbd46 [Cheng Hao] support udf instance ser/de after initialization
This is the code refactor and follow ups for #2570
Author: Cheng Hao <hao.cheng@intel.com>
Closes#3336 from chenghao-intel/createtbl and squashes the following commits:
3563142 [Cheng Hao] remove the unused variable
e215187 [Cheng Hao] eliminate the compiling warning
4f97f14 [Cheng Hao] fix bug in unittest
5d58812 [Cheng Hao] revert the API changes
b85b620 [Cheng Hao] fix the regression of temp tabl not found in CTAS
Author: Jacky Li <jacky.likun@huawei.com>
Closes#3630 from jackylk/remove and squashes the following commits:
150e7e0 [Jacky Li] remove unnecessary import
Enables Kryo and disables reference tracking by default in Spark SQL Thrift server. Configurations explicitly defined by users in `spark-defaults.conf` are respected (the Thrift server is started by `spark-submit`, which handles configuration properties properly).
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3621)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes#3621 from liancheng/kryo-by-default and squashes the following commits:
70c2775 [Cheng Lian] Enables Kryo by default in Spark SQL Thrift server
Author: Michael Armbrust <michael@databricks.com>
Closes#3613 from marmbrus/parquetPartitionPruning and squashes the following commits:
4f138f8 [Michael Armbrust] Use catalyst for partition pruning in newParquet.
Just found this instance while doing some jstack-based profiling of a Spark SQL job. It is very unlikely that this is causing much of a perf issue anywhere, but it is unnecessarily suboptimal.
Author: Aaron Davidson <aaron@databricks.com>
Closes#3593 from aarondav/seq-opt and squashes the following commits:
962cdfc [Aaron Davidson] [SQL] Minor: Avoid calling Seq#size in a loop
Author: Jacky Li <jacky.likun@huawei.com>
Closes#3585 from jackylk/remove and squashes the following commits:
045423d [Jacky Li] remove unnecessary import
This is a very small fix that catches one specific exception and returns an empty table. #3441 will address this in a more principled way.
Author: Michael Armbrust <michael@databricks.com>
Closes#3586 from marmbrus/fixEmptyParquet and squashes the following commits:
2781d9f [Michael Armbrust] Handle empty lists for newParquet
04dd376 [Michael Armbrust] Avoid exception when reading empty parquet data through Hive
Using ```executeCollect``` to collect the result, because executeCollect is a custom implementation of collect in spark sql which better than rdd's collect
Author: wangfei <wangfei1@huawei.com>
Closes#3547 from scwf/executeCollect and squashes the following commits:
a5ab68e [wangfei] Revert "adding debug info"
a60d680 [wangfei] fix test failure
0db7ce8 [wangfei] adding debug info
184c594 [wangfei] using executeCollect instead collect
We should use `~` instead of `-` for bitwise NOT.
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#3528 from adrian-wang/symbol and squashes the following commits:
affd4ad [Daoyuan Wang] fix code gen test case
56efb79 [Daoyuan Wang] ensure bitwise NOT over byte and short persist data type
f55fbae [Daoyuan Wang] wrong symbol for bitwise not
SELECT max(1/0) FROM src
would return a very large number, which is obviously not right.
For hive-0.12, hive would return `Infinity` for 1/0, while for hive-0.13.1, it is `NULL` for 1/0.
I think it is better to keep our behavior with newer Hive version.
This PR ensures that when the divider is 0, the result of expression should be NULL, same with hive-0.13.1
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#3443 from adrian-wang/div and squashes the following commits:
2e98677 [Daoyuan Wang] fix code gen for divide 0
85c28ba [Daoyuan Wang] temp
36236a5 [Daoyuan Wang] add test cases
6f5716f [Daoyuan Wang] fix comments
cee92bd [Daoyuan Wang] avoid evaluation 2 times
22ecd9a [Daoyuan Wang] fix style
cf28c58 [Daoyuan Wang] divide fix
2dfe50f [Daoyuan Wang] return null when divider is 0 of Double type
val jsc = new org.apache.spark.api.java.JavaSparkContext(sc)
val jhc = new org.apache.spark.sql.hive.api.java.JavaHiveContext(jsc)
val nrdd = jhc.hql("select null from spark_test.for_test")
println(nrdd.schema)
Then the error is thrown as follows:
scala.MatchError: NullType (of class org.apache.spark.sql.catalyst.types.NullType$)
at org.apache.spark.sql.types.util.DataTypeConversions$.asJavaDataType(DataTypeConversions.scala:43)
Author: YanTangZhai <hakeemzhai@tencent.com>
Author: yantangzhai <tyz0303@163.com>
Author: Michael Armbrust <michael@databricks.com>
Closes#3538 from YanTangZhai/MatchNullType and squashes the following commits:
e052dff [yantangzhai] [SPARK-4676] [SQL] JavaSchemaRDD.schema may throw NullType MatchError if sql has null
4b4bb34 [yantangzhai] [SPARK-4676] [SQL] JavaSchemaRDD.schema may throw NullType MatchError if sql has null
896c7b7 [yantangzhai] fix NullType MatchError in JavaSchemaRDD when sql has null
6e643f8 [YanTangZhai] Merge pull request #11 from apache/master
e249846 [YanTangZhai] Merge pull request #10 from apache/master
d26d982 [YanTangZhai] Merge pull request #9 from apache/master
76d4027 [YanTangZhai] Merge pull request #8 from apache/master
03b62b0 [YanTangZhai] Merge pull request #7 from apache/master
8a00106 [YanTangZhai] Merge pull request #6 from apache/master
cbcba66 [YanTangZhai] Merge pull request #3 from apache/master
cdef539 [YanTangZhai] Merge pull request #1 from apache/master
Author: baishuo <vc_java@hotmail.com>
Closes#3526 from baishuo/master-trycatch and squashes the following commits:
d446e14 [baishuo] correct the code style
b36bf96 [baishuo] correct the code style
ae0e447 [baishuo] add finally to avoid resource leak
Spark SQL has embeded sqrt and abs but DSL doesn't support those functions.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#3401 from sarutak/dsl-missing-operator and squashes the following commits:
07700cf [Kousuke Saruta] Modified Literal(null, NullType) to Literal(null) in DslQuerySuite
8f366f8 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into dsl-missing-operator
1b88e2e [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into dsl-missing-operator
0396f89 [Kousuke Saruta] Added sqrt and abs to Spark SQL DSL
A very small nit update.
Author: Reynold Xin <rxin@databricks.com>
Closes#3552 from rxin/license-header and squashes the following commits:
df8d1a4 [Reynold Xin] Indent license header properly for interfaces.scala.
In addition, using `s.isEmpty` to eliminate the string comparison.
Author: zsxwing <zsxwing@gmail.com>
Closes#3132 from zsxwing/SPARK-4268 and squashes the following commits:
358e235 [zsxwing] Improvement of allCaseVersions
Support view definition like
CREATE VIEW view3(valoo)
TBLPROPERTIES ("fear" = "factor")
AS SELECT upper(value) FROM src WHERE key=86;
[valoo as the alias of upper(value)]. This is missing part of SPARK-4239, for a fully view support.
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#3396 from adrian-wang/viewcolumn and squashes the following commits:
4d001d0 [Daoyuan Wang] support view with column alias
Author: ravipesala <ravindra.pesala@huawei.com>
Closes#3516 from ravipesala/ddl_doc and squashes the following commits:
d101fdf [ravipesala] Style issues fixed
d2238cd [ravipesala] Corrected documentation
Supporting multi column support in countDistinct function like count(distinct c1,c2..) in Spark SQL
Author: ravipesala <ravindra.pesala@huawei.com>
Author: Michael Armbrust <michael@databricks.com>
Closes#3511 from ravipesala/countdistinct and squashes the following commits:
cc4dbb1 [ravipesala] style
070e12a [ravipesala] Supporting multi column support in count(distinct c1,c2..) in Spark SQL
Remove hardcoding max and min values for types. Let BigDecimal do checking type compatibility.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#3208 from viirya/more_numericLit and squashes the following commits:
e9834b4 [Liang-Chi Hsieh] Remove byte and short types for number literal.
1bd1825 [Liang-Chi Hsieh] Fix Indentation and make the modification clearer.
cf1a997 [Liang-Chi Hsieh] Modified for comment to add a rule of analysis that adds a cast.
91fe489 [Liang-Chi Hsieh] add Byte and Short.
1bdc69d [Liang-Chi Hsieh] Let BigDecimal do checking type compatibility.
group tab is missing for scaladoc
Author: Jacky Li <jacky.likun@gmail.com>
Closes#3458 from jackylk/patch-7 and squashes the following commits:
0121a70 [Jacky Li] add @group tab in limit() and count()
Author: zsxwing <zsxwing@gmail.com>
Closes#3521 from zsxwing/SPARK-4661 and squashes the following commits:
03cbe3f [zsxwing] Minor code and docs cleanup
This PR disables HiveThriftServer2 asynchronous execution by setting `runInBackground` argument in `ExecuteStatementOperation` to `false`, and reverting `SparkExecuteStatementOperation.run` in Hive 13 shim to Hive 12 version. This change makes Simba ODBC driver v1.0.0.1000 work.
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3506)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes#3506 from liancheng/disable-async-exec and squashes the following commits:
593804d [Cheng Lian] Disables asynchronous execution in Hive 0.13.1 HiveThriftServer2
```timeTaken``` should not count the time of printing result.
Author: w00228970 <wangfei1@huawei.com>
Closes#3423 from scwf/time-taken-bug and squashes the following commits:
da7e102 [w00228970] compute time taken correctly
Re-implement the Python broadcast using file:
1) serialize the python object using cPickle, write into disks.
2) Create a wrapper in JVM (for the dumped file), it read data from during serialization
3) Using TorrentBroadcast or HttpBroadcast to transfer the data (compressed) into executors
4) During deserialization, writing the data into disk.
5) Passing the path into Python worker, read data from disk and unpickle it into python object, until the first access.
It fixes the performance regression introduced in #2659, has similar performance as 1.1, but support object larger than 2G, also improve the memory efficiency (only one compressed copy in driver and executor).
Testing with a 500M broadcast and 4 tasks (excluding the benefit from reused worker in 1.2):
name | 1.1 | 1.2 with this patch | improvement
---------|--------|---------|--------
python-broadcast-w-bytes | 25.20 | 9.33 | 170.13% |
python-broadcast-w-set | 4.13 | 4.50 | -8.35% |
Testing with 100 tasks (16 CPUs):
name | 1.1 | 1.2 with this patch | improvement
---------|--------|---------|--------
python-broadcast-w-bytes | 38.16 | 8.40 | 353.98%
python-broadcast-w-set | 23.29 | 9.59 | 142.80%
Author: Davies Liu <davies@databricks.com>
Closes#3417 from davies/pybroadcast and squashes the following commits:
50a58e0 [Davies Liu] address comments
b98de1d [Davies Liu] disable gc while unpickle
e5ee6b9 [Davies Liu] support large string
09303b8 [Davies Liu] read all data into memory
dde02dd [Davies Liu] improve performance of python broadcast
When we use ORDER BY clause, at first, attributes referenced by projection are resolved (1).
And then, attributes referenced at ORDER BY clause are resolved (2).
But when resolving attributes referenced at ORDER BY clause, the resolution result generated in (1) is discarded so for example, following query fails.
SELECT c1 + c2 FROM mytable ORDER BY c1;
The query above fails because when resolving the attribute reference 'c1', the resolution result of 'c2' is discarded.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#3363 from sarutak/SPARK-4487 and squashes the following commits:
fd314f3 [Kousuke Saruta] Fixed attribute resolution logic in Analyzer
6e60c20 [Kousuke Saruta] Fixed conflicts
cb5b7e9 [Kousuke Saruta] Added test case for SPARK-4487
282d529 [Kousuke Saruta] Fixed attributes reference resolution error
b6123e6 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into concat-feature
317b7fb [Kousuke Saruta] WIP