Author: Michael Armbrust <michael@databricks.com>
Closes#2515 from marmbrus/jdbcExistingContext and squashes the following commits:
7866fad [Michael Armbrust] Allows starting a JDBC server on an existing context.
User may be confused for the HQL logging & configurations, we'd better provide a default templates.
Both files are copied from Hive.
Author: Cheng Hao <hao.cheng@intel.com>
Closes#2263 from chenghao-intel/hive_template and squashes the following commits:
53bffa9 [Cheng Hao] Remove the hive-log4j.properties initialization
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#2396 from adrian-wang/selectnull and squashes the following commits:
2458229 [Daoyuan Wang] rebase solution
This will allow us to take advantage of things like the spark.defaults file.
Author: Michael Armbrust <michael@databricks.com>
Closes#2493 from marmbrus/copySparkConf and squashes the following commits:
0bd1377 [Michael Armbrust] Copy SQL configuration from SparkConf when a SQLContext is created.
It returns null metadata from parquet if querying on empty parquet file while calculating splits.So added null check and returns the empty splits.
Author : ravipesala ravindra.pesalahuawei.com
Author: ravipesala <ravindra.pesala@huawei.com>
Closes#2456 from ravipesala/SPARK-3536 and squashes the following commits:
1e81a50 [ravipesala] Fixed the issue when querying on empty parquet file.
Since we have moved to `ConventionHelper`, it is quite easy to avoid call `javaClassToDataType` in hive simple udf. This will solve SPARK-3582.
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#2506 from adrian-wang/spark3582 and squashes the following commits:
450c28e [Daoyuan Wang] not limit argument type for hive simple udf
this patch fixes timestamp smaller than 0 and cast int as timestamp
select cast(1000 as timestamp) from src limit 1;
should return 1970-01-01 00:00:01, but we now take it as 1000 seconds.
also, current implementation has bug when the time is before 1970-01-01 00:00:00.
rxin marmbrus chenghao-intel
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#2458 from adrian-wang/timestamp and squashes the following commits:
4274b1d [Daoyuan Wang] set test not related to timezone
1234f66 [Daoyuan Wang] fix timestamp smaller than 0 and cast int as timestamp
**This PR introduces a subtle change in semantics for HiveContext when using the results in Python or Scala. Specifically, while resolution remains case insensitive, it is now case preserving.**
_This PR is a follow up to #2293 (and to a lesser extent #2262#2334)._
In #2293 the catalog was changed to store analyzed logical plans instead of unresolved ones. While this change fixed the reported bug (which was caused by yet another instance of us forgetting to put in a `LowerCaseSchema` operator) it had the consequence of breaking assumptions made by `MultiInstanceRelation`. Specifically, we can't replace swap out leaf operators in a tree without rewriting changed expression ids (which happens when you self join the same RDD that has been registered as a temp table).
In this PR, I instead remove the need to insert `LowerCaseSchema` operators at all, by moving the concern of matching up identifiers completely into analysis. Doing so allows the test cases from both #2293 and #2262 to pass at the same time (and likely fixes a slew of other "unknown unknown" bugs).
While it is rolled back in this PR, storing the analyzed plan might actually be a good idea. For instance, it is kind of confusing if you register a temporary table, change the case sensitivity of resolution and now you can't query that table anymore. This can be addressed in a follow up PR.
Follow-ups:
- Configurable case sensitivity
- Consider storing analyzed plans for temp tables
Author: Michael Armbrust <michael@databricks.com>
Closes#2382 from marmbrus/lowercase and squashes the following commits:
c21171e [Michael Armbrust] Ensure the resolver is used for field lookups and ensure that case insensitive resolution is still case preserving.
d4320f1 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into lowercase
2de881e [Michael Armbrust] Address comments.
219805a [Michael Armbrust] style
5b93711 [Michael Armbrust] Replace LowerCaseSchema with Resolver.
This helps to replace shuffled hash joins with broadcast hash joins in some cases.
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes#2468 from liancheng/more-stats and squashes the following commits:
32687dc [Cheng Lian] Moved the test case to PlannerSuite
5595a91 [Cheng Lian] Removes debugging code
73faf69 [Cheng Lian] Test case for auto choosing broadcast hash join
f30fe1d [Cheng Lian] Adds sizeInBytes estimation for Limit when all output types are native types
This is just another solution to SPARK-3485, in addition to PR #2355
In this patch, we will use ConventionHelper and FunctionRegistry to invoke a simple udf evaluation, which rely more on hive, but much cleaner and safer.
We can discuss which one is better.
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#2407 from adrian-wang/simpleudf and squashes the following commits:
15762d2 [Daoyuan Wang] add posmod test which would fail the test but now ok
0d69eb4 [Daoyuan Wang] another way to pass to hive simple udf
Author: Sandy Ryza <sandy@cloudera.com>
Closes#2460 from sryza/sandy-spark-3605 and squashes the following commits:
09d940b [Sandy Ryza] SPARK-3605. Fix typo in SchemaRDD.
This feature allows user to add cache table from the select query.
Example : ```CACHE TABLE testCacheTable AS SELECT * FROM TEST_TABLE```
Spark takes this type of SQL as command and it does lazy caching just like ```SQLContext.cacheTable```, ```CACHE TABLE <name>``` does.
It can be executed from both SQLContext and HiveContext.
Recreated the pull request after rebasing with master.And fixed all the comments raised in previous pull requests.
https://github.com/apache/spark/pull/2381https://github.com/apache/spark/pull/2390
Author : ravipesala ravindra.pesalahuawei.com
Author: ravipesala <ravindra.pesala@huawei.com>
Closes#2397 from ravipesala/SPARK-2594 and squashes the following commits:
a5f0beb [ravipesala] Simplified the code as per Admin comment.
8059cd2 [ravipesala] Changed the behaviour from eager caching to lazy caching.
d6e469d [ravipesala] Code review comments by Admin are handled.
c18aa38 [ravipesala] Merge remote-tracking branch 'remotes/ravipesala/Add-Cache-table-as' into SPARK-2594
394d5ca [ravipesala] Changed style
fb1759b [ravipesala] Updated as per Admin comments
8c9993c [ravipesala] Changed the style
d8b37b2 [ravipesala] Updated as per the comments by Admin
bc0bffc [ravipesala] Merge remote-tracking branch 'ravipesala/Add-Cache-table-as' into Add-Cache-table-as
e3265d0 [ravipesala] Updated the code as per the comments by Admin in pull request.
724b9db [ravipesala] Changed style
aaf5b59 [ravipesala] Added comment
dc33895 [ravipesala] Updated parser to support add cache table command
b5276b2 [ravipesala] Updated parser to support add cache table command
eebc0c1 [ravipesala] Add CACHE TABLE <name> AS SELECT ...
6758f80 [ravipesala] Changed style
7459ce3 [ravipesala] Added comment
13c8e27 [ravipesala] Updated parser to support add cache table command
4e858d8 [ravipesala] Updated parser to support add cache table command
b803fc8 [ravipesala] Add CACHE TABLE <name> AS SELECT ...
When do the query like:
```
select datediff(cast(value as timestamp), cast('2002-03-21 00:00:00' as timestamp)) from src;
```
SparkSQL will raise exception:
```
[info] scala.MatchError: TimestampType (of class org.apache.spark.sql.catalyst.types.TimestampType$)
[info] at org.apache.spark.sql.catalyst.expressions.Cast.castToTimestamp(Cast.scala:77)
[info] at org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:251)
[info] at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247)
[info] at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263)
[info] at org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$5$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:217)
[info] at org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$5$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:210)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$2.apply(TreeNode.scala:180)
[info] at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
[info] at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
```
Author: Cheng Hao <hao.cheng@intel.com>
Closes#2368 from chenghao-intel/cast_exception and squashes the following commits:
5c9c3a5 [Cheng Hao] make more clear code
49dfc50 [Cheng Hao] Add no-op for Cast and revert the position of SimplifyCasts
b804abd [Cheng Hao] Add unit test to show the failure in identical data type casting
330a5c8 [Cheng Hao] Update Code based on comments
b834ed4 [Cheng Hao] Fix bug of HiveSimpleUDF with unnecessary type cast which cause exception in constant folding
SchemaRDD overrides RDD functions, including collect, count, and take, with optimized versions making use of the query optimizer. The java and python interface classes wrapping SchemaRDD need to ensure the optimized versions are called as well. This patch overrides relevant calls in the python and java interfaces with optimized versions.
Adds a new Row serialization pathway between python and java, based on JList[Array[Byte]] versus the existing RDD[Array[Byte]]. I wasn’t overjoyed about doing this, but I noticed that some QueryPlans implement optimizations in executeCollect(), which outputs an Array[Row] rather than the typical RDD[Row] that can be shipped to python using the existing serialization code. To me it made sense to ship the Array[Row] over to python directly instead of converting it back to an RDD[Row] just for the purpose of sending the Rows to python using the existing serialization code.
Author: Aaron Staple <aaron.staple@gmail.com>
Closes#1592 from staple/SPARK-2314 and squashes the following commits:
89ff550 [Aaron Staple] Merge with master.
6bb7b6c [Aaron Staple] Fix typo.
b56d0ac [Aaron Staple] [SPARK-2314][SQL] Override count in JavaSchemaRDD, forwarding to SchemaRDD's count.
0fc9d40 [Aaron Staple] Fix comment typos.
f03cdfa [Aaron Staple] [SPARK-2314][SQL] Override collect and take in sql.py, forwarding to SchemaRDD's collect.
Throwing an error in the constructor makes it possible to run queries, even when there is no actual ambiguity. Remove this check in favor of throwing an error in analysis when they query is actually is ambiguous.
Also took the opportunity to add test cases that would have caught a subtle bug in my first attempt at fixing this and refactor some other test code.
Author: Michael Armbrust <michael@databricks.com>
Closes#2209 from marmbrus/sameNameStruct and squashes the following commits:
729cca4 [Michael Armbrust] Better tests.
a003aeb [Michael Armbrust] Remove error (it'll be caught in analysis).
This PR aims to support reading top level JSON arrays and take every element in such an array as a row (an empty array will not generate a row).
JIRA: https://issues.apache.org/jira/browse/SPARK-3308
Author: Yin Huai <huai@cse.ohio-state.edu>
Closes#2400 from yhuai/SPARK-3308 and squashes the following commits:
990077a [Yin Huai] Handle top level JSON arrays.
Author: Cheng Hao <hao.cheng@intel.com>
Closes#2392 from chenghao-intel/trim and squashes the following commits:
e52024f [Cheng Hao] trim the string message
SPARK-3039: Adds the maven property "avro.mapred.classifier" to build spark-assembly with avro-mapred with support for the new Hadoop API. Sets this property to hadoop2 for Hadoop 2 profiles.
I am not very familiar with maven, nor do I know whether this potentially breaks something in the hive part of spark. There might be a more elegant way of doing this.
Author: Bertrand Bossy <bertrandbossy@gmail.com>
Closes#1945 from bbossy/SPARK-3039 and squashes the following commits:
c32ce59 [Bertrand Bossy] SPARK-3039: Allow spark to be built using avro-mapred for hadoop2
Author: Michael Armbrust <michael@databricks.com>
Closes#2164 from marmbrus/shufflePartitions and squashes the following commits:
0da1e8c [Michael Armbrust] test hax
ef2d985 [Michael Armbrust] more test hacks.
2dabae3 [Michael Armbrust] more test fixes
0bdbf21 [Michael Armbrust] Make parquet tests less order dependent
b42eeab [Michael Armbrust] increase test parallelism
80453d5 [Michael Armbrust] Decrease partitions when testing
This is a major refactoring of the in-memory columnar storage implementation, aims to eliminate boxing costs from critical paths (building/accessing column buffers) as much as possible. The basic idea is to refactor all major interfaces into a row-based form and use them together with `SpecificMutableRow`. The difficult part is how to adapt all compression schemes, esp. `RunLengthEncoding` and `DictionaryEncoding`, to this design. Since in-memory compression is disabled by default for now, and this PR should be strictly better than before no matter in-memory compression is enabled or not, maybe I'll finish that part in another PR.
**UPDATE** This PR also took the chance to optimize `HiveTableScan` by
1. leveraging `SpecificMutableRow` to avoid boxing cost, and
1. building specific `Writable` unwrapper functions a head of time to avoid per row pattern matching and branching costs.
TODO
- [x] Benchmark
- [ ] ~~Eliminate boxing costs in `RunLengthEncoding`~~ (left to future PRs)
- [ ] ~~Eliminate boxing costs in `DictionaryEncoding` (seems not easy to do without specializing `DictionaryEncoding` for every supported column type)~~ (left to future PRs)
## Micro benchmark
The benchmark uses a 10 million line CSV table consists of bytes, shorts, integers, longs, floats and doubles, measures the time to build the in-memory version of this table, and the time to scan the whole in-memory table.
Benchmark code can be found [here](https://gist.github.com/liancheng/fe70a148de82e77bd2c8#file-hivetablescanbenchmark-scala). Script used to generate the input table can be found [here](https://gist.github.com/liancheng/fe70a148de82e77bd2c8#file-tablegen-scala).
Speedup:
- Hive table scanning + column buffer building: **18.74%**
The original benchmark uses 1K as in-memory batch size, when increased to 10K, it can be 28.32% faster.
- In-memory table scanning: **7.95%**
Before:
| Building | Scanning
------- | -------- | --------
1 | 16472 | 525
2 | 16168 | 530
3 | 16386 | 529
4 | 16184 | 538
5 | 16209 | 521
Average | 16283.8 | 528.6
After:
| Building | Scanning
------- | -------- | --------
1 | 13124 | 458
2 | 13260 | 529
3 | 12981 | 463
4 | 13214 | 483
5 | 13583 | 500
Average | 13232.4 | 486.6
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes#2327 from liancheng/prevent-boxing/unboxing and squashes the following commits:
4419fe4 [Cheng Lian] Addressing comments
e5d2cf2 [Cheng Lian] Bug fix: should call setNullAt when field value is null to avoid NPE
8b8552b [Cheng Lian] Only checks for partition batch pruning flag once
489f97b [Cheng Lian] Bug fix: TableReader.fillObject uses wrong ordinals
97bbc4e [Cheng Lian] Optimizes hive.TableReader by by providing specific Writable unwrappers a head of time
3dc1f94 [Cheng Lian] Minor changes to eliminate row object creation
5b39cb9 [Cheng Lian] Lowers log level of compression scheme details
f2a7890 [Cheng Lian] Use SpecificMutableRow in InMemoryColumnarTableScan to avoid boxing
9cf30b0 [Cheng Lian] Added row based ColumnType.append/extract
456c366 [Cheng Lian] Made compression decoder row based
edac3cd [Cheng Lian] Makes ColumnAccessor.extractSingle row based
8216936 [Cheng Lian] Removes boxing cost in IntDelta and LongDelta by providing specialized implementations
b70d519 [Cheng Lian] Made some in-memory columnar storage interfaces row-based
This is a follow up of #2352. Now we can finally remove the evil "MINOR HACK", which covered up the eldest bug in the history of Spark SQL (see details [here](https://github.com/apache/spark/pull/2352#issuecomment-55440621)).
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes#2377 from liancheng/remove-evil-minor-hack and squashes the following commits:
0869c78 [Cheng Lian] Removes the evil MINOR HACK
Please refer to the JIRA ticket for details.
**NOTE** We should check all test suites that do similar initialization-like side effects in their constructors. This PR only fixes `ParquetMetastoreSuite` because it breaks our Jenkins Maven build.
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes#2375 from liancheng/say-no-to-constructor and squashes the following commits:
0ceb75b [Cheng Lian] Moves test suite setup code to beforeAll rather than in constructor
Logically, we should remove the Hive Table/Database first and then reset the Hive configuration, repoint to the new data warehouse directory etc.
Otherwise it raised exceptions like "Database doesn't not exists: default" in the local testing.
Author: Cheng Hao <hao.cheng@intel.com>
Closes#2352 from chenghao-intel/test_hive and squashes the following commits:
74fd76b [Cheng Hao] eliminate the error log
Author: Cody Koeninger <cody.koeninger@mediacrossing.com>
Closes#2345 from koeninger/SPARK-3462 and squashes the following commits:
5c8d24d [Cody Koeninger] SPARK-3462 remove now-unused parameter
0788691 [Cody Koeninger] SPARK-3462 add tests, handle compatible schema with different aliases, per marmbrus feedback
ef47b3b [Cody Koeninger] SPARK-3462 push down filters and projections into Unions
This PR aims to correctly handle JSON arrays in the type of `ArrayType(...(ArrayType(StructType)))`.
JIRA: https://issues.apache.org/jira/browse/SPARK-3390.
Author: Yin Huai <huai@cse.ohio-state.edu>
Closes#2364 from yhuai/SPARK-3390 and squashes the following commits:
46db418 [Yin Huai] Handle JSON arrays in the type of ArrayType(...(ArrayType(StructType))).
Author: Cheng Hao <hao.cheng@intel.com>
Closes#1846 from chenghao-intel/ctas and squashes the following commits:
56a0578 [Cheng Hao] remove the unused imports
9a57abc [Cheng Hao] Avoid table creation in logical plan analyzing
LogicalPlan contains a ‘resolved’ attribute indicating that all of its execution requirements have been resolved. This attribute is not checked before query execution. The analyzer contains a step to check that all Expressions are resolved, but this is not equivalent to checking all LogicalPlans. In particular, the Union plan’s implementation of ‘resolved’ verifies that the types of its children’s columns are compatible. Because the analyzer does not check that a Union plan is resolved, it is possible to execute a Union plan that outputs different types in the same column. See SPARK-2781 for an example.
This patch adds two checks to the analyzer’s CheckResolution rule. First, each logical plan is checked to see if it is not resolved despite its children being resolved. This allows the ‘problem’ unresolved plan to be included in the TreeNodeException for reporting. Then as a backstop the root plan is checked to see if it is resolved, which recursively checks that the entire plan tree is resolved. Note that the resolved attribute is implemented recursively, and this patch also explicitly checks the resolved attribute on each logical plan in the tree. I assume the query plan trees will not be large enough for this redundant checking to meaningfully impact performance.
Because this patch starts validating that LogicalPlans are resolved before execution, I had to fix some cases where unresolved plans were passing through the analyzer as part of the implementation of the hive query system. In particular, HiveContext applies the CreateTables and PreInsertionCasts, and ExtractPythonUdfs rules manually after the analyzer runs. I moved these rules to the analyzer stage (for hive queries only), in the process completing a code TODO indicating the rules should be moved to the analyzer.
It’s worth noting that moving the CreateTables rule means introducing an analyzer rule with a significant side effect - in this case the side effect is creating a hive table. The rule will only attempt to create a table once even if its batch is executed multiple times, because it converts the InsertIntoCreatedTable plan it matches against into an InsertIntoTable. Additionally, these hive rules must be added to the Resolution batch rather than as a separate batch because hive rules rules may be needed to resolve non-root nodes, leaving the root to be resolved on a subsequent batch iteration. For example, the hive compatibility test auto_smb_mapjoin_14, and others, make use of a query plan where the root is a Union and its children are each a hive InsertIntoTable.
Mixing the custom hive rules with standard analyzer rules initially resulted in an additional failure because of policy differences between spark sql and hive when casting a boolean to a string. Hive casts booleans to strings as “true” / “false” while spark sql casts booleans to strings as “1” / “0” (causing the cast1.q test to fail). This behavior is a result of the BooleanCasts rule in HiveTypeCoercion.scala, and from looking at the implementation of BooleanCasts I think converting to to “1”/“0” is potentially a programming mistake. (If the BooleanCasts rule is disabled, casting produces “true”/“false” instead.) I believe “true” / “false” should be the behavior for spark sql - I changed the behavior so bools are converted to “true”/“false” to be consistent with hive, and none of the existing spark tests failed.
Finally, in some initial testing with hive it appears that an implicit type coercion of boolean to string results in a lowercase string, e.g. CONCAT( TRUE, “” ) -> “true” while an explicit cast produces an all caps string, e.g. CAST( TRUE AS STRING ) -> “TRUE”. The change I’ve made just converts to lowercase strings in all cases. I believe it is at least more correct than the existing spark sql implementation where all Cast expressions become “1” / “0”.
Author: Aaron Staple <aaron.staple@gmail.com>
Closes#1706 from staple/SPARK-2781 and squashes the following commits:
32683c4 [Aaron Staple] Fix compilation failure due to merge.
7c77fda [Aaron Staple] Move ExtractPythonUdfs to Analyzer's extendedRules in HiveContext.
d49bfb3 [Aaron Staple] Address review comments.
915b690 [Aaron Staple] Fix merge issue causing compilation failure.
701dcd2 [Aaron Staple] [SPARK-2781][SQL] Check resolution of LogicalPlans in Analyzer.
In order to read from partitioned Avro files we need to also set the `SERDEPROPERTIES` since `TBLPROPERTIES` are not passed to the initialization. This PR simply adds a test to make sure we don't break this workaround.
Author: Michael Armbrust <michael@databricks.com>
Closes#2340 from marmbrus/avroPartitioned and squashes the following commits:
6b969d6 [Michael Armbrust] fix style
fea2124 [Michael Armbrust] Add test case with workaround for reading partitioned avro files.
First let me write down the current `projections` grammar of spark sql:
expression : orExpression
orExpression : andExpression {"or" andExpression}
andExpression : comparisonExpression {"and" comparisonExpression}
comparisonExpression : termExpression | termExpression "=" termExpression | termExpression ">" termExpression | ...
termExpression : productExpression {"+"|"-" productExpression}
productExpression : baseExpression {"*"|"/"|"%" baseExpression}
baseExpression : expression "[" expression "]" | ... | ident | ...
ident : identChar {identChar | digit} | delimiters | ...
identChar : letter | "_" | "."
delimiters : "," | ";" | "(" | ")" | "[" | "]" | ...
projection : expression [["AS"] ident]
projections : projection { "," projection}
For something like `a.b.c[1]`, it will be parsed as:
<img src="http://img51.imgspice.com/i/03008/4iltjsnqgmtt_t.jpg" border=0>
But for something like `a[1].b`, the current grammar can't parse it correctly.
A simple solution is written in `ParquetQuerySuite#NestedSqlParser`, changed grammars are:
delimiters : "." | "," | ";" | "(" | ")" | "[" | "]" | ...
identChar : letter | "_"
baseExpression : expression "[" expression "]" | expression "." ident | ... | ident | ...
This works well, but can't cover some corner case like `select t.a.b from table as t`:
<img src="http://img51.imgspice.com/i/03008/v2iau3hoxoxg_t.jpg" border=0>
`t.a.b` parsed as `GetField(GetField(UnResolved("t"), "a"), "b")` instead of `GetField(UnResolved("t.a"), "b")` using this new grammar.
However, we can't resolve `t` as it's not a filed, but the whole table.(if we could do this, then `select t from table as t` is legal, which is unexpected)
My solution is:
dotExpressionHeader : ident "." ident
baseExpression : expression "[" expression "]" | expression "." ident | ... | dotExpressionHeader | ident | ...
I passed all test cases under sql locally and add a more complex case.
"arrayOfStruct.field1 to access all values of field1" is not supported yet. Since this PR has changed a lot of code, I will open another PR for it.
I'm not familiar with the latter optimize phase, please correct me if I missed something.
Author: Wenchen Fan <cloud0fan@163.com>
Author: Michael Armbrust <michael@databricks.com>
Closes#2230 from cloud-fan/dot and squashes the following commits:
e1a8898 [Wenchen Fan] remove support for arbitrary nested arrays
ee8a724 [Wenchen Fan] rollback LogicalPlan, support dot operation on nested array type
a58df40 [Michael Armbrust] add regression test for doubly nested data
16bc4c6 [Wenchen Fan] some enhance
95d733f [Wenchen Fan] split long line
dc31698 [Wenchen Fan] SPARK-2096 Correctly parse dot notations
Type Coercion should support every type to have null value
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Author: Michael Armbrust <michael@databricks.com>
Closes#2246 from adrian-wang/spark3363-0 and squashes the following commits:
c6241de [Daoyuan Wang] minor code clean
595b417 [Daoyuan Wang] Merge pull request #2 from marmbrus/pr/2246
832e640 [Michael Armbrust] reduce code duplication
ef6f986 [Daoyuan Wang] make double boolean miss in jsonRDD compatibleType
c619f0a [Daoyuan Wang] Type Coercion should support every type to have null value
Current implementation will ignore else val type.
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#2245 from adrian-wang/casewhenbug and squashes the following commits:
3332f6e [Daoyuan Wang] remove wrong comment
83b536c [Daoyuan Wang] a comment to trigger retest
d7315b3 [Daoyuan Wang] code improve
eed35fc [Daoyuan Wang] bug in casewhen resolve
This resolves https://issues.apache.org/jira/browse/SPARK-3395
Author: Eric Liang <ekl@google.com>
Closes#2266 from ericl/spark-3395 and squashes the following commits:
7f2b6f0 [Eric Liang] add regression test
05bd1e4 [Eric Liang] in the dsl, create a new schema instance in each applySchema
`SpecificMutableRow.update` doesn't check for null, and breaks existing `MutableRow` contract.
The tricky part here is that for performance considerations, the `update` method of all subclasses of `MutableValue` doesn't check for null and sets the null bit to false.
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes#2325 from liancheng/check-for-null and squashes the following commits:
9366c44 [Cheng Lian] Check for null in SpecificMutableRow.update
Add support for the mathematical function"ABS" and the analytic function "last" to return a subset of the rows satisfying a query within spark sql. Test-cases included.
Author: xinyunh <xinyun.huang@huawei.com>
Author: bomeng <golf8lover>
Closes#2099 from xinyunh/sqlTest and squashes the following commits:
71d15e7 [xinyunh] remove POWER part
8843643 [xinyunh] fix the code style issue
39f0309 [bomeng] Modify the code of POWER and ABS. Move them to the file arithmetic
ff8e51e [bomeng] add abs() function support
7f6980a [xinyunh] fix the bug in 'Last' component
b3df91b [xinyunh] add 'Last' component
Unit test failed due to can not resolve the attribute references. Temporally disable this test case for a quick fixing, otherwise it will block the others.
Author: Cheng Hao <hao.cheng@intel.com>
Closes#2334 from chenghao-intel/unit_test_failure and squashes the following commits:
661f784 [Cheng Hao] temporally disable the failed test case
This fixes some possible spurious test failures in `HiveQuerySuite` by comparing sets of key-value pairs as sets, rather than as lists.
Author: William Benton <willb@redhat.com>
Author: Aaron Davidson <aaron@databricks.com>
Closes#2220 from willb/spark-3329 and squashes the following commits:
3b3e205 [William Benton] Collapse collectResults case match in HiveQuerySuite
6525d8e [William Benton] Handle cases where SET returns Rows of (single) strings
cf11b0e [Aaron Davidson] Fix flakey HiveQuerySuite test
Case insensitivity breaks when unresolved relation contains attributes with uppercase letters in their names, because we store unanalyzed logical plan when registering temp tables while the `CaseInsensitivityAttributeReferences` batch runs before the `Resolution` batch. To fix this issue, we need to store analyzed logical plan.
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes#2293 from liancheng/spark-3414 and squashes the following commits:
d9fa1d6 [Cheng Lian] Stores analyzed logical plan when registering a temp table
This patch improves the SQLParser by adding support for BETWEEN conditions
Author: William Benton <willb@redhat.com>
Closes#2295 from willb/sql-between and squashes the following commits:
0016d30 [William Benton] Implement BETWEEN for SQLParser
This resolves https://issues.apache.org/jira/browse/SPARK-3349
Author: Eric Liang <ekl@google.com>
Closes#2262 from ericl/spark-3349 and squashes the following commits:
3e1b05c [Eric Liang] add regression test
ac32723 [Eric Liang] make limit/takeOrdered output SinglePartition
Author: Reynold Xin <rxin@apache.org>
Closes#2281 from rxin/sql-limit-sort and squashes the following commits:
1ef7780 [Reynold Xin] [SPARK-3408] Fixed Limit operator so it works with sort-based shuffle.
Author: GuoQiang Li <witgo@qq.com>
Closes#2268 from witgo/SPARK-3397 and squashes the following commits:
eaf913f [GuoQiang Li] Bump pom.xml version number of master branch to 1.2.0-SNAPSHOT
This is a tiny teeny optimization to move the if check of sortBasedShuffledOn to outside the closures so the closures don't need to pull in the entire Exchange operator object.
Author: Reynold Xin <rxin@apache.org>
Closes#2282 from rxin/SPARK-3409 and squashes the following commits:
1de3f88 [Reynold Xin] [SPARK-3409][SQL] Avoid pulling in Exchange operator itself in Exchange's closures.
This is a tiny fix for getting the value of "mapred.reduce.tasks", which make more sense for the hive user.
As well as the command "set -v", which should output verbose information for all of the key/values.
Author: Cheng Hao <hao.cheng@intel.com>
Closes#2261 from chenghao-intel/set_mapreduce_tasks and squashes the following commits:
653858a [Cheng Hao] show value spark.sql.shuffle.partitions for mapred.reduce.tasks
Adds logical and physical command classes for the "add jar" command.
Note that this PR conflicts with and should be merged after #2215.
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes#2242 from liancheng/add-jar and squashes the following commits:
e43a2f1 [Cheng Lian] Updates AddJar according to conventions introduced in #2215
b99107f [Cheng Lian] Added test case for ADD JAR command
095b2c7 [Cheng Lian] Also forward ADD JAR command to Hive
9be031b [Cheng Lian] Trims Jar path string
8195056 [Cheng Lian] Added support for the "add jar" command
We can directly use currentTable there without unnecessary implicit conversion.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#2203 from viirya/direct_use_inmemoryrelation and squashes the following commits:
4741d02 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into direct_use_inmemoryrelation
b671f67 [Liang-Chi Hsieh] Can directly use currentTable there without unnecessary implicit conversion.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#2251 from sarutak/SPARK-3378 and squashes the following commits:
0bfe234 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3378
bb5938f [Kousuke Saruta] Replaced rest of "SparkSQL" with "Spark SQL"
6df66de [Kousuke Saruta] Replaced "SparkSQL" with "Spark SQL"