Commit graph

396 commits

Author SHA1 Message Date
Daoyuan Wang f0c87dc86a [SPARK-3363][SQL] Type Coercion should promote null to all other types.
Type Coercion should support every type to have null value

Author: Daoyuan Wang <daoyuan.wang@intel.com>
Author: Michael Armbrust <michael@databricks.com>

Closes #2246 from adrian-wang/spark3363-0 and squashes the following commits:

c6241de [Daoyuan Wang] minor code clean
595b417 [Daoyuan Wang] Merge pull request #2 from marmbrus/pr/2246
832e640 [Michael Armbrust] reduce code duplication
ef6f986 [Daoyuan Wang] make double boolean miss in jsonRDD compatibleType
c619f0a [Daoyuan Wang] Type Coercion should support every type to have null value
2014-09-10 10:48:36 -07:00
Daoyuan Wang a0283300c4 [SPARK-3362][SQL] Fix resolution for casewhen with nulls.
Current implementation will ignore else val type.

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #2245 from adrian-wang/casewhenbug and squashes the following commits:

3332f6e [Daoyuan Wang] remove wrong comment
83b536c [Daoyuan Wang] a comment to trigger retest
d7315b3 [Daoyuan Wang] code improve
eed35fc [Daoyuan Wang] bug in casewhen resolve
2014-09-10 10:45:24 -07:00
Eric Liang b734ed0c22 [SPARK-3395] [SQL] DSL sometimes incorrectly reuses attribute ids, breaking queries
This resolves https://issues.apache.org/jira/browse/SPARK-3395

Author: Eric Liang <ekl@google.com>

Closes #2266 from ericl/spark-3395 and squashes the following commits:

7f2b6f0 [Eric Liang] add regression test
05bd1e4 [Eric Liang] in the dsl, create a new schema instance in each applySchema
2014-09-09 23:47:12 -07:00
Cheng Lian c110614b33 [SPARK-3448][SQL] Check for null in SpecificMutableRow.update
`SpecificMutableRow.update` doesn't check for null, and breaks existing `MutableRow` contract.

The tricky part here is that for performance considerations, the `update` method of all subclasses of `MutableValue` doesn't check for null and sets the null bit to false.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #2325 from liancheng/check-for-null and squashes the following commits:

9366c44 [Cheng Lian] Check for null in SpecificMutableRow.update
2014-09-09 18:39:33 -07:00
xinyunh 07ee4a28c3 [SPARK-3176] Implement 'ABS and 'LAST' for sql
Add support for the mathematical function"ABS" and the analytic function "last" to return a subset of the rows satisfying a query within spark sql. Test-cases included.

Author: xinyunh <xinyun.huang@huawei.com>
Author: bomeng <golf8lover>

Closes #2099 from xinyunh/sqlTest and squashes the following commits:

71d15e7 [xinyunh] remove POWER part
8843643 [xinyunh] fix the code style issue
39f0309 [bomeng] Modify the code of POWER and ABS. Move them to the file arithmetic
ff8e51e [bomeng] add abs() function support
7f6980a [xinyunh] fix the bug in 'Last' component
b3df91b [xinyunh] add 'Last' component
2014-09-09 16:55:39 -07:00
Cheng Hao 1e03cf79f8 [SPARK-3455] [SQL] **HOT FIX** Fix the unit test failure
Unit test failed due to can not resolve the attribute references. Temporally disable this test case for a quick fixing, otherwise it will block the others.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #2334 from chenghao-intel/unit_test_failure and squashes the following commits:

661f784 [Cheng Hao] temporally disable the failed test case
2014-09-09 10:18:25 -07:00
William Benton 2b7ab814f9 [SPARK-3329][SQL] Don't depend on Hive SET pair ordering in tests.
This fixes some possible spurious test failures in `HiveQuerySuite` by comparing sets of key-value pairs as sets, rather than as lists.

Author: William Benton <willb@redhat.com>
Author: Aaron Davidson <aaron@databricks.com>

Closes #2220 from willb/spark-3329 and squashes the following commits:

3b3e205 [William Benton] Collapse collectResults case match in HiveQuerySuite
6525d8e [William Benton] Handle cases where SET returns Rows of (single) strings
cf11b0e [Aaron Davidson] Fix flakey HiveQuerySuite test
2014-09-08 19:29:23 -07:00
Cheng Lian dc1dbf206e [SPARK-3414][SQL] Stores analyzed logical plan when registering a temp table
Case insensitivity breaks when unresolved relation contains attributes with uppercase letters in their names, because we store unanalyzed logical plan when registering temp tables while the `CaseInsensitivityAttributeReferences` batch runs before the `Resolution` batch. To fix this issue, we need to store analyzed logical plan.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #2293 from liancheng/spark-3414 and squashes the following commits:

d9fa1d6 [Cheng Lian] Stores analyzed logical plan when registering a temp table
2014-09-08 19:08:05 -07:00
William Benton ca0348e682 SPARK-3423: [SQL] Implement BETWEEN for SQLParser
This patch improves the SQLParser by adding support for BETWEEN conditions

Author: William Benton <willb@redhat.com>

Closes #2295 from willb/sql-between and squashes the following commits:

0016d30 [William Benton] Implement BETWEEN for SQLParser
2014-09-08 19:05:02 -07:00
Eric Liang 7db53391f1 [SPARK-3349][SQL] Output partitioning of limit should not be inherited from child
This resolves https://issues.apache.org/jira/browse/SPARK-3349

Author: Eric Liang <ekl@google.com>

Closes #2262 from ericl/spark-3349 and squashes the following commits:

3e1b05c [Eric Liang] add regression test
ac32723 [Eric Liang] make limit/takeOrdered output SinglePartition
2014-09-08 16:14:36 -07:00
Reynold Xin e2614038e7 [SPARK-3408] Fixed Limit operator so it works with sort-based shuffle.
Author: Reynold Xin <rxin@apache.org>

Closes #2281 from rxin/sql-limit-sort and squashes the following commits:

1ef7780 [Reynold Xin] [SPARK-3408] Fixed Limit operator so it works with sort-based shuffle.
2014-09-07 18:42:24 -07:00
Michael Armbrust 39db1bfdab [SQL] Update SQL Programming Guide
Author: Michael Armbrust <michael@databricks.com>
Author: Yin Huai <huai@cse.ohio-state.edu>

Closes #2258 from marmbrus/sqlDocUpdate and squashes the following commits:

f3d450b [Michael Armbrust] fix brackets
bea3bfa [Michael Armbrust] Davies suggestions
3a29fe2 [Michael Armbrust] tighten visibility
a71aa36 [Michael Armbrust] Draft of doc updates
52932c0 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into sqlDocUpdate
1e8c849 [Yin Huai] Update the example used for applySchema.
9457c39 [Yin Huai] Update doc.
31ba240 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeDoc
29bc668 [Yin Huai] Draft doc for data type and schema APIs.
2014-09-07 21:34:46 -04:00
GuoQiang Li 607ae39c22 [SPARK-3397] Bump pom.xml version number of master branch to 1.2.0-SNAPSHOT
Author: GuoQiang Li <witgo@qq.com>

Closes #2268 from witgo/SPARK-3397 and squashes the following commits:

eaf913f [GuoQiang Li] Bump pom.xml version number of master branch to 1.2.0-SNAPSHOT
2014-09-06 15:04:50 -07:00
Reynold Xin 1b9001f78d [SPARK-3409][SQL] Avoid pulling in Exchange operator itself in Exchange's closures.
This is a tiny teeny optimization to move the if check of sortBasedShuffledOn to outside the closures so the closures don't need to pull in the entire Exchange operator object.

Author: Reynold Xin <rxin@apache.org>

Closes #2282 from rxin/SPARK-3409 and squashes the following commits:

1de3f88 [Reynold Xin] [SPARK-3409][SQL] Avoid pulling in Exchange operator itself in Exchange's closures.
2014-09-06 00:33:00 -07:00
Cheng Hao 1904bac38d [SPARK-3392] [SQL] Show value spark.sql.shuffle.partitions for mapred.reduce.tasks
This is a tiny fix for getting the value of "mapred.reduce.tasks", which make more sense for the hive user.
As well as the command "set -v", which should output verbose information for all of the key/values.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #2261 from chenghao-intel/set_mapreduce_tasks and squashes the following commits:

653858a [Cheng Hao] show value spark.sql.shuffle.partitions for mapred.reduce.tasks
2014-09-04 19:16:12 -07:00
Cheng Lian ee575f12f2 [SPARK-2219][SQL] Added support for the "add jar" command
Adds logical and physical command classes for the "add jar" command.

Note that this PR conflicts with and should be merged after #2215.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #2242 from liancheng/add-jar and squashes the following commits:

e43a2f1 [Cheng Lian] Updates AddJar according to conventions introduced in #2215
b99107f [Cheng Lian] Added test case for ADD JAR command
095b2c7 [Cheng Lian] Also forward ADD JAR command to Hive
9be031b [Cheng Lian] Trims Jar path string
8195056 [Cheng Lian] Added support for the "add jar" command
2014-09-04 18:47:45 -07:00
Liang-Chi Hsieh 3eb6ef316c [SPARK-3310][SQL] Directly use currentTable without unnecessary implicit conversion
We can directly use currentTable there without unnecessary implicit conversion.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #2203 from viirya/direct_use_inmemoryrelation and squashes the following commits:

4741d02 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into direct_use_inmemoryrelation
b671f67 [Liang-Chi Hsieh] Can directly use currentTable there without unnecessary implicit conversion.
2014-09-04 18:46:09 -07:00
Kousuke Saruta dc1ba9e9fc [SPARK-3378] [DOCS] Replace the word "SparkSQL" with right word "Spark SQL"
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2251 from sarutak/SPARK-3378 and squashes the following commits:

0bfe234 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3378
bb5938f [Kousuke Saruta] Replaced rest of "SparkSQL" with "Spark SQL"
6df66de [Kousuke Saruta] Replaced "SparkSQL" with "Spark SQL"
2014-09-04 15:06:08 -07:00
Davies Liu c5cbc49233 [SPARK-3335] [SQL] [PySpark] support broadcast in Python UDF
After this patch, broadcast can be used in Python UDF.

Author: Davies Liu <davies.liu@gmail.com>

Closes #2243 from davies/udf_broadcast and squashes the following commits:

7b88861 [Davies Liu] support broadcast in UDF
2014-09-03 19:08:39 -07:00
Cheng Lian 248067adbe [SPARK-2961][SQL] Use statistics to prune batches within cached partitions
This PR is based on #1883 authored by marmbrus. Key differences:

1. Batch pruning instead of partition pruning

   When #1883 was authored, batched column buffer building (#1880) hadn't been introduced. This PR combines these two and provide partition batch level pruning, which leads to smaller memory footprints and can generally skip more elements. The cost is that the pruning predicates are evaluated more frequently (partition number multiplies batch number per partition).

1. More filters are supported

   Filter predicates consist of `=`, `<`, `<=`, `>`, `>=` and their conjunctions and disjunctions are supported.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #2188 from liancheng/in-mem-batch-pruning and squashes the following commits:

68cf019 [Cheng Lian] Marked sqlContext as @transient
4254f6c [Cheng Lian] Enables in-memory partition pruning in PartitionBatchPruningSuite
3784105 [Cheng Lian] Overrides InMemoryColumnarTableScan.sqlContext
d2a1d66 [Cheng Lian] Disables in-memory partition pruning by default
062c315 [Cheng Lian] HiveCompatibilitySuite code cleanup
16b77bf [Cheng Lian] Fixed pruning predication conjunctions and disjunctions
16195c5 [Cheng Lian] Enabled both disjunction and conjunction
89950d0 [Cheng Lian] Worked around Scala style check
9c167f6 [Cheng Lian] Minor code cleanup
3c4d5c7 [Cheng Lian] Minor code cleanup
ea59ee5 [Cheng Lian] Renamed PartitionSkippingSuite to PartitionBatchPruningSuite
fc517d0 [Cheng Lian] More test cases
1868c18 [Cheng Lian] Code cleanup, bugfix, and adding tests
cb76da4 [Cheng Lian] Added more predicate filters, fixed table scan stats for testing purposes
385474a [Cheng Lian] Merge branch 'inMemStats' into in-mem-batch-pruning
2014-09-03 18:59:26 -07:00
Cheng Lian f48420fde5 [SPARK-2973][SQL] Lightweight SQL commands without distributed jobs when calling .collect()
By overriding `executeCollect()` in physical plan classes of all commands, we can avoid to kick off a distributed job when collecting result of a SQL command, e.g. `sql("SET").collect()`.

Previously, `Command.sideEffectResult` returns a `Seq[Any]`, and the `execute()` method in sub-classes of `Command` typically convert that to a `Seq[Row]` then parallelize it to an RDD. Now with this PR, `sideEffectResult` is required to return a `Seq[Row]` directly, so that `executeCollect()` can directly leverage that and be factored to the `Command` parent class.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #2215 from liancheng/lightweight-commands and squashes the following commits:

3fbef60 [Cheng Lian] Factored execute() method of physical commands to parent class Command
5a0e16c [Cheng Lian] Passes test suites
e0e12e9 [Cheng Lian] Refactored Command.sideEffectResult and Command.executeCollect
995bdd8 [Cheng Lian] Cleaned up DescribeHiveTableCommand
542977c [Cheng Lian] Avoids confusion between logical and physical plan by adding package prefixes
55b2aa5 [Cheng Lian] Avoids distributed jobs when execution SQL commands
2014-09-03 18:57:20 -07:00
Liang-Chi Hsieh 24ab384018 [SPARK-3300][SQL] No need to call clear() and shorten build()
The function `ensureFreeSpace` in object `ColumnBuilder` clears old buffer before copying its content to new buffer. This PR fixes it.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #2195 from viirya/fix_buffer_clear and squashes the following commits:

792f009 [Liang-Chi Hsieh] no need to call clear(). use flip() instead of calling limit(), position() and rewind().
df2169f [Liang-Chi Hsieh] should clean old buffer after copying its content.
2014-09-02 20:51:25 -07:00
Cheng Lian 19d3e1e8e9 [SQL] Renamed ColumnStat to ColumnMetrics to avoid confusion between ColumnStats
Class names of these two are just too similar.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #2189 from liancheng/column-metrics and squashes the following commits:

8bb3b21 [Cheng Lian] Renamed ColumnStat to ColumnMetrics to avoid confusion between ColumnStats
2014-09-02 20:49:36 -07:00
Takuya UESHIN 0cd91f666d [SPARK-3341][SQL] The dataType of Sqrt expression should be DoubleType.
Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #2233 from ueshin/issues/SPARK-3341 and squashes the following commits:

e497320 [Takuya UESHIN] Fix data type of Sqrt expression.
2014-09-02 20:31:15 -07:00
scwf 725715cbf3 [SPARK-3010] fix redundant conditional
https://issues.apache.org/jira/browse/SPARK-3010

this pr is to fix redundant conditional in spark, such as
1.
private[spark] def codegenEnabled: Boolean =
if (getConf(CODEGEN_ENABLED, "false") == "true") true else false
2.
x => if (x == 2) true else false
...

Author: scwf <wangfei1@huawei.com>
Author: wangfei <wangfei_hello@126.com>

Closes #1992 from scwf/condition and squashes the following commits:

b2a044a [scwf] merge SecurityManager
e16239c [scwf] fix confilct
6811401 [scwf] fix merge confilct
0824df4 [scwf] Merge branch 'master' of https://github.com/apache/spark into patch-4
e274515 [scwf] fix redundant conditions
d032bf9 [wangfei] [SQL]Excess judgment
2014-08-31 14:02:11 -07:00
Cheng Lian 32b18dd52c [SPARK-3320][SQL] Made batched in-memory column buffer building work for SchemaRDDs with empty partitions
Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #2213 from liancheng/spark-3320 and squashes the following commits:

45a0139 [Cheng Lian] Fixed typo in InMemoryColumnarQuerySuite
f67067d [Cheng Lian] Fixed SPARK-3320
2014-08-29 18:16:47 -07:00
Zdenek Farana 98ddbe6cdb [SPARK-3173][SQL] Timestamp support in the parser
If you have a table with TIMESTAMP column, that column can't be used in WHERE clause properly - it is not evaluated properly. [More](https://issues.apache.org/jira/browse/SPARK-3173)

Motivation: http://www.aproint.com/aggregation-with-spark-sql/

- [x] modify SqlParser so it supports casting to TIMESTAMP (workaround for item 2)
- [x] the string literal should be converted into Timestamp if the column is Timestamp.

Author: Zdenek Farana <zdenek.farana@gmail.com>
Author: Zdenek Farana <zdenek.farana@aproint.com>

Closes #2084 from byF/SPARK-3173 and squashes the following commits:

442b59d [Zdenek Farana] Fixed test merge conflict
2dbf4f6 [Zdenek Farana] Merge remote-tracking branch 'origin/SPARK-3173' into SPARK-3173
65b6215 [Zdenek Farana] Fixed timezone sensitivity in the test
47b27b4 [Zdenek Farana] Now works in the case of "StringLiteral=TimestampColumn"
96a661b [Zdenek Farana] Code style change
491dfcf [Zdenek Farana] Added test cases for SPARK-3173
4446b1e [Zdenek Farana] A string literal is casted into Timestamp when the column is Timestamp.
59af397 [Zdenek Farana] Added a new TIMESTAMP keyword; CAST to TIMESTAMP now can be used in SQL expression.
2014-08-29 15:39:15 -07:00
qiping.lqp 634d04b87c [SPARK-3291][SQL]TestcaseName in createQueryTest should not contain ":"
":" is not allowed to appear in a file name of Windows system. If file name contains ":", this file can't be checked out in a Windows system and developers using Windows must be careful to not commit the deletion of such files, Which is very inconvenient.

Author: qiping.lqp <qiping.lqp@alibaba-inc.com>

Closes #2191 from chouqin/querytest and squashes the following commits:

0e943a1 [qiping.lqp] rename golden file
60a863f [qiping.lqp] TestcaseName in createQueryTest should not contain ":"
2014-08-29 15:37:43 -07:00
Cheng Lian d94a44d7ca [SPARK-3269][SQL] Decreases initial buffer size for row set to prevent OOM
When a large batch size is specified, `SparkSQLOperationManager` OOMs even if the whole result set is much smaller than the batch size.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #2171 from liancheng/jdbc-fetch-size and squashes the following commits:

5e1623b [Cheng Lian] Decreases initial buffer size for row set to prevent OOM
2014-08-29 15:36:04 -07:00
Cheng Lian b1eccfc88a [SQL] Turns on in-memory columnar compression in HiveCompatibilitySuite
`HiveCompatibilitySuite` already turns on in-memory columnar caching, it would be good to also enable compression to improve test coverage.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #2190 from liancheng/compression-on and squashes the following commits:

88b536c [Cheng Lian] Code cleanup, narrowed field visibility
d13efd2 [Cheng Lian] Turns on in-memory columnar compression in HiveCompatibilitySuite
2014-08-29 15:34:59 -07:00
Cheng Hao dc4d577c65 [SPARK-3198] [SQL] Remove the TreeNode.id
Thus id property of the TreeNode API does save time in a faster way to compare 2 TreeNodes, it is kind of performance bottleneck during the expression object creation in a multi-threading env (because of the memory barrier).
Fortunately, the tree node comparison only happen once in master, so even we remove it, the entire performance will not be affected.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #2155 from chenghao-intel/treenode and squashes the following commits:

7cf2cd2 [Cheng Hao] Remove the implicit keyword for TreeNodeRef and some other small issues
5873415 [Cheng Hao] Remove the TreeNode.id
2014-08-29 15:32:26 -07:00
William Benton 2f1519defa SPARK-2813: [SQL] Implement SQRT() directly in Spark SQL
This PR adds a native implementation for SQL SQRT() and thus avoids delegating this function to Hive.

Author: William Benton <willb@redhat.com>

Closes #1750 from willb/spark-2813 and squashes the following commits:

22c8a79 [William Benton] Fixed missed newline from rebase
d673861 [William Benton] Added string coercions for SQRT and associated test case
e125df4 [William Benton] Added ExpressionEvaluationSuite test cases for SQRT
7b84bcd [William Benton] SQL SQRT now properly returns NULL for NULL inputs
8256971 [William Benton] added SQRT test to SqlQuerySuite
504d2e5 [William Benton] Added native SQRT implementation
2014-08-29 15:26:59 -07:00
Michael Armbrust 76e3ba4264 [SPARK-3230][SQL] Fix udfs that return structs
We need to convert the case classes into Rows.

Author: Michael Armbrust <michael@databricks.com>

Closes #2133 from marmbrus/structUdfs and squashes the following commits:

189722f [Michael Armbrust] Merge remote-tracking branch 'origin/master' into structUdfs
8e29b1c [Michael Armbrust] Use existing function
d8d0b76 [Michael Armbrust] Fix udfs that return structs
2014-08-28 00:15:23 -07:00
Cheng Lian 68f75dcdfe [SQL] Fixed 2 comment typos in SQLConf
Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #2172 from liancheng/sqlconf-typo and squashes the following commits:

115cc71 [Cheng Lian] Fixed 2 comment typos in SQLConf
2014-08-28 00:08:09 -07:00
Michael Armbrust 024178c574 [HOTFIX][SQL] Remove cleaning of UDFs
It is not safe to run the closure cleaner on slaves.  #2153 introduced this which broke all UDF execution on slaves.  Will re-add cleaning of UDF closures in a follow-up PR.

Author: Michael Armbrust <michael@databricks.com>

Closes #2174 from marmbrus/fixUdfs and squashes the following commits:

55406de [Michael Armbrust] [HOTFIX] Remove cleaning of UDFs
2014-08-27 23:05:34 -07:00
Michael Armbrust 7d2a7a91f2 [SPARK-3235][SQL] Ensure in-memory tables don't always broadcast.
Author: Michael Armbrust <michael@databricks.com>

Closes #2147 from marmbrus/inMemDefaultSize and squashes the following commits:

5390360 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into inMemDefaultSize
14204d3 [Michael Armbrust] Set the context before creating SparkLogicalPlans.
8da4414 [Michael Armbrust] Make sure we throw errors when leaf nodes fail to provide statistcs
18ce029 [Michael Armbrust] Ensure in-memory tables don't always broadcast.
2014-08-27 15:14:08 -07:00
luogankun 65253502b9 [SPARK-3065][SQL] Add locale setting to fix results do not match for udf_unix_timestamp format "yyyy MMM dd h:mm:ss a" run with not "America/Los_Angeles" TimeZone in HiveCompatibilitySuite
When run the udf_unix_timestamp of org.apache.spark.sql.hive.execution.HiveCompatibilitySuite testcase
with not "America/Los_Angeles" TimeZone throws error. [https://issues.apache.org/jira/browse/SPARK-3065]
add locale setting on beforeAll and afterAll method to fix the bug of HiveCompatibilitySuite testcase

Author: luogankun <luogankun@gmail.com>

Closes #1968 from luogankun/SPARK-3065 and squashes the following commits:

c167832 [luogankun] [SPARK-3065][SQL] Add Locale setting to HiveCompatibilitySuite
0a25e3a [luogankun] [SPARK-3065][SQL] Add Locale setting to HiveCompatibilitySuite
2014-08-27 15:08:22 -07:00
Aaron Davidson cc275f4b79 [SQL] [SPARK-3236] Reading Parquet tables from Metastore mangles location
Currently we do `relation.hiveQlTable.getDataLocation.getPath`, which returns the path-part of the URI (e.g., "s3n://my-bucket/my-path" => "/my-path"). We should do `relation.hiveQlTable.getDataLocation.toString` instead, as a URI's toString returns a faithful representation of the full URI, which can later be passed into a Hadoop Path.

Author: Aaron Davidson <aaron@databricks.com>

Closes #2150 from aarondav/parquet-location and squashes the following commits:

459f72c [Aaron Davidson] [SQL] [SPARK-3236] Reading Parquet tables from Metastore mangles location
2014-08-27 15:05:47 -07:00
viirya 28d41d6279 [SPARK-3252][SQL] Add missing condition for test
According to the text message, both relations should be tested. So add the missing condition.

Author: viirya <viirya@gmail.com>

Closes #2159 from viirya/fix_test and squashes the following commits:

b1c0f52 [viirya] add missing condition.
2014-08-27 14:55:05 -07:00
chutium 48f42781de [SPARK-3138][SQL] sqlContext.parquetFile should be able to take a single file as parameter
```if (!fs.getFileStatus(path).isDir) throw Exception``` make no sense after this commit #1370

be careful if someone is working on SPARK-2551, make sure the new change passes test case ```test("Read a parquet file instead of a directory")```

Author: chutium <teng.qiu@gmail.com>

Closes #2044 from chutium/parquet-singlefile and squashes the following commits:

4ae477f [chutium] [SPARK-3138][SQL] sqlContext.parquetFile should be able to take a single file as parameter
2014-08-27 13:13:04 -07:00
Cheng Hao 4238c17dc9 [SPARK-3197] [SQL] Reduce the Expression tree object creations for aggregation function (min/max)
Aggregation function min/max in catalyst will create expression tree for each single row, however, the expression tree creation is quite expensive in a multithreading env currently. Hence we got a very bad performance for the min/max.
Here is the benchmark that I've done in my local.

Master | Previous Result (ms) | Current Result (ms)
------------ | ------------- | -------------
local | 3645 | 3416
local[6] | 3602 | 1002

The Benchmark source code.
```
case class Record(key: Int, value: Int)

object TestHive2 extends HiveContext(new SparkContext("local[6]", "TestSQLContext", new SparkConf()))

object DataPrepare extends App {
  import TestHive2._

  val rdd = sparkContext.parallelize((1 to 10000000).map(i => Record(i % 3000, i)), 12)

  runSqlHive("SHOW TABLES")
  runSqlHive("DROP TABLE if exists a")
  runSqlHive("DROP TABLE if exists result")
  rdd.registerAsTable("records")

  runSqlHive("""CREATE TABLE a (key INT, value INT)
                 | ROW FORMAT SERDE
                 | 'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe'
                 | STORED AS RCFILE
               """.stripMargin)
  runSqlHive("""CREATE TABLE result (key INT, value INT)
                 | ROW FORMAT SERDE
                 | 'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe'
                 | STORED AS RCFILE
               """.stripMargin)

  hql(s"""from records
             | insert into table a
             | select key, value
           """.stripMargin)
}

object PerformanceTest extends App {
  import TestHive2._

  hql("SHOW TABLES")
  hql("set spark.sql.shuffle.partitions=12")

  val cmd = "select min(value), max(value) from a group by key"

  val results = ("Result1", benchmark(cmd)) ::
                ("Result2", benchmark(cmd)) ::
                ("Result3", benchmark(cmd)) :: Nil
  results.foreach { case (prompt, result) => {
      println(s"$prompt: took ${result._1} ms (${result._2} records)")
    }
  }

  def benchmark(cmd: String) = {
    val begin = System.currentTimeMillis()
    val count = hql(cmd).count
    val end = System.currentTimeMillis()
    ((end - begin), count)
  }
}
```

Author: Cheng Hao <hao.cheng@intel.com>

Closes #2113 from chenghao-intel/aggregation_expression_optimization and squashes the following commits:

db40395 [Cheng Hao] remove the transient and add val for the expression property
d56167d [Cheng Hao] Reduce the Expressions creation
2014-08-27 12:50:47 -07:00
u0jing 3b5eb7083d [SPARK-3118][SQL]add "SHOW TBLPROPERTIES tblname;" and "SHOW COLUMNS (FROM|IN) table_name [(FROM|IN) db_name]" support
JIRA issue: [SPARK-3118] https://issues.apache.org/jira/browse/SPARK-3118

eg:
> SHOW TBLPROPERTIES test;
SHOW TBLPROPERTIES test;
numPartitions	0
numFiles	1
transient_lastDdlTime	1407923642
numRows	0
totalSize	82
rawDataSize	0

eg:
> SHOW COLUMNS  in test;
SHOW COLUMNS  in test;
OK
Time taken: 0.304 seconds
id
stid
bo

Author: u0jing <u9jing@gmail.com>

Closes #2034 from u0jing/spark-3118 and squashes the following commits:

b231d87 [u0jing] add golden answer files
35f4885 [u0jing] add 'show columns' and 'show tblproperties' support
2014-08-27 12:47:14 -07:00
Michael Armbrust e1139dd60e [SPARK-3237][SQL] Fix parquet filters with UDFs
Author: Michael Armbrust <michael@databricks.com>

Closes #2153 from marmbrus/parquetFilters and squashes the following commits:

712731a [Michael Armbrust] Use closure serializer for sending filters.
1e83f80 [Michael Armbrust] Clean udf functions.
2014-08-27 00:59:23 -07:00
Takuya UESHIN 727cb25bcc [SPARK-3036][SPARK-3037][SQL] Add MapType/ArrayType containing null value support to Parquet.
JIRA:
- https://issues.apache.org/jira/browse/SPARK-3036
- https://issues.apache.org/jira/browse/SPARK-3037

Currently this uses the following Parquet schema for `MapType` when `valueContainsNull` is `true`:

```
message root {
  optional group a (MAP) {
    repeated group map (MAP_KEY_VALUE) {
      required int32 key;
      optional int32 value;
    }
  }
}
```

for `ArrayType` when `containsNull` is `true`:

```
message root {
  optional group a (LIST) {
    repeated group bag {
      optional int32 array;
    }
  }
}
```

We have to think about compatibilities with older version of Spark or Hive or others I mentioned in the JIRA issues.

Notice:
This PR is based on #1963 and #1889.
Please check them first.

/cc marmbrus, yhuai

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #2032 from ueshin/issues/SPARK-3036_3037 and squashes the following commits:

4e8e9e7 [Takuya UESHIN] Add ArrayType containing null value support to Parquet.
013c2ca [Takuya UESHIN] Add MapType containing null value support to Parquet.
62989de [Takuya UESHIN] Merge branch 'issues/SPARK-2969' into issues/SPARK-3036_3037
8e38b53 [Takuya UESHIN] Merge branch 'issues/SPARK-3063' into issues/SPARK-3036_3037
2014-08-26 18:28:41 -07:00
Michael Armbrust c4787a3690 [SPARK-3194][SQL] Add AttributeSet to fix bugs with invalid comparisons of AttributeReferences
It is common to want to describe sets of attributes that are in various parts of a query plan.  However, the semantics of putting `AttributeReference` objects into a standard Scala `Set` result in subtle bugs when references differ cosmetically.  For example, with case insensitive resolution it is possible to have two references to the same attribute whose names are not equal.

In this PR I introduce a new abstraction, an `AttributeSet`, which performs all comparisons using the globally unique `ExpressionId` instead of case class equality.  (There is already a related class, [`AttributeMap`](https://github.com/marmbrus/spark/blob/inMemStats/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AttributeMap.scala#L32))  This new type of set is used to fix a bug in the optimizer where needed attributes were getting projected away underneath join operators.

I also took this opportunity to refactor the expression and query plan base classes.  In all but one instance the logic for computing the `references` of an `Expression` were the same.  Thus, I moved this logic into the base class.

For query plans the semantics of  the `references` method were ill defined (is it the references output? or is it those used by expression evaluation? or what?).  As a result, this method wasn't really used very much.  So, I removed it.

TODO:
 - [x] Finish scala doc for `AttributeSet`
 - [x] Scan the code for other instances of `Set[Attribute]` and refactor them.
 - [x] Finish removing `references` from `QueryPlan`

Author: Michael Armbrust <michael@databricks.com>

Closes #2109 from marmbrus/attributeSets and squashes the following commits:

1c0dae5 [Michael Armbrust] work on serialization bug.
9ba868d [Michael Armbrust] Merge remote-tracking branch 'origin/master' into attributeSets
3ae5288 [Michael Armbrust] review comments
40ce7f6 [Michael Armbrust] style
d577cc7 [Michael Armbrust] Scaladoc
cae5d22 [Michael Armbrust] remove more references implementations
d6e16be [Michael Armbrust] Remove more instances of "def references" and normal sets of attributes.
fc26b49 [Michael Armbrust] Add AttributeSet class, remove references from Expression.
2014-08-26 16:29:14 -07:00
Takuya UESHIN 6b5584ef1c [SPARK-3063][SQL] ExistingRdd should convert Map to catalyst Map.
Currently `ExistingRdd.convertToCatalyst` doesn't convert `Map` value.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #1963 from ueshin/issues/SPARK-3063 and squashes the following commits:

3ba41f2 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-3063
4d7bae2 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-3063
9321379 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-3063
d8a900a [Takuya UESHIN] Make ExistingRdd.convertToCatalyst be able to convert Map value.
2014-08-26 15:04:08 -07:00
Takuya UESHIN 98c2bb0bbd [SPARK-2969][SQL] Make ScalaReflection be able to handle ArrayType.containsNull and MapType.valueContainsNull.
Make `ScalaReflection` be able to handle like:

- `Seq[Int]` as `ArrayType(IntegerType, containsNull = false)`
- `Seq[java.lang.Integer]` as `ArrayType(IntegerType, containsNull = true)`
- `Map[Int, Long]` as `MapType(IntegerType, LongType, valueContainsNull = false)`
- `Map[Int, java.lang.Long]` as `MapType(IntegerType, LongType, valueContainsNull = true)`

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #1889 from ueshin/issues/SPARK-2969 and squashes the following commits:

24f1c5c [Takuya UESHIN] Change the default value of ArrayType.containsNull to true in Python API.
79f5b65 [Takuya UESHIN] Change the default value of ArrayType.containsNull to true in Java API.
7cd1a7a [Takuya UESHIN] Fix json test failures.
2cfb862 [Takuya UESHIN] Change the default value of ArrayType.containsNull to true.
2f38e61 [Takuya UESHIN] Revert the default value of MapTypes.valueContainsNull.
9fa02f5 [Takuya UESHIN] Fix a test failure.
1a9a96b [Takuya UESHIN] Modify ScalaReflection to handle ArrayType.containsNull and MapType.valueContainsNull.
2014-08-26 13:22:55 -07:00
chutium 8856c3d860 [SPARK-3131][SQL] Allow user to set parquet compression codec for writing ParquetFile in SQLContext
There are 4 different compression codec available for ```ParquetOutputFormat```

in Spark SQL, it was set as a hard-coded value in ```ParquetRelation.defaultCompression```

original discuss:
https://github.com/apache/spark/pull/195#discussion-diff-11002083

i added a new config property in SQLConf to allow user to change this compression codec, and i used similar short names syntax as described in SPARK-2953 #1873 (https://github.com/apache/spark/pull/1873/files#diff-0)

btw, which codec should we use as default? it was set to GZIP (https://github.com/apache/spark/pull/195/files#diff-4), but i think maybe we should change this to SNAPPY, since SNAPPY is already the default codec for shuffling in spark-core (SPARK-2469, #1415), and parquet-mr supports Snappy codec natively (e440108de5).

Author: chutium <teng.qiu@gmail.com>

Closes #2039 from chutium/parquet-compression and squashes the following commits:

2f44964 [chutium] [SPARK-3131][SQL] parquet compression default codec set to snappy, also in test suite
e578e21 [chutium] [SPARK-3131][SQL] compression codec config property name and default codec set to snappy
21235dc [chutium] [SPARK-3131][SQL] Allow user to set parquet compression codec for writing ParquetFile in SQLContext
2014-08-26 11:51:26 -07:00
Daoyuan Wang 52fbdc2ded [Spark-3222] [SQL] Cross join support in HiveQL
We can simple treat cross join as inner join without join conditions.

Author: Daoyuan Wang <daoyuan.wang@intel.com>
Author: adrian-wang <daoyuanwong@gmail.com>

Closes #2124 from adrian-wang/crossjoin and squashes the following commits:

8c9b7c5 [Daoyuan Wang] add a test
7d47bbb [adrian-wang] add cross join support for hql
2014-08-25 22:56:35 -07:00
Chia-Yung Su 4243bb6634 [SPARK-3011][SQL] _temporary directory should be filtered out by sqlContext.parquetFile
fix compile error on hadoop 0.23 for the pull request #1924.

Author: Chia-Yung Su <chiayung@appier.com>

Closes #1959 from joesu/bugfix-spark3011 and squashes the following commits:

be30793 [Chia-Yung Su] remove .* and _* except _metadata
8fe2398 [Chia-Yung Su] add note to explain
40ea9bd [Chia-Yung Su] fix hadoop-0.23 compile error
c7e44f2 [Chia-Yung Su] match syntax
f8fc32a [Chia-Yung Su] filter out tmp dir
2014-08-25 18:20:19 -07:00