Commit graph

348 commits

Author SHA1 Message Date
Daoyuan Wang 52fbdc2ded [Spark-3222] [SQL] Cross join support in HiveQL
We can simple treat cross join as inner join without join conditions.

Author: Daoyuan Wang <daoyuan.wang@intel.com>
Author: adrian-wang <daoyuanwong@gmail.com>

Closes #2124 from adrian-wang/crossjoin and squashes the following commits:

8c9b7c5 [Daoyuan Wang] add a test
7d47bbb [adrian-wang] add cross join support for hql
2014-08-25 22:56:35 -07:00
Chia-Yung Su 4243bb6634 [SPARK-3011][SQL] _temporary directory should be filtered out by sqlContext.parquetFile
fix compile error on hadoop 0.23 for the pull request #1924.

Author: Chia-Yung Su <chiayung@appier.com>

Closes #1959 from joesu/bugfix-spark3011 and squashes the following commits:

be30793 [Chia-Yung Su] remove .* and _* except _metadata
8fe2398 [Chia-Yung Su] add note to explain
40ea9bd [Chia-Yung Su] fix hadoop-0.23 compile error
c7e44f2 [Chia-Yung Su] match syntax
f8fc32a [Chia-Yung Su] filter out tmp dir
2014-08-25 18:20:19 -07:00
wangfei 507a1b5200 [SQL] logWarning should be logInfo in getResultSetSchema
Author: wangfei <wangfei_hello@126.com>

Closes #1939 from scwf/patch-5 and squashes the following commits:

f952d10 [wangfei] [SQL] logWarning should be logInfo in getResultSetSchema
2014-08-25 17:46:43 -07:00
Cheng Hao 156eb39661 [SPARK-3058] [SQL] Support EXTENDED for EXPLAIN
Provide `extended` keyword support for `explain` command in SQL. e.g.
```
explain extended select key as a1, value as a2 from src where key=1;
== Parsed Logical Plan ==
Project ['key AS a1#3,'value AS a2#4]
 Filter ('key = 1)
  UnresolvedRelation None, src, None

== Analyzed Logical Plan ==
Project [key#8 AS a1#3,value#9 AS a2#4]
 Filter (CAST(key#8, DoubleType) = CAST(1, DoubleType))
  MetastoreRelation default, src, None

== Optimized Logical Plan ==
Project [key#8 AS a1#3,value#9 AS a2#4]
 Filter (CAST(key#8, DoubleType) = 1.0)
  MetastoreRelation default, src, None

== Physical Plan ==
Project [key#8 AS a1#3,value#9 AS a2#4]
 Filter (CAST(key#8, DoubleType) = 1.0)
  HiveTableScan [key#8,value#9], (MetastoreRelation default, src, None), None

Code Generation: false
== RDD ==
(2) MappedRDD[14] at map at HiveContext.scala:350
  MapPartitionsRDD[13] at mapPartitions at basicOperators.scala:42
  MapPartitionsRDD[12] at mapPartitions at basicOperators.scala:57
  MapPartitionsRDD[11] at mapPartitions at TableReader.scala:112
  MappedRDD[10] at map at TableReader.scala:240
  HadoopRDD[9] at HadoopRDD at TableReader.scala:230
```

It's the sub task of #1847. But can go without any dependency.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #1962 from chenghao-intel/explain_extended and squashes the following commits:

295db74 [Cheng Hao] Fix bug in printing the simple execution plan
48bc989 [Cheng Hao] Support EXTENDED for EXPLAIN
2014-08-25 17:43:56 -07:00
Cheng Lian cae9414d38 [SPARK-2929][SQL] Refactored Thrift server and CLI suites
Removed most hard coded timeout, timing assumptions and all `Thread.sleep`. Simplified IPC and synchronization with `scala.sys.process` and future/promise so that the test suites can run more robustly and faster.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1856 from liancheng/thriftserver-tests and squashes the following commits:

2d914ca [Cheng Lian] Minor refactoring
0e12e71 [Cheng Lian] Cleaned up test output
0ee921d [Cheng Lian] Refactored Thrift server and CLI suites
2014-08-25 16:29:59 -07:00
Takuya UESHIN d299e2bf2f [SPARK-3204][SQL] MaxOf would be foldable if both left and right are foldable.
Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #2116 from ueshin/issues/SPARK-3204 and squashes the following commits:

7d9b107 [Takuya UESHIN] Make MaxOf foldable if both left and right are foldable.
2014-08-25 16:27:00 -07:00
Michael Armbrust 3519b5e8e5 [SPARK-2967][SQL] Follow-up: Also copy hash expressions in sort based shuffle fix.
Follow-up to #2066

Author: Michael Armbrust <michael@databricks.com>

Closes #2072 from marmbrus/sortShuffle and squashes the following commits:

2ff8114 [Michael Armbrust] Fix bug
2014-08-23 16:21:08 -07:00
Michael Armbrust 7e191fe29b [SPARK-2554][SQL] CountDistinct partial aggregation and object allocation improvements
Author: Michael Armbrust <michael@databricks.com>
Author: Gregory Owen <greowen@gmail.com>

Closes #1935 from marmbrus/countDistinctPartial and squashes the following commits:

5c7848d [Michael Armbrust] turn off caching in the constructor
8074a80 [Michael Armbrust] fix tests
32d216f [Michael Armbrust] reynolds comments
c122cca [Michael Armbrust] Address comments, add tests
b2e8ef3 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into countDistinctPartial
fae38f4 [Michael Armbrust] Fix style
fdca896 [Michael Armbrust] cleanup
93d0f64 [Michael Armbrust] metastore concurrency fix.
db44a30 [Michael Armbrust] JIT hax.
3868f6c [Michael Armbrust] Merge pull request #9 from GregOwen/countDistinctPartial
c9e67de [Gregory Owen] Made SpecificRow and types serializable by Kryo
2b46c4b [Michael Armbrust] Merge remote-tracking branch 'origin/master' into countDistinctPartial
8ff6402 [Michael Armbrust] Add specific row.
58d15f1 [Michael Armbrust] disable codegen logging
87d101d [Michael Armbrust] Fix isNullAt bug
abee26d [Michael Armbrust] WIP
27984d0 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into countDistinctPartial
57ae3b1 [Michael Armbrust] Fix order dependent test
b3d0f64 [Michael Armbrust] Add golden files.
c1f7114 [Michael Armbrust] Improve tests / fix serialization.
f31b8ad [Michael Armbrust] more fixes
38c7449 [Michael Armbrust] comments and style
9153652 [Michael Armbrust] better toString
d494598 [Michael Armbrust] Fix tests now that the planner is better
41fbd1d [Michael Armbrust] Never try and create an empty hash set.
050bb97 [Michael Armbrust] Skip no-arg constructors for kryo,
bd08239 [Michael Armbrust] WIP
213ada8 [Michael Armbrust] First draft of partially aggregated and code generated count distinct / max
2014-08-23 16:19:10 -07:00
Yin Huai 2fb1c72ea2 [SQL] Make functionRegistry in HiveContext transient.
Seems we missed `transient` for the `functionRegistry` in `HiveContext`.

cc: marmbrus

Author: Yin Huai <huaiyin.thu@gmail.com>

Closes #2074 from yhuai/makeFunctionRegistryTransient and squashes the following commits:

6534e7d [Yin Huai] Make functionRegistry transient.
2014-08-23 12:46:41 -07:00
Alex Liu d9e94146a6 [SPARK-2846][SQL] Add configureInputJobPropertiesForStorageHandler to initialization of job conf
...al job conf

Author: Alex Liu <alex_liu68@yahoo.com>

Closes #1927 from alexliu68/SPARK-SQL-2846 and squashes the following commits:

e4bdc4c [Alex Liu] SPARK-SQL-2846 add configureInputJobPropertiesForStorageHandler to initial job conf
2014-08-20 16:14:06 -07:00
Michael Armbrust a2e658dcda [SPARK-2967][SQL] Fix sort based shuffle for spark sql.
Add explicit row copies when sort based shuffle is on.

Author: Michael Armbrust <michael@databricks.com>

Closes #2066 from marmbrus/sortShuffle and squashes the following commits:

fcd7bb2 [Michael Armbrust] Fix sort based shuffle for spark sql.
2014-08-20 15:51:14 -07:00
Cheng Lian cf46e72581 [SPARK-3126][SPARK-3127][SQL] Fixed HiveThriftServer2Suite
This PR fixes two issues:

1. Fixes wrongly quoted command line option in `HiveThriftServer2Suite` that makes test cases hang until timeout.
1. Asks `dev/run-test` to run Spark SQL tests when `bin/spark-sql` and/or `sbin/start-thriftserver.sh` are modified.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #2036 from liancheng/fix-thriftserver-test and squashes the following commits:

f38c4eb [Cheng Lian] Fixed the same quotation issue in CliSuite
26b82a0 [Cheng Lian] Run SQL tests when dff contains bin/spark-sql and/or sbin/start-thriftserver.sh
a87f83d [Cheng Lian] Extended timeout
e5aa31a [Cheng Lian] Fixed metastore JDBC URI quotation
2014-08-20 12:57:39 -07:00
wangfei 0e3ab94d41 [SQL] add note of use synchronizedMap in SQLConf
Refer to:
http://stackoverflow.com/questions/510632/whats-the-difference-between-concurrenthashmap-and-collections-synchronizedmap
Collections.synchronizedMap(map) creates a blocking Map which will degrade performance, albeit ensure consistency. So use ConcurrentHashMap(a more effective thread-safe hashmap) instead.

also update HiveQuerySuite to fix test error when changed to ConcurrentHashMap.

Author: wangfei <wangfei_hello@126.com>
Author: scwf <wangfei1@huawei.com>

Closes #1996 from scwf/sqlconf and squashes the following commits:

93bc0c5 [wangfei] revert change of HiveQuerySuite
0cc05dd [wangfei] add note for use synchronizedMap
3c224d31 [scwf] fix formate
a7bcb98 [scwf] use ConcurrentHashMap in sql conf, intead synchronizedMap
2014-08-19 19:37:02 -07:00
Michael Armbrust 3abd0c1cda [SPARK-2406][SQL] Initial support for using ParquetTableScan to read HiveMetaStore tables.
This PR adds an experimental flag `spark.sql.hive.convertMetastoreParquet` that when true causes the planner to detects tables that use Hive's Parquet SerDe and instead plans them using Spark SQL's native `ParquetTableScan`.

Author: Michael Armbrust <michael@databricks.com>
Author: Yin Huai <huai@cse.ohio-state.edu>

Closes #1819 from marmbrus/parquetMetastore and squashes the following commits:

1620079 [Michael Armbrust] Revert "remove hive parquet bundle"
cc30430 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into parquetMetastore
4f3d54f [Michael Armbrust] fix style
41ebc5f [Michael Armbrust] remove hive parquet bundle
a43e0da [Michael Armbrust] Merge remote-tracking branch 'origin/master' into parquetMetastore
4c4dc19 [Michael Armbrust] Fix bug with tree splicing.
ebb267e [Michael Armbrust] include parquet hive to tests pass (Remove this later).
c0d9b72 [Michael Armbrust] Avoid creating a HadoopRDD per partition.  Add dirty hacks to retrieve partition values from the InputSplit.
8cdc93c [Michael Armbrust] Merge pull request #8 from yhuai/parquetMetastore
a0baec7 [Yin Huai] Partitioning columns can be resolved.
1161338 [Michael Armbrust] Add a test to make sure conversion is actually happening
212d5cd [Michael Armbrust] Initial support for using ParquetTableScan to read HiveMetaStore tables.
2014-08-18 13:17:10 -07:00
Matei Zaharia 9eb74c7d2c [SPARK-3091] [SQL] Add support for caching metadata on Parquet files
For larger Parquet files, reading the file footers (which is done in parallel on up to 5 threads) and HDFS block locations (which is serial) can take multiple seconds. We can add an option to cache this data within FilteringParquetInputFormat. Unfortunately ParquetInputFormat only caches footers within each instance of ParquetInputFormat, not across them.

Note: this PR leaves this turned off by default for 1.1, but I believe it's safe to turn it on after. The keys in the hash maps are FileStatus objects that include a modification time, so this will work fine if files are modified. The location cache could become invalid if files have moved within HDFS, but that's rare so I just made it invalidate entries every 15 minutes.

Author: Matei Zaharia <matei@databricks.com>

Closes #2005 from mateiz/parquet-cache and squashes the following commits:

dae8efe [Matei Zaharia] Bug fix
c71e9ed [Matei Zaharia] Handle empty statuses directly
22072b0 [Matei Zaharia] Use Guava caches and add a config option for caching metadata
8fb56ce [Matei Zaharia] Cache file block locations too
453bd21 [Matei Zaharia] Bug fix
4094df6 [Matei Zaharia] First attempt at caching Parquet footers
2014-08-18 11:00:10 -07:00
Patrick Wendell 6bca8898a1 SPARK-3025 [SQL]: Allow JDBC clients to set a fair scheduler pool
This definitely needs review as I am not familiar with this part of Spark.
I tested this locally and it did seem to work.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #1937 from pwendell/scheduler and squashes the following commits:

b858e33 [Patrick Wendell] SPARK-3025: Allow JDBC clients to set a fair scheduler pool
2014-08-18 10:52:20 -07:00
Matei Zaharia 4bf3de7107 [SPARK-3085] [SQL] Use compact data structures in SQL joins
This reuses the CompactBuffer from Spark Core to save memory and pointer
dereferences. I also tried AppendOnlyMap instead of java.util.HashMap
but unfortunately that slows things down because it seems to do more
equals() calls and the equals on GenericRow, and especially JoinedRow,
is pretty expensive.

Author: Matei Zaharia <matei@databricks.com>

Closes #1993 from mateiz/spark-3085 and squashes the following commits:

188221e [Matei Zaharia] Remove unneeded import
5f903ee [Matei Zaharia] [SPARK-3085] [SQL] Use compact data structures in SQL joins
2014-08-18 10:45:24 -07:00
Matei Zaharia 6a13dca12f [SPARK-3084] [SQL] Collect broadcasted tables in parallel in joins
BroadcastHashJoin has a broadcastFuture variable that tries to collect
the broadcasted table in a separate thread, but this doesn't help
because it's a lazy val that only gets initialized when you attempt to
build the RDD. Thus queries that broadcast multiple tables would collect
and broadcast them sequentially. I changed this to a val to let it start
collecting right when the operator is created.

Author: Matei Zaharia <matei@databricks.com>

Closes #1990 from mateiz/spark-3084 and squashes the following commits:

f468766 [Matei Zaharia] [SPARK-3084] Collect broadcasted tables in parallel in joins
2014-08-18 10:05:52 -07:00
Patrick Wendell 7ae28d1247 SPARK-3096: Include parquet hive serde by default in build
A small change - we should just add this dependency. It doesn't have any recursive deps and it's needed for reading have parquet tables.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #2009 from pwendell/parquet and squashes the following commits:

e411f9f [Patrick Wendell] SPARk-309: Include parquet hive serde by default in build
2014-08-18 10:00:46 -07:00
Michael Armbrust bfa09b01d7 [SQL] Improve debug logging and toStrings.
Author: Michael Armbrust <michael@databricks.com>

Closes #2004 from marmbrus/codgenDebugging and squashes the following commits:

b7a7e41 [Michael Armbrust] Improve debug logging and toStrings.
2014-08-17 19:00:38 -07:00
Michael Armbrust 5ecb08ea06 Revert "[SPARK-2970] [SQL] spark-sql script ends with IOException when EventLogging is enabled"
Revert #1891 due to issues with hadoop 1 compatibility.

Author: Michael Armbrust <michael@databricks.com>

Closes #2007 from marmbrus/revert1891 and squashes the following commits:

68706c0 [Michael Armbrust] Revert "[SPARK-2970] [SQL] spark-sql script ends with IOException when EventLogging is enabled"
2014-08-17 18:10:45 -07:00
Sean Owen e1b85f3102 SPARK-2955 [BUILD] Test code fails to compile with "mvn compile" without "install"
(This is the corrected follow-up to https://issues.apache.org/jira/browse/SPARK-2903)

Right now, `mvn compile test-compile` fails to compile Spark. (Don't worry; `mvn package` works, so this is not major.) The issue stems from test code in some modules depending on test code in other modules. That is perfectly fine and supported by Maven.

It takes extra work to get this to work with scalatest, and this has been attempted: https://github.com/apache/spark/blob/master/sql/catalyst/pom.xml#L86

This formulation is not quite enough, since the SQL Core module's tests fail to compile for lack of finding test classes in SQL Catalyst, and likewise for most Streaming integration modules depending on core Streaming test code. Example:

```
[error] /Users/srowen/Documents/spark/sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala:23: not found: type PlanTest
[error] class QueryTest extends PlanTest {
[error]                         ^
[error] /Users/srowen/Documents/spark/sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala:28: package org.apache.spark.sql.test is not a value
[error]   test("SPARK-1669: cacheTable should be idempotent") {
[error]   ^
...
```

The issue I believe is that generation of a `test-jar` is bound here to the `compile` phase, but the test classes are not being compiled in this phase. It should bind to the `test-compile` phase.

It works when executing `mvn package` or `mvn install` since test-jar artifacts are actually generated available through normal Maven mechanisms as each module is built. They are then found normally, regardless of scalatest configuration.

It would be nice for a simple `mvn compile test-compile` to work since the test code is perfectly compilable given the Maven declarations.

On the plus side, this change is low-risk as it only affects tests.
yhuai made the original scalatest change and has glanced at this and thinks it makes sense.

Author: Sean Owen <srowen@gmail.com>

Closes #1879 from srowen/SPARK-2955 and squashes the following commits:

ad8242f [Sean Owen] Generate test-jar on test-compile for modules whose tests are needed by others' tests
2014-08-14 22:08:44 -07:00
Michael Armbrust a7f8a4f5ee Revert [SPARK-3011][SQL] _temporary directory should be filtered out by sqlContext.parquetFile
Reverts #1924 due to build failures with hadoop 0.23.

Author: Michael Armbrust <michael@databricks.com>

Closes #1949 from marmbrus/revert1924 and squashes the following commits:

6bff940 [Michael Armbrust] Revert "[SPARK-3011][SQL] _temporary directory should be filtered out by sqlContext.parquetFile"
2014-08-14 13:00:21 -07:00
Yin Huai add75d4831 [SPARK-2927][SQL] Add a conf to configure if we always read Binary columns stored in Parquet as String columns
This PR adds a new conf flag `spark.sql.parquet.binaryAsString`. When it is `true`, if there is no parquet metadata file available to provide the schema of the data, we will always treat binary fields stored in parquet as string fields. This conf is used to provide a way to read string fields generated without UTF8 decoration.

JIRA: https://issues.apache.org/jira/browse/SPARK-2927

Author: Yin Huai <huai@cse.ohio-state.edu>

Closes #1855 from yhuai/parquetBinaryAsString and squashes the following commits:

689ffa9 [Yin Huai] Add missing "=".
80827de [Yin Huai] Unit test.
1765ca4 [Yin Huai] Use .toBoolean.
9d3f199 [Yin Huai] Merge remote-tracking branch 'upstream/master' into parquetBinaryAsString
5d436a1 [Yin Huai] The initial support of adding a conf to treat binary columns stored in Parquet as string columns.
2014-08-14 10:46:33 -07:00
Chia-Yung Su 078f3fbda8 [SPARK-3011][SQL] _temporary directory should be filtered out by sqlContext.parquetFile
Author: Chia-Yung Su <chiayung@appier.com>

Closes #1924 from joesu/bugfix-spark3011 and squashes the following commits:

c7e44f2 [Chia-Yung Su] match syntax
f8fc32a [Chia-Yung Su] filter out tmp dir
2014-08-14 10:43:08 -07:00
guowei 63d6777737 [SPARK-2986] [SQL] fixed: setting properties does not effect
it seems that set command does not run by SparkSQLDriver. it runs on hive api.
user can not change reduce number by setting spark.sql.shuffle.partitions

but i think setting hive properties seems just a role to spark sql.

Author: guowei <guowei@upyoo.com>

Closes #1904 from guowei2/temp-branch and squashes the following commits:

7d47dde [guowei] fixed: setting properties like spark.sql.shuffle.partitions does not effective
2014-08-13 17:45:24 -07:00
Kousuke Saruta 905dc4b405 [SPARK-2970] [SQL] spark-sql script ends with IOException when EventLogging is enabled
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #1891 from sarutak/SPARK-2970 and squashes the following commits:

4a2d2fe [Kousuke Saruta] Modified comment style
8bd833c [Kousuke Saruta] Modified style
6c0997c [Kousuke Saruta] Modified the timing of shutdown hook execution. It should be executed before shutdown hook of o.a.h.f.FileSystem
2014-08-13 17:42:38 -07:00
Michael Armbrust 9fde1ff5fc [SPARK-2935][SQL]Fix parquet predicate push down bug
Author: Michael Armbrust <michael@databricks.com>

Closes #1863 from marmbrus/parquetPredicates and squashes the following commits:

10ad202 [Michael Armbrust] left <=> right
f249158 [Michael Armbrust] quiet parquet tests.
802da5b [Michael Armbrust] Add test case.
eab2eda [Michael Armbrust] Fix parquet predicate push down bug
2014-08-13 17:40:59 -07:00
Cheng Lian 376a82e196 [SPARK-2650][SQL] More precise initial buffer size estimation for in-memory column buffer
This is a follow up of #1880.

Since the row number within a single batch is known, we can estimate a much more precise initial buffer size when building an in-memory column buffer.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1901 from liancheng/precise-init-buffer-size and squashes the following commits:

d5501fa [Cheng Lian] More precise initial buffer size estimation for in-memory column buffer
2014-08-13 17:37:55 -07:00
Michael Armbrust 9256d4a9c8 [SPARK-2994][SQL] Support for udfs that take complex types
Author: Michael Armbrust <michael@databricks.com>

Closes #1915 from marmbrus/arrayUDF and squashes the following commits:

a1c503d [Michael Armbrust] Support for udfs that take complex types
2014-08-13 17:35:38 -07:00
tianyi 13f54e2b97 [SPARK-2817] [SQL] add "show create table" support
In spark sql component, the "show create table" syntax had been disabled.
We thought it is a useful funciton to describe a hive table.

Author: tianyi <tianyi@asiainfo-linkage.com>
Author: tianyi <tianyi@asiainfo.com>
Author: tianyi <tianyi.asiainfo@gmail.com>

Closes #1760 from tianyi/spark-2817 and squashes the following commits:

7d28b15 [tianyi] [SPARK-2817] fix too short prefix problem
cbffe8b [tianyi] [SPARK-2817] fix the case problem
565ec14 [tianyi] [SPARK-2817] fix the case problem
60d48a9 [tianyi] [SPARK-2817] use system temporary folder instead of temporary files in the source tree, and also clean some empty line
dbe1031 [tianyi] [SPARK-2817] move some code out of function rewritePaths, as it may be called multiple times
9b2ba11 [tianyi] [SPARK-2817] fix the line length problem
9f97586 [tianyi] [SPARK-2817] remove test.tmp.dir from pom.xml
bfc2999 [tianyi] [SPARK-2817] add "File.separator" support, create a "testTmpDir" outside the rewritePaths
bde800a [tianyi] [SPARK-2817] add "${system:test.tmp.dir}" support add "last_modified_by" to nonDeterministicLineIndicators in HiveComparisonTest
bb82726 [tianyi] [SPARK-2817] remove test which requires a system from the whitelist.
bbf6b42 [tianyi] [SPARK-2817] add a systemProperties named "test.tmp.dir" to pass the test which contains "${system:test.tmp.dir}"
a337bd6 [tianyi] [SPARK-2817] add "show create table" support
a03db77 [tianyi] [SPARK-2817] add "show create table" support
2014-08-13 16:50:02 -07:00
Cheng Lian bdc7a1a474 [SPARK-3004][SQL] Added null checking when retrieving row set
JIRA issue: [SPARK-3004](https://issues.apache.org/jira/browse/SPARK-3004)

HiveThriftServer2 throws exception when the result set contains `NULL`. Should check `isNullAt` in `SparkSQLOperationManager.getNextRowSet`.

Note that simply using `row.addColumnValue(null)` doesn't work, since Hive set the column type of a null `ColumnValue` to String by default.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1920 from liancheng/spark-3004 and squashes the following commits:

1b1db1c [Cheng Lian] Adding NULL column values in the Hive way
2217722 [Cheng Lian] Fixed SPARK-3004: added null checking when retrieving row set
2014-08-13 16:27:50 -07:00
Cheng Hao 5d54d71ddb [SQL] [SPARK-2826] Reduce the memory copy while building the hashmap for HashOuterJoin
This is a follow up for #1147 , this PR will improve the performance about 10% - 15% in my local tests.
```
Before:
LeftOuterJoin: took 16750 ms ([3000000] records)
LeftOuterJoin: took 15179 ms ([3000000] records)
RightOuterJoin: took 15515 ms ([3000000] records)
RightOuterJoin: took 15276 ms ([3000000] records)
FullOuterJoin: took 19150 ms ([6000000] records)
FullOuterJoin: took 18935 ms ([6000000] records)

After:
LeftOuterJoin: took 15218 ms ([3000000] records)
LeftOuterJoin: took 13503 ms ([3000000] records)
RightOuterJoin: took 13663 ms ([3000000] records)
RightOuterJoin: took 14025 ms ([3000000] records)
FullOuterJoin: took 16624 ms ([6000000] records)
FullOuterJoin: took 16578 ms ([6000000] records)
```

Besides the performance improvement, I also do some clean up as suggested in #1147

Author: Cheng Hao <hao.cheng@intel.com>

Closes #1765 from chenghao-intel/hash_outer_join_fixing and squashes the following commits:

ab1f9e0 [Cheng Hao] Reduce the memory copy while building the hashmap
2014-08-11 20:45:14 -07:00
Michael Armbrust bad21ed085 [SPARK-2650][SQL] Build column buffers in smaller batches
Author: Michael Armbrust <michael@databricks.com>

Closes #1880 from marmbrus/columnBatches and squashes the following commits:

0649987 [Michael Armbrust] add test
4756fad [Michael Armbrust] fix compilation
2314532 [Michael Armbrust] Build column buffers in smaller batches
2014-08-11 20:21:56 -07:00
Takuya UESHIN c686b7dd46 [SPARK-2968][SQL] Fix nullabilities of Explode.
Output nullabilities of `Explode` could be detemined by `ArrayType.containsNull` or `MapType.valueContainsNull`.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #1888 from ueshin/issues/SPARK-2968 and squashes the following commits:

d128c95 [Takuya UESHIN] Fix nullability of Explode.
2014-08-11 20:18:03 -07:00
Takuya UESHIN c9c89c31b6 [SPARK-2965][SQL] Fix HashOuterJoin output nullabilities.
Output attributes of opposite side of `OuterJoin` should be nullable.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #1887 from ueshin/issues/SPARK-2965 and squashes the following commits:

bcb2d37 [Takuya UESHIN] Fix HashOuterJoin output nullabilities.
2014-08-11 20:15:01 -07:00
Yin Huai 647aeba3a9 [SQL] A tiny refactoring in HiveContext#analyze
I should use `EliminateAnalysisOperators` in  `analyze` instead of manually pattern matching.

Author: Yin Huai <huaiyin.thu@gmail.com>

Closes #1881 from yhuai/useEliminateAnalysisOperators and squashes the following commits:

f3e1e7f [Yin Huai] Use EliminateAnalysisOperators.
2014-08-11 20:11:29 -07:00
wangfei e83fdcd421 [sql]use SparkSQLEnv.stop() in ShutdownHook
Author: wangfei <wangfei1@huawei.com>

Closes #1852 from scwf/patch-3 and squashes the following commits:

ae28c29 [wangfei] use SparkSQLEnv.stop() in ShutdownHook
2014-08-11 20:10:13 -07:00
Cheng Lian 21a95ef051 [SPARK-2590][SQL] Added option to handle incremental collection, disabled by default
JIRA issue: [SPARK-2590](https://issues.apache.org/jira/browse/SPARK-2590)

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1853 from liancheng/inc-collect-option and squashes the following commits:

cb3ea45 [Cheng Lian] Moved incremental collection option to Thrift server
43ce3aa [Cheng Lian] Changed incremental collect option name
623abde [Cheng Lian] Added option to handle incremental collection, disabled by default
2014-08-11 20:08:06 -07:00
Reynold Xin 5b6585de6b Updated Spark SQL README to include the hive-thriftserver module
Author: Reynold Xin <rxin@apache.org>

Closes #1867 from rxin/sql-readme and squashes the following commits:

42a5307 [Reynold Xin] Updated Spark SQL README to include the hive-thriftserver module
2014-08-09 22:05:36 -07:00
chutium b7c89a7f0c [SPARK-2700] [SQL] Hidden files (such as .impala_insert_staging) should be filtered out by sqlContext.parquetFile
Author: chutium <teng.qiu@gmail.com>

Closes #1691 from chutium/SPARK-2700 and squashes the following commits:

b76ae8c [chutium] [SPARK-2700] [SQL] fixed styling issue
d75a8bd [chutium] [SPARK-2700] [SQL] Hidden files (such as .impala_insert_staging) should be filtered out by sqlContext.parquetFile
2014-08-08 13:31:08 -07:00
Yin Huai 45d8f4deab [SPARK-2919] [SQL] Basic support for analyze command in HiveQl
The command we will support is
```
ANALYZE TABLE tablename COMPUTE STATISTICS noscan
```
Other cases shown in https://cwiki.apache.org/confluence/display/Hive/StatsDev#StatsDev-ExistingTables will still be treated as Hive native commands.

JIRA: https://issues.apache.org/jira/browse/SPARK-2919

Author: Yin Huai <huai@cse.ohio-state.edu>

Closes #1848 from yhuai/sqlAnalyze and squashes the following commits:

0b79d36 [Yin Huai] Typo and format.
c59d94b [Yin Huai] Support "ANALYZE TABLE tableName COMPUTE STATISTICS noscan".
2014-08-08 11:23:58 -07:00
Yin Huai c874723fa8 [SPARK-2877] [SQL] MetastoreRelation should use SparkClassLoader when creating the tableDesc
JIRA: https://issues.apache.org/jira/browse/SPARK-2877

Author: Yin Huai <huai@cse.ohio-state.edu>

Closes #1806 from yhuai/SPARK-2877 and squashes the following commits:

4142bcb [Yin Huai] Use Spark's classloader.
2014-08-08 11:15:16 -07:00
Yin Huai 0489cee6b2 [SPARK-2908] [SQL] JsonRDD.nullTypeToStringType does not convert all NullType to StringType
JIRA: https://issues.apache.org/jira/browse/SPARK-2908

Author: Yin Huai <huai@cse.ohio-state.edu>

Closes #1840 from yhuai/SPARK-2908 and squashes the following commits:

86e833e [Yin Huai] Update test.
cb11759 [Yin Huai] nullTypeToStringType should check columns with the type of array of structs.
2014-08-08 11:10:11 -07:00
Yin Huai 9016af3f27 [SPARK-2888] [SQL] Fix addColumnMetadataToConf in HiveTableScan
JIRA: https://issues.apache.org/jira/browse/SPARK-2888

Author: Yin Huai <huai@cse.ohio-state.edu>

Closes #1817 from yhuai/fixAddColumnMetadataToConf and squashes the following commits:

fba728c [Yin Huai] Fix addColumnMetadataToConf.
2014-08-08 11:01:51 -07:00
Cheng Lian a6cd31108f [SPARK-2678][Core][SQL] A workaround for SPARK-2678
JIRA issues:

- Main: [SPARK-2678](https://issues.apache.org/jira/browse/SPARK-2678)
- Related: [SPARK-2874](https://issues.apache.org/jira/browse/SPARK-2874)

Related PR:

- #1715

This PR is both a fix for SPARK-2874 and a workaround for SPARK-2678. Fixing SPARK-2678 completely requires some API level changes that need further discussion, and we decided not to include it in Spark 1.1 release. As currently SPARK-2678 only affects Spark SQL scripts, this workaround is enough for Spark 1.1. Command line option handling logic in bash scripts looks somewhat dirty and duplicated, but it helps to provide a cleaner user interface as well as retain full downward compatibility for now.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1801 from liancheng/spark-2874 and squashes the following commits:

8045d7a [Cheng Lian] Make sure test suites pass
8493a9e [Cheng Lian] Using eval to retain quoted arguments
aed523f [Cheng Lian] Fixed typo in bin/spark-sql
f12a0b1 [Cheng Lian] Worked arount SPARK-2678
daee105 [Cheng Lian] Fixed usage messages of all Spark SQL related scripts
2014-08-06 12:28:35 -07:00
Davies Liu 48789117c2 [SPARK-2875] [PySpark] [SQL] handle null in schemaRDD()
Handle null in schemaRDD during converting them into Python.

Author: Davies Liu <davies.liu@gmail.com>

Closes #1802 from davies/json and squashes the following commits:

88e6b1f [Davies Liu] handle null in schemaRDD()
2014-08-06 11:08:12 -07:00
Michael Armbrust 5a826c00c3 [SQL] Fix logging warn -> debug
Author: Michael Armbrust <michael@databricks.com>

Closes #1800 from marmbrus/warning and squashes the following commits:

8ea9cf1 [Michael Armbrust] [SQL] Fix logging warn -> debug.
2014-08-05 22:30:32 -07:00
Reynold Xin b70bae40eb [SQL] Tighten the visibility of various SQLConf methods and renamed setter/getters
Author: Reynold Xin <rxin@apache.org>

Closes #1794 from rxin/sql-conf and squashes the following commits:

3ac11ef [Reynold Xin] getAllConfs return an immutable Map instead of an Array.
4b19d6c [Reynold Xin] Tighten the visibility of various SQLConf methods and renamed setter/getters.
2014-08-05 22:29:19 -07:00
Michael Armbrust 1d70c4f66d [SPARK-2866][SQL] Support attributes in ORDER BY that aren't in SELECT
Minor refactoring to allow resolution either using a nodes input or output.

Author: Michael Armbrust <michael@databricks.com>

Closes #1795 from marmbrus/ordering and squashes the following commits:

237f580 [Michael Armbrust] style
74d833b [Michael Armbrust] newline
705d963 [Michael Armbrust] Add a rule for resolving ORDER BY expressions that reference attributes not present in the SELECT clause.
82cabda [Michael Armbrust] Generalize attribute resolution.
2014-08-05 20:55:02 -07:00