We can simple treat cross join as inner join without join conditions.
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Author: adrian-wang <daoyuanwong@gmail.com>
Closes#2124 from adrian-wang/crossjoin and squashes the following commits:
8c9b7c5 [Daoyuan Wang] add a test
7d47bbb [adrian-wang] add cross join support for hql
fix compile error on hadoop 0.23 for the pull request #1924.
Author: Chia-Yung Su <chiayung@appier.com>
Closes#1959 from joesu/bugfix-spark3011 and squashes the following commits:
be30793 [Chia-Yung Su] remove .* and _* except _metadata
8fe2398 [Chia-Yung Su] add note to explain
40ea9bd [Chia-Yung Su] fix hadoop-0.23 compile error
c7e44f2 [Chia-Yung Su] match syntax
f8fc32a [Chia-Yung Su] filter out tmp dir
Author: wangfei <wangfei_hello@126.com>
Closes#1939 from scwf/patch-5 and squashes the following commits:
f952d10 [wangfei] [SQL] logWarning should be logInfo in getResultSetSchema
Provide `extended` keyword support for `explain` command in SQL. e.g.
```
explain extended select key as a1, value as a2 from src where key=1;
== Parsed Logical Plan ==
Project ['key AS a1#3,'value AS a2#4]
Filter ('key = 1)
UnresolvedRelation None, src, None
== Analyzed Logical Plan ==
Project [key#8 AS a1#3,value#9 AS a2#4]
Filter (CAST(key#8, DoubleType) = CAST(1, DoubleType))
MetastoreRelation default, src, None
== Optimized Logical Plan ==
Project [key#8 AS a1#3,value#9 AS a2#4]
Filter (CAST(key#8, DoubleType) = 1.0)
MetastoreRelation default, src, None
== Physical Plan ==
Project [key#8 AS a1#3,value#9 AS a2#4]
Filter (CAST(key#8, DoubleType) = 1.0)
HiveTableScan [key#8,value#9], (MetastoreRelation default, src, None), None
Code Generation: false
== RDD ==
(2) MappedRDD[14] at map at HiveContext.scala:350
MapPartitionsRDD[13] at mapPartitions at basicOperators.scala:42
MapPartitionsRDD[12] at mapPartitions at basicOperators.scala:57
MapPartitionsRDD[11] at mapPartitions at TableReader.scala:112
MappedRDD[10] at map at TableReader.scala:240
HadoopRDD[9] at HadoopRDD at TableReader.scala:230
```
It's the sub task of #1847. But can go without any dependency.
Author: Cheng Hao <hao.cheng@intel.com>
Closes#1962 from chenghao-intel/explain_extended and squashes the following commits:
295db74 [Cheng Hao] Fix bug in printing the simple execution plan
48bc989 [Cheng Hao] Support EXTENDED for EXPLAIN
Removed most hard coded timeout, timing assumptions and all `Thread.sleep`. Simplified IPC and synchronization with `scala.sys.process` and future/promise so that the test suites can run more robustly and faster.
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes#1856 from liancheng/thriftserver-tests and squashes the following commits:
2d914ca [Cheng Lian] Minor refactoring
0e12e71 [Cheng Lian] Cleaned up test output
0ee921d [Cheng Lian] Refactored Thrift server and CLI suites
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#2116 from ueshin/issues/SPARK-3204 and squashes the following commits:
7d9b107 [Takuya UESHIN] Make MaxOf foldable if both left and right are foldable.
Follow-up to #2066
Author: Michael Armbrust <michael@databricks.com>
Closes#2072 from marmbrus/sortShuffle and squashes the following commits:
2ff8114 [Michael Armbrust] Fix bug
Seems we missed `transient` for the `functionRegistry` in `HiveContext`.
cc: marmbrus
Author: Yin Huai <huaiyin.thu@gmail.com>
Closes#2074 from yhuai/makeFunctionRegistryTransient and squashes the following commits:
6534e7d [Yin Huai] Make functionRegistry transient.
...al job conf
Author: Alex Liu <alex_liu68@yahoo.com>
Closes#1927 from alexliu68/SPARK-SQL-2846 and squashes the following commits:
e4bdc4c [Alex Liu] SPARK-SQL-2846 add configureInputJobPropertiesForStorageHandler to initial job conf
Add explicit row copies when sort based shuffle is on.
Author: Michael Armbrust <michael@databricks.com>
Closes#2066 from marmbrus/sortShuffle and squashes the following commits:
fcd7bb2 [Michael Armbrust] Fix sort based shuffle for spark sql.
This PR fixes two issues:
1. Fixes wrongly quoted command line option in `HiveThriftServer2Suite` that makes test cases hang until timeout.
1. Asks `dev/run-test` to run Spark SQL tests when `bin/spark-sql` and/or `sbin/start-thriftserver.sh` are modified.
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes#2036 from liancheng/fix-thriftserver-test and squashes the following commits:
f38c4eb [Cheng Lian] Fixed the same quotation issue in CliSuite
26b82a0 [Cheng Lian] Run SQL tests when dff contains bin/spark-sql and/or sbin/start-thriftserver.sh
a87f83d [Cheng Lian] Extended timeout
e5aa31a [Cheng Lian] Fixed metastore JDBC URI quotation
Refer to:
http://stackoverflow.com/questions/510632/whats-the-difference-between-concurrenthashmap-and-collections-synchronizedmap
Collections.synchronizedMap(map) creates a blocking Map which will degrade performance, albeit ensure consistency. So use ConcurrentHashMap(a more effective thread-safe hashmap) instead.
also update HiveQuerySuite to fix test error when changed to ConcurrentHashMap.
Author: wangfei <wangfei_hello@126.com>
Author: scwf <wangfei1@huawei.com>
Closes#1996 from scwf/sqlconf and squashes the following commits:
93bc0c5 [wangfei] revert change of HiveQuerySuite
0cc05dd [wangfei] add note for use synchronizedMap
3c224d31 [scwf] fix formate
a7bcb98 [scwf] use ConcurrentHashMap in sql conf, intead synchronizedMap
This PR adds an experimental flag `spark.sql.hive.convertMetastoreParquet` that when true causes the planner to detects tables that use Hive's Parquet SerDe and instead plans them using Spark SQL's native `ParquetTableScan`.
Author: Michael Armbrust <michael@databricks.com>
Author: Yin Huai <huai@cse.ohio-state.edu>
Closes#1819 from marmbrus/parquetMetastore and squashes the following commits:
1620079 [Michael Armbrust] Revert "remove hive parquet bundle"
cc30430 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into parquetMetastore
4f3d54f [Michael Armbrust] fix style
41ebc5f [Michael Armbrust] remove hive parquet bundle
a43e0da [Michael Armbrust] Merge remote-tracking branch 'origin/master' into parquetMetastore
4c4dc19 [Michael Armbrust] Fix bug with tree splicing.
ebb267e [Michael Armbrust] include parquet hive to tests pass (Remove this later).
c0d9b72 [Michael Armbrust] Avoid creating a HadoopRDD per partition. Add dirty hacks to retrieve partition values from the InputSplit.
8cdc93c [Michael Armbrust] Merge pull request #8 from yhuai/parquetMetastore
a0baec7 [Yin Huai] Partitioning columns can be resolved.
1161338 [Michael Armbrust] Add a test to make sure conversion is actually happening
212d5cd [Michael Armbrust] Initial support for using ParquetTableScan to read HiveMetaStore tables.
For larger Parquet files, reading the file footers (which is done in parallel on up to 5 threads) and HDFS block locations (which is serial) can take multiple seconds. We can add an option to cache this data within FilteringParquetInputFormat. Unfortunately ParquetInputFormat only caches footers within each instance of ParquetInputFormat, not across them.
Note: this PR leaves this turned off by default for 1.1, but I believe it's safe to turn it on after. The keys in the hash maps are FileStatus objects that include a modification time, so this will work fine if files are modified. The location cache could become invalid if files have moved within HDFS, but that's rare so I just made it invalidate entries every 15 minutes.
Author: Matei Zaharia <matei@databricks.com>
Closes#2005 from mateiz/parquet-cache and squashes the following commits:
dae8efe [Matei Zaharia] Bug fix
c71e9ed [Matei Zaharia] Handle empty statuses directly
22072b0 [Matei Zaharia] Use Guava caches and add a config option for caching metadata
8fb56ce [Matei Zaharia] Cache file block locations too
453bd21 [Matei Zaharia] Bug fix
4094df6 [Matei Zaharia] First attempt at caching Parquet footers
This definitely needs review as I am not familiar with this part of Spark.
I tested this locally and it did seem to work.
Author: Patrick Wendell <pwendell@gmail.com>
Closes#1937 from pwendell/scheduler and squashes the following commits:
b858e33 [Patrick Wendell] SPARK-3025: Allow JDBC clients to set a fair scheduler pool
This reuses the CompactBuffer from Spark Core to save memory and pointer
dereferences. I also tried AppendOnlyMap instead of java.util.HashMap
but unfortunately that slows things down because it seems to do more
equals() calls and the equals on GenericRow, and especially JoinedRow,
is pretty expensive.
Author: Matei Zaharia <matei@databricks.com>
Closes#1993 from mateiz/spark-3085 and squashes the following commits:
188221e [Matei Zaharia] Remove unneeded import
5f903ee [Matei Zaharia] [SPARK-3085] [SQL] Use compact data structures in SQL joins
BroadcastHashJoin has a broadcastFuture variable that tries to collect
the broadcasted table in a separate thread, but this doesn't help
because it's a lazy val that only gets initialized when you attempt to
build the RDD. Thus queries that broadcast multiple tables would collect
and broadcast them sequentially. I changed this to a val to let it start
collecting right when the operator is created.
Author: Matei Zaharia <matei@databricks.com>
Closes#1990 from mateiz/spark-3084 and squashes the following commits:
f468766 [Matei Zaharia] [SPARK-3084] Collect broadcasted tables in parallel in joins
A small change - we should just add this dependency. It doesn't have any recursive deps and it's needed for reading have parquet tables.
Author: Patrick Wendell <pwendell@gmail.com>
Closes#2009 from pwendell/parquet and squashes the following commits:
e411f9f [Patrick Wendell] SPARk-309: Include parquet hive serde by default in build
Author: Michael Armbrust <michael@databricks.com>
Closes#2004 from marmbrus/codgenDebugging and squashes the following commits:
b7a7e41 [Michael Armbrust] Improve debug logging and toStrings.
Revert #1891 due to issues with hadoop 1 compatibility.
Author: Michael Armbrust <michael@databricks.com>
Closes#2007 from marmbrus/revert1891 and squashes the following commits:
68706c0 [Michael Armbrust] Revert "[SPARK-2970] [SQL] spark-sql script ends with IOException when EventLogging is enabled"
(This is the corrected follow-up to https://issues.apache.org/jira/browse/SPARK-2903)
Right now, `mvn compile test-compile` fails to compile Spark. (Don't worry; `mvn package` works, so this is not major.) The issue stems from test code in some modules depending on test code in other modules. That is perfectly fine and supported by Maven.
It takes extra work to get this to work with scalatest, and this has been attempted: https://github.com/apache/spark/blob/master/sql/catalyst/pom.xml#L86
This formulation is not quite enough, since the SQL Core module's tests fail to compile for lack of finding test classes in SQL Catalyst, and likewise for most Streaming integration modules depending on core Streaming test code. Example:
```
[error] /Users/srowen/Documents/spark/sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala:23: not found: type PlanTest
[error] class QueryTest extends PlanTest {
[error] ^
[error] /Users/srowen/Documents/spark/sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala:28: package org.apache.spark.sql.test is not a value
[error] test("SPARK-1669: cacheTable should be idempotent") {
[error] ^
...
```
The issue I believe is that generation of a `test-jar` is bound here to the `compile` phase, but the test classes are not being compiled in this phase. It should bind to the `test-compile` phase.
It works when executing `mvn package` or `mvn install` since test-jar artifacts are actually generated available through normal Maven mechanisms as each module is built. They are then found normally, regardless of scalatest configuration.
It would be nice for a simple `mvn compile test-compile` to work since the test code is perfectly compilable given the Maven declarations.
On the plus side, this change is low-risk as it only affects tests.
yhuai made the original scalatest change and has glanced at this and thinks it makes sense.
Author: Sean Owen <srowen@gmail.com>
Closes#1879 from srowen/SPARK-2955 and squashes the following commits:
ad8242f [Sean Owen] Generate test-jar on test-compile for modules whose tests are needed by others' tests
Reverts #1924 due to build failures with hadoop 0.23.
Author: Michael Armbrust <michael@databricks.com>
Closes#1949 from marmbrus/revert1924 and squashes the following commits:
6bff940 [Michael Armbrust] Revert "[SPARK-3011][SQL] _temporary directory should be filtered out by sqlContext.parquetFile"
This PR adds a new conf flag `spark.sql.parquet.binaryAsString`. When it is `true`, if there is no parquet metadata file available to provide the schema of the data, we will always treat binary fields stored in parquet as string fields. This conf is used to provide a way to read string fields generated without UTF8 decoration.
JIRA: https://issues.apache.org/jira/browse/SPARK-2927
Author: Yin Huai <huai@cse.ohio-state.edu>
Closes#1855 from yhuai/parquetBinaryAsString and squashes the following commits:
689ffa9 [Yin Huai] Add missing "=".
80827de [Yin Huai] Unit test.
1765ca4 [Yin Huai] Use .toBoolean.
9d3f199 [Yin Huai] Merge remote-tracking branch 'upstream/master' into parquetBinaryAsString
5d436a1 [Yin Huai] The initial support of adding a conf to treat binary columns stored in Parquet as string columns.
Author: Chia-Yung Su <chiayung@appier.com>
Closes#1924 from joesu/bugfix-spark3011 and squashes the following commits:
c7e44f2 [Chia-Yung Su] match syntax
f8fc32a [Chia-Yung Su] filter out tmp dir
it seems that set command does not run by SparkSQLDriver. it runs on hive api.
user can not change reduce number by setting spark.sql.shuffle.partitions
but i think setting hive properties seems just a role to spark sql.
Author: guowei <guowei@upyoo.com>
Closes#1904 from guowei2/temp-branch and squashes the following commits:
7d47dde [guowei] fixed: setting properties like spark.sql.shuffle.partitions does not effective
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#1891 from sarutak/SPARK-2970 and squashes the following commits:
4a2d2fe [Kousuke Saruta] Modified comment style
8bd833c [Kousuke Saruta] Modified style
6c0997c [Kousuke Saruta] Modified the timing of shutdown hook execution. It should be executed before shutdown hook of o.a.h.f.FileSystem
Author: Michael Armbrust <michael@databricks.com>
Closes#1863 from marmbrus/parquetPredicates and squashes the following commits:
10ad202 [Michael Armbrust] left <=> right
f249158 [Michael Armbrust] quiet parquet tests.
802da5b [Michael Armbrust] Add test case.
eab2eda [Michael Armbrust] Fix parquet predicate push down bug
This is a follow up of #1880.
Since the row number within a single batch is known, we can estimate a much more precise initial buffer size when building an in-memory column buffer.
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes#1901 from liancheng/precise-init-buffer-size and squashes the following commits:
d5501fa [Cheng Lian] More precise initial buffer size estimation for in-memory column buffer
Author: Michael Armbrust <michael@databricks.com>
Closes#1915 from marmbrus/arrayUDF and squashes the following commits:
a1c503d [Michael Armbrust] Support for udfs that take complex types
In spark sql component, the "show create table" syntax had been disabled.
We thought it is a useful funciton to describe a hive table.
Author: tianyi <tianyi@asiainfo-linkage.com>
Author: tianyi <tianyi@asiainfo.com>
Author: tianyi <tianyi.asiainfo@gmail.com>
Closes#1760 from tianyi/spark-2817 and squashes the following commits:
7d28b15 [tianyi] [SPARK-2817] fix too short prefix problem
cbffe8b [tianyi] [SPARK-2817] fix the case problem
565ec14 [tianyi] [SPARK-2817] fix the case problem
60d48a9 [tianyi] [SPARK-2817] use system temporary folder instead of temporary files in the source tree, and also clean some empty line
dbe1031 [tianyi] [SPARK-2817] move some code out of function rewritePaths, as it may be called multiple times
9b2ba11 [tianyi] [SPARK-2817] fix the line length problem
9f97586 [tianyi] [SPARK-2817] remove test.tmp.dir from pom.xml
bfc2999 [tianyi] [SPARK-2817] add "File.separator" support, create a "testTmpDir" outside the rewritePaths
bde800a [tianyi] [SPARK-2817] add "${system:test.tmp.dir}" support add "last_modified_by" to nonDeterministicLineIndicators in HiveComparisonTest
bb82726 [tianyi] [SPARK-2817] remove test which requires a system from the whitelist.
bbf6b42 [tianyi] [SPARK-2817] add a systemProperties named "test.tmp.dir" to pass the test which contains "${system:test.tmp.dir}"
a337bd6 [tianyi] [SPARK-2817] add "show create table" support
a03db77 [tianyi] [SPARK-2817] add "show create table" support
JIRA issue: [SPARK-3004](https://issues.apache.org/jira/browse/SPARK-3004)
HiveThriftServer2 throws exception when the result set contains `NULL`. Should check `isNullAt` in `SparkSQLOperationManager.getNextRowSet`.
Note that simply using `row.addColumnValue(null)` doesn't work, since Hive set the column type of a null `ColumnValue` to String by default.
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes#1920 from liancheng/spark-3004 and squashes the following commits:
1b1db1c [Cheng Lian] Adding NULL column values in the Hive way
2217722 [Cheng Lian] Fixed SPARK-3004: added null checking when retrieving row set
This is a follow up for #1147 , this PR will improve the performance about 10% - 15% in my local tests.
```
Before:
LeftOuterJoin: took 16750 ms ([3000000] records)
LeftOuterJoin: took 15179 ms ([3000000] records)
RightOuterJoin: took 15515 ms ([3000000] records)
RightOuterJoin: took 15276 ms ([3000000] records)
FullOuterJoin: took 19150 ms ([6000000] records)
FullOuterJoin: took 18935 ms ([6000000] records)
After:
LeftOuterJoin: took 15218 ms ([3000000] records)
LeftOuterJoin: took 13503 ms ([3000000] records)
RightOuterJoin: took 13663 ms ([3000000] records)
RightOuterJoin: took 14025 ms ([3000000] records)
FullOuterJoin: took 16624 ms ([6000000] records)
FullOuterJoin: took 16578 ms ([6000000] records)
```
Besides the performance improvement, I also do some clean up as suggested in #1147
Author: Cheng Hao <hao.cheng@intel.com>
Closes#1765 from chenghao-intel/hash_outer_join_fixing and squashes the following commits:
ab1f9e0 [Cheng Hao] Reduce the memory copy while building the hashmap
Author: Michael Armbrust <michael@databricks.com>
Closes#1880 from marmbrus/columnBatches and squashes the following commits:
0649987 [Michael Armbrust] add test
4756fad [Michael Armbrust] fix compilation
2314532 [Michael Armbrust] Build column buffers in smaller batches
Output nullabilities of `Explode` could be detemined by `ArrayType.containsNull` or `MapType.valueContainsNull`.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#1888 from ueshin/issues/SPARK-2968 and squashes the following commits:
d128c95 [Takuya UESHIN] Fix nullability of Explode.
Output attributes of opposite side of `OuterJoin` should be nullable.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#1887 from ueshin/issues/SPARK-2965 and squashes the following commits:
bcb2d37 [Takuya UESHIN] Fix HashOuterJoin output nullabilities.
I should use `EliminateAnalysisOperators` in `analyze` instead of manually pattern matching.
Author: Yin Huai <huaiyin.thu@gmail.com>
Closes#1881 from yhuai/useEliminateAnalysisOperators and squashes the following commits:
f3e1e7f [Yin Huai] Use EliminateAnalysisOperators.
Author: wangfei <wangfei1@huawei.com>
Closes#1852 from scwf/patch-3 and squashes the following commits:
ae28c29 [wangfei] use SparkSQLEnv.stop() in ShutdownHook
JIRA issue: [SPARK-2590](https://issues.apache.org/jira/browse/SPARK-2590)
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes#1853 from liancheng/inc-collect-option and squashes the following commits:
cb3ea45 [Cheng Lian] Moved incremental collection option to Thrift server
43ce3aa [Cheng Lian] Changed incremental collect option name
623abde [Cheng Lian] Added option to handle incremental collection, disabled by default
Author: Reynold Xin <rxin@apache.org>
Closes#1867 from rxin/sql-readme and squashes the following commits:
42a5307 [Reynold Xin] Updated Spark SQL README to include the hive-thriftserver module
Author: chutium <teng.qiu@gmail.com>
Closes#1691 from chutium/SPARK-2700 and squashes the following commits:
b76ae8c [chutium] [SPARK-2700] [SQL] fixed styling issue
d75a8bd [chutium] [SPARK-2700] [SQL] Hidden files (such as .impala_insert_staging) should be filtered out by sqlContext.parquetFile
JIRA: https://issues.apache.org/jira/browse/SPARK-2908
Author: Yin Huai <huai@cse.ohio-state.edu>
Closes#1840 from yhuai/SPARK-2908 and squashes the following commits:
86e833e [Yin Huai] Update test.
cb11759 [Yin Huai] nullTypeToStringType should check columns with the type of array of structs.
JIRA: https://issues.apache.org/jira/browse/SPARK-2888
Author: Yin Huai <huai@cse.ohio-state.edu>
Closes#1817 from yhuai/fixAddColumnMetadataToConf and squashes the following commits:
fba728c [Yin Huai] Fix addColumnMetadataToConf.
JIRA issues:
- Main: [SPARK-2678](https://issues.apache.org/jira/browse/SPARK-2678)
- Related: [SPARK-2874](https://issues.apache.org/jira/browse/SPARK-2874)
Related PR:
- #1715
This PR is both a fix for SPARK-2874 and a workaround for SPARK-2678. Fixing SPARK-2678 completely requires some API level changes that need further discussion, and we decided not to include it in Spark 1.1 release. As currently SPARK-2678 only affects Spark SQL scripts, this workaround is enough for Spark 1.1. Command line option handling logic in bash scripts looks somewhat dirty and duplicated, but it helps to provide a cleaner user interface as well as retain full downward compatibility for now.
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes#1801 from liancheng/spark-2874 and squashes the following commits:
8045d7a [Cheng Lian] Make sure test suites pass
8493a9e [Cheng Lian] Using eval to retain quoted arguments
aed523f [Cheng Lian] Fixed typo in bin/spark-sql
f12a0b1 [Cheng Lian] Worked arount SPARK-2678
daee105 [Cheng Lian] Fixed usage messages of all Spark SQL related scripts
Handle null in schemaRDD during converting them into Python.
Author: Davies Liu <davies.liu@gmail.com>
Closes#1802 from davies/json and squashes the following commits:
88e6b1f [Davies Liu] handle null in schemaRDD()
Author: Michael Armbrust <michael@databricks.com>
Closes#1800 from marmbrus/warning and squashes the following commits:
8ea9cf1 [Michael Armbrust] [SQL] Fix logging warn -> debug.
Author: Reynold Xin <rxin@apache.org>
Closes#1794 from rxin/sql-conf and squashes the following commits:
3ac11ef [Reynold Xin] getAllConfs return an immutable Map instead of an Array.
4b19d6c [Reynold Xin] Tighten the visibility of various SQLConf methods and renamed setter/getters.
Minor refactoring to allow resolution either using a nodes input or output.
Author: Michael Armbrust <michael@databricks.com>
Closes#1795 from marmbrus/ordering and squashes the following commits:
237f580 [Michael Armbrust] style
74d833b [Michael Armbrust] newline
705d963 [Michael Armbrust] Add a rule for resolving ORDER BY expressions that reference attributes not present in the SELECT clause.
82cabda [Michael Armbrust] Generalize attribute resolution.