@yhuai @marmbrus @concretevitamin
Author: Reynold Xin <rxin@apache.org>
Closes#1123 from rxin/explain and squashes the following commits:
def83b0 [Reynold Xin] Update unit tests for explain.
a9d3ba8 [Reynold Xin] [SPARK-2187] Explain should not run the optimizer twice.
...redPartitioning.
Author: Michael Armbrust <michael@databricks.com>
Closes#1122 from marmbrus/fixAddExchange and squashes the following commits:
3417537 [Michael Armbrust] Don't bind partitioning expressions as that breaks comparison with requiredPartitioning.
```
hql("explain select * from src group by key").collect().foreach(println)
[ExplainCommand [plan#27:0]]
[ Aggregate false, [key#25], [key#25,value#26]]
[ Exchange (HashPartitioning [key#25:0], 200)]
[ Exchange (HashPartitioning [key#25:0], 200)]
[ Aggregate true, [key#25], [key#25]]
[ HiveTableScan [key#25,value#26], (MetastoreRelation default, src, None), None]
```
There are two exchange operators.
However, if we do not use explain...
```
hql("select * from src group by key")
res4: org.apache.spark.sql.SchemaRDD =
SchemaRDD[8] at RDD at SchemaRDD.scala:100
== Query Plan ==
Aggregate false, [key#8], [key#8,value#9]
Exchange (HashPartitioning [key#8:0], 200)
Aggregate true, [key#8], [key#8]
HiveTableScan [key#8,value#9], (MetastoreRelation default, src, None), None
```
The plan is fine.
The cause of this bug is explained below.
When we create an `execution.ExplainCommand`, we use the `executedPlan` as the child of this `ExplainCommand`. But, this `executedPlan` is prepared for execution again when we generate the `executedPlan` for the `ExplainCommand`. Basically, `prepareForExecution` is called twice on a physical plan. Because after `prepareForExecution` we have already bounded those references (in `BoundReference`s), `AddExchange` cannot figure out we are using the same partitioning (we use `AttributeReference`s to create an `ExchangeOperator` and then those references will be changed to `BoundReference`s after `prepareForExecution` is called). So, an extra `ExchangeOperator` is inserted.
I think in `CommandStrategy`, we should just use the `sparkPlan` (`sparkPlan` is the input of `prepareForExecution`) to initialize the `ExplainCommand` instead of using `executedPlan`.
The link to JIRA: https://issues.apache.org/jira/browse/SPARK-2176
Author: Yin Huai <huai@cse.ohio-state.edu>
Closes#1116 from yhuai/SPARK-2176 and squashes the following commits:
197c19c [Yin Huai] Use sparkPlan to initialize a Physical Explain Command instead of using executedPlan.
JIRA: https://issues.apache.org/jira/browse/SPARK-2060
Programming guide: http://yhuai.github.io/site/sql-programming-guide.html
Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext
Author: Yin Huai <huai@cse.ohio-state.edu>
Closes#999 from yhuai/newJson and squashes the following commits:
227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
ce8eedd [Yin Huai] rxin's comments.
bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
94ffdaa [Yin Huai] Remove "get" from method names.
ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
79ea9ba [Yin Huai] Fix typos.
5428451 [Yin Huai] Newline
1f908ce [Yin Huai] Remove extra line.
d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
7ea750e [Yin Huai] marmbrus's comments.
6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
83013fb [Yin Huai] Update Java Example.
e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map.
6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
4fbddf0 [Yin Huai] Programming guide.
9df8c5a [Yin Huai] Python API.
7027634 [Yin Huai] Java API.
cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset.
d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
ab810b0 [Yin Huai] Make JsonRDD private.
6df0891 [Yin Huai] Apache header.
8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema.
8ffed79 [Yin Huai] Update the example.
a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution.
65b87f0 [Yin Huai] Fix sampling...
8846af5 [Yin Huai] API doc.
52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
0387523 [Yin Huai] Address PR comments.
666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
a2313a6 [Yin Huai] Address PR comments.
f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used.
0576406 [Yin Huai] Add Apache license header.
af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD.
f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.
Fixed the broken JDBC output. Test from Shark `beeline`:
```
beeline> !connect jdbc:hive2://localhost:10000/
scan complete in 2ms
Connecting to jdbc:hive2://localhost:10000/
Enter username for jdbc:hive2://localhost:10000/: lian
Enter password for jdbc:hive2://localhost:10000/:
Connected to: Hive (version 0.12.0)
Driver: Hive (version 0.12.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://localhost:10000/>
0: jdbc:hive2://localhost:10000/> explain select * from src;
+-------------------------------------------------------------------------------+
| plan |
+-------------------------------------------------------------------------------+
| ExplainCommand [plan#2:0] |
| HiveTableScan [key#0,value#1], (MetastoreRelation default, src, None), None |
+-------------------------------------------------------------------------------+
2 rows selected (1.386 seconds)
```
Before this change, the output looked something like this:
```
+-------------------------------------------------------------------------------+
| plan |
+-------------------------------------------------------------------------------+
| ExplainCommand [plan#2:0]
HiveTableScan [key#0,value#1], (MetastoreRelation default, src, None), None |
+-------------------------------------------------------------------------------+
```
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes#1097 from liancheng/multiLineExplain and squashes the following commits:
eb37967 [Cheng Lian] Made output of "EXPLAIN" play well with JDBC output format
Updated `JavaSQLContext` and `JavaHiveContext` similar to what we've done to `SQLContext` and `HiveContext` in PR #1071. Added corresponding test case for Spark SQL Java API.
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes#1085 from liancheng/spark-2094-java and squashes the following commits:
29b8a51 [Cheng Lian] Avoided instantiating JavaSparkContext & JavaHiveContext to workaround test failure
92bb4fb [Cheng Lian] Marked test cases in JavaHiveQLSuite with "ignore"
22aec97 [Cheng Lian] Follow up of PR #1071 for Java API
Added batching with default batch size 10 in SchemaRDD.javaToPython
Author: Kan Zhang <kzhang@apache.org>
Closes#1023 from kanzhang/SPARK-2079 and squashes the following commits:
2d1915e [Kan Zhang] [SPARK-2079] Add batching in SchemaRDD.javaToPython
19b0c09 [Kan Zhang] [SPARK-2079] Removing unnecessary wrapping in SchemaRDD.javaToPython
## Related JIRA issues
- Main issue:
- [SPARK-2094](https://issues.apache.org/jira/browse/SPARK-2094): Ensure exactly once semantics for DDL/Commands
- Issues resolved as dependencies:
- [SPARK-2081](https://issues.apache.org/jira/browse/SPARK-2081): Undefine output() from the abstract class Command and implement it in concrete subclasses
- [SPARK-2128](https://issues.apache.org/jira/browse/SPARK-2128): No plan for DESCRIBE
- [SPARK-1852](https://issues.apache.org/jira/browse/SPARK-1852): SparkSQL Queries with Sorts run before the user asks them to
- Other related issue:
- [SPARK-2129](https://issues.apache.org/jira/browse/SPARK-2129): NPE thrown while lookup a view
Two test cases, `join_view` and `mergejoin_mixed`, within the `HiveCompatibilitySuite` are removed from the whitelist to workaround this issue.
## PR Overview
This PR defines physical plans for DDL statements and commands and wraps their side effects in a lazy field `PhysicalCommand.sideEffectResult`, so that they are executed eagerly and exactly once. Also, as a positive side effect, now DDL statements and commands can be turned into proper `SchemaRDD`s and let user query the execution results.
This PR defines schemas for the following DDL/commands:
- EXPLAIN command
- `plan`: String, the plan explanation
- SET command
- `key`: String, the key(s) of the propert(y/ies) being set or queried
- `value`: String, the value(s) of the propert(y/ies) being queried
- Other Hive native command
- `result`: String, execution result returned by Hive
**NOTE**: We should refine schemas for different native commands by defining physical plans for them in the future.
## Examples
### EXPLAIN command
Take the "EXPLAIN" command as an example, we first execute the command and obtain a `SchemaRDD` at the same time, then query the `plan` field with the schema DSL:
```
scala> loadTestTable("src")
...
scala> val q0 = hql("EXPLAIN SELECT key, COUNT(*) FROM src GROUP BY key")
...
q0: org.apache.spark.sql.SchemaRDD =
SchemaRDD[0] at RDD at SchemaRDD.scala:98
== Query Plan ==
ExplainCommandPhysical [plan#11:0]
Aggregate false, [key#4], [key#4,SUM(PartialCount#6L) AS c_1#2L]
Exchange (HashPartitioning [key#4:0], 200)
Exchange (HashPartitioning [key#4:0], 200)
Aggregate true, [key#4], [key#4,COUNT(1) AS PartialCount#6L]
HiveTableScan [key#4], (MetastoreRelation default, src, None), None
scala> q0.select('plan).collect()
...
[ExplainCommandPhysical [plan#24:0]
Aggregate false, [key#17], [key#17,SUM(PartialCount#19L) AS c_1#2L]
Exchange (HashPartitioning [key#17:0], 200)
Exchange (HashPartitioning [key#17:0], 200)
Aggregate true, [key#17], [key#17,COUNT(1) AS PartialCount#19L]
HiveTableScan [key#17], (MetastoreRelation default, src, None), None]
scala>
```
### SET command
In this example we query all the properties set in `SQLConf`, register the result as a table, and then query the table with HiveQL:
```
scala> val q1 = hql("SET")
...
q1: org.apache.spark.sql.SchemaRDD =
SchemaRDD[7] at RDD at SchemaRDD.scala:98
== Query Plan ==
<SET command: executed by Hive, and noted by SQLContext>
scala> q1.registerAsTable("properties")
scala> hql("SELECT key, value FROM properties ORDER BY key LIMIT 10").foreach(println)
...
== Query Plan ==
TakeOrdered 10, [key#51:0 ASC]
Project [key#51:0,value#52:1]
SetCommandPhysical None, None, [key#55:0,value#56:1]), which has no missing parents
14/06/12 12:19:27 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 5 (SchemaRDD[21] at RDD at SchemaRDD.scala:98
== Query Plan ==
TakeOrdered 10, [key#51:0 ASC]
Project [key#51:0,value#52:1]
SetCommandPhysical None, None, [key#55:0,value#56:1])
...
[datanucleus.autoCreateSchema,true]
[datanucleus.autoStartMechanismMode,checked]
[datanucleus.cache.level2,false]
[datanucleus.cache.level2.type,none]
[datanucleus.connectionPoolingType,BONECP]
[datanucleus.fixedDatastore,false]
[datanucleus.identifierFactory,datanucleus1]
[datanucleus.plugin.pluginRegistryBundleCheck,LOG]
[datanucleus.rdbms.useLegacyNativeValueStrategy,true]
[datanucleus.storeManagerType,rdbms]
scala>
```
### "Exactly once" semantics
At last, an example of the "exactly once" semantics:
```
scala> val q2 = hql("CREATE TABLE t1(key INT, value STRING)")
...
q2: org.apache.spark.sql.SchemaRDD =
SchemaRDD[28] at RDD at SchemaRDD.scala:98
== Query Plan ==
<Native command: executed by Hive>
scala> table("t1")
...
res9: org.apache.spark.sql.SchemaRDD =
SchemaRDD[32] at RDD at SchemaRDD.scala:98
== Query Plan ==
HiveTableScan [key#58,value#59], (MetastoreRelation default, t1, None), None
scala> q2.collect()
...
res10: Array[org.apache.spark.sql.Row] = Array([])
scala>
```
As we can see, the "CREATE TABLE" command is executed eagerly right after the `SchemaRDD` is created, and referencing the `SchemaRDD` again won't trigger a duplicated execution.
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes#1071 from liancheng/exactlyOnceCommand and squashes the following commits:
d005b03 [Cheng Lian] Made "SET key=value" returns the newly set key value pair
f6c7715 [Cheng Lian] Added test cases for DDL/command statement RDDs
1d00937 [Cheng Lian] Makes SchemaRDD DSLs work for DDL/command statement RDDs
5c7e680 [Cheng Lian] Bug fix: wrong type used in pattern matching
48aa2e5 [Cheng Lian] Refined SQLContext.emptyResult as an empty RDD[Row]
cc64f32 [Cheng Lian] Renamed physical plan classes for DDL/commands
74789c1 [Cheng Lian] Fixed failing test cases
0ad343a [Cheng Lian] Added physical plan for DDL and commands to ensure the "exactly once" semantics
Author: Michael Armbrust <michael@databricks.com>
Closes#1072 from marmbrus/cachedStars and squashes the following commits:
8757c8e [Michael Armbrust] Use planner for in-memory scans.
This has been messing up the SQL PySpark tests on Jenkins.
Author: Patrick Wendell <pwendell@gmail.com>
Closes#1054 from pwendell/pyspark and squashes the following commits:
1eb5487 [Patrick Wendell] False change
06f062d [Patrick Wendell] HOTFIX: PySpark tests should be order insensitive
Some improvement for PR #837, add another case to white list and use `filter` to build result iterator.
Author: Daoyuan <daoyuan.wang@intel.com>
Closes#1049 from adrian-wang/clean-LeftSemiJoinHash and squashes the following commits:
b314d5a [Daoyuan] change hashSet name
27579a9 [Daoyuan] add semijoin to white list and use filter to create new iterator in LeftSemiJoinBNL
Signed-off-by: Michael Armbrust <michael@databricks.com>
This PR implements `take()` on a `SchemaRDD` by inserting a logical limit that is followed by a `collect()`. This is also accompanied by adding a catalyst optimizer rule for collapsing adjacent limits. Doing so prevents an unnecessary shuffle that is sometimes triggered by `take()`.
Author: Sameer Agarwal <sameer@databricks.com>
Closes#1048 from sameeragarwal/master and squashes the following commits:
3eeb848 [Sameer Agarwal] Fixing Tests
1b76ff1 [Sameer Agarwal] Deprecating limit(limitExpr: Expression) in v1.1.0
b723ac4 [Sameer Agarwal] Added limit folding tests
a0ff7c4 [Sameer Agarwal] Adding catalyst rule to fold two consecutive limits
8d42d03 [Sameer Agarwal] Implement trigger() as limit() followed by collect()
JIRA issue: [SPARK-1968](https://issues.apache.org/jira/browse/SPARK-1968)
This PR added support for SQL/HiveQL command for caching/uncaching tables:
```
scala> sql("CACHE TABLE src")
...
res0: org.apache.spark.sql.SchemaRDD =
SchemaRDD[0] at RDD at SchemaRDD.scala:98
== Query Plan ==
CacheCommandPhysical src, true
scala> table("src")
...
res1: org.apache.spark.sql.SchemaRDD =
SchemaRDD[3] at RDD at SchemaRDD.scala:98
== Query Plan ==
InMemoryColumnarTableScan [key#0,value#1], (HiveTableScan [key#0,value#1], (MetastoreRelation default, src, None), None), false
scala> isCached("src")
res2: Boolean = true
scala> sql("CACHE TABLE src")
...
res3: org.apache.spark.sql.SchemaRDD =
SchemaRDD[4] at RDD at SchemaRDD.scala:98
== Query Plan ==
CacheCommandPhysical src, false
scala> table("src")
...
res4: org.apache.spark.sql.SchemaRDD =
SchemaRDD[11] at RDD at SchemaRDD.scala:98
== Query Plan ==
HiveTableScan [key#2,value#3], (MetastoreRelation default, src, None), None
scala> isCached("src")
res5: Boolean = false
```
Things also work for `hql`.
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes#1038 from liancheng/sqlCacheTable and squashes the following commits:
ecb7194 [Cheng Lian] Trimmed the SQL string before parsing special commands
6f4ce42 [Cheng Lian] Moved logical command classes to a separate file
3458a24 [Cheng Lian] Added comment for public API
f0ffacc [Cheng Lian] Added isCached() predicate
15ec6d2 [Cheng Lian] Added "(UN)CACHE TABLE" SQL/HiveQL statements
Thanks goes to @liancheng, who pointed out that `sql/test-only *.SQLConfSuite *.SQLQuerySuite` passed but `sql/test-only *.SQLQuerySuite *.SQLConfSuite` failed. The reason is that some tests use the same test keys and without clear()'ing, they get carried over to other tests. This hotfix simply adds some `clear()` calls.
This problem was not evident on Jenkins before probably because `parallelExecution` is not set to `false` for `sqlCoreSettings`.
Author: Zongheng Yang <zongheng.y@gmail.com>
Closes#1040 from concretevitamin/sqlconf-tests and squashes the following commits:
6d14ceb [Zongheng Yang] HOTFIX: clear() confs in SQLConf related unit tests.
By @egraldlo.
Author: egraldlo <egraldlo@gmail.com>
Author: Michael Armbrust <michael@databricks.com>
Closes#1033 from marmbrus/pr/978 and squashes the following commits:
e228c5e [Michael Armbrust] Remove "test".
762aeaf [Michael Armbrust] Remove unneeded rule. More descriptive name for test table.
d414cd7 [egraldlo] fommatting issues
1153f75 [egraldlo] do best to avoid overflowing in function avg().
This PR (1) introduces a new class SQLConf that stores key-value properties for a SQLContext (2) clean up the semantics of various forms of SET commands.
The SQLConf class unlocks user-controllable optimization opportunities; for example, user can now override the number of partitions used during an Exchange. A SQLConf can be accessed and modified programmatically through its getters and setters. It can also be modified through SET commands executed by `sql()` or `hql()`. Note that users now have the ability to change a particular property for different queries inside the same Spark job, unlike settings configured in SparkConf.
For SET commands: "SET" will return all properties currently set in a SQLConf, "SET key" will return the key-value pair (if set) or an undefined message, and "SET key=value" will call the setter on SQLConf, and if a HiveContext is used, it will be executed in Hive as well.
Author: Zongheng Yang <zongheng.y@gmail.com>
Closes#956 from concretevitamin/sqlconf and squashes the following commits:
4968c11 [Zongheng Yang] Very minor cleanup.
d74dde5 [Zongheng Yang] Remove the redundant mkQueryExecution() method.
c129b86 [Zongheng Yang] Merge remote-tracking branch 'upstream/master' into sqlconf
26c40eb [Zongheng Yang] Make SQLConf a trait and have SQLContext mix it in.
dd19666 [Zongheng Yang] Update a comment.
baa5d29 [Zongheng Yang] Remove default param for shuffle partitions accessor.
5f7e6d8 [Zongheng Yang] Add default num partitions.
22d9ed7 [Zongheng Yang] Fix output() of Set physical. Add SQLConf param accessor method.
e9856c4 [Zongheng Yang] Use java.util.Collections.synchronizedMap on a Java HashMap.
88dd0c8 [Zongheng Yang] Remove redundant SET Keyword.
271f0b1 [Zongheng Yang] Minor change.
f8983d1 [Zongheng Yang] Minor changes per review comments.
1ce8a5e [Zongheng Yang] Invoke runSqlHive() in SQLConf#get for the HiveContext case.
b766af9 [Zongheng Yang] Remove a test.
d52e1bd [Zongheng Yang] De-hardcode number of shuffle partitions for BasicOperators (read from SQLConf).
555599c [Zongheng Yang] Bullet-proof (relatively) parsing SET per review comment.
c2067e8 [Zongheng Yang] Mark SQLContext transient and put it in a second param list.
2ea8cdc [Zongheng Yang] Wrap long line.
41d7f09 [Zongheng Yang] Fix imports.
13279e6 [Zongheng Yang] Refactor the logic of eagerly processing SET commands.
b14b83e [Zongheng Yang] In a HiveContext, make SQLConf a subset of HiveConf.
6983180 [Zongheng Yang] Move a SET test to SQLQuerySuite and make it complete.
5b67985 [Zongheng Yang] New line at EOF.
c651797 [Zongheng Yang] Add commands.scala.
efd82db [Zongheng Yang] Clean up semantics of several cases of SET.
c1017c2 [Zongheng Yang] WIP in changing SetCommand to take two Options (for different semantics of SETs).
0f00d86 [Zongheng Yang] Add a test for singleton set command in SQL.
41acd75 [Zongheng Yang] Add a test for hql() in HiveQuerySuite.
2276929 [Zongheng Yang] Fix default hive result for set commands in HiveComparisonTest.
3b0c71b [Zongheng Yang] Remove Parser for set commands. A few other fixes.
d0c4578 [Zongheng Yang] Tmux typo.
0ecea46 [Zongheng Yang] Changes for HiveQl and HiveContext.
ce22d80 [Zongheng Yang] Fix parsing issues.
cb722c1 [Zongheng Yang] Finish up SQLConf patch.
4ebf362 [Zongheng Yang] First cut at SQLConf inside SQLContext.
This PR attempts to resolve [SPARK-1704](https://issues.apache.org/jira/browse/SPARK-1704) by introducing a physical plan for EXPLAIN commands, which just prints out the debug string (containing various SparkSQL's plans) of the corresponding QueryExecution for the actual query.
Author: Zongheng Yang <zongheng.y@gmail.com>
Closes#1003 from concretevitamin/explain-cmd and squashes the following commits:
5b7911f [Zongheng Yang] Add a regression test.
1bfa379 [Zongheng Yang] Modify output().
719ada9 [Zongheng Yang] Override otherCopyArgs for ExplainCommandPhysical.
4318fd7 [Zongheng Yang] Make all output one Row.
439c6ab [Zongheng Yang] Minor cleanups.
408f574 [Zongheng Yang] SPARK-1704: Add CommandStrategy and ExplainCommandPhysical.
Just submit another solution for #395
Author: Daoyuan <daoyuan.wang@intel.com>
Author: Michael Armbrust <michael@databricks.com>
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#837 from adrian-wang/left-semi-join-support and squashes the following commits:
d39cd12 [Daoyuan Wang] Merge pull request #1 from marmbrus/pr/837
6713c09 [Michael Armbrust] Better debugging for failed query tests.
035b73e [Michael Armbrust] Add test for left semi that can't be done with a hash join.
5ec6fa4 [Michael Armbrust] Add left semi to SQL Parser.
4c726e5 [Daoyuan] improvement according to Michael
8d4a121 [Daoyuan] add golden files for leftsemijoin
83a3c8a [Daoyuan] scala style fix
14cff80 [Daoyuan] add support for left semi join
Basically there is a race condition (possibly a scala bug?) when these values are recomputed on all of the slaves that results in an incorrect projection being generated (possibly because the GUID uniqueness contract is broken?).
In general we should probably enforce that all expression planing occurs on the driver, as is now occurring here.
Author: Michael Armbrust <michael@databricks.com>
Closes#1004 from marmbrus/fixAggBug and squashes the following commits:
e0c116c [Michael Armbrust] Compute aggregate expression during planning instead of lazily on workers.
Author: Michael Armbrust <michael@databricks.com>
Closes#985 from marmbrus/tableName and squashes the following commits:
3caaa27 [Michael Armbrust] Correctly analyze queries where columnName == tableName.
`CaseConversionExpression` should check if the evaluated value is `null`.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#982 from ueshin/issues/SPARK-2036 and squashes the following commits:
61e1c54 [Takuya UESHIN] Add check if the evaluated value is null.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#974 from ueshin/issues/SPARK-2029 and squashes the following commits:
e19e8f4 [Takuya UESHIN] Bump version number to 1.1.0-SNAPSHOT.
I don't know whether it's time to implement system function about string operation in spark sql now.
Author: egraldlo <egraldlo@gmail.com>
Closes#936 from egraldlo/stringoperator and squashes the following commits:
3c6c60a [egraldlo] Add UPPER, LOWER, MAX and MIN into hive parser
ea76d0a [egraldlo] modify the formatting issues
b49f25e [egraldlo] modify the formatting issues
1f0bbb5 [egraldlo] system function upper and lower supported
13d3267 [egraldlo] system function upper and lower supported
In cases like `Limit` and `TakeOrdered`, `executeCollect()` makes optimizations that `execute().collect()` will not.
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes#939 from liancheng/spark-1958 and squashes the following commits:
bdc4a14 [Cheng Lian] Copy rows to present immutable data to users
8250976 [Cheng Lian] Added return type explicitly for public API
192a25c [Cheng Lian] [SPARK-1958] Calling .collect() on a SchemaRDD should call executeCollect() on the underlying query plan.
JIRA issue: [SPARK-1368](https://issues.apache.org/jira/browse/SPARK-1368)
This PR introduces two major updates:
- Replaced FP style code with `while` loop and reusable `GenericMutableRow` object in critical path of `HiveTableScan`.
- Using `ColumnProjectionUtils` to help optimizing RCFile and ORC column pruning.
My quick micro benchmark suggests these two optimizations made the optimized version 2x and 2.5x faster when scanning CSV table and RCFile table respectively:
```
Original:
[info] CSV: 27676 ms, RCFile: 26415 ms
[info] CSV: 27703 ms, RCFile: 26029 ms
[info] CSV: 27511 ms, RCFile: 25962 ms
Optimized:
[info] CSV: 13820 ms, RCFile: 10402 ms
[info] CSV: 14158 ms, RCFile: 10691 ms
[info] CSV: 13606 ms, RCFile: 10346 ms
```
The micro benchmark loads a 609MB CVS file (structurally similar to the `src` test table) into a normal Hive table with `LazySimpleSerDe` and a RCFile table, then scans these tables respectively.
Preparation code:
```scala
package org.apache.spark.examples.sql.hive
import org.apache.spark.sql.hive.LocalHiveContext
import org.apache.spark.{SparkConf, SparkContext}
object HiveTableScanPrepare extends App {
val sparkContext = new SparkContext(
new SparkConf()
.setMaster("local")
.setAppName(getClass.getSimpleName.stripSuffix("$")))
val hiveContext = new LocalHiveContext(sparkContext)
import hiveContext._
hql("drop table scan_csv")
hql("drop table scan_rcfile")
hql("""create table scan_csv (key int, value string)
| row format serde 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
| with serdeproperties ('field.delim'=',')
""".stripMargin)
hql(s"""load data local inpath "${args(0)}" into table scan_csv""")
hql("""create table scan_rcfile (key int, value string)
| row format serde 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe'
|stored as
| inputformat 'org.apache.hadoop.hive.ql.io.RCFileInputFormat'
| outputformat 'org.apache.hadoop.hive.ql.io.RCFileOutputFormat'
""".stripMargin)
hql(
"""
|from scan_csv
|insert overwrite table scan_rcfile
|select scan_csv.key, scan_csv.value
""".stripMargin)
}
```
Benchmark code:
```scala
package org.apache.spark.examples.sql.hive
import org.apache.spark.sql.hive.LocalHiveContext
import org.apache.spark.{SparkConf, SparkContext}
object HiveTableScanBenchmark extends App {
val sparkContext = new SparkContext(
new SparkConf()
.setMaster("local")
.setAppName(getClass.getSimpleName.stripSuffix("$")))
val hiveContext = new LocalHiveContext(sparkContext)
import hiveContext._
val scanCsv = hql("select key from scan_csv")
val scanRcfile = hql("select key from scan_rcfile")
val csvDuration = benchmark(scanCsv.count())
val rcfileDuration = benchmark(scanRcfile.count())
println(s"CSV: $csvDuration ms, RCFile: $rcfileDuration ms")
def benchmark(f: => Unit) = {
val begin = System.currentTimeMillis()
f
val end = System.currentTimeMillis()
end - begin
}
}
```
@marmbrus Please help review, thanks!
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes#758 from liancheng/fastHiveTableScan and squashes the following commits:
4241a19 [Cheng Lian] Distinguishes sorted and possibly not sorted operations more accurately in HiveComparisonTest
cf640d8 [Cheng Lian] More HiveTableScan optimisations:
bf0e7dc [Cheng Lian] Added SortedOperation pattern to match *some* definitely sorted operations and avoid some sorting cost in HiveComparisonTest.
6d1c642 [Cheng Lian] Using ColumnProjectionUtils to optimise RCFile and ORC column pruning
eb62fd3 [Cheng Lian] [SPARK-1368] Optimized HiveTableScan
Average values are difference between the calculation is done partially or not partially.
Because `AverageFunction` (in not-partially calculation) counts even if the evaluated value is null.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#862 from ueshin/issues/SPARK-1915 and squashes the following commits:
b1ff3c0 [Takuya UESHIN] Modify AverageFunction not to count if the evaluated value is null.
`CountFunction` should count up only if the child's evaluated value is not null.
Because it traverses to evaluate all child expressions, even if the child is null, it counts up if one of the all children is not null.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#861 from ueshin/issues/SPARK-1914 and squashes the following commits:
3b37315 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-1914
2afa238 [Takuya UESHIN] Simplify CountFunction not to traverse to evaluate all child expressions.
```scala
rdd.aggregate(Sum('val))
```
is just shorthand for
```scala
rdd.groupBy()(Sum('val))
```
but seems be more natural than doing a groupBy with no grouping expressions when you really just want an aggregation over all rows.
Did not add a JavaSchemaRDD or Python API, as these seem to be lacking several other methods like groupBy() already -- leaving that cleanup for future patches.
Author: Aaron Davidson <aaron@databricks.com>
Closes#874 from aarondav/schemardd and squashes the following commits:
e9e68ee [Aaron Davidson] Add comment
db6afe2 [Aaron Davidson] Introduce SchemaRDD#aggregate() for simple aggregations
Minor cleanup following #841.
Author: Reynold Xin <rxin@apache.org>
Closes#868 from rxin/schema-count and squashes the following commits:
5442651 [Reynold Xin] SPARK-1822: Some minor cleanup work on SchemaRDD.count()
Author: Kan Zhang <kzhang@apache.org>
Closes#841 from kanzhang/SPARK-1822 and squashes the following commits:
2f8072a [Kan Zhang] [SPARK-1822] Minor style update
cf4baa4 [Kan Zhang] [SPARK-1822] Adding Scaladoc
e67c910 [Kan Zhang] [SPARK-1822] SchemaRDD.count() should use optimizer
JIRA issue: [SPARK-1913](https://issues.apache.org/jira/browse/SPARK-1913)
When scanning Parquet tables, attributes referenced only in predicates that are pushed down are not passed to the `ParquetTableScan` operator and causes exception.
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes#863 from liancheng/spark-1913 and squashes the following commits:
f976b73 [Cheng Lian] Addessed the readability issue commented by @rxin
f5b257d [Cheng Lian] Added back comments deleted by mistake
ae60ab3 [Cheng Lian] [SPARK-1913] Attributes referenced only in predicates pushed down should remain in ParquetTableScan operator
...ys.
When tables are equi-joined by multiple-keys `HashJoin` should be used, but `CartesianProduct` and then `Filter` are used.
The join keys are paired by `And` expression so we need to apply `splitConjunctivePredicates` to join condition while finding join keys.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#836 from ueshin/issues/SPARK-1889 and squashes the following commits:
fe1c387 [Takuya UESHIN] Apply splitConjunctivePredicates to join condition while finding join keys.
Simple filter predicates such as LessThan, GreaterThan, etc., where one side is a literal and the other one a NamedExpression are now pushed down to the underlying ParquetTableScan. Here are some results for a microbenchmark with a simple schema of six fields of different types where most records failed the test:
| Uncompressed | Compressed
-------------| ------------- | -------------
File size | 10 GB | 2 GB
Speedup | 2 | 1.8
Since mileage may vary I added a new option to SparkConf:
`org.apache.spark.sql.parquet.filter.pushdown`
Default value would be `true` and setting it to `false` disables the pushdown. When most rows are expected to pass the filter or when there are few fields performance can be better when pushdown is disabled. The default should fit situations with a reasonable number of (possibly nested) fields where not too many records on average pass the filter.
Because of an issue with Parquet ([see here](https://github.com/Parquet/parquet-mr/issues/371])) currently only predicates on non-nullable attributes are pushed down. If one would know that for a given table no optional fields have missing values one could also allow overriding this.
Author: Andre Schumacher <andre.schumacher@iki.fi>
Closes#511 from AndreSchumacher/parquet_filter and squashes the following commits:
16bfe83 [Andre Schumacher] Removing leftovers from merge during rebase
7b304ca [Andre Schumacher] Fixing formatting
c36d5cb [Andre Schumacher] Scalastyle
3da98db [Andre Schumacher] Second round of review feedback
7a78265 [Andre Schumacher] Fixing broken formatting in ParquetFilter
a86553b [Andre Schumacher] First round of code review feedback
b0f7806 [Andre Schumacher] Optimizing imports in ParquetTestData
85fea2d [Andre Schumacher] Adding SparkConf setting to disable filter predicate pushdown
f0ad3cf [Andre Schumacher] Undoing changes not needed for this PR
210e9cb [Andre Schumacher] Adding disjunctive filter predicates
a93a588 [Andre Schumacher] Adding unit test for filtering
6d22666 [Andre Schumacher] Extending ParquetFilters
93e8192 [Andre Schumacher] First commit Parquet record filtering
...Scala collections.
When I execute `orderBy` or `limit` for `SchemaRDD` including `ArrayType` or `MapType`, `SparkSqlSerializer` throws the following exception:
```
com.esotericsoftware.kryo.KryoException: Class cannot be created (missing no-arg constructor): scala.collection.immutable.$colon$colon
```
or
```
com.esotericsoftware.kryo.KryoException: Class cannot be created (missing no-arg constructor): scala.collection.immutable.Vector
```
or
```
com.esotericsoftware.kryo.KryoException: Class cannot be created (missing no-arg constructor): scala.collection.immutable.HashMap$HashTrieMap
```
and so on.
This is because registrations of serializers for each concrete collections are missing in `SparkSqlSerializer`.
I believe it should use `AllScalaRegistrar`.
`AllScalaRegistrar` covers a lot of serializers for concrete classes of `Seq`, `Map` for `ArrayType`, `MapType`.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#790 from ueshin/issues/SPARK-1845 and squashes the following commits:
d1ed992 [Takuya UESHIN] Use AllScalaRegistrar for SparkSqlSerializer to register serializers of Scala collections.
Add the implementation for ApproximateCountDistinct to SparkSql. We use the HyperLogLog algorithm implemented in stream-lib, and do the count in two phases: 1) counting the number of distinct elements in each partitions, and 2) merge the HyperLogLog results from different partitions.
A simple serializer and test cases are added as well.
Author: larvaboy <larvaboy@gmail.com>
Closes#737 from larvaboy/master and squashes the following commits:
bd8ef3f [larvaboy] Add support of user-provided standard deviation to ApproxCountDistinct.
9ba8360 [larvaboy] Fix alignment and null handling issues.
95b4067 [larvaboy] Add a test case for count distinct and approximate count distinct.
f57917d [larvaboy] Add the parser for the approximate count.
a2d5d10 [larvaboy] Add ApproximateCountDistinct aggregates and functions.
7ad273a [larvaboy] Add SparkSql serializer for HyperLogLog.
1d9aacf [larvaboy] Fix a minor typo in the toString method of the Count case class.
653542b [larvaboy] Fix a couple of minor typos.
Author: Michael Armbrust <michael@databricks.com>
Closes#761 from marmbrus/existingContext and squashes the following commits:
4651051 [Michael Armbrust] Make it possible to create Java/Python SQLContexts from an existing Scala SQLContext.
https://issues.apache.org/jira/browse/SPARK-1757
The first test succeeds, but the second test fails with exception:
```
[info] - save and load case class RDD with Nones as parquet *** FAILED *** (14 milliseconds)
[info] java.lang.RuntimeException: Unsupported datatype StructType(List())
[info] at scala.sys.package$.error(package.scala:27)
[info] at org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType(ParquetRelation.scala:201)
[info] at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$1.apply(ParquetRelation.scala:235)
[info] at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$1.apply(ParquetRelation.scala:235)
[info] at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
[info] at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
[info] at scala.collection.immutable.List.foreach(List.scala:318)
[info] at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
[info] at scala.collection.AbstractTraversable.map(Traversable.scala:105)
[info] at org.apache.spark.sql.parquet.ParquetTypesConverter$.convertFromAttributes(ParquetRelation.scala:234)
[info] at org.apache.spark.sql.parquet.ParquetTypesConverter$.writeMetaData(ParquetRelation.scala:267)
[info] at org.apache.spark.sql.parquet.ParquetRelation$.createEmpty(ParquetRelation.scala:143)
[info] at org.apache.spark.sql.parquet.ParquetRelation$.create(ParquetRelation.scala:122)
[info] at org.apache.spark.sql.execution.SparkStrategies$ParquetOperations$.apply(SparkStrategies.scala:139)
[info] at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
[info] at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
[info] at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
[info] at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
[info] at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:264)
[info] at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:264)
[info] at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:265)
[info] at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:265)
[info] at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:268)
[info] at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:268)
[info] at org.apache.spark.sql.SchemaRDDLike$class.saveAsParquetFile(SchemaRDDLike.scala:66)
[info] at org.apache.spark.sql.SchemaRDD.saveAsParquetFile(SchemaRDD.scala:98)
```
Author: Andrew Ash <andrew@andrewash.com>
Author: Michael Armbrust <michael@databricks.com>
Closes#690 from ash211/rdd-parquet-save and squashes the following commits:
747a0b9 [Andrew Ash] Merge pull request #1 from marmbrus/pr/690
54bd00e [Michael Armbrust] Need to put Option first since Option <: Seq.
8f3f281 [Andrew Ash] SPARK-1757 Add failing test for saving SparkSQL Schemas with Option[?] fields as parquet
Three issues related to temp files that tests generate – these should be touched up for hygiene but are not urgent.
Modules have a log4j.properties which directs the unit-test.log output file to a directory like `[module]/target/unit-test.log`. But this ends up creating `[module]/[module]/target/unit-test.log` instead of former.
The `work/` directory is not deleted by "mvn clean", in the parent and in modules. Neither is the `checkpoint/` directory created under the various external modules.
Many tests create a temp directory, which is not usually deleted. This can be largely resolved by calling `deleteOnExit()` at creation and trying to call `Utils.deleteRecursively` consistently to clean up, sometimes in an `@After` method.
_If anyone seconds the motion, I can create a more significant change that introduces a new test trait along the lines of `LocalSparkContext`, which provides management of temp directories for subclasses to take advantage of._
Author: Sean Owen <sowen@cloudera.com>
Closes#732 from srowen/SPARK-1798 and squashes the following commits:
5af578e [Sean Owen] Try to consistently delete test temp dirs and files, and set deleteOnExit() for each
b21b356 [Sean Owen] Remove work/ and checkpoint/ dirs with mvn clean
bdd0f41 [Sean Owen] Remove duplicate module dir in log4j.properties output path for tests
This pull request contains a rebased patch from @heathermiller (https://github.com/heathermiller/spark/pull/1) to add ClassTags on Serializer and types that depend on it (Broadcast and AccumulableCollection). Putting these in the public API signatures now will allow us to use Scala Pickling for serialization down the line without breaking binary compatibility.
One question remaining is whether we also want them on Accumulator -- Accumulator is passed as part of a bigger Task or TaskResult object via the closure serializer so it doesn't seem super useful to add the ClassTag there. Broadcast and AccumulableCollection in contrast were being serialized directly.
CC @rxin, @pwendell, @heathermiller
Author: Matei Zaharia <matei@databricks.com>
Closes#700 from mateiz/spark-1708 and squashes the following commits:
1a3d8b0 [Matei Zaharia] Use fake ClassTag in Java
3b449ed [Matei Zaharia] test fix
2209a27 [Matei Zaharia] Code style fixes
9d48830 [Matei Zaharia] Add a ClassTag on Serializer and things that depend on it
Add `limit` transformation to `SchemaRDD`.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#711 from ueshin/issues/SPARK-1778 and squashes the following commits:
33169df [Takuya UESHIN] Add 'limit' transformation to SchemaRDD.
* Add native min/max (was using hive before).
* Handle nulls correctly in Avg and Sum.
Author: Michael Armbrust <michael@databricks.com>
Closes#683 from marmbrus/aggFixes and squashes the following commits:
64fe30b [Michael Armbrust] Improve SparkSQL Aggregates * Add native min/max (was using hive before). * Handle nulls correctly in Avg and Sum.
... that do not change schema
Author: Kan Zhang <kzhang@apache.org>
Closes#448 from kanzhang/SPARK-1460 and squashes the following commits:
111e388 [Kan Zhang] silence MiMa errors in EdgeRDD and VertexRDD
91dc787 [Kan Zhang] Taking into account newly added Ordering param
79ed52a [Kan Zhang] [SPARK-1460] Returning SchemaRDD on Set operations that do not change schema
I also removed a println that I bumped into.
Author: Michael Armbrust <michael@databricks.com>
Closes#658 from marmbrus/nullPrimitives and squashes the following commits:
a3ec4f3 [Michael Armbrust] Remove println.
695606b [Michael Armbrust] Support for null primatives from using scala and java reflection.
In-memory compression is now configurable in `SparkConf` by the `spark.sql.inMemoryCompression.enabled` property, and is disabled by default.
To help code review, the bug fix is in [the first commit](d537a367ed), compression configuration is in [the second one](4ce09aa8aa).
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes#608 from liancheng/spark-1678 and squashes the following commits:
66c3a8d [Cheng Lian] Renamed in-memory compression configuration key
f8fb3a0 [Cheng Lian] Added assertion for testing .hasNext of various decoder
4ce09aa [Cheng Lian] Made in-memory compression configurable via SparkConf
d537a36 [Cheng Lian] Fixed SPARK-1678
Modify spelling errors
Author: ArcherShao <ArcherShao@users.noreply.github.com>
Closes#619 from ArcherShao/patch-1 and squashes the following commits:
2957195 [ArcherShao] Update SchemaRDD.scala