Commit graph

2117 commits

Author SHA1 Message Date
Sun Rui 896edb51ab [SPARK-10050] [SPARKR] Support collecting data of MapType in DataFrame.
1. Support collecting data of MapType from DataFrame.
2. Support data of MapType in createDataFrame.

Author: Sun Rui <rui.sun@intel.com>

Closes #8711 from sun-rui/SPARK-10050.
2015-09-16 13:20:39 -07:00
sureshthalamati 64c29afcb7 [SPARK-9078] [SQL] Allow jdbc dialects to override the query used to check the table.
Current implementation uses query with a LIMIT clause to find if table already exists. This syntax works only in some database systems. This patch changes the default query to the one that is likely to work on most databases, and adds a new method to the  JdbcDialect abstract class to allow  dialects to override the default query.

I looked at using the JDBC meta data calls, it turns out there is no common way to find the current schema, catalog..etc.  There is a new method Connection.getSchema() , but that is available only starting jdk1.7 , and existing jdbc drivers may not have implemented it.  Other option was to use jdbc escape syntax clause for LIMIT, not sure on how well this supported in all the databases also. After looking at all the jdbc metadata options my conclusion was most common way is to use the simple select query with 'where 1 =0' , and allow dialects to customize as needed

Author: sureshthalamati <suresh.thalamati@gmail.com>

Closes #8676 from sureshthalamati/table_exists_spark-9078.
2015-09-15 19:41:38 -07:00
Andrew Or 35a19f3357 [SPARK-10613] [SPARK-10624] [SQL] Reduce LocalNode tests dependency on SQLContext
Instead of relying on `DataFrames` to verify our answers, we can just use simple arrays. This significantly simplifies the test logic for `LocalNode`s and reduces a lot of code duplicated from `SparkPlanTest`.

This also fixes an additional issue [SPARK-10624](https://issues.apache.org/jira/browse/SPARK-10624) where the output of `TakeOrderedAndProjectNode` is not actually ordered.

Author: Andrew Or <andrew@databricks.com>

Closes #8764 from andrewor14/sql-local-tests-cleanup.
2015-09-15 17:24:32 -07:00
Josh Rosen 38700ea40c [SPARK-10381] Fix mixup of taskAttemptNumber & attemptId in OutputCommitCoordinator
When speculative execution is enabled, consider a scenario where the authorized committer of a particular output partition fails during the OutputCommitter.commitTask() call. In this case, the OutputCommitCoordinator is supposed to release that committer's exclusive lock on committing once that task fails. However, due to a unit mismatch (we used task attempt number in one place and task attempt id in another) the lock will not be released, causing Spark to go into an infinite retry loop.

This bug was masked by the fact that the OutputCommitCoordinator does not have enough end-to-end tests (the current tests use many mocks). Other factors contributing to this bug are the fact that we have many similarly-named identifiers that have different semantics but the same data types (e.g. attemptNumber and taskAttemptId, with inconsistent variable naming which makes them difficult to distinguish).

This patch adds a regression test and fixes this bug by always using task attempt numbers throughout this code.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #8544 from JoshRosen/SPARK-10381.
2015-09-15 17:11:21 -07:00
Reynold Xin a63cdc769f [SPARK-10612] [SQL] Add prepare to LocalNode.
The idea is that we should separate the function call that does memory reservation (i.e. prepare) from the function call that consumes the input (e.g. open()), so all operators can be a chance to reserve memory before they are all consumed.

Author: Reynold Xin <rxin@databricks.com>

Closes #8761 from rxin/SPARK-10612.
2015-09-15 16:53:27 -07:00
Andrew Or b6e998634e [SPARK-10548] [SPARK-10563] [SQL] Fix concurrent SQL executions
*Note: this is for master branch only.* The fix for branch-1.5 is at #8721.

The query execution ID is currently passed from a thread to its children, which is not the intended behavior. This led to `IllegalArgumentException: spark.sql.execution.id is already set` when running queries in parallel, e.g.:
```
(1 to 100).par.foreach { _ =>
  sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count()
}
```
The cause is `SparkContext`'s local properties are inherited by default. This patch adds a way to exclude keys we don't want to be inherited, and makes SQL go through that code path.

Author: Andrew Or <andrew@databricks.com>

Closes #8710 from andrewor14/concurrent-sql-executions.
2015-09-15 16:45:47 -07:00
Wenchen Fan 31a229aa73 [SPARK-10475] [SQL] improve column prunning for Project on Sort
Sometimes we can't push down the whole `Project` though `Sort`, but we still have a chance to push down part of it.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #8644 from cloud-fan/column-prune.
2015-09-15 13:36:52 -07:00
Liang-Chi Hsieh 841972e22c [SPARK-10437] [SQL] Support aggregation expressions in Order By
JIRA: https://issues.apache.org/jira/browse/SPARK-10437

If an expression in `SortOrder` is a resolved one, such as `count(1)`, the corresponding rule in `Analyzer` to make it work in order by will not be applied.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #8599 from viirya/orderby-agg.
2015-09-15 13:33:32 -07:00
Marcelo Vanzin b42059d2ef Revert "[SPARK-10300] [BUILD] [TESTS] Add support for test tags in run-tests.py."
This reverts commit 8abef21dac.
2015-09-15 13:03:38 -07:00
Marcelo Vanzin 8abef21dac [SPARK-10300] [BUILD] [TESTS] Add support for test tags in run-tests.py.
This change does two things:

- tag a few tests and adds the mechanism in the build to be able to disable those tags,
  both in maven and sbt, for both junit and scalatest suites.
- add some logic to run-tests.py to disable some tags depending on what files have
  changed; that's used to disable expensive tests when a module hasn't explicitly
  been changed, to speed up testing for changes that don't directly affect those
  modules.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #8437 from vanzin/test-tags.
2015-09-15 10:45:02 -07:00
Reynold Xin 09b7e7c198 Update version to 1.6.0-SNAPSHOT.
Author: Reynold Xin <rxin@databricks.com>

Closes #8350 from rxin/1.6.
2015-09-15 00:54:20 -07:00
Sean Owen 4e2242bb41 [SPARK-10576] [BUILD] Move .java files out of src/main/scala
Move .java files in `src/main/scala` to `src/main/java` root, except for `package-info.java` (to stay next to package.scala)

Author: Sean Owen <sowen@cloudera.com>

Closes #8736 from srowen/SPARK-10576.
2015-09-14 15:03:51 -07:00
zsxwing 217e496444 [SPARK-9996] [SPARK-9997] [SQL] Add local expand and NestedLoopJoin operators
This PR is in conflict with #8535 and #8573. Will update this one when they are merged.

Author: zsxwing <zsxwing@gmail.com>

Closes #8642 from zsxwing/expand-nest-join.
2015-09-14 15:00:27 -07:00
Edoardo Vacchi 64f04154e3 [SPARK-6981] [SQL] Factor out SparkPlanner and QueryExecution from SQLContext
Alternative to PR #6122; in this case the refactored out classes are replaced by inner classes with the same name for backwards binary compatibility

   * process in a lighter-weight, backwards-compatible way

Author: Edoardo Vacchi <uncommonnonsense@gmail.com>

Closes #6356 from evacchi/sqlctx-refactoring-lite.
2015-09-14 14:56:04 -07:00
Davies Liu 7e32387ae6 [SPARK-10522] [SQL] Nanoseconds of Timestamp in Parquet should be positive
Or Hive can't read it back correctly.

Thanks vanzin for report this.

Author: Davies Liu <davies@databricks.com>

Closes #8674 from davies/positive_nano.
2015-09-14 14:20:49 -07:00
Kousuke Saruta cf2821ef5f [SPARK-10584] [DOC] [SQL] Documentation about spark.sql.hive.metastore.version is wrong.
The default value of hive metastore version is 1.2.1 but the documentation says the value of `spark.sql.hive.metastore.version` is 0.13.1.
Also, we cannot get the default value by `sqlContext.getConf("spark.sql.hive.metastore.version")`.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #8739 from sarutak/SPARK-10584.
2015-09-14 12:06:23 -07:00
Wenchen Fan 32407bfd2b [SPARK-9899] [SQL] log warning for direct output committer with speculation enabled
This is a follow-up of https://github.com/apache/spark/pull/8317.

When speculation is enabled, there may be multiply tasks writing to the same path. Generally it's OK as we will write to a temporary directory first and only one task can commit the temporary directory to target path.

However, when we use direct output committer, tasks will write data to target path directly without temporary directory. This causes problems like corrupted data. Please see [PR comment](https://github.com/apache/spark/pull/8191#issuecomment-131598385) for more details.

Unfortunately, we don't have a simple flag to tell if a output committer will write to temporary directory or not, so for safety, we have to disable any customized output committer when `speculation` is true.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #8687 from cloud-fan/direct-committer.
2015-09-14 11:51:39 -07:00
Josh Rosen b3a7480ab0 [SPARK-10330] Add Scalastyle rule to require use of SparkHadoopUtil JobContext methods
This is a followup to #8499 which adds a Scalastyle rule to mandate the use of SparkHadoopUtil's JobContext accessor methods and fixes the existing violations.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #8521 from JoshRosen/SPARK-10330-part2.
2015-09-12 16:23:55 -07:00
JihongMa f4a22808e0 [SPARK-6548] Adding stddev to DataFrame functions
Adding STDDEV support for DataFrame using 1-pass online /parallel algorithm to compute variance. Please review the code change.

Author: JihongMa <linlin200605@gmail.com>
Author: Jihong MA <linlin200605@gmail.com>
Author: Jihong MA <jihongma@jihongs-mbp.usca.ibm.com>
Author: Jihong MA <jihongma@Jihongs-MacBook-Pro.local>

Closes #6297 from JihongMA/SPARK-SQL.
2015-09-12 10:17:15 -07:00
Sean Owen 22730ad54d [SPARK-10547] [TEST] Streamline / improve style of Java API tests
Fix a few Java API test style issues: unused generic types, exceptions, wrong assert argument order

Author: Sean Owen <sowen@cloudera.com>

Closes #8706 from srowen/SPARK-10547.
2015-09-12 10:40:10 +01:00
Andrew Or c2af42b5f3 [SPARK-9990] [SQL] Local hash join follow-ups
1. Hide `LocalNodeIterator` behind the `LocalNode#asIterator` method
2. Add tests for this

Author: Andrew Or <andrew@databricks.com>

Closes #8708 from andrewor14/local-hash-join-follow-up.
2015-09-11 15:01:37 -07:00
zsxwing e626ac5f5c [SPARK-9992] [SPARK-9994] [SPARK-9998] [SQL] Implement the local TopK, sample and intersect operators
This PR is in conflict with #8535. I will update this one when #8535 gets merged.

Author: zsxwing <zsxwing@gmail.com>

Closes #8573 from zsxwing/more-local-operators.
2015-09-11 15:00:13 -07:00
Yash Datta 1eede3b254 [SPARK-7142] [SQL] Minor enhancement to BooleanSimplification Optimizer rule. Incorporate review comments
Adding changes suggested by cloud-fan  in #5700

cc marmbrus

Author: Yash Datta <Yash.Datta@guavus.com>

Closes #8716 from saucam/bool_simp.
2015-09-11 14:55:15 -07:00
Wenchen Fan d5d647380f [SPARK-10442] [SQL] fix string to boolean cast
When we cast string to boolean in hive, it returns `true` if the length of string is > 0, and spark SQL follows this behavior.

However, this behavior is very different from other SQL systems:

1. [presto](https://github.com/facebook/presto/blob/master/presto-main/src/main/java/com/facebook/presto/type/VarcharOperators.java#L89-L118) will return `true` for 't' 'true' '1', `false` for 'f' 'false' '0', throw exception for others.
2. [redshift](http://docs.aws.amazon.com/redshift/latest/dg/r_Boolean_type.html) will return `true` for 't' 'true' 'y' 'yes' '1', `false` for 'f' 'false' 'n' 'no' '0', null for others.
3. [postgresql](http://www.postgresql.org/docs/devel/static/datatype-boolean.html) will return `true` for 't' 'true' 'y' 'yes' 'on' '1', `false` for 'f' 'false' 'n' 'no' 'off' '0', throw exception for others.
4. [vertica](https://my.vertica.com/docs/5.0/HTML/Master/2983.htm) will return `true` for 't' 'true' 'y' 'yes' '1', `false` for 'f' 'false' 'n' 'no' '0', null for others.
5. [impala](http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/latest/topics/impala_boolean.html) throw exception when try to cast string to boolean.
6. mysql, oracle, sqlserver don't have boolean type

Whether we should change the cast behavior according to other SQL system or not is not decided yet, this PR is a test to see if we changed, how many compatibility tests will fail.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #8698 from cloud-fan/string2boolean.
2015-09-11 14:15:16 -07:00
Yin Huai 6ce0886eb0 [SPARK-10540] [SQL] Ignore HadoopFsRelationTest's "test all data types" if it is too flaky
If hadoopFsRelationSuites's "test all data types" is too flaky we can disable it for now.

https://issues.apache.org/jira/browse/SPARK-10540

Author: Yin Huai <yhuai@databricks.com>

Closes #8705 from yhuai/SPARK-10540-ignore.
2015-09-11 09:42:53 -07:00
Cheng Lian e1d7f64296 [SPARK-10472] [SQL] Fixes DataType.typeName for UDT
Before this fix, `MyDenseVectorUDT.typeName` gives `mydensevecto`, which is not desirable.

Author: Cheng Lian <lian@databricks.com>

Closes #8640 from liancheng/spark-10472/udt-type-name.
2015-09-11 18:26:56 +08:00
Andrew Or 3db72554be [SPARK-10443] [SQL] Refactor SortMergeOuterJoin to reduce duplication
`LeftOutputIterator` and `RightOutputIterator` are symmetrically identical and can share a lot of code. If someone makes a change in one but forgets to do the same thing in the other we'll end up with inconsistent behavior. This patch also adds inline comments to clarify the intention of the code.

Author: Andrew Or <andrew@databricks.com>

Closes #8596 from andrewor14/smoj-cleanup.
2015-09-10 13:22:35 -07:00
Sun Rui 45e3be5c13 [SPARK-10049] [SPARKR] Support collecting data of ArraryType in DataFrame.
this PR :
1.  Enhance reflection in RBackend. Automatically matching a Java array to Scala Seq when finding methods. Util functions like seq(), listToSeq() in R side can be removed, as they will conflict with the Serde logic that transferrs a Scala seq to R side.

2.  Enhance the SerDe to support transferring  a Scala seq to R side. Data of ArrayType in DataFrame
after collection is observed to be of Scala Seq type.

3.  Support ArrayType in createDataFrame().

Author: Sun Rui <rui.sun@intel.com>

Closes #8458 from sun-rui/SPARK-10049.
2015-09-10 12:21:13 -07:00
zsxwing d88abb7e21 [SPARK-9990] [SQL] Create local hash join operator
This PR includes the following changes:
- Add SQLConf to LocalNode
- Add HashJoinNode
- Add ConvertToUnsafeNode and ConvertToSafeNode.scala to test unsafe hash join.

Author: zsxwing <zsxwing@gmail.com>

Closes #8535 from zsxwing/SPARK-9990.
2015-09-10 12:06:49 -07:00
Cheng Hao e048111376 [SPARK-10466] [SQL] UnsafeRow SerDe exception with data spill
Data Spill with UnsafeRow causes assert failure.

```
java.lang.AssertionError: assertion failed
	at scala.Predef$.assert(Predef.scala:165)
	at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeKey(UnsafeRowSerializer.scala:75)
	at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:180)
	at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:688)
	at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:687)
	at scala.collection.Iterator$class.foreach(Iterator.scala:727)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
	at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:687)
	at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:683)
	at scala.collection.Iterator$class.foreach(Iterator.scala:727)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
	at org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:683)
	at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
	at org.apache.spark.scheduler.Task.run(Task.scala:88)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
```

To reproduce that with code (thanks andrewor14):
```scala
bin/spark-shell --master local
  --conf spark.shuffle.memoryFraction=0.005
  --conf spark.shuffle.sort.bypassMergeThreshold=0

sc.parallelize(1 to 2 * 1000 * 1000, 10)
  .map { i => (i, i) }.toDF("a", "b").groupBy("b").avg().count()
```

Author: Cheng Hao <hao.cheng@intel.com>

Closes #8635 from chenghao-intel/unsafe_spill.
2015-09-10 11:48:43 -07:00
Cheng Lian 49da38e5f7 [SPARK-10301] [SPARK-10428] [SQL] Addresses comments of PR #8583 and #8509 for master
Author: Cheng Lian <lian@databricks.com>

Closes #8670 from liancheng/spark-10301/address-pr-comments.
2015-09-10 11:01:08 -07:00
Yash Datta f892d927d7 [SPARK-7142] [SQL] Minor enhancement to BooleanSimplification Optimizer rule
Use these in the optimizer as well:

            A and (not(A) or B) => A and B
            not(A and B) => not(A) or not(B)
            not(A or B) => not(A) and not(B)

Author: Yash Datta <Yash.Datta@guavus.com>

Closes #5700 from saucam/bool_simp.
2015-09-10 10:34:00 -07:00
Wenchen Fan 4f1daa1ef6 [SPARK-10065] [SQL] avoid the extra copy when generate unsafe array
The reason for this extra copy is that we iterate the array twice: calculate elements data size and copy elements to array buffer.

A simple solution is to follow `createCodeForStruct`, we can dynamically grow the buffer when needed and thus don't need to know the data size ahead.

This PR also include some typo and style fixes, and did some minor refactor to make sure `input.primitive` is always variable name not code when generate unsafe code.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #8496 from cloud-fan/avoid-copy.
2015-09-10 10:04:38 -07:00
Liang-Chi Hsieh 45de518742 [SPARK-9730] [SQL] Add Full Outer Join support for SortMergeJoin
This PR is based on #8383 , thanks to viirya

JIRA: https://issues.apache.org/jira/browse/SPARK-9730

This patch adds the Full Outer Join support for SortMergeJoin. A new class SortMergeFullJoinScanner is added to scan rows from left and right iterators. FullOuterIterator is simply a wrapper of type RowIterator to consume joined rows from SortMergeFullJoinScanner.

Closes #8383

Author: Liang-Chi Hsieh <viirya@appier.com>
Author: Davies Liu <davies@databricks.com>

Closes #8579 from davies/smj_fullouter.
2015-09-09 16:02:27 -07:00
Wenchen Fan 71da1633c4 [SPARK-10461] [SQL] make sure input.primitive is always variable name not code at GenerateUnsafeProjection
When we generate unsafe code inside `createCodeForXXX`, we always assign the `input.primitive` to a temp variable in case `input.primitive` is expression code.

This PR did some refactor to make sure `input.primitive` is always variable name, and some other typo and style fixes.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #8613 from cloud-fan/minor.
2015-09-09 10:57:29 -07:00
Luc Bourlier c1bc4f439f [SPARK-10227] fatal warnings with sbt on Scala 2.11
The bulk of the changes are on `transient` annotation on class parameter. Often the compiler doesn't generate a field for this parameters, so the the transient annotation would be unnecessary.
But if the class parameter are used in methods, then fields are created. So it is safer to keep the annotations.

The remainder are some potential bugs, and deprecated syntax.

Author: Luc Bourlier <luc.bourlier@typesafe.com>

Closes #8433 from skyluc/issue/sbt-2.11.
2015-09-09 09:57:58 +01:00
Michael Armbrust 2143d592c8 [HOTFIX] Fix build break caused by #8494
Author: Michael Armbrust <michael@databricks.com>

Closes #8659 from marmbrus/testBuildBreak.
2015-09-08 16:51:45 -07:00
Cheng Hao d637a666d5 [SPARK-10327] [SQL] Cache Table is not working while subquery has alias in its project list
```scala
    import org.apache.spark.sql.hive.execution.HiveTableScan
    sql("select key, value, key + 1 from src").registerTempTable("abc")
    cacheTable("abc")

    val sparkPlan = sql(
      """select a.key, b.key, c.key from
        |abc a join abc b on a.key=b.key
        |join abc c on a.key=c.key""".stripMargin).queryExecution.sparkPlan

    assert(sparkPlan.collect { case e: InMemoryColumnarTableScan => e }.size === 3) // failed
    assert(sparkPlan.collect { case e: HiveTableScan => e }.size === 0) // failed
```

The actual plan is:

```
== Parsed Logical Plan ==
'Project [unresolvedalias('a.key),unresolvedalias('b.key),unresolvedalias('c.key)]
 'Join Inner, Some(('a.key = 'c.key))
  'Join Inner, Some(('a.key = 'b.key))
   'UnresolvedRelation [abc], Some(a)
   'UnresolvedRelation [abc], Some(b)
  'UnresolvedRelation [abc], Some(c)

== Analyzed Logical Plan ==
key: int, key: int, key: int
Project [key#14,key#61,key#66]
 Join Inner, Some((key#14 = key#66))
  Join Inner, Some((key#14 = key#61))
   Subquery a
    Subquery abc
     Project [key#14,value#15,(key#14 + 1) AS _c2#16]
      MetastoreRelation default, src, None
   Subquery b
    Subquery abc
     Project [key#61,value#62,(key#61 + 1) AS _c2#58]
      MetastoreRelation default, src, None
  Subquery c
   Subquery abc
    Project [key#66,value#67,(key#66 + 1) AS _c2#63]
     MetastoreRelation default, src, None

== Optimized Logical Plan ==
Project [key#14,key#61,key#66]
 Join Inner, Some((key#14 = key#66))
  Project [key#14,key#61]
   Join Inner, Some((key#14 = key#61))
    Project [key#14]
     InMemoryRelation [key#14,value#15,_c2#16], true, 10000, StorageLevel(true, true, false, true, 1), (Project [key#14,value#15,(key#14 + 1) AS _c2#16]), Some(abc)
    Project [key#61]
     MetastoreRelation default, src, None
  Project [key#66]
   MetastoreRelation default, src, None

== Physical Plan ==
TungstenProject [key#14,key#61,key#66]
 BroadcastHashJoin [key#14], [key#66], BuildRight
  TungstenProject [key#14,key#61]
   BroadcastHashJoin [key#14], [key#61], BuildRight
    ConvertToUnsafe
     InMemoryColumnarTableScan [key#14], (InMemoryRelation [key#14,value#15,_c2#16], true, 10000, StorageLevel(true, true, false, true, 1), (Project [key#14,value#15,(key#14 + 1) AS _c2#16]), Some(abc))
    ConvertToUnsafe
     HiveTableScan [key#61], (MetastoreRelation default, src, None)
  ConvertToUnsafe
   HiveTableScan [key#66], (MetastoreRelation default, src, None)
```

Author: Cheng Hao <hao.cheng@intel.com>

Closes #8494 from chenghao-intel/weird_cache.
2015-09-08 16:16:50 -07:00
Yin Huai 7a9dcbc91d [SPARK-10441] [SQL] Save data correctly to json.
https://issues.apache.org/jira/browse/SPARK-10441

Author: Yin Huai <yhuai@databricks.com>

Closes #8597 from yhuai/timestampJson.
2015-09-08 14:10:12 -07:00
Wenchen Fan 5fd57955ef [SPARK-10316] [SQL] respect nondeterministic expressions in PhysicalOperation
We did a lot of special handling for non-deterministic expressions in `Optimizer`. However, `PhysicalOperation` just collects all Projects and Filters and mess it up. We should respect the operators order caused by non-deterministic expressions in `PhysicalOperation`.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #8486 from cloud-fan/fix.
2015-09-08 12:05:41 -07:00
Liang-Chi Hsieh 990c9f79c2 [SPARK-9170] [SQL] Use OrcStructInspector to be case preserving when writing ORC files
JIRA: https://issues.apache.org/jira/browse/SPARK-9170

`StandardStructObjectInspector` will implicitly lowercase column names. But I think Orc format doesn't have such requirement. In fact, there is a `OrcStructInspector` specified for Orc format. We should use it when serialize rows to Orc file. It can be case preserving when writing ORC files.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #7520 from viirya/use_orcstruct.
2015-09-08 23:07:34 +08:00
Cheng Lian bca8c072bd [SPARK-10434] [SQL] Fixes Parquet schema of arrays that may contain null
To keep full compatibility of Parquet write path with Spark 1.4, we should rename the innermost field name of arrays that may contain null from "array_element" to "array".

Please refer to [SPARK-10434] [1] for more details.

[1]: https://issues.apache.org/jira/browse/SPARK-10434

Author: Cheng Lian <lian@databricks.com>

Closes #8586 from liancheng/spark-10434/fix-parquet-array-type.
2015-09-05 17:50:12 +08:00
Cheng Lian 6c751940ea [HOTFIX] [SQL] Fixes compilation error
Jenkins master builders are currently broken by a merge conflict between PR #8584 and PR #8155.

Author: Cheng Lian <lian@databricks.com>

Closes #8614 from liancheng/hotfix/fix-pr-8155-8584-conflict.
2015-09-04 22:57:52 -10:00
Yin Huai 47058ca5db [SPARK-9925] [SQL] [TESTS] Set SQLConf.SHUFFLE_PARTITIONS.key correctly for tests
This PR fix the failed test and conflict for #8155

https://issues.apache.org/jira/browse/SPARK-9925

Closes #8155

Author: Yin Huai <yhuai@databricks.com>
Author: Davies Liu <davies@databricks.com>

Closes #8602 from davies/shuffle_partitions.
2015-09-04 18:58:25 -07:00
Andrew Or 3339e6f674 [SPARK-10450] [SQL] Minor improvements to readability / style / typos etc.
Author: Andrew Or <andrew@databricks.com>

Closes #8603 from andrewor14/minor-sql-changes.
2015-09-04 15:20:20 -07:00
Wenchen Fan c3c0e431a6 [SPARK-10176] [SQL] Show partially analyzed plans when checkAnswer fails to analyze
This PR takes over https://github.com/apache/spark/pull/8389.

This PR improves `checkAnswer` to print the partially analyzed plan in addition to the user friendly error message, in order to aid debugging failing tests.

In doing so, I ran into a conflict with the various ways that we bring a SQLContext into the tests. Depending on the trait we refer to the current context as `sqlContext`, `_sqlContext`, `ctx` or `hiveContext` with access modifiers `public`, `protected` and `private` depending on the defining class.

I propose we refactor as follows:

1. All tests should only refer to a `protected sqlContext` when testing general features, and `protected hiveContext` when it is a method that only exists on a `HiveContext`.
2. All tests should only import `testImplicits._` (i.e., don't import `TestHive.implicits._`)

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #8584 from cloud-fan/cleanupTests.
2015-09-04 15:17:37 -07:00
WangTaoTheTonic 3abc0d5125 [SPARK-9596] [SQL] treat hadoop classes as shared one in IsolatedClientLoader
https://issues.apache.org/jira/browse/SPARK-9596

Author: WangTaoTheTonic <wangtao111@huawei.com>

Closes #7931 from WangTaoTheTonic/SPARK-9596.
2015-09-03 12:56:36 -07:00
zsxwing 0349b5b438 [SPARK-10411] [SQL] Move visualization above explain output and hide explain by default
New screenshots after this fix:

<img width="627" alt="s1" src="https://cloud.githubusercontent.com/assets/1000778/9625782/4b2dba36-518b-11e5-9104-c713ff026e3d.png">

Default:
<img width="462" alt="s2" src="https://cloud.githubusercontent.com/assets/1000778/9625817/92366e50-518b-11e5-9981-cdfb774d66b8.png">

After clicking `+details`:
<img width="377" alt="s3" src="https://cloud.githubusercontent.com/assets/1000778/9625784/4ba24342-518b-11e5-8522-846a16a95d44.png">

Author: zsxwing <zsxwing@gmail.com>

Closes #8570 from zsxwing/SPARK-10411.
2015-09-02 22:17:39 -07:00
Yin Huai 03f3e91ff2 [SPARK-10422] [SQL] String column in InMemoryColumnarCache needs to override clone method
https://issues.apache.org/jira/browse/SPARK-10422

Author: Yin Huai <yhuai@databricks.com>

Closes #8578 from yhuai/SPARK-10422.
2015-09-02 21:00:13 -07:00
Wenchen Fan fc48307797 [SPARK-10389] [SQL] support order by non-attribute grouping expression on Aggregate
For example, we can write `SELECT MAX(value) FROM src GROUP BY key + 1 ORDER BY key + 1` in PostgreSQL, and we should support this in Spark SQL.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #8548 from cloud-fan/support-order-by-non-attribute.
2015-09-02 11:32:27 -07:00
Wenchen Fan 56c4c172e9 [SPARK-10034] [SQL] add regression test for Sort on Aggregate
Before #8371, there was a bug for `Sort` on `Aggregate` that we can't use aggregate expressions named `_aggOrdering` and can't use more than one ordering expressions which contains aggregate functions. The reason of this bug is that: The aggregate expression in `SortOrder` never get resolved, we alias it with `_aggOrdering` and call `toAttribute` which gives us an `UnresolvedAttribute`. So actually we are referencing aggregate expression by name, not by exprId like we thought. And if there is already an aggregate expression named `_aggOrdering` or there are more than one ordering expressions having aggregate functions, we will have conflict names and can't search by name.

However, after #8371 got merged, the `SortOrder`s are guaranteed to be resolved and we are always referencing aggregate expression by exprId. The Bug doesn't exist anymore and this PR add regression tests for it.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #8231 from cloud-fan/sort-agg.
2015-09-02 11:13:17 -07:00
Cheng Lian 391e6be0ae [SPARK-10301] [SQL] Fixes schema merging for nested structs
This PR can be quite challenging to review.  I'm trying to give a detailed description of the problem as well as its solution here.

When reading Parquet files, we need to specify a potentially nested Parquet schema (of type `MessageType`) as requested schema for column pruning.  This Parquet schema is translated from a Catalyst schema (of type `StructType`), which is generated by the query planner and represents all requested columns.  However, this translation can be fairly complicated because of several reasons:

1.  Requested schema must conform to the real schema of the physical file to be read.

    This means we have to tailor the actual file schema of every individual physical Parquet file to be read according to the given Catalyst schema.  Fortunately we are already doing this in Spark 1.5 by pushing request schema conversion to executor side in PR #7231.

1.  Support for schema merging.

    A single Parquet dataset may consist of multiple physical Parquet files come with different but compatible schemas.  This means we may request for a column path that doesn't exist in a physical Parquet file.  All requested column paths can be nested.  For example, for a Parquet file schema

    ```
    message root {
      required group f0 {
        required group f00 {
          required int32 f000;
          required binary f001 (UTF8);
        }
      }
    }
    ```

    we may request for column paths defined in the following schema:

    ```
    message root {
      required group f0 {
        required group f00 {
          required binary f001 (UTF8);
          required float f002;
        }
      }

      optional double f1;
    }
    ```

    Notice that we pruned column path `f0.f00.f000`, but added `f0.f00.f002` and `f1`.

    The good news is that Parquet handles non-existing column paths properly and always returns null for them.

1.  The map from `StructType` to `MessageType` is a one-to-many map.

    This is the most unfortunate part.

    Due to historical reasons (dark histories!), schemas of Parquet files generated by different libraries have different "flavors".  For example, to handle a schema with a single non-nullable column, whose type is an array of non-nullable integers, parquet-protobuf generates the following Parquet schema:

    ```
    message m0 {
      repeated int32 f;
    }
    ```

    while parquet-avro generates another version:

    ```
    message m1 {
      required group f (LIST) {
        repeated int32 array;
      }
    }
    ```

    and parquet-thrift spills this:

    ```
    message m1 {
      required group f (LIST) {
        repeated int32 f_tuple;
      }
    }
    ```

    All of them can be mapped to the following _unique_ Catalyst schema:

    ```
    StructType(
      StructField(
        "f",
        ArrayType(IntegerType, containsNull = false),
        nullable = false))
    ```

    This greatly complicates Parquet requested schema construction, since the path of a given column varies in different cases.  To read the array elements from files with the above schemas, we must use `f` for `m0`, `f.array` for `m1`, and `f.f_tuple` for `m2`.

In earlier Spark versions, we didn't try to fix this issue properly.  Spark 1.4 and prior versions simply translate the Catalyst schema in a way more or less compatible with parquet-hive and parquet-avro, but is broken in many other cases.  Earlier revisions of Spark 1.5 only try to tailor the Parquet file schema at the first level, and ignore nested ones.  This caused [SPARK-10301] [spark-10301] as well as [SPARK-10005] [spark-10005].  In PR #8228, I tried to avoid the hard part of the problem and made a minimum change in `CatalystRowConverter` to fix SPARK-10005.  However, when taking SPARK-10301 into consideration, keeping hacking `CatalystRowConverter` doesn't seem to be a good idea.  So this PR is an attempt to fix the problem in a proper way.

For a given physical Parquet file with schema `ps` and a compatible Catalyst requested schema `cs`, we use the following algorithm to tailor `ps` to get the result Parquet requested schema `ps'`:

For a leaf column path `c` in `cs`:

- if `c` exists in `cs` and a corresponding Parquet column path `c'` can be found in `ps`, `c'` should be included in `ps'`;
- otherwise, we convert `c` to a Parquet column path `c"` using `CatalystSchemaConverter`, and include `c"` in `ps'`;
- no other column paths should exist in `ps'`.

Then comes the most tedious part:

> Given `cs`, `ps`, and `c`, how to locate `c'` in `ps`?

Unfortunately, there's no quick answer, and we have to enumerate all possible structures defined in parquet-format spec.  They are:

1.  the standard structure of nested types, and
1.  cases defined in all backwards-compatibility rules for `LIST` and `MAP`.

The core part of this PR is `CatalystReadSupport.clipParquetType()`, which tailors a given Parquet file schema according to a requested schema in its Catalyst form.  Backwards-compatibility rules of `LIST` and `MAP` are covered in `clipParquetListType()` and `clipParquetMapType()` respectively.  The column path selection algorithm is implemented in `clipParquetGroupFields()`.

With this PR, we no longer need to do schema tailoring in `CatalystReadSupport` and `CatalystRowConverter`.  Another benefit is that, now we can also read Parquet datasets consist of files with different physical Parquet schema but share the same logical schema, for example, files generated by different Parquet libraries.  This situation is illustrated by [this test case] [test-case].

[spark-10301]: https://issues.apache.org/jira/browse/SPARK-10301
[spark-10005]: https://issues.apache.org/jira/browse/SPARK-10005
[test-case]: 38644d8a45 (diff-a9b98e28ce3ae30641829dffd1173be2R26)

Author: Cheng Lian <lian@databricks.com>

Closes #8509 from liancheng/spark-10301/fix-parquet-requested-schema.
2015-09-01 16:52:59 +08:00
Reynold Xin d65656c455 [SPARK-10378][SQL][Test] Remove HashJoinCompatibilitySuite.
They don't bring much value since we now have better unit test coverage for hash joins. This will also help reduce the test time.

Author: Reynold Xin <rxin@databricks.com>

Closes #8542 from rxin/SPARK-10378.
2015-08-31 18:09:24 -07:00
sureshthalamati a2d5c72091 [SPARK-10170] [SQL] Add DB2 JDBC dialect support.
Data frame write to DB2 database is failing because by default JDBC data source implementation is generating a table schema with DB2 unsupported data types TEXT for String, and BIT1(1) for Boolean.

This patch registers DB2 JDBC Dialect that maps String, Boolean to valid DB2 data types.

Author: sureshthalamati <suresh.thalamati@gmail.com>

Closes #8393 from sureshthalamati/db2_dialect_spark-10170.
2015-08-31 12:39:58 -07:00
Feynman Liang 8694c3ad7d [SPARK-10351] [SQL] Fixes UTF8String.fromAddress to handle off-heap memory
CC rxin marmbrus

Author: Feynman Liang <fliang@databricks.com>

Closes #8523 from feynmanliang/SPARK-10351.
2015-08-30 23:12:56 -07:00
zsxwing 13f5f8ec97 [SPARK-9986] [SPARK-9991] [SPARK-9993] [SQL] Create a simple test framework for local operators
This PR includes the following changes:
- Add `LocalNodeTest` for local operator tests and add unit tests for FilterNode and ProjectNode.
- Add `LimitNode` and `UnionNode` and their unit tests to show how to use `LocalNodeTest`. (SPARK-9991, SPARK-9993)

Author: zsxwing <zsxwing@gmail.com>

Closes #8464 from zsxwing/local-execution.
2015-08-29 18:10:44 -07:00
Yin Huai 097a7e36e0 [SPARK-10339] [SPARK-10334] [SPARK-10301] [SQL] Partitioned table scan can OOM driver and throw a better error message when users need to enable parquet schema merging
This fixes the problem that scanning partitioned table causes driver have a high memory pressure and takes down the cluster. Also, with this fix, we will be able to correctly show the query plan of a query consuming partitioned tables.

https://issues.apache.org/jira/browse/SPARK-10339
https://issues.apache.org/jira/browse/SPARK-10334

Finally, this PR squeeze in a "quick fix" for SPARK-10301. It is not a real fix, but it just throw a better error message to let user know what to do.

Author: Yin Huai <yhuai@databricks.com>

Closes #8515 from yhuai/partitionedTableScan.
2015-08-29 16:39:40 -07:00
Josh Rosen 6a6f3c91ee [SPARK-10330] Use SparkHadoopUtil TaskAttemptContext reflection methods in more places
SparkHadoopUtil contains methods that use reflection to work around TaskAttemptContext binary incompatibilities between Hadoop 1.x and 2.x. We should use these methods in more places.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #8499 from JoshRosen/use-hadoop-reflection-in-more-places.
2015-08-29 13:36:25 -07:00
wangwei 277148b285 [SPARK-10226] [SQL] Fix exclamation mark issue in SparkSQL
When I tested the latest version of spark with exclamation mark, I got some errors. Then I reseted the spark version and found that commit id "a2409d1c8e8ddec04b529ac6f6a12b5993f0eeda" brought the bug. With jline version changing from 0.9.94 to 2.12 after this commit, exclamation mark would be treated as a special character in ConsoleReader.

Author: wangwei <wangwei82@huawei.com>

Closes #8420 from small-wang/jline-SPARK-10226.
2015-08-29 13:29:50 -07:00
Michael Armbrust 5c3d16a9b9 [SPARK-10344] [SQL] Add tests for extraStrategies
Actually using this API requires access to a lot of classes that we might make private by accident.  I've added some tests to prevent this.

Author: Michael Armbrust <michael@databricks.com>

Closes #8516 from marmbrus/extraStrategiesTests.
2015-08-29 13:26:01 -07:00
Cheng Lian 24ffa85c00 [SPARK-10289] [SQL] A direct write API for testing Parquet
This PR introduces a direct write API for testing Parquet. It's a DSL flavored version of the [`writeDirect` method] [1] comes with parquet-avro testing code. With this API, it's much easier to construct arbitrary Parquet structures. It's especially useful when adding regression tests for various compatibility corner cases.

Sample usage of this API can be found in the new test case added in `ParquetThriftCompatibilitySuite`.

[1]: https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.1/parquet-avro/src/test/java/org/apache/parquet/avro/TestArrayCompatibility.java#L945-L972

Author: Cheng Lian <lian@databricks.com>

Closes #8454 from liancheng/spark-10289/parquet-testing-direct-write-api.
2015-08-29 13:24:32 -07:00
Davies Liu bb7f352393 [SPARK-10323] [SQL] fix nullability of In/InSet/ArrayContain
After this PR, In/InSet/ArrayContain will return null if value is null, instead of false. They also will return null even if there is a null in the set/array.

Author: Davies Liu <davies@databricks.com>

Closes #8492 from davies/fix_in.
2015-08-28 14:38:20 -07:00
Josh Rosen d3f87dc394 [SPARK-10325] Override hashCode() for public Row
This commit fixes an issue where the public SQL `Row` class did not override `hashCode`, causing it to violate the hashCode() + equals() contract. To fix this, I simply ported the `hashCode` implementation from the 1.4.x version of `Row`.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #8500 from JoshRosen/SPARK-10325 and squashes the following commits:

51ffea1 [Josh Rosen] Override hashCode() for public Row.
2015-08-28 11:51:42 -07:00
Cheng Lian 89b9434385 [SPARK-SQL] [MINOR] Fixes some typos in HiveContext
Author: Cheng Lian <lian@databricks.com>

Closes #8481 from liancheng/hive-context-typo.
2015-08-27 22:30:01 -07:00
Davies Liu 54cda0deb6 [SPARK-10321] sizeInBytes in HadoopFsRelation
Having sizeInBytes in HadoopFsRelation to enable broadcast join.

cc marmbrus

Author: Davies Liu <davies@databricks.com>

Closes #8490 from davies/sizeInByte.
2015-08-27 16:38:00 -07:00
Yin Huai b3dd569ad4 [SPARK-10287] [SQL] Fixes JSONRelation refreshing on read path
https://issues.apache.org/jira/browse/SPARK-10287

After porting json to HadoopFsRelation, it seems hard to keep the behavior of picking up new files automatically for JSON. This PR removes this behavior, so JSON is consistent with others (ORC and Parquet).

Author: Yin Huai <yhuai@databricks.com>

Closes #8469 from yhuai/jsonRefresh.
2015-08-27 16:11:25 -07:00
Davies Liu 7467b52ed0 [SPARK-10215] [SQL] Fix precision of division (follow the rule in Hive)
Follow the rule in Hive for decimal division. see ac755ebe26/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFOPDivide.java (L113)

cc chenghao-intel

Author: Davies Liu <davies@databricks.com>

Closes #8415 from davies/decimal_div2.
2015-08-25 15:20:24 -07:00
Davies Liu ec89bd840a [SPARK-10245] [SQL] Fix decimal literals with precision < scale
In BigDecimal or java.math.BigDecimal, the precision could be smaller than scale, for example, BigDecimal("0.001") has precision = 1 and scale = 3. But DecimalType require that the precision should be larger than scale, so we should use the maximum of precision and scale when inferring the schema from decimal literal.

Author: Davies Liu <davies@databricks.com>

Closes #8428 from davies/smaller_decimal.
2015-08-25 14:55:34 -07:00
Sun Rui 71a138cd0e [SPARK-10048] [SPARKR] Support arbitrary nested Java array in serde.
This PR:
1. supports transferring arbitrary nested array from JVM to R side in SerDe;
2. based on 1, collect() implemenation is improved. Now it can support collecting data of complex types
   from a DataFrame.

Author: Sun Rui <rui.sun@intel.com>

Closes #8276 from sun-rui/SPARK-10048.
2015-08-25 13:14:10 -07:00
Michael Armbrust 5c08c86bfa [SPARK-10198] [SQL] Turn off partition verification by default
Author: Michael Armbrust <michael@databricks.com>

Closes #8404 from marmbrus/turnOffPartitionVerification.
2015-08-25 10:22:54 -07:00
Sean Owen 69c9c17716 [SPARK-9613] [CORE] Ban use of JavaConversions and migrate all existing uses to JavaConverters
Replace `JavaConversions` implicits with `JavaConverters`

Most occurrences I've seen so far are necessary conversions; a few have been avoidable. None are in critical code as far as I see, yet.

Author: Sean Owen <sowen@cloudera.com>

Closes #8033 from srowen/SPARK-9613.
2015-08-25 12:33:13 +01:00
Yin Huai 0e6368ffae [SPARK-10197] [SQL] Add null check in wrapperFor (inside HiveInspectors).
https://issues.apache.org/jira/browse/SPARK-10197

Author: Yin Huai <yhuai@databricks.com>

Closes #8407 from yhuai/ORCSPARK-10197.
2015-08-25 16:19:34 +08:00
Josh Rosen 7bc9a8c624 [SPARK-10195] [SQL] Data sources Filter should not expose internal types
Spark SQL's data sources API exposes Catalyst's internal types through its Filter interfaces. This is a problem because types like UTF8String are not stable developer APIs and should not be exposed to third-parties.

This issue caused incompatibilities when upgrading our `spark-redshift` library to work against Spark 1.5.0.  To avoid these issues in the future we should only expose public types through these Filter objects. This patch accomplishes this by using CatalystTypeConverters to add the appropriate conversions.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #8403 from JoshRosen/datasources-internal-vs-external-types.
2015-08-25 01:06:36 -07:00
Davies Liu 2f493f7e39 [SPARK-10177] [SQL] fix reading Timestamp in parquet from Hive
We misunderstood the Julian days and nanoseconds of the day in parquet (as TimestampType) from Hive/Impala, they are overlapped, so can't be added together directly.

In order to avoid the confusing rounding when do the converting, we use `2440588` as the Julian Day of epoch of unix timestamp (which should be 2440587.5).

Author: Davies Liu <davies@databricks.com>
Author: Cheng Lian <lian@databricks.com>

Closes #8400 from davies/timestamp_parquet.
2015-08-25 16:00:44 +08:00
Josh Rosen 82268f07ab [SPARK-9293] [SPARK-9813] Analysis should check that set operations are only performed on tables with equal numbers of columns
This patch adds an analyzer rule to ensure that set operations (union, intersect, and except) are only applied to tables with the same number of columns. Without this rule, there are scenarios where invalid queries can return incorrect results instead of failing with error messages; SPARK-9813 provides one example of this problem. In other cases, the invalid query can crash at runtime with extremely confusing exceptions.

I also performed a bit of cleanup to refactor some of those logical operators' code into a common `SetOperation` base class.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7631 from JoshRosen/SPARK-9293.
2015-08-25 00:04:10 -07:00
Cheng Lian bf03fe68d6 [SPARK-10136] [SQL] A more robust fix for SPARK-10136
PR #8341 is a valid fix for SPARK-10136, but it didn't catch the real root cause.  The real problem can be rather tricky to explain, and requires audiences to be pretty familiar with parquet-format spec, especially details of `LIST` backwards-compatibility rules.  Let me have a try to give an explanation here.

The structure of the problematic Parquet schema generated by parquet-avro is something like this:

```
message m {
  <repetition> group f (LIST) {         // Level 1
    repeated group array (LIST) {       // Level 2
      repeated <primitive-type> array;  // Level 3
    }
  }
}
```

(The schema generated by parquet-thrift is structurally similar, just replace the `array` at level 2 with `f_tuple`, and the other one at level 3 with `f_tuple_tuple`.)

This structure consists of two nested legacy 2-level `LIST`-like structures:

1. The repeated group type at level 2 is the element type of the outer array defined at level 1

   This group should map to an `CatalystArrayConverter.ElementConverter` when building converters.

2. The repeated primitive type at level 3 is the element type of the inner array defined at level 2

   This group should also map to an `CatalystArrayConverter.ElementConverter`.

The root cause of SPARK-10136 is that, the group at level 2 isn't properly recognized as the element type of level 1.  Thus, according to parquet-format spec, the repeated primitive at level 3 is left as a so called "unannotated repeated primitive type", and is recognized as a required list of required primitive type, thus a `RepeatedPrimitiveConverter` instead of a `CatalystArrayConverter.ElementConverter` is created for it.

According to  parquet-format spec, unannotated repeated type shouldn't appear in a `LIST`- or `MAP`-annotated group.  PR #8341 fixed this issue by allowing such unannotated repeated type appear in `LIST`-annotated groups, which is a non-standard, hacky, but valid fix.  (I didn't realize this when authoring #8341 though.)

As for the reason why level 2 isn't recognized as a list element type, it's because of the following `LIST` backwards-compatibility rule defined in the parquet-format spec:

> If the repeated field is a group with one field and is named either `array` or uses the `LIST`-annotated group's name with `_tuple` appended then the repeated type is the element type and elements are required.

(The `array` part is for parquet-avro compatibility, while the `_tuple` part is for parquet-thrift.)

This rule is implemented in [`CatalystSchemaConverter.isElementType`] [1], but neglected in [`CatalystRowConverter.isElementType`] [2].  This PR delivers a more robust fix by adding this rule in the latter method.

Note that parquet-avro 1.7.0 also suffers from this issue. Details can be found at [PARQUET-364] [3].

[1]: 85f9a61357/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystSchemaConverter.scala (L259-L305)
[2]: 85f9a61357/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystRowConverter.scala (L456-L463)
[3]: https://issues.apache.org/jira/browse/PARQUET-364

Author: Cheng Lian <lian@databricks.com>

Closes #8361 from liancheng/spark-10136/proper-version.
2015-08-25 14:58:42 +08:00
Yin Huai df7041d02d [SPARK-10196] [SQL] Correctly saving decimals in internal rows to JSON.
https://issues.apache.org/jira/browse/SPARK-10196

Author: Yin Huai <yhuai@databricks.com>

Closes #8408 from yhuai/DecimalJsonSPARK-10196.
2015-08-24 23:38:32 -07:00
Michael Armbrust 5175ca0c85 [SPARK-10178] [SQL] HiveComparisionTest should print out dependent tables
In `HiveComparisionTest`s it is possible to fail a query of the form `SELECT * FROM dest1`, where `dest1` is the query that is actually computing the incorrect results.  To aid debugging this patch improves the harness to also print these query plans and their results.

Author: Michael Armbrust <michael@databricks.com>

Closes #8388 from marmbrus/generatedTables.
2015-08-24 23:15:27 -07:00
Yin Huai a0c0aae1de [SPARK-10121] [SQL] Thrift server always use the latest class loader provided by the conf of executionHive's state
https://issues.apache.org/jira/browse/SPARK-10121

Looks like the problem is that if we add a jar through another thread, the thread handling the JDBC session will not get the latest classloader.

Author: Yin Huai <yhuai@databricks.com>

Closes #8368 from yhuai/SPARK-10121.
2015-08-25 12:49:50 +08:00
Feynman Liang 642c43c81c [SQL] [MINOR] [DOC] Clarify docs for inferring DataFrame from RDD of Products
* Makes `SQLImplicits.rddToDataFrameHolder` scaladoc consistent with `SQLContext.createDataFrame[A <: Product](rdd: RDD[A])` since the former is essentially a wrapper for the latter
 * Clarifies `createDataFrame[A <: Product]` scaladoc to apply for any `RDD[Product]`, not just case classes

Author: Feynman Liang <fliang@databricks.com>

Closes #8406 from feynmanliang/sql-doc-fixes.
2015-08-24 19:45:41 -07:00
Michael Armbrust 2bf338c626 [SPARK-10165] [SQL] Await child resolution in ResolveFunctions
Currently, we eagerly attempt to resolve functions, even before their children are resolved.  However, this is not valid in cases where we need to know the types of the input arguments (i.e. when resolving Hive UDFs).

As a fix, this PR delays function resolution until the functions children are resolved.  This change also necessitates a change to the way we resolve aggregate expressions that are not in aggregate operators (e.g., in `HAVING` or `ORDER BY` clauses).  Specifically, we can't assume that these misplaced functions will be resolved, allowing us to differentiate aggregate functions from normal functions.  To compensate for this change we now attempt to resolve these unresolved expressions in the context of the aggregate operator, before checking to see if any aggregate expressions are present.

Author: Michael Armbrust <michael@databricks.com>

Closes #8371 from marmbrus/hiveUDFResolution.
2015-08-24 18:10:51 -07:00
Josh Rosen d7b4c09527 [SPARK-10190] Fix NPE in CatalystTypeConverters Decimal toScala converter
This adds a missing null check to the Decimal `toScala` converter in `CatalystTypeConverters`, fixing an NPE.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #8401 from JoshRosen/SPARK-10190.
2015-08-24 16:17:45 -07:00
Sean Owen cb2d2e1584 [SPARK-9758] [TEST] [SQL] Compilation issue for hive test / wrong package?
Move `test.org.apache.spark.sql.hive` package tests to apparent intended `org.apache.spark.sql.hive` as they don't intend to test behavior from outside org.apache.spark.*

Alternate take, per discussion at https://github.com/apache/spark/pull/8051
I think this is what vanzin and I had in mind but also CC rxin to cross-check, as this does indeed depend on whether these tests were accidentally in this package or not. Testing from a `test.org.apache.spark` package is legitimate but didn't seem to be the intent here.

Author: Sean Owen <sowen@cloudera.com>

Closes #8307 from srowen/SPARK-9758.
2015-08-24 22:35:21 +01:00
Cheng Lian a2f4cdceba [SPARK-8580] [SQL] Refactors ParquetHiveCompatibilitySuite and adds more test cases
This PR refactors `ParquetHiveCompatibilitySuite` so that it's easier to add new test cases.

Hit two bugs, SPARK-10177 and HIVE-11625, while working on this, added test cases for them and marked as ignored for now. SPARK-10177 will be addressed in a separate PR.

Author: Cheng Lian <lian@databricks.com>

Closes #8392 from liancheng/spark-8580/parquet-hive-compat-tests.
2015-08-24 14:11:19 -07:00
Burak Yavuz 9ce0c7ad33 [SPARK-7710] [SPARK-7998] [DOCS] Docs for DataFrameStatFunctions
This PR contains examples on how to use some of the Stat Functions available for DataFrames under `df.stat`.

rxin

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #8378 from brkyvz/update-sql-docs.
2015-08-24 13:48:01 -07:00
Yijie Shen 90cb9f0565 [SPARK-9401] [SQL] Fully implement code generation for ConcatWs
This PR adds full codegen support for ConcatWs, is a substitute of #7782

JIRA: https://issues.apache.org/jira/browse/SPARK-9401

cc davies

Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #8353 from yjshen/concatws.
2015-08-22 10:16:35 -07:00
Yin Huai e3355090d4 [SPARK-10143] [SQL] Use parquet's block size (row group size) setting as the min split size if necessary.
https://issues.apache.org/jira/browse/SPARK-10143

With this PR, we will set min split size to parquet's block size (row group size) set in the conf if the min split size is smaller. So, we can avoid have too many tasks and even useless tasks for reading parquet data.

I tested it locally. The table I have has 343MB and it is in my local FS. Because I did not set any min/max split size, the default split size was 32MB and the map stage had 11 tasks. But there were only three tasks that actually read data. With my PR, there were only three tasks in the map stage. Here is the difference.

Without this PR:
![image](https://cloud.githubusercontent.com/assets/2072857/9399179/8587dba6-4765-11e5-9189-7ebba52a2b6d.png)

With this PR:
![image](https://cloud.githubusercontent.com/assets/2072857/9399185/a4735d74-4765-11e5-8848-1f1e361a6b4b.png)

Even if the block size setting does match the actual block size of parquet file, I think it is still generally good to use parquet's block size setting if min split size is smaller than this block size.

Tested it on a cluster using
```
val count = sqlContext.table("""store_sales""").groupBy().count().queryExecution.executedPlan(3).execute().count
```
Basically, it reads 0 column of table `store_sales`. My table has 1824 parquet files with size from 80MB to 280MB (1 to 3 row group sizes). Without this patch, in a 16 worker cluster, the job had 5023 tasks and spent 102s. With this patch, the job had 2893 tasks and spent 64s. It is still not as good as using one mapper per file (1824 tasks and 42s), but it is much better than our master.

Author: Yin Huai <yhuai@databricks.com>

Closes #8346 from yhuai/parquetMinSplit.
2015-08-21 14:30:00 -07:00
Daoyuan Wang 3c462f5d87 [SPARK-10130] [SQL] type coercion for IF should have children resolved first
Type coercion for IF should have children resolved first, or we could meet unresolved exception.

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #8331 from adrian-wang/spark10130.
2015-08-21 12:21:51 -07:00
Liang-Chi Hsieh bb220f6570 [SPARK-10040] [SQL] Use batch insert for JDBC writing
JIRA: https://issues.apache.org/jira/browse/SPARK-10040

We should use batch insert instead of single row in JDBC.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #8273 from viirya/jdbc-insert-batch.
2015-08-21 01:43:49 -07:00
Tarek Auel afe9f03fd9 [SPARK-9400] [SQL] codegen for StringLocate
This is based on #7779 , thanks to tarekauel . Fix the conflict and nullability.

Closes #7779 and #8274 .

Author: Tarek Auel <tarek.auel@googlemail.com>
Author: Davies Liu <davies@databricks.com>

Closes #8330 from davies/stringLocate.
2015-08-20 15:10:13 -07:00
Wenchen Fan 907df2fce0 [SQL] [MINOR] remove unnecessary class
This class is identical to `org.apache.spark.sql.execution.datasources.jdbc. DefaultSource` and is not needed.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #8334 from cloud-fan/minor.
2015-08-20 13:51:54 -07:00
Cheng Lian 85f9a61357 [SPARK-10136] [SQL] Fixes Parquet support for Avro array of primitive array
I caught SPARK-10136 while adding more test cases to `ParquetAvroCompatibilitySuite`. Actual bug fix code lies in `CatalystRowConverter.scala`.

Author: Cheng Lian <lian@databricks.com>

Closes #8341 from liancheng/spark-10136/parquet-avro-nested-primitive-array.
2015-08-20 11:00:29 -07:00
Reynold Xin b4f4e91c39 [SPARK-10100] [SQL] Eliminate hash table lookup if there is no grouping key in aggregation.
This improves performance by ~ 20 - 30% in one of my local test and should fix the performance regression from 1.4 to 1.5 on ss_max.

Author: Reynold Xin <rxin@databricks.com>

Closes #8332 from rxin/SPARK-10100.
2015-08-20 07:53:27 -07:00
Yin Huai 43e0135421 [SPARK-10092] [SQL] Multi-DB support follow up.
https://issues.apache.org/jira/browse/SPARK-10092

This pr is a follow-up one for Multi-DB support. It has the following changes:

* `HiveContext.refreshTable` now accepts `dbName.tableName`.
* `HiveContext.analyze` now accepts `dbName.tableName`.
* `CreateTableUsing`, `CreateTableUsingAsSelect`, `CreateTempTableUsing`, `CreateTempTableUsingAsSelect`, `CreateMetastoreDataSource`, and `CreateMetastoreDataSourceAsSelect` all take `TableIdentifier` instead of the string representation of table name.
* When you call `saveAsTable` with a specified database, the data will be saved to the correct location.
* Explicitly do not allow users to create a temporary with a specified database name (users cannot do it before).
* When we save table to metastore, we also check if db name and table name can be accepted by hive (using `MetaStoreUtils.validateName`).

Author: Yin Huai <yhuai@databricks.com>

Closes #8324 from yhuai/saveAsTableDB.
2015-08-20 15:30:31 +08:00
Reynold Xin 2f2686a73f [SPARK-9242] [SQL] Audit UDAF interface.
A few minor changes:

1. Improved documentation
2. Rename apply(distinct....) to distinct.
3. Changed MutableAggregationBuffer from a trait to an abstract class.
4. Renamed returnDataType to dataType to be more consistent with other expressions.

And unrelated to UDAFs:

1. Renamed file names in expressions to use suffix "Expressions" to be more consistent.
2. Moved regexp related expressions out to its own file.
3. Renamed StringComparison => StringPredicate.

Author: Reynold Xin <rxin@databricks.com>

Closes #8321 from rxin/SPARK-9242.
2015-08-19 17:35:41 -07:00
hyukjinkwon ba5f7e1842 [SPARK-10035] [SQL] Parquet filters does not process EqualNullSafe filter.
As I talked with Lian,

1. I added EquelNullSafe to ParquetFilters
 - It uses the same equality comparison filter with EqualTo since the Parquet filter performs actually null-safe equality comparison.

2. Updated the test code (ParquetFilterSuite)
 - Convert catalyst.Expression to sources.Filter
 - Removed Cast since only Literal is picked up as a proper Filter in DataSourceStrategy
 - Added EquelNullSafe comparison

3. Removed deprecated createFilter for catalyst.Expression

Author: hyukjinkwon <gurwls223@gmail.com>
Author: 권혁진 <gurwls223@gmail.com>

Closes #8275 from HyukjinKwon/master.
2015-08-20 08:13:25 +08:00
Wenchen Fan b0dbaec4f9 [SPARK-6489] [SQL] add column pruning for Generate
This PR takes over https://github.com/apache/spark/pull/5358

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #8268 from cloud-fan/6489.
2015-08-19 15:05:06 -07:00
Daoyuan Wang 373a376c04 [SPARK-10083] [SQL] CaseWhen should support type coercion of DecimalType and FractionalType
create t1 (a decimal(7, 2), b long);
select case when 1=1 then a else 1.0 end from t1;
select case when 1=1 then a else b end from t1;

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #8270 from adrian-wang/casewhenfractional.
2015-08-19 14:31:51 -07:00
Cheng Lian f3ff4c41d2 [SPARK-9899] [SQL] Disables customized output committer when speculation is on
Speculation hates direct output committer, as there are multiple corner cases that may cause data corruption and/or data loss.

Please see this [PR comment] [1] for more details.

[1]: https://github.com/apache/spark/pull/8191#issuecomment-131598385

Author: Cheng Lian <lian@databricks.com>

Closes #8317 from liancheng/spark-9899/speculation-hates-direct-output-committer.
2015-08-19 14:15:28 -07:00
Davies Liu 1f4c4fe6df [SPARK-10090] [SQL] fix decimal scale of division
We should rounding the result of multiply/division of decimal to expected precision/scale, also check overflow.

Author: Davies Liu <davies@databricks.com>

Closes #8287 from davies/decimal_division.
2015-08-19 14:03:47 -07:00