Commit graph

17750 commits

Author SHA1 Message Date
Sean Owen befab9c1c6 [SPARK-17264][SQL] DataStreamWriter should document that it only supports Parquet for now
## What changes were proposed in this pull request?

Clarify that only parquet files are supported by DataStreamWriter now

## How was this patch tested?

(Doc build -- no functional changes to test)

Author: Sean Owen <sowen@cloudera.com>

Closes #14860 from srowen/SPARK-17264.
2016-08-30 11:19:45 +01:00
Xin Ren 2d76cb11f5 [SPARK-17276][CORE][TEST] Stop env params output on Jenkins job page
https://issues.apache.org/jira/browse/SPARK-17276

## What changes were proposed in this pull request?

When trying to find error msg in a failed Jenkins build job, I'm annoyed by the huge env output.
The env parameter output should be muted.

![screen shot 2016-08-26 at 10 52 07 pm](https://cloud.githubusercontent.com/assets/3925641/18025581/b8d567ba-6be2-11e6-9eeb-6aec223f1730.png)

## How was this patch tested?

Tested manually on local laptop.

Author: Xin Ren <iamshrek@126.com>

Closes #14848 from keypointt/SPARK-17276.
2016-08-30 11:18:29 +01:00
gatorsmile bca79c8230 [SPARK-17234][SQL] Table Existence Checking when Index Table with the Same Name Exists
### What changes were proposed in this pull request?
Hive Index tables are not supported by Spark SQL. Thus, we issue an exception when users try to access Hive Index tables. When the internal function `tableExists` tries to access Hive Index tables, it always gets the same error message: ```Hive index table is not supported```. This message could be confusing to users, since their SQL operations could be completely unrelated to Hive Index tables. For example, when users try to alter a table to a new name and there exists an index table with the same name, the expected exception should be a `TableAlreadyExistsException`.

This PR made the following changes:
- Introduced a new `AnalysisException` type: `SQLFeatureNotSupportedException`. When users try to access an `Index Table`, we will issue a `SQLFeatureNotSupportedException`.
- `tableExists` returns `true` when hitting a `SQLFeatureNotSupportedException` and the feature is `Hive index table`.
- Add a checking `requireTableNotExists` for `SessionCatalog`'s `createTable` API; otherwise, the current implementation relies on the Hive's internal checking.

### How was this patch tested?
Added a test case

Author: gatorsmile <gatorsmile@gmail.com>

Closes #14801 from gatorsmile/tableExists.
2016-08-30 17:27:00 +08:00
Takeshi YAMAMURO 94922d79e9 [SPARK-17289][SQL] Fix a bug to satisfy sort requirements in partial aggregations
## What changes were proposed in this pull request?
Partial aggregations are generated in `EnsureRequirements`, but the planner fails to
check if partial aggregation satisfies sort requirements.
For the following query:
```
val df2 = (0 to 1000).map(x => (x % 2, x.toString)).toDF("a", "b").createOrReplaceTempView("t2")
spark.sql("select max(b) from t2 group by a").explain(true)
```
Now, the SortAggregator won't insert Sort operator before partial aggregation, this will break sort-based partial aggregation.
```
== Physical Plan ==
SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17])
+- *Sort [a#5 ASC], false, 0
   +- Exchange hashpartitioning(a#5, 200)
      +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, max#19])
         +- LocalTableScan [a#5, b#6]
```
Actually, a correct plan is:
```
== Physical Plan ==
SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17])
+- *Sort [a#5 ASC], false, 0
   +- Exchange hashpartitioning(a#5, 200)
      +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, max#19])
         +- *Sort [a#5 ASC], false, 0
            +- LocalTableScan [a#5, b#6]
```

## How was this patch tested?
Added tests in `PlannerSuite`.

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #14865 from maropu/SPARK-17289.
2016-08-30 16:43:47 +08:00
frreiss 8fb445d9bd [SPARK-17303] Added spark-warehouse to dev/.rat-excludes
## What changes were proposed in this pull request?

Excludes the `spark-warehouse` directory from the Apache RAT checks that src/run-tests performs. `spark-warehouse` is created by some of the Spark SQL tests, as well as by `bin/spark-sql`.

## How was this patch tested?

Ran src/run-tests twice. The second time, the script failed because the first iteration
Made the change in this PR.
Ran src/run-tests a third time; RAT checks succeeded.

Author: frreiss <frreiss@us.ibm.com>

Closes #14870 from frreiss/fred-17303.
2016-08-29 23:33:00 -07:00
Josh Rosen 48b459ddd5 [SPARK-17301][SQL] Remove unused classTag field from AtomicType base class
There's an unused `classTag` val in the AtomicType base class which is causing unnecessary slowness in deserialization because it needs to grab ScalaReflectionLock and create a new runtime reflection mirror. Removing this unused code gives a small but measurable performance boost in SQL task deserialization.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #14869 from JoshRosen/remove-unused-classtag.
2016-08-30 09:58:00 +08:00
Shivaram Venkataraman 736a7911cb [SPARK-16581][SPARKR] Make JVM backend calling functions public
## What changes were proposed in this pull request?

This change exposes a public API in SparkR to create objects, call methods on the Spark driver JVM

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

Unit tests, CRAN checks

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #14775 from shivaram/sparkr-java-api.
2016-08-29 12:55:32 -07:00
Davies Liu 48caec2516 [SPARK-17063] [SQL] Improve performance of MSCK REPAIR TABLE with Hive metastore
## What changes were proposed in this pull request?

This PR split the the single `createPartitions()` call into smaller batches, which could prevent Hive metastore from OOM (caused by millions of partitions).

It will also try to gather all the fast stats (number of files and total size of all files) in parallel to avoid the bottle neck of listing the files in metastore sequential, which is controlled by spark.sql.gatherFastStats (enabled by default).

## How was this patch tested?

Tested locally with 10000 partitions and 100 files with embedded metastore, without gathering fast stats in parallel, adding partitions took 153 seconds, after enable that, gathering the fast stats took about 34 seconds, adding these partitions took 25 seconds (most of the time spent in object store), 59 seconds in total, 2.5X faster (with larger cluster, gathering will much faster).

Author: Davies Liu <davies@databricks.com>

Closes #14607 from davies/repair_batch.
2016-08-29 11:23:53 -07:00
Junyang Qian 6a0fda2c05 [SPARKR][MINOR] Fix LDA doc
## What changes were proposed in this pull request?

This PR tries to fix the name of the `SparkDataFrame` used in the example. Also, it gives a reference url of an example data file so that users can play with.

## How was this patch tested?

Manual test.

Author: Junyang Qian <junyangq@databricks.com>

Closes #14853 from junyangq/SPARKR-FixLDADoc.
2016-08-29 10:23:10 -07:00
Seigneurin, Alexis (CONT) 08913ce000 fixed a typo
idempotant -> idempotent

Author: Seigneurin, Alexis (CONT) <Alexis.Seigneurin@capitalone.com>

Closes #14833 from aseigneurin/fix-typo.
2016-08-29 13:12:10 +01:00
Sean Owen 1a48c0047b [BUILD] Closes some stale PRs.
## What changes were proposed in this pull request?

Closes #10995
Closes #13658
Closes #14505
Closes #14536
Closes #12753
Closes #14449
Closes #12694
Closes #12695
Closes #14810
Closes #10572

## How was this patch tested?

N/A

Author: Sean Owen <sowen@cloudera.com>

Closes #14849 from srowen/CloseStalePRs.
2016-08-29 10:46:26 +01:00
Tejas Patil 095862a3cf [SPARK-17271][SQL] Planner adds un-necessary Sort even if child ordering is semantically same as required ordering
## What changes were proposed in this pull request?

Jira : https://issues.apache.org/jira/browse/SPARK-17271

Planner is adding un-needed SORT operation due to bug in the way comparison for `SortOrder` is done at https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala#L253
`SortOrder` needs to be compared semantically because `Expression` within two `SortOrder` can be "semantically equal" but not literally equal objects.

eg. In case of `sql("SELECT * FROM table1 a JOIN table2 b ON a.col1=b.col1")`

Expression in required SortOrder:
```
      AttributeReference(
        name = "col1",
        dataType = LongType,
        nullable = false
      ) (exprId = exprId,
        qualifier = Some("a")
      )
```

Expression in child SortOrder:
```
      AttributeReference(
        name = "col1",
        dataType = LongType,
        nullable = false
      ) (exprId = exprId)
```

Notice that the output column has a qualifier but the child attribute does not but the inherent expression is the same and hence in this case we can say that the child satisfies the required sort order.

This PR includes following changes:
- Added a `semanticEquals` method to `SortOrder` so that it can compare underlying child expressions semantically (and not using default Object.equals)
- Fixed `EnsureRequirements` to use semantic comparison of SortOrder

## How was this patch tested?

- Added a test case to `PlannerSuite`. Ran rest tests in `PlannerSuite`

Author: Tejas Patil <tejasp@fb.com>

Closes #14841 from tejasapatil/SPARK-17271_sort_order_equals_bug.
2016-08-28 19:14:58 +02:00
Sean Owen e07baf1412 [SPARK-17001][ML] Enable standardScaler to standardize sparse vectors when withMean=True
## What changes were proposed in this pull request?

Allow centering / mean scaling of sparse vectors in StandardScaler, if requested. This is for compatibility with `VectorAssembler` in common usages.

## How was this patch tested?

Jenkins tests, including new caes to reflect the new behavior.

Author: Sean Owen <sowen@cloudera.com>

Closes #14663 from srowen/SPARK-17001.
2016-08-27 08:48:56 +01:00
Robert Kruszewski 9fbced5b25 [SPARK-17216][UI] fix event timeline bars length
## What changes were proposed in this pull request?

Make event timeline bar expand to full length of the bar (which is total time)

This issue occurs only on chrome, firefox looks fine. Haven't tested other browsers.

## How was this patch tested?
Inspection in browsers

Before
![screen shot 2016-08-24 at 3 38 24 pm](https://cloud.githubusercontent.com/assets/512084/17935104/0d6cda74-6a12-11e6-9c66-e00cfa855606.png)

After
![screen shot 2016-08-24 at 3 36 39 pm](https://cloud.githubusercontent.com/assets/512084/17935114/15740ea4-6a12-11e6-83a1-7c06eef6abb8.png)

Author: Robert Kruszewski <robertk@palantir.com>

Closes #14791 from robert3005/robertk/event-timeline.
2016-08-27 08:47:15 +01:00
Peng, Meng 40168dbe77 [ML][MLLIB] The require condition and message doesn't match in SparseMatrix.
## What changes were proposed in this pull request?
The require condition and message doesn't match, and the condition also should be optimized.
Small change.  Please kindly let me know if JIRA required.

## How was this patch tested?
No additional test required.

Author: Peng, Meng <peng.meng@intel.com>

Closes #14824 from mpjlu/smallChangeForMatrixRequire.
2016-08-27 08:46:01 +01:00
Takeshi YAMAMURO cd0ed31ea9 [SPARK-15382][SQL] Fix a bug in sampling with replacement
## What changes were proposed in this pull request?
This pr to fix a bug below in sampling with replacement
```
val df = Seq((1, 0), (2, 0), (3, 0)).toDF("a", "b")
df.sample(true, 2.0).withColumn("c", monotonically_increasing_id).select($"c").show
+---+
|  c|
+---+
|  0|
|  1|
|  1|
|  1|
|  2|
+---+
```

## How was this patch tested?
Added a test in `DataFrameSuite`.

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #14800 from maropu/FixSampleBug.
2016-08-27 08:42:41 +01:00
Reynold Xin 718b6bad2d [SPARK-17274][SQL] Move join optimizer rules into a separate file
## What changes were proposed in this pull request?
As part of breaking Optimizer.scala apart, this patch moves various join rules into a single file.

## How was this patch tested?
This should be covered by existing tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #14846 from rxin/SPARK-17274.
2016-08-27 00:36:18 -07:00
Reynold Xin 5aad4509c1 [SPARK-17273][SQL] Move expression optimizer rules into a separate file
## What changes were proposed in this pull request?
As part of breaking Optimizer.scala apart, this patch moves various expression optimization rules into a single file.

## How was this patch tested?
This should be covered by existing tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #14845 from rxin/SPARK-17273.
2016-08-27 00:34:35 -07:00
Reynold Xin 0243b32873 [SPARK-17272][SQL] Move subquery optimizer rules into its own file
## What changes were proposed in this pull request?
As part of breaking Optimizer.scala apart, this patch moves various subquery rules into a single file.

## How was this patch tested?
This should be covered by existing tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #14844 from rxin/SPARK-17272.
2016-08-27 00:32:57 -07:00
Reynold Xin dcefac4387 [SPARK-17269][SQL] Move finish analysis optimization stage into its own file
## What changes were proposed in this pull request?
As part of breaking Optimizer.scala apart, this patch moves various finish analysis optimization stage rules into a single file. I'm submitting separate pull requests so we can more easily merge this in branch-2.0 to simplify optimizer backports.

## How was this patch tested?
This should be covered by existing tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #14838 from rxin/SPARK-17269.
2016-08-26 22:10:28 -07:00
Reynold Xin cc0caa690b [SPARK-17270][SQL] Move object optimization rules into its own file
## What changes were proposed in this pull request?
As part of breaking Optimizer.scala apart, this patch moves various Dataset object optimization rules into a single file. I'm submitting separate pull requests so we can more easily merge this in branch-2.0 to simplify optimizer backports.

## How was this patch tested?
This should be covered by existing tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #14839 from rxin/SPARK-17270.
2016-08-26 21:41:58 -07:00
Yin Huai a6bca3ad02 [SPARK-17266][TEST] Add empty strings to the regressionTests of PrefixComparatorsSuite
## What changes were proposed in this pull request?
This PR adds a regression test to PrefixComparatorsSuite's "String prefix comparator" because this test failed on jenkins once (https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.4/1620/testReport/junit/org.apache.spark.util.collection.unsafe.sort/PrefixComparatorsSuite/String_prefix_comparator/).

I could not reproduce it locally. But, let's this test case in the regressionTests.

Author: Yin Huai <yhuai@databricks.com>

Closes #14837 from yhuai/SPARK-17266.
2016-08-26 19:38:52 -07:00
Sameer Agarwal 540e912801 [SPARK-17244] Catalyst should not pushdown non-deterministic join conditions
## What changes were proposed in this pull request?

Given that non-deterministic expressions can be stateful, pushing them down the query plan during the optimization phase can cause incorrect behavior. This patch fixes that issue by explicitly disabling that.

## How was this patch tested?

A new test in `FilterPushdownSuite` that checks catalyst behavior for both deterministic and non-deterministic join conditions.

Author: Sameer Agarwal <sameerag@cs.berkeley.edu>

Closes #14815 from sameeragarwal/constraint-inputfile.
2016-08-26 16:40:59 -07:00
petermaxlee f64a1ddd09 [SPARK-17235][SQL] Support purging of old logs in MetadataLog
## What changes were proposed in this pull request?
This patch adds a purge interface to MetadataLog, and an implementation in HDFSMetadataLog. The purge function is currently unused, but I will use it to purge old execution and file source logs in follow-up patches. These changes are required in a production structured streaming job that runs for a long period of time.

## How was this patch tested?
Added a unit test case in HDFSMetadataLogSuite.

Author: petermaxlee <petermaxlee@gmail.com>

Closes #14802 from petermaxlee/SPARK-17235.
2016-08-26 16:05:34 -07:00
Herman van Hovell a11d10f182 [SPARK-17246][SQL] Add BigDecimal literal
## What changes were proposed in this pull request?
This PR adds parser support for `BigDecimal` literals. If you append the suffix `BD` to a valid number then this will be interpreted as a `BigDecimal`, for example `12.0E10BD` will interpreted into a BigDecimal with scale -9 and precision 3. This is useful in situations where you need exact values.

## How was this patch tested?
Added tests to `ExpressionParserSuite`, `ExpressionSQLBuilderSuite` and `SQLQueryTestSuite`.

Author: Herman van Hovell <hvanhovell@databricks.com>

Closes #14819 from hvanhovell/SPARK-17246.
2016-08-26 13:29:22 -07:00
Michael Gummelt 8e5475be3c [SPARK-16967] move mesos to module
## What changes were proposed in this pull request?

Move Mesos code into a mvn module

## How was this patch tested?

unit tests
manually submitting a client mode and cluster mode job
spark/mesos integration test suite

Author: Michael Gummelt <mgummelt@mesosphere.io>

Closes #14637 from mgummelt/mesos-module.
2016-08-26 12:25:22 -07:00
Peng, Meng c0949dc944 [SPARK-17207][MLLIB] fix comparing Vector bug in TestingUtils
## What changes were proposed in this pull request?

fix comparing Vector bug in TestingUtils.
There is the same bug for Matrix comparing. How to check the length of Matrix should be discussed first.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: Peng, Meng <peng.meng@intel.com>

Closes #14785 from mpjlu/testUtils.
2016-08-26 11:54:10 -07:00
petermaxlee 9812f7d538 [SPARK-17165][SQL] FileStreamSource should not track the list of seen files indefinitely
## What changes were proposed in this pull request?
Before this change, FileStreamSource uses an in-memory hash set to track the list of files processed by the engine. The list can grow indefinitely, leading to OOM or overflow of the hash set.

This patch introduces a new user-defined option called "maxFileAge", default to 24 hours. If a file is older than this age, FileStreamSource will purge it from the in-memory map that was used to track the list of files that have been processed.

## How was this patch tested?
Added unit tests for the underlying utility, and also added an end-to-end test to validate the purge in FileStreamSourceSuite. Also verified the new test cases would fail when the timeout was set to a very large number.

Author: petermaxlee <petermaxlee@gmail.com>

Closes #14728 from petermaxlee/SPARK-17165.
2016-08-26 11:30:23 -07:00
gatorsmile 261c55dd88 [SPARK-17250][SQL] Remove HiveClient and setCurrentDatabase from HiveSessionCatalog
### What changes were proposed in this pull request?
This is the first step to remove `HiveClient` from `HiveSessionState`. In the metastore interaction, we always use the fully qualified table name when accessing/operating a table. That means, we always specify the database. Thus, it is not necessary to use `HiveClient` to change the active database in Hive metastore.

In `HiveSessionCatalog `, `setCurrentDatabase` is the only function that uses `HiveClient`. Thus, we can remove it after removing `setCurrentDatabase`

### How was this patch tested?
The existing test cases.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #14821 from gatorsmile/setCurrentDB.
2016-08-26 11:19:03 -07:00
gatorsmile fd4ba3f626 [SPARK-17192][SQL] Issue Exception when Users Specify the Partitioning Columns without a Given Schema
### What changes were proposed in this pull request?
Address the comments by yhuai in the original PR: https://github.com/apache/spark/pull/14207

First, issue an exception instead of logging a warning when users specify the partitioning columns without a given schema.

Second, refactor the codes a little.

### How was this patch tested?
Fixed the test cases.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #14572 from gatorsmile/followup16552.
2016-08-26 11:13:38 -07:00
Junyang Qian 1883216235 [SPARKR][MINOR] Fix example of spark.naiveBayes
## What changes were proposed in this pull request?

The original example doesn't work because the features are not categorical. This PR fixes this by changing to another dataset.

## How was this patch tested?

Manual test.

Author: Junyang Qian <junyangq@databricks.com>

Closes #14820 from junyangq/SPARK-FixNaiveBayes.
2016-08-26 11:01:48 -07:00
Wenchen Fan 970ab8f6dd [SPARK-17187][SQL][FOLLOW-UP] improve document of TypedImperativeAggregate
## What changes were proposed in this pull request?

improve the document to make it easier to understand and also mention window operator.

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes #14822 from cloud-fan/object-agg.
2016-08-26 10:56:57 -07:00
Wenchen Fan 28ab17922a [SPARK-17260][MINOR] move CreateTables to HiveStrategies
## What changes were proposed in this pull request?

`CreateTables` rule turns a general `CreateTable` plan to `CreateHiveTableAsSelectCommand` for hive serde table. However, this rule is logically a planner strategy, we should move it to `HiveStrategies`, to be consistent with other DDL commands.

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #14825 from cloud-fan/ctas.
2016-08-26 08:52:10 -07:00
hyukjinkwon 6063d5963f [SPARK-16216][SQL][FOLLOWUP] Enable timestamp type tests for JSON and verify all unsupported types in CSV
## What changes were proposed in this pull request?

This PR enables the tests for `TimestampType` for JSON and unifies the logics for verifying schema when writing in CSV.

In more details, this PR,

- Enables the tests for `TimestampType` for JSON and

  This was disabled due to an issue in `DatatypeConverter.parseDateTime` which parses dates incorrectly, for example as below:

  ```scala
   val d = javax.xml.bind.DatatypeConverter.parseDateTime("0900-01-01T00:00:00.000").getTime
  println(d.toString)
  ```
  ```
  Fri Dec 28 00:00:00 KST 899
  ```

  However, since we use `FastDateFormat`, it seems we are safe now.

  ```scala
  val d = FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSS").parse("0900-01-01T00:00:00.000")
  println(d)
  ```
  ```
  Tue Jan 01 00:00:00 PST 900
  ```

- Verifies all unsupported types in CSV

  There is a separate logics to verify the schemas in `CSVFileFormat`. This is actually not quite correct enough because we don't support `NullType` and `CalanderIntervalType` as well `StructType`, `ArrayType`, `MapType`. So, this PR adds both types.

## How was this patch tested?

Tests in `JsonHadoopFsRelation` and `CSVSuite`

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14829 from HyukjinKwon/SPARK-16216-followup.
2016-08-26 17:29:37 +02:00
Shixiong Zhu 341e0e778d [SPARK-17242][DOCUMENT] Update links of external dstream projects
## What changes were proposed in this pull request?

Updated links of external dstream projects.

## How was this patch tested?

Just document changes.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #14814 from zsxwing/dstream-link.
2016-08-25 21:08:42 -07:00
hyukjinkwon b964a172a8 [SPARK-17212][SQL] TypeCoercion supports widening conversion between DateType and TimestampType
## What changes were proposed in this pull request?

Currently, type-widening does not work between `TimestampType` and `DateType`.

This applies to `SetOperation`, `Union`, `In`, `CaseWhen`, `Greatest`,  `Leatest`, `CreateArray`, `CreateMap`, `Coalesce`, `NullIf`, `IfNull`, `Nvl` and `Nvl2`, .

This PR adds the support for widening `DateType` to `TimestampType` for them.

For a simple example,

**Before**

```scala
Seq(Tuple2(new Timestamp(0), new Date(0))).toDF("a", "b").selectExpr("greatest(a, b)").show()
```

shows below:

```
cannot resolve 'greatest(`a`, `b`)' due to data type mismatch: The expressions should all have the same type, got GREATEST(timestamp, date)
```

or union as below:

```scala
val a = Seq(Tuple1(new Timestamp(0))).toDF()
val b = Seq(Tuple1(new Date(0))).toDF()
a.union(b).show()
```

shows below:

```
Union can only be performed on tables with the compatible column types. DateType <> TimestampType at the first column of the second table;
```

**After**

```scala
Seq(Tuple2(new Timestamp(0), new Date(0))).toDF("a", "b").selectExpr("greatest(a, b)").show()
```

shows below:

```
+----------------------------------------------------+
|greatest(CAST(a AS TIMESTAMP), CAST(b AS TIMESTAMP))|
+----------------------------------------------------+
|                                1969-12-31 16:00:...|
+----------------------------------------------------+
```

or union as below:

```scala
val a = Seq(Tuple1(new Timestamp(0))).toDF()
val b = Seq(Tuple1(new Date(0))).toDF()
a.union(b).show()
```

shows below:

```
+--------------------+
|                  _1|
+--------------------+
|1969-12-31 16:00:...|
|1969-12-31 00:00:...|
+--------------------+
```

## How was this patch tested?

Unit tests in `TypeCoercionSuite`.

Author: hyukjinkwon <gurwls223@gmail.com>
Author: HyukjinKwon <gurwls223@gmail.com>

Closes #14786 from HyukjinKwon/SPARK-17212.
2016-08-26 08:58:43 +08:00
Sean Zhong d96d151563 [SPARK-17187][SQL] Supports using arbitrary Java object as internal aggregation buffer object
## What changes were proposed in this pull request?

This PR introduces an abstract class `TypedImperativeAggregate` so that an aggregation function of TypedImperativeAggregate can use  **arbitrary** user-defined Java object as intermediate aggregation buffer object.

**This has advantages like:**
1. It now can support larger category of aggregation functions. For example, it will be much easier to implement aggregation function `percentile_approx`, which has a complex aggregation buffer definition.
2. It can be used to avoid doing serialization/de-serialization for every call of `update` or `merge` when converting domain specific aggregation object to internal Spark-Sql storage format.
3. It is easier to integrate with other existing monoid libraries like algebird, and supports more aggregation functions with high performance.

Please see `org.apache.spark.sql.TypedImperativeAggregateSuite.TypedMaxAggregate` to find an example of how to defined a `TypedImperativeAggregate` aggregation function.
Please see Java doc of `TypedImperativeAggregate` and Jira ticket SPARK-17187 for more information.

## How was this patch tested?

Unit tests.

Author: Sean Zhong <seanzhong@databricks.com>
Author: Yin Huai <yhuai@databricks.com>

Closes #14753 from clockfly/object_aggregation_buffer_try_2.
2016-08-25 16:36:16 -07:00
Marcelo Vanzin 9b5a1d1d53 [SPARK-17240][CORE] Make SparkConf serializable again.
Make the config reader transient, and initialize it lazily so that
serialization works with both java and kryo (and hopefully any other
custom serializer).

Added unit test to make sure SparkConf remains serializable and the
reader works with both built-in serializers.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #14813 from vanzin/SPARK-17240.
2016-08-25 16:11:42 -07:00
Josh Rosen 3e4c7db4d1 [SPARK-17205] Literal.sql should handle Infinity and NaN
This patch updates `Literal.sql` to properly generate SQL for `NaN` and `Infinity` float and double literals: these special values need to be handled differently from regular values, since simply appending a suffix to the value's `toString()` representation will not work for these values.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #14777 from JoshRosen/SPARK-17205.
2016-08-26 00:15:01 +02:00
Josh Rosen a133057ce5 [SPARK-17229][SQL] PostgresDialect shouldn't widen float and short types during reads
## What changes were proposed in this pull request?

When reading float4 and smallint columns from PostgreSQL, Spark's `PostgresDialect` widens these types to Decimal and Integer rather than using the narrower Float and Short types. According to https://www.postgresql.org/docs/7.1/static/datatype.html#DATATYPE-TABLE, Postgres maps the `smallint` type to a signed two-byte integer and the `real` / `float4` types to single precision floating point numbers.

This patch fixes this by adding more special-cases to `getCatalystType`, similar to what was done for the Derby JDBC dialect. I also fixed a similar problem in the write path which causes Spark to create integer columns in Postgres for what should have been ShortType columns.

## How was this patch tested?

New test cases in `PostgresIntegrationSuite` (which I ran manually because Jenkins can't run it right now).

Author: Josh Rosen <joshrosen@databricks.com>

Closes #14796 from JoshRosen/postgres-jdbc-type-fixes.
2016-08-25 23:22:40 +02:00
wm624@hotmail.com 9958ac0ce2 [SPARKR][BUILD] ignore cran-check.out under R folder
## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)
R add cran check which will generate the cran-check.out. This file should be ignored in git.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Manual test it. Run clean test and git status to make sure the file is not included in git.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #14774 from wangmiao1981/ignore.
2016-08-25 12:11:27 -07:00
Michael Allman f209310719 [SPARK-17231][CORE] Avoid building debug or trace log messages unless the respective log level is enabled
(This PR addresses https://issues.apache.org/jira/browse/SPARK-17231)

## What changes were proposed in this pull request?

While debugging the performance of a large GraphX connected components computation, we found several places in the `network-common` and `network-shuffle` code bases where trace or debug log messages are constructed even if the respective log level is disabled. According to YourKit, these constructions were creating substantial churn in the eden region. Refactoring the respective code to avoid these unnecessary constructions except where necessary led to a modest but measurable reduction in our job's task time, GC time and the ratio thereof.

## How was this patch tested?

We computed the connected components of a graph with about 2.6 billion vertices and 1.7 billion edges four times. We used four different EC2 clusters each with 8 r3.8xl worker nodes. Two test runs used Spark master. Two used Spark master + this PR. The results from the first test run, master and master+PR:
![master](https://cloud.githubusercontent.com/assets/833693/17951634/7471cbca-6a18-11e6-9c26-78afe9319685.jpg)
![logging_perf_improvements](https://cloud.githubusercontent.com/assets/833693/17951632/7467844e-6a18-11e6-9a0e-053dc7650413.jpg)

The results from the second test run, master and master+PR:
![master 2](https://cloud.githubusercontent.com/assets/833693/17951633/746dd6aa-6a18-11e6-8e27-606680b3f105.jpg)
![logging_perf_improvements 2](https://cloud.githubusercontent.com/assets/833693/17951631/74488710-6a18-11e6-8a32-08692f373386.jpg)

Though modest, I believe these results are significant.

Author: Michael Allman <michael@videoamp.com>

Closes #14798 from mallman/spark-17231-logging_perf_improvements.
2016-08-25 11:57:38 -07:00
gatorsmile d2ae6399ee [SPARK-16991][SPARK-17099][SPARK-17120][SQL] Fix Outer Join Elimination when Filter's isNotNull Constraints Unable to Filter Out All Null-supplying Rows
### What changes were proposed in this pull request?
This PR is to fix an incorrect outer join elimination when filter's `isNotNull` constraints is unable to filter out all null-supplying rows. For example, `isnotnull(coalesce(b#227, c#238))`.

Users can hit this error when they try to use `using/natural outer join`, which is converted to a normal outer join with a `coalesce` expression on the `using columns`. For example,
```Scala
    val a = Seq((1, 2), (2, 3)).toDF("a", "b")
    val b = Seq((2, 5), (3, 4)).toDF("a", "c")
    val c = Seq((3, 1)).toDF("a", "d")
    val ab = a.join(b, Seq("a"), "fullouter")
    ab.join(c, "a").explain(true)
```
The dataframe `ab` is doing `using full-outer join`, which is converted to a normal outer join with a `coalesce` expression. Constraints inference generates a `Filter` with constraints `isnotnull(coalesce(b#227, c#238))`. Then, it triggers a wrong outer join elimination and generates a wrong result.
```
Project [a#251, b#227, c#237, d#247]
+- Join Inner, (a#251 = a#246)
   :- Project [coalesce(a#226, a#236) AS a#251, b#227, c#237]
   :  +- Join FullOuter, (a#226 = a#236)
   :     :- Project [_1#223 AS a#226, _2#224 AS b#227]
   :     :  +- LocalRelation [_1#223, _2#224]
   :     +- Project [_1#233 AS a#236, _2#234 AS c#237]
   :        +- LocalRelation [_1#233, _2#234]
   +- Project [_1#243 AS a#246, _2#244 AS d#247]
      +- LocalRelation [_1#243, _2#244]

== Optimized Logical Plan ==
Project [a#251, b#227, c#237, d#247]
+- Join Inner, (a#251 = a#246)
   :- Project [coalesce(a#226, a#236) AS a#251, b#227, c#237]
   :  +- Filter isnotnull(coalesce(a#226, a#236))
   :     +- Join FullOuter, (a#226 = a#236)
   :        :- LocalRelation [a#226, b#227]
   :        +- LocalRelation [a#236, c#237]
   +- LocalRelation [a#246, d#247]
```

**A note to the `Committer`**, please also give the credit to dongjoon-hyun who submitted another PR for fixing this issue. https://github.com/apache/spark/pull/14580

### How was this patch tested?
Added test cases

Author: gatorsmile <gatorsmile@gmail.com>

Closes #14661 from gatorsmile/fixOuterJoinElimination.
2016-08-25 14:18:58 +02:00
Takeshi YAMAMURO 2b0cc4e0df [SPARK-12978][SQL] Skip unnecessary final group-by when input data already clustered with group-by keys
This ticket targets the optimization to skip an unnecessary group-by operation below;

Without opt.:
```
== Physical Plan ==
TungstenAggregate(key=[col0#159], functions=[(sum(col1#160),mode=Final,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)], output=[col0#159,sum(col1)#177,avg(col2)#178])
+- TungstenAggregate(key=[col0#159], functions=[(sum(col1#160),mode=Partial,isDistinct=false),(avg(col2#161),mode=Partial,isDistinct=false)], output=[col0#159,sum#200,sum#201,count#202L])
   +- TungstenExchange hashpartitioning(col0#159,200), None
      +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], InMemoryRelation [col0#159,col1#160,col2#161], true, 10000, StorageLevel(true, true, false, true, 1), ConvertToUnsafe, None
```

With opt.:
```
== Physical Plan ==
TungstenAggregate(key=[col0#159], functions=[(sum(col1#160),mode=Complete,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)], output=[col0#159,sum(col1)#177,avg(col2)#178])
+- TungstenExchange hashpartitioning(col0#159,200), None
  +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], InMemoryRelation [col0#159,col1#160,col2#161], true, 10000, StorageLevel(true, true, false, true, 1), ConvertToUnsafe, None
```

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #10896 from maropu/SkipGroupbySpike.
2016-08-25 12:39:58 +02:00
Yanbo Liang 6b8cb1fe52 [SPARK-17197][ML][PYSPARK] PySpark LiR/LoR supports tree aggregation level configurable.
## What changes were proposed in this pull request?
[SPARK-17090](https://issues.apache.org/jira/browse/SPARK-17090) makes tree aggregation level in LiR/LoR configurable, this PR makes PySpark support this function.

## How was this patch tested?
Since ```aggregationDepth``` is an expert param, I'm not prefer to test it in doctest which is also used for example. Here is the offline test result:
![image](https://cloud.githubusercontent.com/assets/1962026/17879457/f83d7760-68a6-11e6-9936-d0a884d5d6ec.png)

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #14766 from yanboliang/spark-17197.
2016-08-25 02:26:33 -07:00
Liwei Lin e0b20f9f24 [SPARK-17061][SPARK-17093][SQL] MapObjects` should make copies of unsafe-backed data
## What changes were proposed in this pull request?

Currently `MapObjects` does not make copies of unsafe-backed data, leading to problems like [SPARK-17061](https://issues.apache.org/jira/browse/SPARK-17061) [SPARK-17093](https://issues.apache.org/jira/browse/SPARK-17093).

This patch makes `MapObjects` make copies of unsafe-backed data.

Generated code - prior to this patch:
```java
...
/* 295 */ if (isNull12) {
/* 296 */   convertedArray1[loopIndex1] = null;
/* 297 */ } else {
/* 298 */   convertedArray1[loopIndex1] = value12;
/* 299 */ }
...
```

Generated code - after this patch:
```java
...
/* 295 */ if (isNull12) {
/* 296 */   convertedArray1[loopIndex1] = null;
/* 297 */ } else {
/* 298 */   convertedArray1[loopIndex1] = value12 instanceof UnsafeRow? value12.copy() : value12;
/* 299 */ }
...
```

## How was this patch tested?

Add a new test case which would fail without this patch.

Author: Liwei Lin <lwlin7@gmail.com>

Closes #14698 from lw-lin/mapobjects-copy.
2016-08-25 11:24:40 +02:00
Sean Owen 2bcd5d5ce3 [SPARK-17193][CORE] HadoopRDD NPE at DEBUG log level when getLocationInfo == null
## What changes were proposed in this pull request?

Handle null from Hadoop getLocationInfo directly instead of catching (and logging) exception

## How was this patch tested?

Jenkins tests

Author: Sean Owen <sowen@cloudera.com>

Closes #14760 from srowen/SPARK-17193.
2016-08-25 09:45:49 +01:00
jiangxingbo 5f02d2e5b4 [SPARK-17215][SQL] Method SQLContext.parseDataType(dataTypeString: String) could be removed.
## What changes were proposed in this pull request?

Method `SQLContext.parseDataType(dataTypeString: String)` could be removed, we should use `SparkSession.parseDataType(dataTypeString: String)` instead.
This require updating PySpark.

## How was this patch tested?

Existing test cases.

Author: jiangxingbo <jiangxb1987@gmail.com>

Closes #14790 from jiangxb1987/parseDataType.
2016-08-24 23:36:04 -07:00
gatorsmile 4d0706d616 [SPARK-17190][SQL] Removal of HiveSharedState
### What changes were proposed in this pull request?
Since `HiveClient` is used to interact with the Hive metastore, it should be hidden in `HiveExternalCatalog`. After moving `HiveClient` into `HiveExternalCatalog`, `HiveSharedState` becomes a wrapper of `HiveExternalCatalog`. Thus, removal of `HiveSharedState` becomes straightforward. After removal of `HiveSharedState`, the reflection logic is directly applied on the choice of `ExternalCatalog` types, based on the configuration of `CATALOG_IMPLEMENTATION`.

~~`HiveClient` is also used/invoked by the other entities besides HiveExternalCatalog, we defines the following two APIs: getClient and getNewClient~~

### How was this patch tested?
The existing test cases

Author: gatorsmile <gatorsmile@gmail.com>

Closes #14757 from gatorsmile/removeHiveClient.
2016-08-25 12:50:03 +08:00
Sameer Agarwal ac27557eb6 [SPARK-17228][SQL] Not infer/propagate non-deterministic constraints
## What changes were proposed in this pull request?

Given that filters based on non-deterministic constraints shouldn't be pushed down in the query plan, unnecessarily inferring them is confusing and a source of potential bugs. This patch simplifies the inferring logic by simply ignoring them.

## How was this patch tested?

Added a new test in `ConstraintPropagationSuite`.

Author: Sameer Agarwal <sameerag@cs.berkeley.edu>

Closes #14795 from sameeragarwal/deterministic-constraints.
2016-08-24 21:24:24 -07:00