Commit graph

17124 commits

Author SHA1 Message Date
Wenchen Fan a2abb583ca [SPARK-16663][SQL] desc table should be consistent between data source and hive serde tables
## What changes were proposed in this pull request?

Currently there are 2 inconsistence:

1. for data source table, we only print partition names, for hive table, we also print partition schema. After this PR, we will always print schema
2. if column doesn't have comment, data source table will print empty string, hive table will print null. After this PR, we will always print null

## How was this patch tested?

new test in `HiveDDLSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #14302 from cloud-fan/minor3.
2016-07-26 18:46:12 +08:00
WeichenXu 4c9695598e [SPARK-16697][ML][MLLIB] improve LDA submitMiniBatch method to avoid redundant RDD computation
## What changes were proposed in this pull request?

In `LDAOptimizer.submitMiniBatch`, do persist on `stats: RDD[(BDM[Double], List[BDV[Double]])]`
and also move the place of unpersisting `expElogbetaBc` broadcast variable,
to avoid the `expElogbetaBc` broadcast variable to be unpersisted too early,
and update previous `expElogbetaBc.unpersist()` into `expElogbetaBc.destroy(false)`

## How was this patch tested?

Existing test.

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #14335 from WeichenXu123/improve_LDA.
2016-07-26 10:41:41 +01:00
hyukjinkwon 3b2b785ece [SPARK-16675][SQL] Avoid per-record type dispatch in JDBC when writing
## What changes were proposed in this pull request?

Currently, `JdbcUtils.savePartition` is doing type-based dispatch for each row to write appropriate values.

So, appropriate setters for `PreparedStatement` can be created first according to the schema, and then apply them to each row. This approach is similar with `CatalystWriteSupport`.

This PR simply make the setters to avoid this.

## How was this patch tested?

Existing tests should cover this.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14323 from HyukjinKwon/SPARK-16675.
2016-07-26 17:14:58 +08:00
Tathagata Das 03c27435ae [TEST][STREAMING] Fix flaky Kafka rate controlling test
## What changes were proposed in this pull request?

The current test is incorrect, because
- The expected number of messages does not take into account that the topic has 2 partitions, and rate is set per partition.
- Also in some cases, the test ran out of data in Kafka while waiting for the right amount of data per batch.

The PR
- Reduces the number of partitions to 1
- Adds more data to Kafka
- Runs with 0.5 second so that batches are created slowly

## How was this patch tested?
Ran many times locally, going to run it many times in Jenkins

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #14361 from tdas/kafka-rate-test-fix.
2016-07-26 00:41:46 -07:00
Wenchen Fan 6959061f02 [SPARK-16706][SQL] support java map in encoder
## What changes were proposed in this pull request?

finish the TODO, create a new expression `ExternalMapToCatalyst` to iterate the map directly.

## How was this patch tested?

new test in `JavaDatasetSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #14344 from cloud-fan/java-map.
2016-07-26 15:33:05 +08:00
Liang-Chi Hsieh 7b06a8948f [SPARK-16686][SQL] Remove PushProjectThroughSample since it is handled by ColumnPruning
## What changes were proposed in this pull request?

We push down `Project` through `Sample` in `Optimizer` by the rule `PushProjectThroughSample`. However, if the projected columns produce new output, they will encounter whole data instead of sampled data. It will bring some inconsistency between original plan (Sample then Project) and optimized plan (Project then Sample). In the extreme case such as attached in the JIRA, if the projected column is an UDF which is supposed to not see the sampled out data, the result of UDF will be incorrect.

Since the rule `ColumnPruning` already handles general `Project` pushdown. We don't need  `PushProjectThroughSample` anymore. The rule `ColumnPruning` also avoids the described issue.

## How was this patch tested?

Jenkins tests.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #14327 from viirya/fix-sample-pushdown.
2016-07-26 12:00:01 +08:00
Yin Huai 815f3eece5 [SPARK-16633][SPARK-16642][SPARK-16721][SQL] Fixes three issues related to lead and lag functions
## What changes were proposed in this pull request?
This PR contains three changes.

First, this PR changes the behavior of lead/lag back to Spark 1.6's behavior, which is described as below:
1. lead/lag respect null input values, which means that if the offset row exists and the input value is null, the result will be null instead of the default value.
2. If the offset row does not exist, the default value will be used.
3. OffsetWindowFunction's nullable setting also considers the nullability of its input (because of the first change).

Second, this PR fixes the evaluation of lead/lag when the input expression is a literal. This fix is a result of the first change. In current master, if a literal is used as the input expression of a lead or lag function, the result will be this literal even if the offset row does not exist.

Third, this PR makes ResolveWindowFrame not fire if a window function is not resolved.

## How was this patch tested?
New tests in SQLWindowFunctionSuite

Author: Yin Huai <yhuai@databricks.com>

Closes #14284 from yhuai/lead-lag.
2016-07-25 20:58:07 -07:00
Michael Armbrust f99e34e8e5 [SPARK-16724] Expose DefinedByConstructorParams
We don't generally make things in catalyst/execution private.  Instead they are just undocumented due to their lack of stability guarantees.

Author: Michael Armbrust <michael@databricks.com>

Closes #14356 from marmbrus/patch-1.
2016-07-25 20:41:24 -07:00
Dongjoon Hyun 8a8d26f1e2 [SPARK-16672][SQL] SQLBuilder should not raise exceptions on EXISTS queries
## What changes were proposed in this pull request?

Currently, `SQLBuilder` raises `empty.reduceLeft` exceptions on *unoptimized* `EXISTS` queries. We had better prevent this.
```scala
scala> sql("CREATE TABLE t1(a int)")
scala> val df = sql("select * from t1 b where exists (select * from t1 a)")
scala> new org.apache.spark.sql.catalyst.SQLBuilder(df).toSQL
java.lang.UnsupportedOperationException: empty.reduceLeft
```

## How was this patch tested?

Pass the Jenkins tests with a new test suite.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #14307 from dongjoon-hyun/SPARK-16672.
2016-07-25 19:52:17 -07:00
Nicholas Brown ba0aade6d5 Fix description of spark.speculation.quantile
## What changes were proposed in this pull request?

Minor doc fix regarding the spark.speculation.quantile configuration parameter.  It incorrectly states it should be a percentage, when it should be a fraction.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

I tried building the documentation but got some unidoc errors.  I also got them when building off origin/master, so I don't think I caused that problem.  I did run the web app and saw the changes reflected as expected.

Author: Nicholas Brown <nbrown@adroitdigital.com>

Closes #14352 from nwbvt/master.
2016-07-25 19:18:27 -07:00
gatorsmile 3fc4566941 [SPARK-16678][SPARK-16677][SQL] Fix two View-related bugs
## What changes were proposed in this pull request?
**Issue 1: Disallow Creating/Altering a View when the same-name Table Exists (without IF NOT EXISTS)**
When we create OR alter a view, we check whether the view already exists. In the current implementation, if a table with the same name exists, we treat it as a view. However, this is not the right behavior. We should follow what Hive does. For example,
```
hive> CREATE TABLE tab1 (id int);
OK
Time taken: 0.196 seconds
hive> CREATE OR REPLACE VIEW tab1 AS SELECT * FROM t1;
FAILED: SemanticException [Error 10218]: Existing table is not a view
 The following is an existing table, not a view: default.tab1
hive> ALTER VIEW tab1 AS SELECT * FROM t1;
FAILED: SemanticException [Error 10218]: Existing table is not a view
 The following is an existing table, not a view: default.tab1
hive> CREATE VIEW IF NOT EXISTS tab1 AS SELECT * FROM t1;
OK
Time taken: 0.678 seconds
```

**Issue 2: Strange Error when Issuing Load Table Against A View**
Users should not be allowed to issue LOAD DATA against a view. Currently, when users doing it, we got a very strange runtime error. For example,
```SQL
LOAD DATA LOCAL INPATH "$testData" INTO TABLE $viewName
```
```
java.lang.reflect.InvocationTargetException was thrown.
java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.spark.sql.hive.client.Shim_v0_14.loadTable(HiveShim.scala:680)
```
## How was this patch tested?
Added test cases

Author: gatorsmile <gatorsmile@gmail.com>

Closes #14314 from gatorsmile/tableDDLAgainstView.
2016-07-26 09:32:29 +08:00
Shixiong Zhu e164a04b2b [SPARK-16722][TESTS] Fix a StreamingContext leak in StreamingContextSuite when eventually fails
## What changes were proposed in this pull request?

This PR moves `ssc.stop()` into `finally` for `StreamingContextSuite.createValidCheckpoint` to avoid leaking a StreamingContext since leaking a StreamingContext will fail a lot of tests and make us hard to find the real failure one.

## How was this patch tested?

Jenkins unit tests

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #14354 from zsxwing/ssc-leak.
2016-07-25 18:26:29 -07:00
Tao Lin db36e1e75d [SPARK-15590][WEBUI] Paginate Job Table in Jobs tab
## What changes were proposed in this pull request?

This patch adds pagination support for the Job Tables in the Jobs tab. Pagination is provided for all of the three Job Tables (active, completed, and failed). Interactions (jumping, sorting, and setting page size) for paged tables are also included.

The diff didn't keep track of some lines based on the original ones. The function `makeRow`of the original `AllJobsPage.scala` is reused. They are separated at the beginning of the function `jobRow` (L427-439) and the function `row`(L594-618) in the new `AllJobsPage.scala`.

## How was this patch tested?

Tested manually by using checking the Web UI after completing and failing hundreds of jobs.
Generate completed jobs by:
```scala
val d = sc.parallelize(Array(1,2,3,4,5))
for(i <- 1 to 255){ var b = d.collect() }
```
Generate failed jobs by calling the following code multiple times:
```scala
var b = d.map(_/0).collect()
```
Interactions like jumping, sorting, and setting page size are all tested.

This shows the pagination for completed jobs:
![paginate success jobs](https://cloud.githubusercontent.com/assets/5558370/15986498/efa12ef6-303b-11e6-8b1d-c3382aeb9ad0.png)

This shows the sorting works in job tables:
![sorting](https://cloud.githubusercontent.com/assets/5558370/15986539/98c8a81a-303c-11e6-86f2-8d2bc7924ee9.png)

This shows the pagination for failed jobs and the effect of jumping and setting page size:
![paginate failed jobs](https://cloud.githubusercontent.com/assets/5558370/15986556/d8c1323e-303c-11e6-8e4b-7bdb030ea42b.png)

Author: Tao Lin <nblintao@gmail.com>

Closes #13620 from nblintao/dev.
2016-07-25 17:35:50 -07:00
Tathagata Das c979c8bba0 [SPARK-14131][STREAMING] SQL Improved fix for avoiding potential deadlocks in HDFSMetadataLog
## What changes were proposed in this pull request?
Current fix for deadlock disables interrupts in the StreamExecution which getting offsets for all sources, and when writing to any metadata log, to avoid potential deadlocks in HDFSMetadataLog(see JIRA for more details). However, disabling interrupts can have unintended consequences in other sources. So I am making the fix more narrow, by disabling interrupt it only in the HDFSMetadataLog. This is a narrower fix for something risky like disabling interrupt.

## How was this patch tested?
Existing tests.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #14292 from tdas/SPARK-14131.
2016-07-25 16:09:22 -07:00
Shixiong Zhu 12f490b5c8 [SPARK-16715][TESTS] Fix a potential ExprId conflict for SubexpressionEliminationSuite."Semantic equals and hash"
## What changes were proposed in this pull request?

SubexpressionEliminationSuite."Semantic equals and hash" assumes the default AttributeReference's exprId wont' be "ExprId(1)". However, that depends on when this test runs. It may happen to use "ExprId(1)".

This PR detects the conflict and makes sure we create a different ExprId when the conflict happens.

## How was this patch tested?

Jenkins unit tests.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #14350 from zsxwing/SPARK-16715.
2016-07-25 16:08:29 -07:00
jerryshao f5ea7fe539 [SPARK-16166][CORE] Also take off-heap memory usage into consideration in log and webui display
## What changes were proposed in this pull request?

Currently in the log and UI display, only on-heap storage memory is calculated and displayed,

```
16/06/27 13:41:52 INFO MemoryStore: Block rdd_5_0 stored as values in memory (estimated size 17.8 KB, free 665.9 MB)
```
<img width="1232" alt="untitled" src="https://cloud.githubusercontent.com/assets/850797/16369960/53fb614e-3c6e-11e6-8fa3-7ffe65abcb49.png">

With [SPARK-13992](https://issues.apache.org/jira/browse/SPARK-13992) off-heap memory is supported for data persistence, so here change to also take off-heap storage memory into consideration.

## How was this patch tested?

Unit test and local verification.

Author: jerryshao <sshao@hortonworks.com>

Closes #13920 from jerryshao/SPARK-16166.
2016-07-25 15:17:06 -07:00
Takeshi YAMAMURO cda4603de3 [SQL][DOC] Fix a default name for parquet compression
## What changes were proposed in this pull request?
This pr is to fix a wrong description for parquet default compression.

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #14351 from maropu/FixParquetDoc.
2016-07-25 15:08:58 -07:00
Josh Rosen fc17121d59 Revert "[SPARK-15271][MESOS] Allow force pulling executor docker images"
This reverts commit 978cd5f125.
2016-07-25 12:43:44 -07:00
Shuai Lin 3b6e1d094e [SPARK-16485][DOC][ML] Fixed several inline formatting in ml features doc
## What changes were proposed in this pull request?

Fixed several inline formatting in ml features doc.

Before:

<img width="475" alt="screen shot 2016-07-14 at 12 24 57 pm" src="https://cloud.githubusercontent.com/assets/717363/16827974/1e1b6e04-49be-11e6-8aa9-4a0cb6cd3b4e.png">

After:

<img width="404" alt="screen shot 2016-07-14 at 12 25 48 pm" src="https://cloud.githubusercontent.com/assets/717363/16827976/2576510a-49be-11e6-96dd-92a1fa464d36.png">

## How was this patch tested?

Genetate the docs locally by `SKIP_API=1 jekyll build` and view it in the browser.

Author: Shuai Lin <linshuai2012@gmail.com>

Closes #14194 from lins05/fix-docs-formatting.
2016-07-25 20:26:55 +01:00
Philipp Hoffmann 978cd5f125 [SPARK-15271][MESOS] Allow force pulling executor docker images
## What changes were proposed in this pull request?

Mesos agents by default will not pull docker images which are cached
locally already. In order to run Spark executors from mutable tags like
`:latest` this commit introduces a Spark setting
`spark.mesos.executor.docker.forcePullImage`. Setting this flag to
true will tell the Mesos agent to force pull the docker image (default is `false` which is consistent with the previous
implementation and Mesos' default
behaviour).

## How was this patch tested?

I ran a sample application including this change on a Mesos cluster and verified the correct behaviour for both, with and without, force pulling the executor image. As expected the image is being force pulled if the flag is set.

Author: Philipp Hoffmann <mail@philipphoffmann.de>

Closes #13051 from philipphoffmann/force-pull-image.
2016-07-25 20:14:47 +01:00
Reynold Xin dd784a8822 [SPARK-16685] Remove audit-release scripts.
## What changes were proposed in this pull request?
This patch removes dev/audit-release. It was initially created to do basic release auditing. They have been unused by for the last one year+.

## How was this patch tested?
N/A

Author: Reynold Xin <rxin@databricks.com>

Closes #14342 from rxin/SPARK-16685.
2016-07-25 20:03:54 +01:00
WeichenXu ad3708e783 [SPARK-16653][ML][OPTIMIZER] update ANN convergence tolerance param default to 1e-6
## What changes were proposed in this pull request?

replace ANN convergence tolerance param default
from 1e-4 to 1e-6

so that it will be the same with other algorithms in MLLib which use LBFGS as optimizer.

## How was this patch tested?

Existing Test.

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #14286 from WeichenXu123/update_ann_tol.
2016-07-25 20:00:37 +01:00
Felix Cheung b73defdd79 [SPARKR][DOCS] fix broken url in doc
## What changes were proposed in this pull request?

Fix broken url, also,

sparkR.session.stop doc page should have it in the header, instead of saying "sparkR.stop"
![image](https://cloud.githubusercontent.com/assets/8969467/17080129/26d41308-50d9-11e6-8967-79d6c920313f.png)

Data type section is in the middle of a list of gapply/gapplyCollect subsections:
![image](https://cloud.githubusercontent.com/assets/8969467/17080122/f992d00a-50d8-11e6-8f2c-fd5786213920.png)

## How was this patch tested?

manual test

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #14329 from felixcheung/rdoclinkfix.
2016-07-25 11:25:41 -07:00
Cheng Lian 7ea6d282b9 [SPARK-16703][SQL] Remove extra whitespace in SQL generation for window functions
## What changes were proposed in this pull request?

This PR fixes a minor formatting issue of `WindowSpecDefinition.sql` when no partitioning expressions are present.

Before:

```sql
( ORDER BY `a` ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
```

After:

```sql
(ORDER BY `a` ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
```

## How was this patch tested?

New test case added in `ExpressionSQLBuilderSuite`.

Author: Cheng Lian <lian@databricks.com>

Closes #14334 from liancheng/window-spec-sql-format.
2016-07-25 09:42:39 -07:00
hyukjinkwon 79826f3c79 [SPARK-16698][SQL] Field names having dots should be allowed for datasources based on FileFormat
## What changes were proposed in this pull request?

It seems this is a regression assuming from https://issues.apache.org/jira/browse/SPARK-16698.

Field name having dots throws an exception. For example the codes below:

```scala
val path = "/tmp/path"
val json =""" {"a.b":"data"}"""
spark.sparkContext
  .parallelize(json :: Nil)
  .saveAsTextFile(path)
spark.read.json(path).collect()
```

throws an exception as below:

```
Unable to resolve a.b given [a.b];
org.apache.spark.sql.AnalysisException: Unable to resolve a.b given [a.b];
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
	at scala.Option.getOrElse(Option.scala:121)
```

This problem was introduced in 17eec0a71b (diff-27c76f96a7b2733ecfd6f46a1716e153R121)

When extracting the data columns, it does not count that it can contains dots in field names. Actually, it seems the fields name are not expected as quoted when defining schema. So, It not have to consider whether this is wrapped with quotes because the actual schema (inferred or user-given schema) would not have the quotes for fields.

For example, this throws an exception. (**Loading JSON from RDD is fine**)

```scala
val json =""" {"a.b":"data"}"""
val rdd = spark.sparkContext.parallelize(json :: Nil)
spark.read.schema(StructType(Seq(StructField("`a.b`", StringType, true))))
  .json(rdd).select("`a.b`").printSchema()
```

as below:

```
cannot resolve '```a.b```' given input columns: [`a.b`];
org.apache.spark.sql.AnalysisException: cannot resolve '```a.b```' given input columns: [`a.b`];
	at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
```

## How was this patch tested?

Unit tests in `FileSourceStrategySuite`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14339 from HyukjinKwon/SPARK-16698-regression.
2016-07-25 22:51:30 +08:00
Sameer Agarwal d6a52176ad [SPARK-16668][TEST] Test parquet reader for row groups containing both dictionary and plain encoded pages
## What changes were proposed in this pull request?

This patch adds an explicit test for [SPARK-14217] by setting the parquet dictionary and page size the generated parquet file spans across 3 pages (within a single row group) where the first page is dictionary encoded and the remaining two are plain encoded.

## How was this patch tested?

1. ParquetEncodingSuite
2. Also manually tested that this test fails without https://github.com/apache/spark/pull/12279

Author: Sameer Agarwal <sameerag@cs.berkeley.edu>

Closes #14304 from sameeragarwal/hybrid-encoding-test.
2016-07-25 22:31:01 +08:00
Wenchen Fan 64529b186a [SPARK-16691][SQL] move BucketSpec to catalyst module and use it in CatalogTable
## What changes were proposed in this pull request?

It's weird that we have `BucketSpec` to abstract bucket info, but don't use it in `CatalogTable`. This PR moves `BucketSpec` into catalyst module.

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #14331 from cloud-fan/check.
2016-07-25 22:05:48 +08:00
Wenchen Fan d27d362eba [SPARK-16660][SQL] CreateViewCommand should not take CatalogTable
## What changes were proposed in this pull request?

`CreateViewCommand` only needs some information of a `CatalogTable`, but not all of them. We have some tricks(e.g. we need to check the table type is `VIEW`, we need to make `CatalogColumn.dataType` nullable) to allow it to take a `CatalogTable`.
This PR cleans it up and only pass in necessary information to `CreateViewCommand`.

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #14297 from cloud-fan/minor2.
2016-07-25 22:02:00 +08:00
hyukjinkwon 7ffd99ec5f [SPARK-16674][SQL] Avoid per-record type dispatch in JDBC when reading
## What changes were proposed in this pull request?

Currently, `JDBCRDD.compute` is doing type dispatch for each row to read appropriate values.
It might not have to be done like this because the schema is already kept in `JDBCRDD`.

So, appropriate converters can be created first according to the schema, and then apply them to each row.

## How was this patch tested?

Existing tests should cover this.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14313 from HyukjinKwon/SPARK-16674.
2016-07-25 19:57:47 +08:00
Cheng Lian 68b4020d0c [SPARK-16648][SQL] Make ignoreNullsExpr a child expression of First and Last
## What changes were proposed in this pull request?

Default `TreeNode.withNewChildren` implementation doesn't work for `Last` and when both constructor arguments are the same, e.g.:

```sql
LAST_VALUE(FALSE) -- The 2nd argument defaults to FALSE
LAST_VALUE(FALSE, FALSE)
LAST_VALUE(TRUE, TRUE)
```

This is because although `Last` is a unary expression, both of its constructor arguments, `child` and `ignoreNullsExpr`, are `Expression`s. When they have the same value, `TreeNode.withNewChildren` treats both of them as child nodes by mistake. `First` is also affected by this issue in exactly the same way.

This PR fixes this issue by making `ignoreNullsExpr` a child expression of `First` and `Last`.

## How was this patch tested?

New test case added in `WindowQuerySuite`.

Author: Cheng Lian <lian@databricks.com>

Closes #14295 from liancheng/spark-16648-last-value.
2016-07-25 17:22:29 +08:00
Qifan Pu 468a3c3ac5 [SPARK-16699][SQL] Fix performance bug in hash aggregate on long string keys
In the following code in `VectorizedHashMapGenerator.scala`:
```
    def hashBytes(b: String): String = {
      val hash = ctx.freshName("hash")
      s"""
         |int $result = 0;
         |for (int i = 0; i < $b.length; i++) {
         |  ${genComputeHash(ctx, s"$b[i]", ByteType, hash)}
         |  $result = ($result ^ (0x9e3779b9)) + $hash + ($result << 6) + ($result >>> 2);
         |}
       """.stripMargin
    }

```
when b=input.getBytes(), the current 2.0 code results in getBytes() being called n times, n being length of input. getBytes() involves memory copy is thus expensive and causes a performance degradation.
Fix is to evaluate getBytes() before the for loop.

Performance bug, no additional test added.

Author: Qifan Pu <qifan.pu@gmail.com>

Closes #14337 from ooq/SPARK-16699.

(cherry picked from commit d226dce12b)
Signed-off-by: Reynold Xin <rxin@databricks.com>
2016-07-24 21:54:42 -07:00
Brian Cho daace60142 [SPARK-5581][CORE] When writing sorted map output file, avoid open / …
…close between each partition

## What changes were proposed in this pull request?

Replace commitAndClose with separate commit and close to avoid opening and closing
the file between partitions.

## How was this patch tested?

Run existing unit tests, add a few unit tests regarding reverts.

Observed a ~20% reduction in total time in tasks on stages with shuffle
writes to many partitions.

JoshRosen

Author: Brian Cho <bcho@fb.com>

Closes #13382 from dafrista/separatecommit-master.
2016-07-24 19:36:58 -07:00
Wenchen Fan 1221ce0402 [SPARK-16645][SQL] rename CatalogStorageFormat.serdeProperties to properties
## What changes were proposed in this pull request?

we also store data source table options in this field, it's unreasonable to call it `serdeProperties`.

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes #14283 from cloud-fan/minor1.
2016-07-25 09:28:56 +08:00
Mikael Ståldal 23e047f460 [SPARK-16416][CORE] force eager creation of loggers to avoid shutdown hook conflicts
## What changes were proposed in this pull request?

Force eager creation of loggers to avoid shutdown hook conflicts.

## How was this patch tested?

Manually tested with a project using Log4j 2, verified that the shutdown hook conflict issue was solved.

Author: Mikael Ståldal <mikael.staldal@magine.com>

Closes #14320 from mikaelstaldal/shutdown-hook-logging.
2016-07-24 11:16:24 +01:00
WeichenXu 37bed97de5 [PYSPARK] add picklable SparseMatrix in pyspark.ml.common
## What changes were proposed in this pull request?

add `SparseMatrix` class whick support pickler.

## How was this patch tested?

Existing test.

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #14265 from WeichenXu123/picklable_py.
2016-07-24 02:29:08 -07:00
Dongjoon Hyun cc1d2dcb61 [SPARK-16463][SQL] Support truncate option in Overwrite mode for JDBC DataFrameWriter
## What changes were proposed in this pull request?

This PR adds a boolean option, `truncate`, for `SaveMode.Overwrite` of JDBC DataFrameWriter. If this option is `true`, it try to take advantage of `TRUNCATE TABLE` instead of `DROP TABLE`. This is a trivial option, but will provide great **convenience** for BI tool users based on RDBMS tables generated by Spark.

**Goal**
- Without `CREATE/DROP` privilege, we can save dataframe to database. Sometime these are not allowed for security.
- It will preserve the existing table information, so users can add and keep some additional `INDEX` and `CONSTRAINT`s for the table.
- Sometime, `TRUNCATE` is faster than the combination of `DROP/CREATE`.

**Supported DBMS**
The following is `truncate`-option support table. Due to the different behavior of `TRUNCATE TABLE` among DBMSs, it's not always safe to use `TRUNCATE TABLE`. Spark will ignore the `truncate` option for **unknown** and **some** DBMS with **default CASCADING** behavior. Newly added JDBCDialect should implement corresponding function to support `truncate` option additionally.

Spark Dialects | `truncate` OPTION SUPPORT
---------------|-------------------------------
MySQLDialect | O
PostgresDialect | X
DB2Dialect | O
MsSqlServerDialect | O
DerbyDialect | O
OracleDialect | O

**Before (TABLE with INDEX case)**: SparkShell & MySQL CLI are interleaved intentionally.
```scala
scala> val (url, prop)=("jdbc:mysql://localhost:3306/temp?useSSL=false", new java.util.Properties)
scala> prop.setProperty("user","root")
scala> df.write.mode("overwrite").jdbc(url, "table_with_index", prop)
scala> spark.range(10).write.mode("overwrite").jdbc(url, "table_with_index", prop)
mysql> DESC table_with_index;
+-------+------------+------+-----+---------+-------+
| Field | Type       | Null | Key | Default | Extra |
+-------+------------+------+-----+---------+-------+
| id    | bigint(20) | NO   |     | NULL    |       |
+-------+------------+------+-----+---------+-------+
mysql> CREATE UNIQUE INDEX idx_id ON table_with_index(id);
mysql> DESC table_with_index;
+-------+------------+------+-----+---------+-------+
| Field | Type       | Null | Key | Default | Extra |
+-------+------------+------+-----+---------+-------+
| id    | bigint(20) | NO   | PRI | NULL    |       |
+-------+------------+------+-----+---------+-------+
scala> spark.range(10).write.mode("overwrite").jdbc(url, "table_with_index", prop)
mysql> DESC table_with_index;
+-------+------------+------+-----+---------+-------+
| Field | Type       | Null | Key | Default | Extra |
+-------+------------+------+-----+---------+-------+
| id    | bigint(20) | NO   |     | NULL    |       |
+-------+------------+------+-----+---------+-------+
```

**After (TABLE with INDEX case)**
```scala
scala> spark.range(10).write.mode("overwrite").option("truncate", true).jdbc(url, "table_with_index", prop)
mysql> DESC table_with_index;
+-------+------------+------+-----+---------+-------+
| Field | Type       | Null | Key | Default | Extra |
+-------+------------+------+-----+---------+-------+
| id    | bigint(20) | NO   | PRI | NULL    |       |
+-------+------------+------+-----+---------+-------+
```

**Error Handling**
- In case of exceptions, Spark will not retry. Users should turn off the `truncate` option.
- In case of schema change:
  - If one of the column names changes, this will raise exceptions intuitively.
  - If there exists only type difference, this will work like Append mode.

## How was this patch tested?

Pass the Jenkins tests with a updated testcase.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #14086 from dongjoon-hyun/SPARK-16410.
2016-07-24 09:25:02 +01:00
Liwei Lin d6795c7a25 [SPARK-16515][SQL][FOLLOW-UP] Fix test script on OS X/Windows...
## Problem

The current `sed` in `test_script.sh` is missing a `$`, leading to the failure of `script` test on OS X:
```
== Results ==
!== Correct Answer - 2 ==   == Spark Answer - 2 ==
![x1_y1]                    [x1]
![x2_y2]                    [x2]
```

In addition, this `script` test would also fail on systems like Windows where we couldn't be able to invoke `bash` or `echo | sed`.

## What changes were proposed in this pull request?
This patch
- fixes `sed` in `test_script.sh`
- adds command guards so that the `script` test would pass on systems like Windows

## How was this patch tested?

- Jenkins
- Manually verified tests pass on OS X

Author: Liwei Lin <lwlin7@gmail.com>

Closes #14280 from lw-lin/osx-sed.
2016-07-24 08:35:57 +01:00
Sean Owen e3c7039b45 [MINOR] Close old PRs that should be closed but have not been
Closes #11598
Closes #7278
Closes #13882
Closes #12053
Closes #14125
Closes #8760
Closes #12848
Closes #14224

Author: Sean Owen <sowen@cloudera.com>

Closes #14328 from srowen/CloseOldPRs.
2016-07-24 08:06:13 +01:00
Cheng Lian 53b2456d1d [SPARK-16380][EXAMPLES] Update SQL examples and programming guide for Python language binding
This PR is based on PR #14098 authored by wangmiao1981.

## What changes were proposed in this pull request?

This PR replaces the original Python Spark SQL example file with the following three files:

- `sql/basic.py`

  Demonstrates basic Spark SQL features.

- `sql/datasource.py`

  Demonstrates various Spark SQL data sources.

- `sql/hive.py`

  Demonstrates Spark SQL Hive interaction.

This PR also removes hard-coded Python example snippets in the SQL programming guide by extracting snippets from the above files using the `include_example` Liquid template tag.

## How was this patch tested?

Manually tested.

Author: wm624@hotmail.com <wm624@hotmail.com>
Author: Cheng Lian <lian@databricks.com>

Closes #14317 from liancheng/py-examples-update.
2016-07-23 11:41:24 -07:00
Wenchen Fan 86c2752066 [SPARK-16690][TEST] rename SQLTestUtils.withTempTable to withTempView
## What changes were proposed in this pull request?

after https://github.com/apache/spark/pull/12945, we renamed the `registerTempTable` to `createTempView`, as we do create a view actually. This PR renames `SQLTestUtils.withTempTable` to reflect this change.

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes #14318 from cloud-fan/minor4.
2016-07-23 11:39:48 -07:00
WeichenXu ab6e4aea5f [SPARK-16662][PYSPARK][SQL] fix HiveContext warning bug
## What changes were proposed in this pull request?

move the `HiveContext` deprecate warning printing statement into `HiveContext` constructor.
so that this warning will appear only when we use `HiveContext`
otherwise this warning will always appear if we reference the pyspark.ml.context code file.

## How was this patch tested?

Manual.

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #14301 from WeichenXu123/hiveContext_python_warning_update.
2016-07-23 12:33:47 +01:00
WeichenXu 25db51675f [SPARK-16561][MLLIB] fix multivarOnlineSummary min/max bug
## What changes were proposed in this pull request?

renaming var names to make code more clear:
nnz => weightSum
weightSum => totalWeightSum

and add a new member vector `nnz` (not `nnz` in previous code, which renamed to `weightSum`) to count each dimensions non-zero value number.
using `nnz` which I added above instead of `weightSum` when calculating min/max so that it fix several numerical error in some extreme case.

## How was this patch tested?

A new testcase added.

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #14216 from WeichenXu123/multivarOnlineSummary.
2016-07-23 12:32:30 +01:00
Liang-Chi Hsieh e10b8741d8 [SPARK-16622][SQL] Fix NullPointerException when the returned value of the called method in Invoke is null
## What changes were proposed in this pull request?

Currently we don't check the value returned by called method in `Invoke`. When the returned value is null and is assigned to a variable of primitive type, `NullPointerException` will be thrown.

## How was this patch tested?

Jenkins tests.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #14259 from viirya/agg-empty-ds.
2016-07-23 10:27:16 +08:00
Dongjoon Hyun 47f5b88db4 [SPARK-16651][PYSPARK][DOC] Make withColumnRenamed/drop description more consistent with Scala API
## What changes were proposed in this pull request?

`withColumnRenamed` and `drop` is a no-op if the given column name does not exists. Python documentation also describe that, but this PR adds more explicit line consistently with Scala to reduce the ambiguity.

## How was this patch tested?

It's about docs.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #14288 from dongjoon-hyun/SPARK-16651.
2016-07-22 13:20:06 +01:00
Tom Graves 6c56fff118 [SPARK-16650] Improve documentation of spark.task.maxFailures
Clarify documentation on spark.task.maxFailures

No tests run as its documentation

Author: Tom Graves <tgraves@yahoo-inc.com>

Closes #14287 from tgravescs/SPARK-16650.
2016-07-22 12:41:38 +01:00
WeichenXu b4e16bd582 [GIT] add pydev & Rstudio project file to gitignore list
## What changes were proposed in this pull request?

Add Pydev & Rstudio project file to gitignore list, I think the two IEDs are used by many developers.
so that won't need personal gitignore_global config.

## How was this patch tested?

N/A

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #14293 from WeichenXu123/update_gitignore.
2016-07-22 12:40:41 +01:00
Ahmed Mahran 2c72a4432b [SPARK-16487][STREAMING] Fix some batches might not get marked as fully processed in JobGenerator
## What changes were proposed in this pull request?

In `JobGenerator`, the code reads like that some batches might not get marked as fully processed. In the following flowchart, the batch should get marked fully processed before endpoint C however it is not. Currently, this does not actually cause an issue, as the condition `(time - zeroTime) is multiple of checkpoint duration?` always evaluates to `true` as the `checkpoint duration` is always set to be equal to the `batch duration`.

![Flowchart](https://s31.postimg.org/udy9lti2j/spark_streaming_job_generator.png)

This PR fixes this issue so as to improve code readability and to avoid any potential issue in case there is any future change making checkpoint duration to be set different from batch duration.

Author: Ahmed Mahran <ahmed.mahran@mashin.io>

Closes #14145 from ahmed-mahran/b-mark-batch-fully-processed.
2016-07-22 12:39:12 +01:00
Jacek Laskowski e1bd70f44b [SPARK-16287][HOTFIX][BUILD][SQL] Fix annotation argument needs to be a constant
## What changes were proposed in this pull request?

Build fix for [SPARK-16287][SQL] Implement str_to_map SQL function that has introduced this compilation error:

```
/Users/jacek/dev/oss/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala:402: error: annotation argument needs to be a constant; found: "_FUNC_(text[, pairDelim, keyValueDelim]) - Creates a map after splitting the text ".+("into key/value pairs using delimiters. ").+("Default delimiters are \',\' for pairDelim and \':\' for keyValueDelim.")
    "into key/value pairs using delimiters. " +
                                              ^
```

## How was this patch tested?

Local build

Author: Jacek Laskowski <jacek@japila.pl>

Closes #14315 from jaceklaskowski/build-fix-complexTypeCreator.
2016-07-22 12:37:30 +01:00
gatorsmile 94f14b52a6 [SPARK-16556][SPARK-16559][SQL] Fix Two Bugs in Bucket Specification
### What changes were proposed in this pull request?

**Issue 1: Silent Ignorance of Bucket Specification When Creating Table Using Schema Inference**

When creating a data source table without explicit specification of schema or SELECT clause, we silently ignore the bucket specification (CLUSTERED BY... SORTED BY...) in [the code](ce3b98bae2/sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala (L339-L354)).

For example,
```SQL
CREATE TABLE jsonTable
USING org.apache.spark.sql.json
OPTIONS (
  path '${tempDir.getCanonicalPath}'
)
CLUSTERED BY (inexistentColumnA) SORTED BY (inexistentColumnB) INTO 2 BUCKETS
```

This PR captures it and issues an error message.

**Issue 2: Got a run-time `java.lang.ArithmeticException` when num of buckets is set to zero.**

For example,
```SQL
CREATE TABLE t USING PARQUET
OPTIONS (PATH '${path.toString}')
CLUSTERED BY (a) SORTED BY (b) INTO 0 BUCKETS
AS SELECT 1 AS a, 2 AS b
```
The exception we got is
```
ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 1.0 (TID 2)
java.lang.ArithmeticException: / by zero
```

This PR captures the misuse and issues an appropriate error message.

### How was this patch tested?
Added a test case in DDLSuite

Author: gatorsmile <gatorsmile@gmail.com>

Closes #14210 from gatorsmile/createTableWithoutSchema.
2016-07-22 13:27:17 +08:00
Sandeep Singh df2c6d59d0 [SPARK-16287][SQL] Implement str_to_map SQL function
## What changes were proposed in this pull request?
This PR adds `str_to_map` SQL function in order to remove Hive fallback.

## How was this patch tested?
Pass the Jenkins tests with newly added.

Author: Sandeep Singh <sandeep@techaddict.me>

Closes #13990 from techaddict/SPARK-16287.
2016-07-22 10:05:21 +08:00