Commit graph

16732 commits

Author SHA1 Message Date
Josh Rosen 921fa40b14 [SPARK-12712] Fix failure in ./dev/test-dependencies when run against empty .m2 cache
This patch fixes a bug in `./dev/test-dependencies.sh` which caused spurious failures when the script was run on a machine with an empty `.m2` cache. The problem was that extra log output from the dependency download was conflicting with the grep / regex used to identify the classpath in the Maven output. This patch fixes this issue by adjusting the regex pattern.

Tested manually with the following reproduction of the bug:

```
rm -rf ~/.m2/repository/org/apache/commons/
./dev/test-dependencies.sh
```

Author: Josh Rosen <joshrosen@databricks.com>

Closes #13568 from JoshRosen/SPARK-12712.
2016-06-09 00:51:24 -07:00
Sandeep Singh d5807def10 [MINOR][DOC] In Dataset docs, remove self link to Dataset and add link to Column
## What changes were proposed in this pull request?
Documentation Fix

## How was this patch tested?

Author: Sandeep Singh <sandeep@techaddict.me>

Closes #13567 from techaddict/minor-4.
2016-06-08 23:41:29 -07:00
Wenchen Fan afbe35cf5b [SPARK-14670] [SQL] allow updating driver side sql metrics
## What changes were proposed in this pull request?

On the SparkUI right now we have this SQLTab that displays accumulator values per operator. However, it only displays metrics updated on the executors, not on the driver. It is useful to also include driver metrics, e.g. broadcast time.

This is a different version from https://github.com/apache/spark/pull/12427. This PR sends driver side accumulator updates right after the updating happens, not at the end of execution, by a new event.

## How was this patch tested?

new test in `SQLListenerSuite`

![qq20160606-0](https://cloud.githubusercontent.com/assets/3182036/15841418/0eb137da-2c06-11e6-9068-5694eeb78530.png)

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13189 from cloud-fan/metrics.
2016-06-08 22:47:29 -07:00
Eric Liang 4e8ac6edd5 [SPARK-15735] Allow specifying min time to run in microbenchmarks
## What changes were proposed in this pull request?

This makes microbenchmarks run for at least 2 seconds by default, to allow some time for jit compilation to kick in.

## How was this patch tested?

Tested manually with existing microbenchmarks. This change is backwards compatible in that existing microbenchmarks which specified numIters per-case will still run exactly that number of iterations. Microbenchmarks which previously overrode defaultNumIters now override minNumIters.

cc hvanhovell

Author: Eric Liang <ekl@databricks.com>
Author: Eric Liang <ekhliang@gmail.com>

Closes #13472 from ericl/spark-15735.
2016-06-08 16:21:41 -07:00
prabs ca70ab27cc [DOCUMENTATION] Fixed target JAR path
## What changes were proposed in this pull request?

Mentioned Scala version in the sbt configuration file is 2.11, so the path of the target JAR should be `/target/scala-2.11/simple-project_2.11-1.0.jar`

## How was this patch tested?

n/a

Author: prabs <prabsmails@gmail.com>
Author: Prabeesh K <prabsmails@gmail.com>

Closes #13554 from prabeesh/master.
2016-06-08 17:22:55 +01:00
Sandeep Singh f958c1c3e2 [MINOR] Fix Java Lint errors introduced by #13286 and #13280
## What changes were proposed in this pull request?

revived #13464

Fix Java Lint errors introduced by #13286 and #13280
Before:
```
Using `mvn` from path: /Users/pichu/Project/spark/build/apache-maven-3.3.9/bin/mvn
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0
Checkstyle checks failed at following occurrences:
[ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[340,5] (whitespace) FileTabCharacter: Line contains a tab character.
[ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[341,5] (whitespace) FileTabCharacter: Line contains a tab character.
[ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[342,5] (whitespace) FileTabCharacter: Line contains a tab character.
[ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[343,5] (whitespace) FileTabCharacter: Line contains a tab character.
[ERROR] src/main/java/org/apache/spark/sql/streaming/OutputMode.java:[41,28] (naming) MethodName: Method name 'Append' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
[ERROR] src/main/java/org/apache/spark/sql/streaming/OutputMode.java:[52,28] (naming) MethodName: Method name 'Complete' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
[ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[61,8] (imports) UnusedImports: Unused import - org.apache.parquet.schema.PrimitiveType.
[ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[62,8] (imports) UnusedImports: Unused import - org.apache.parquet.schema.Type.
```

## How was this patch tested?
ran `dev/lint-java` locally

Author: Sandeep Singh <sandeep@techaddict.me>

Closes #13559 from techaddict/minor-3.
2016-06-08 14:51:00 +01:00
yinxusen 87706eb66c [SPARK-15793][ML] Add maxSentenceLength for ml.Word2Vec
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-15793

Word2vec in ML package should have maxSentenceLength method for feature parity.

## How was this patch tested?

Tested with Spark unit test.

Author: yinxusen <yinxusen@gmail.com>

Closes #13536 from yinxusen/SPARK-15793.
2016-06-08 09:18:04 +01:00
Herman van Hovell 91fbc880b6 [SPARK-15789][SQL] Allow reserved keywords in most places
## What changes were proposed in this pull request?
The parser currently does not allow the use of some SQL keywords as table or field names. This PR adds supports for all keywords as identifier. The exception to this are table aliases, in this case most keywords are allowed except for join keywords (```anti, full, inner, left, semi, right, natural, on, join, cross```) and set-operator keywords (```union, intersect, except```).

## How was this patch tested?
I have added/move/renamed test in the catalyst `*ParserSuite`s.

Author: Herman van Hovell <hvanhovell@databricks.com>

Closes #13534 from hvanhovell/SPARK-15789.
2016-06-07 17:01:11 -07:00
Shixiong Zhu 0cfd6192f3 [SPARK-15580][SQL] Add ContinuousQueryInfo to make ContinuousQueryListener events serializable
## What changes were proposed in this pull request?

This PR adds ContinuousQueryInfo to make ContinuousQueryListener events serializable in order to support writing events into the event log.

## How was this patch tested?

Jenkins unit tests.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #13335 from zsxwing/query-info.
2016-06-07 16:40:03 -07:00
zhonghaihua 695dbc816a [SPARK-14485][CORE] ignore task finished for executor lost and removed by driver
Now, when executor is removed by driver with heartbeats timeout, driver will re-queue the task on this executor and send a kill command to cluster to kill this executor.
But, in a situation, the running task of this executor is finished and return result to driver before this executor killed by kill command sent by driver. At this situation, driver will accept the task finished event and ignore speculative task and re-queued task.
But, as we know, this executor has removed by driver, the result of this finished task can not save in driver because the BlockManagerId has also removed from BlockManagerMaster by driver. So, the result data of this stage is not complete, and then, it will cause fetch failure. For more details, [link to jira issues SPARK-14485](https://issues.apache.org/jira/browse/SPARK-14485)
This PR introduce a mechanism to ignore this kind of task finished.

N/A

Author: zhonghaihua <793507405@qq.com>

Closes #12258 from zhonghaihua/ignoreTaskFinishForExecutorLostAndRemovedByDriver.
2016-06-07 16:32:27 -07:00
Yanbo Liang 6ecedf39b4 [SPARK-13590][ML][DOC] Document spark.ml LiR, LoR and AFTSurvivalRegression behavior difference
## What changes were proposed in this pull request?
When fitting ```LinearRegressionModel```(by "l-bfgs" solver) and ```LogisticRegressionModel``` w/o intercept on dataset with constant nonzero column, spark.ml produce same model as R glmnet but different from LIBSVM.

When fitting ```AFTSurvivalRegressionModel``` w/o intercept on dataset with constant nonzero column, spark.ml produce different model compared with R survival::survreg.

We should output a warning message and clarify in document for this condition.

## How was this patch tested?
Document change, no unit test.

cc mengxr

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12731 from yanboliang/spark-13590.
2016-06-07 15:25:36 -07:00
Sean Zhong 890baaca50 [SPARK-15674][SQL] Deprecates "CREATE TEMPORARY TABLE USING...", uses "CREAT TEMPORARY VIEW USING..." instead
## What changes were proposed in this pull request?

The current implementation of "CREATE TEMPORARY TABLE USING datasource..." is NOT creating any intermediate temporary data directory like temporary HDFS folder, instead, it only stores a SQL string in memory. Probably we should use "TEMPORARY VIEW" instead.

This PR assumes a temporary table has to link with some temporary intermediate data. It follows the definition of temporary table like this (from [hortonworks doc](https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_dataintegration/content/temp-tables.html)):
> A temporary table is a convenient way for an application to automatically manage intermediate data generated during a complex query

**Example**:

```
scala> spark.sql("CREATE temporary view  my_tab7 (c1: String, c2: String)  USING org.apache.spark.sql.execution.datasources.csv.CSVFileFormat  OPTIONS (PATH '/Users/seanzhong/csv/cars.csv')")
scala> spark.sql("select c1, c2 from my_tab7").show()
+----+-----+
|  c1|   c2|
+----+-----+
|year| make|
|2012|Tesla|
...
```

It NOW prints a **deprecation warning** if "CREATE TEMPORARY TABLE USING..." is used.

```
scala> spark.sql("CREATE temporary table  my_tab7 (c1: String, c2: String)  USING org.apache.spark.sql.execution.datasources.csv.CSVFileFormat  OPTIONS (PATH '/Users/seanzhong/csv/cars.csv')")
16/05/31 10:39:27 WARN SparkStrategies$DDLStrategy: CREATE TEMPORARY TABLE tableName USING... is deprecated, please use CREATE TEMPORARY VIEW viewName USING... instead
```

## How was this patch tested?

Unit test.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #13414 from clockfly/create_temp_view_using.
2016-06-07 15:21:55 -07:00
Marcelo Vanzin 200f01c8fb [SPARK-15760][DOCS] Add documentation for package-related configs.
While there, also document spark.files and spark.jars. Text is the
same as the spark-submit help text with some minor adjustments.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #13502 from vanzin/SPARK-15760.
2016-06-07 09:28:39 -07:00
wm624@hotmail.com 3ec4461c46 [SPARK-15684][SPARKR] Not mask startsWith and endsWith in R
## What changes were proposed in this pull request?

In R 3.3.0, startsWith and endsWith are added. In this PR, I make the two work in SparkR.
1. Remove signature in generic.R
2. Add setMethod in column.R
3. Add unit tests

## How was this patch tested?
Manually test it through SparkR shell for both column data and string data, which are added into the unit test file.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #13476 from wangmiao1981/start.
2016-06-07 09:13:18 -07:00
WeichenXu 1e2c931187 [MINOR] fix typo in documents
## What changes were proposed in this pull request?

I use spell check tools checks typo in spark documents and fix them.

## How was this patch tested?

N/A

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #13538 from WeichenXu123/fix_doc_typo.
2016-06-07 13:29:27 +01:00
Sean Zhong 5f731d6859 [SPARK-15792][SQL] Allows operator to change the verbosity in explain output
## What changes were proposed in this pull request?

This PR allows customization of verbosity in explain output. After change, `dataframe.explain()` and `dataframe.explain(true)` has different verbosity output for physical plan.

Currently, this PR only enables verbosity string for operator `HashAggregateExec` and `SortAggregateExec`. We will gradually enable verbosity string for more operators in future.

**Less verbose mode:** dataframe.explain(extended = false)

`output=[count(a)#85L]` is **NOT** displayed for HashAggregate.

```
scala> Seq((1,2,3)).toDF("a", "b", "c").createTempView("df2")
scala> spark.sql("select count(a) from df2").explain()
== Physical Plan ==
*HashAggregate(key=[], functions=[count(1)])
+- Exchange SinglePartition
   +- *HashAggregate(key=[], functions=[partial_count(1)])
      +- LocalTableScan
```

**Verbose mode:** dataframe.explain(extended = true)

`output=[count(a)#85L]` is displayed for HashAggregate.

```
scala> spark.sql("select count(a) from df2").explain(true)  // "output=[count(a)#85L]" is added
...
== Physical Plan ==
*HashAggregate(key=[], functions=[count(1)], output=[count(a)#85L])
+- Exchange SinglePartition
   +- *HashAggregate(key=[], functions=[partial_count(1)], output=[count#87L])
      +- LocalTableScan
```

## How was this patch tested?

Manual test.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #13535 from clockfly/verbose_breakdown_2.
2016-06-06 22:59:25 -07:00
Sean Zhong 0e0904a2fc [SPARK-15632][SQL] Typed Filter should NOT change the Dataset schema
## What changes were proposed in this pull request?

This PR makes sure the typed Filter doesn't change the Dataset schema.

**Before the change:**

```
scala> val df = spark.range(0,9)
scala> df.schema
res12: org.apache.spark.sql.types.StructType = StructType(StructField(id,LongType,false))
scala> val afterFilter = df.filter(_=>true)
scala> afterFilter.schema   // !!! schema is CHANGED!!! Column name is changed from id to value, nullable is changed from false to true.
res13: org.apache.spark.sql.types.StructType = StructType(StructField(value,LongType,true))

```

SerializeFromObject and DeserializeToObject are inserted to wrap the Filter, and these two can possibly change the schema of Dataset.

**After the change:**

```
scala> afterFilter.schema   // schema is NOT changed.
res47: org.apache.spark.sql.types.StructType = StructType(StructField(id,LongType,false))
```

## How was this patch tested?

Unit test.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #13529 from clockfly/spark-15632.
2016-06-06 22:40:21 -07:00
Subroto Sanyal c409e23abd [SPARK-15652][LAUNCHER] Added a new State (LOST) for the listeners of SparkLauncher
## What changes were proposed in this pull request?
This situation can happen when the LauncherConnection gets an exception while reading through the socket and terminating silently without notifying making the client/listener think that the job is still in previous state.
The fix force sends a notification to client that the job finished with unknown status and let client handle it accordingly.

## How was this patch tested?
Added a unit test.

Author: Subroto Sanyal <ssanyal@datameer.com>

Closes #13497 from subrotosanyal/SPARK-15652-handle-spark-submit-jvm-crash.
2016-06-06 16:05:40 -07:00
Imran Rashid 36d3dfa59a [SPARK-15783][CORE] still some flakiness in these blacklist tests so ignore for now
## What changes were proposed in this pull request?

There is still some flakiness in BlacklistIntegrationSuite, so turning it off for the moment to avoid breaking more builds -- will turn it back with more fixes.

## How was this patch tested?

jenkins.

Author: Imran Rashid <irashid@cloudera.com>

Closes #13528 from squito/ignore_blacklist.
2016-06-06 12:53:11 -07:00
Josh Rosen 0b8d694999 [SPARK-15764][SQL] Replace N^2 loop in BindReferences
BindReferences contains a n^2 loop which causes performance issues when operating over large schemas: to determine the ordinal of an attribute reference, we perform a linear scan over the `input` array. Because input can sometimes be a `List`, the call to `input(ordinal).nullable` can also be O(n).

Instead of performing a linear scan, we can convert the input into an array and build a hash map to map from expression ids to ordinals. The greater up-front cost of the map construction is offset by the fact that an expression can contain multiple attribute references, so the cost of the map construction is amortized across a number of lookups.

Perf. benchmarks to follow. /cc ericl

Author: Josh Rosen <joshrosen@databricks.com>

Closes #13505 from JoshRosen/bind-references-improvement.
2016-06-06 11:44:51 -07:00
Joseph K. Bradley 4c74ee8d8e [SPARK-15721][ML] Make DefaultParamsReadable, DefaultParamsWritable public
## What changes were proposed in this pull request?

Made DefaultParamsReadable, DefaultParamsWritable public.  Also added relevant doc and annotations.  Added UnaryTransformerExample to demonstrate use of UnaryTransformer and DefaultParamsReadable,Writable.

## How was this patch tested?

Wrote example making use of the now-public APIs.  Compiled and ran locally

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #13461 from jkbradley/defaultparamswritable.
2016-06-06 09:49:45 -07:00
Dhruve Ashar fa4bc8ea8b [SPARK-14279][BUILD] Pick the spark version from pom
## What changes were proposed in this pull request?
Change the way spark picks up version information. Also embed the build information to better identify the spark version running.

More context can be found here : https://github.com/apache/spark/pull/12152

## How was this patch tested?
Ran the mvn and sbt builds to verify the version information was being displayed correctly on executing <code>spark-submit --version </code>

![image](https://cloud.githubusercontent.com/assets/7732317/15197251/f7c673a2-1795-11e6-8b2f-88f2a70cf1c1.png)

Author: Dhruve Ashar <dhruveashar@gmail.com>

Closes #13061 from dhruve/impr/SPARK-14279.
2016-06-06 09:42:50 -07:00
Zheng RuiFeng 00ad4f054c [SPARK-14900][ML][PYSPARK] Add accuracy and deprecate precison,recall,f1
## What changes were proposed in this pull request?
1, add accuracy for MulticlassMetrics
2, deprecate overall precision,recall,f1 and recommend accuracy usage

## How was this patch tested?
manual tests in pyspark shell

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #13511 from zhengruifeng/deprecate_py_precisonrecall.
2016-06-06 15:19:22 +01:00
Yanbo Liang a95252823e [SPARK-15771][ML][EXAMPLES] Use 'accuracy' rather than 'precision' in many ML examples
## What changes were proposed in this pull request?
Since [SPARK-15617](https://issues.apache.org/jira/browse/SPARK-15617) deprecated ```precision``` in ```MulticlassClassificationEvaluator```, many ML examples broken.
```python
pyspark.sql.utils.IllegalArgumentException: u'MulticlassClassificationEvaluator_4c3bb1d73d8cc0cedae6 parameter metricName given invalid value precision.'
```
We should use ```accuracy``` to replace ```precision``` in these examples.

## How was this patch tested?
Offline tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #13519 from yanboliang/spark-15771.
2016-06-06 09:36:34 +01:00
Zheng RuiFeng fd8af39713 [MINOR] Fix Typos 'an -> a'
## What changes were proposed in this pull request?

`an -> a`

Use cmds like `find . -name '*.R' | xargs -i sh -c "grep -in ' an [^aeiou]' {} && echo {}"` to generate candidates, and review them one by one.

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #13515 from zhengruifeng/an_a.
2016-06-06 09:35:47 +01:00
Reynold Xin 32f2f95dbd Revert "[SPARK-15585][SQL] Fix NULL handling along with a spark-csv behaivour"
This reverts commit b7e8d1cb3c.
2016-06-05 23:40:13 -07:00
Takeshi YAMAMURO b7e8d1cb3c [SPARK-15585][SQL] Fix NULL handling along with a spark-csv behaivour
## What changes were proposed in this pull request?
This pr fixes the behaviour of `format("csv").option("quote", null)` along with one of spark-csv.
Also, it explicitly sets default values for CSV options in python.

## How was this patch tested?
Added tests in CSVSuite.

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #13372 from maropu/SPARK-15585.
2016-06-05 23:35:04 -07:00
Hiroshi Inoue 79268aa461 [SPARK-15704][SQL] add a test case in DatasetAggregatorSuite for regression testing
## What changes were proposed in this pull request?

This change fixes a crash in TungstenAggregate while executing "Dataset complex Aggregator" test case due to IndexOutOfBoundsException.

jira entry for detail: https://issues.apache.org/jira/browse/SPARK-15704

## How was this patch tested?
Using existing unit tests (including DatasetBenchmark)

Author: Hiroshi Inoue <inouehrs@jp.ibm.com>

Closes #13446 from inouehrs/fix_aggregate.
2016-06-05 20:10:33 -07:00
Josh Rosen 26c1089c37 [SPARK-15748][SQL] Replace inefficient foldLeft() call with flatMap() in PartitionStatistics
`PartitionStatistics` uses `foldLeft` and list concatenation (`++`) to flatten an iterator of lists, but this is extremely inefficient compared to simply doing `flatMap`/`flatten` because it performs many unnecessary object allocations. Simply replacing this `foldLeft` by a `flatMap` results in decent performance gains when constructing PartitionStatistics instances for tables with many columns.

This patch fixes this and also makes two similar changes in MLlib and streaming to try to fix all known occurrences of this pattern.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #13491 from JoshRosen/foldleft-to-flatmap.
2016-06-05 16:51:00 -07:00
Wenchen Fan 30c4774f33 [SPARK-15657][SQL] RowEncoder should validate the data type of input object
## What changes were proposed in this pull request?

This PR improves the error handling of `RowEncoder`. When we create a `RowEncoder` with a given schema, we should validate the data type of input object. e.g. we should throw an exception when a field is boolean but is declared as a string column.

This PR also removes the support to use `Product` as a valid external type of struct type.  This support is added at https://github.com/apache/spark/pull/9712, but is incomplete, e.g. nested product, product in array are both not working.  However, we never officially support this feature and I think it's ok to ban it.

## How was this patch tested?

new tests in `RowEncoderSuite`.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13401 from cloud-fan/bug.
2016-06-05 15:59:52 -07:00
Kai Jiang 8a9110510c [MINOR][R][DOC] Fix R documentation generation instruction.
## What changes were proposed in this pull request?
changes in R/README.md

- Make step of generating SparkR document more clear.
- link R/DOCUMENTATION.md from R/README.md
- turn on some code syntax highlight in R/README.md

## How was this patch tested?
local test

Author: Kai Jiang <jiangkai@gmail.com>

Closes #13488 from vectorijk/R-Readme.
2016-06-05 13:03:02 -07:00
Zheng RuiFeng 372fa61f51 [SPARK-15770][ML] Annotation audit for Experimental and DeveloperApi
## What changes were proposed in this pull request?
1, remove comments `:: Experimental ::` for non-experimental API
2, add comments `:: Experimental ::` for experimental API
3, add comments `:: DeveloperApi ::` for developerApi API

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #13514 from zhengruifeng/del_experimental.
2016-06-05 11:55:25 -07:00
Brett Randall 4e767d0f90 [SPARK-15723] Fixed local-timezone-brittle test where short-timezone form "EST" is …
## What changes were proposed in this pull request?

Stop using the abbreviated and ambiguous timezone "EST" in a test, since it is machine-local default timezone dependent, and fails in different timezones.  Fixed [SPARK-15723](https://issues.apache.org/jira/browse/SPARK-15723).

## How was this patch tested?

Note that to reproduce this problem in any locale/timezone, you can modify the scalatest-maven-plugin argLine to add a timezone:

    <argLine>-ea -Xmx3g -XX:MaxPermSize=${MaxPermGen} -XX:ReservedCodeCacheSize=${CodeCacheSize} -Duser.timezone="Australia/Sydney"</argLine>

and run

    $ mvn test -DwildcardSuites=org.apache.spark.status.api.v1.SimpleDateParamSuite -Dtest=none. Equally this will fix it in an effected timezone:

    <argLine>-ea -Xmx3g -XX:MaxPermSize=${MaxPermGen} -XX:ReservedCodeCacheSize=${CodeCacheSize} -Duser.timezone="America/New_York"</argLine>

To test the fix, apply the above change to `pom.xml` to set test TZ to `Australia/Sydney`, and confirm the test now passes.

Author: Brett Randall <javabrett@gmail.com>

Closes #13462 from javabrett/SPARK-15723-SimpleDateParamSuite.
2016-06-05 15:31:56 +01:00
Weiqing Yang 0f307db5e1 [SPARK-15707][SQL] Make Code Neat - Use map instead of if check.
## What changes were proposed in this pull request?
In forType function of object RandomDataGenerator, the code following:
if (maybeSqlTypeGenerator.isDefined){
  ....
  Some(generator)
} else{
 None
}
will be changed. Instead, maybeSqlTypeGenerator.map will be used.

## How was this patch tested?
All of the current unit tests passed.

Author: Weiqing Yang <yangweiqing001@gmail.com>

Closes #13448 from Sherry302/master.
2016-06-04 22:44:03 +01:00
Josh Rosen 091f81e1f7 [SPARK-15762][SQL] Cache Metadata & StructType hashCodes; use singleton Metadata.empty
We should cache `Metadata.hashCode` and use a singleton for `Metadata.empty` because calculating metadata hashCodes appears to be a bottleneck for certain workloads.

We should also cache `StructType.hashCode`.

In an optimizer stress-test benchmark run by ericl, these `hashCode` calls accounted for roughly 40% of the total CPU time and this bottleneck was completely eliminated by the caching added by this patch.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #13504 from JoshRosen/metadata-fix.
2016-06-04 14:14:50 -07:00
Sean Owen 681387b2dc [MINOR][BUILD] Add modernizr MIT license; specify "2014 and onwards" in license copyright
## What changes were proposed in this pull request?

Per conversation on dev list, add missing modernizr license.
Specify "2014 and onwards" in copyright statement.

## How was this patch tested?

(none required)

Author: Sean Owen <sowen@cloudera.com>

Closes #13510 from srowen/ModernizrLicense.
2016-06-04 21:41:27 +01:00
Ruifeng Zheng 2099e05f93 [SPARK-15617][ML][DOC] Clarify that fMeasure in MulticlassMetrics is "micro" f1_score
## What changes were proposed in this pull request?
1, del precision,recall in  `ml.MulticlassClassificationEvaluator`
2, update user guide for `mlllib.weightedFMeasure`

## How was this patch tested?
local build

Author: Ruifeng Zheng <ruifengz@foxmail.com>

Closes #13390 from zhengruifeng/clarify_f1.
2016-06-04 13:56:04 +01:00
Lianhui Wang 2ca563cc45 [SPARK-15756][SQL] Support command 'create table stored as orcfile/parquetfile/avrofile'
## What changes were proposed in this pull request?
Now Spark SQL can support 'create table src stored as orc/parquet/avro' for orc/parquet/avro table. But Hive can support  both commands: ' stored as orc/parquet/avro' and 'stored as orcfile/parquetfile/avrofile'.
So this PR supports these keywords 'orcfile/parquetfile/avrofile' in Spark SQL.

## How was this patch tested?
add unit tests

Author: Lianhui Wang <lianhuiwang09@gmail.com>

Closes #13500 from lianhuiwang/SPARK-15756.
2016-06-03 22:19:22 -07:00
Subroto Sanyal 61d729abda [SPARK-15754][YARN] Not letting the credentials containing hdfs delegation tokens to be added in current user credential.
## What changes were proposed in this pull request?
The credentials are not added to the credentials of UserGroupInformation.getCurrentUser(). Further if the client has possibility to login using keytab then the updateDelegationToken thread is not started on client.

## How was this patch tested?
ran dev/run-tests

Author: Subroto Sanyal <ssanyal@datameer.com>

Closes #13499 from subrotosanyal/SPARK-15754-save-ugi-from-changing.
2016-06-03 16:50:00 -07:00
Davies Liu 3074f575a3 [SPARK-15391] [SQL] manage the temporary memory of timsort
## What changes were proposed in this pull request?

Currently, the memory for temporary buffer used by TimSort is always allocated as on-heap without bookkeeping, it could cause OOM both in on-heap and off-heap mode.

This PR will try to manage that by preallocate it together with the pointer array, same with RadixSort. It both works for on-heap and off-heap mode.

This PR also change the loadFactor of BytesToBytesMap to 0.5 (it was 0.70), it enables use to radix sort also makes sure that we have enough memory for timsort.

## How was this patch tested?

Existing tests.

Author: Davies Liu <davies@databricks.com>

Closes #13318 from davies/fix_timsort.
2016-06-03 16:45:09 -07:00
Holden Karau 67cc89ff02 [SPARK-15168][PYSPARK][ML] Add missing params to MultilayerPerceptronClassifier
## What changes were proposed in this pull request?

MultilayerPerceptronClassifier is missing step size, solver, and weights. Add these params. Also clarify the scaladoc a bit while we are updating these params.

Eventually we should follow up and unify the HasSolver params (filed https://issues.apache.org/jira/browse/SPARK-15169 )

## How was this patch tested?

Doc tests

Author: Holden Karau <holden@us.ibm.com>

Closes #12943 from holdenk/SPARK-15168-add-missing-params-to-MultilayerPerceptronClassifier.
2016-06-03 15:56:17 -07:00
Andrew Or b1cc7da3e3 [SPARK-15722][SQL] Disallow specifying schema in CTAS statement
## What changes were proposed in this pull request?

As of this patch, the following throws an exception because the schemas may not match:
```
CREATE TABLE students (age INT, name STRING) AS SELECT * FROM boxes
```
but this is OK:
```
CREATE TABLE students AS SELECT * FROM boxes
```

## How was this patch tested?

SQLQuerySuite, HiveDDLCommandSuite

Author: Andrew Or <andrew@databricks.com>

Closes #13490 from andrewor14/ctas-no-column.
2016-06-03 14:39:41 -07:00
Wenchen Fan 11c83f83d5 [SPARK-15140][SQL] make the semantics of null input object for encoder clear
## What changes were proposed in this pull request?

For input object of non-flat type, we can't encode it to row if it's null, as Spark SQL doesn't allow row to be null, only its columns can be null.

This PR explicitly add this constraint and throw exception if users break it.

## How was this patch tested?

several new tests

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13469 from cloud-fan/null-object.
2016-06-03 14:28:19 -07:00
Xin Wu 28ad0f7b0d [SPARK-15681][CORE] allow lowercase or mixed case log level string when calling sc.setLogLevel
## What changes were proposed in this pull request?
Currently `SparkContext API setLogLevel(level: String) `can not handle lower case or mixed case input string. But `org.apache.log4j.Level.toLevel` can take lowercase or mixed case.

This PR is to allow case-insensitive user input for the log level.

## How was this patch tested?
A unit testcase is added.

Author: Xin Wu <xinwu@us.ibm.com>

Closes #13422 from xwu0226/reset_loglevel.
2016-06-03 14:26:48 -07:00
Wenchen Fan 61b80d552a [SPARK-15547][SQL] nested case class in encoder can have different number of fields from the real schema
## What changes were proposed in this pull request?

There are 2 kinds of `GetStructField`:

1. resolved from `UnresolvedExtractValue`, and it will have a `name` property.
2. created when we build deserializer expression for nested tuple, no `name` property.

When we want to validate the ordinals of nested tuple, we should only catch `GetStructField` without the name property.

## How was this patch tested?

new test in `EncoderResolutionSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13474 from cloud-fan/ordinal-check.
2016-06-03 14:26:24 -07:00
gatorsmile eb10b481ca [SPARK-15286][SQL] Make the output readable for EXPLAIN CREATE TABLE and DESC EXTENDED
#### What changes were proposed in this pull request?
Before this PR, the output of EXPLAIN of following SQL is like

```SQL
CREATE EXTERNAL TABLE extTable_with_partitions (key INT, value STRING)
PARTITIONED BY (ds STRING, hr STRING)
LOCATION '/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-b39a6185-8981-403b-a4aa-36fb2f4ca8a9'
```
``ExecutedCommand CreateTableCommand CatalogTable(`extTable_with_partitions`,CatalogTableType(EXTERNAL),CatalogStorageFormat(Some(/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-dd234718-e85d-4c5a-8353-8f1834ac0323),Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(key,int,true,None), CatalogColumn(value,string,true,None), CatalogColumn(ds,string,true,None), CatalogColumn(hr,string,true,None)),List(ds, hr),List(),List(),-1,,1463026413544,-1,Map(),None,None,None), false``

After this PR, the output is like

```
ExecutedCommand
:  +- CreateTableCommand CatalogTable(
	Table:`extTable_with_partitions`
	Created:Thu Jun 02 21:30:54 PDT 2016
	Last Access:Wed Dec 31 15:59:59 PST 1969
	Type:EXTERNAL
	Schema:[`key` int, `value` string, `ds` string, `hr` string]
	Partition Columns:[`ds`, `hr`]
	Storage(Location:/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-a06083b8-8e88-4d07-9ff0-d6bd8d943ad3, InputFormat:org.apache.hadoop.mapred.TextInputFormat, OutputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat)), false
```

This is also applicable to `DESC EXTENDED`. However, this does not have special handling for Data Source Tables. If needed, we need to move the logics of `DDLUtil`. Let me know if we should do it in this PR. Thanks! rxin liancheng

#### How was this patch tested?
Manual testing

Author: gatorsmile <gatorsmile@gmail.com>

Closes #13070 from gatorsmile/betterExplainCatalogTable.
2016-06-03 13:56:22 -07:00
Josh Rosen e526913989 [SPARK-15742][SQL] Reduce temp collections allocations in TreeNode transform methods
In Catalyst's TreeNode transform methods we end up calling `productIterator.map(...).toArray` in a number of places, which is slightly inefficient because it needs to allocate an `ArrayBuilder` and grow a temporary array. Since we already know the size of the final output (`productArity`), we can simply allocate an array up-front and use a while loop to consume the iterator and populate the array.

For most workloads, this performance difference is negligible but it does make a measurable difference in optimizer performance for queries that operate over very wide schemas (such as the benchmark queries in #13456).

### Perf results (from #13456 benchmarks)

**Before**

```
Java HotSpot(TM) 64-Bit Server VM 1.8.0_66-b17 on Mac OS X 10.10.5
Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz

parsing large select:                    Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
1 select expressions                            19 /   22          0.0    19119858.0       1.0X
10 select expressions                           23 /   25          0.0    23208774.0       0.8X
100 select expressions                          55 /   73          0.0    54768402.0       0.3X
1000 select expressions                        229 /  259          0.0   228606373.0       0.1X
2500 select expressions                        530 /  554          0.0   529938178.0       0.0X
```

**After**

```
parsing large select:                    Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
1 select expressions                            15 /   21          0.0    14978203.0       1.0X
10 select expressions                           22 /   27          0.0    22492262.0       0.7X
100 select expressions                          48 /   64          0.0    48449834.0       0.3X
1000 select expressions                        189 /  208          0.0   189346428.0       0.1X
2500 select expressions                        429 /  449          0.0   428943897.0       0.0X
```

###

Author: Josh Rosen <joshrosen@databricks.com>

Closes #13484 from JoshRosen/treenode-productiterator-map.
2016-06-03 13:53:02 -07:00
Devaraj K efd3b11a47 [SPARK-15665][CORE] spark-submit --kill and --status are not working
## What changes were proposed in this pull request?
--kill and --status were not considered while handling in OptionParser and due to that it was failing. Now handling the --kill and --status options as part of OptionParser.handle.

## How was this patch tested?
Added a test org.apache.spark.launcher.SparkSubmitCommandBuilderSuite.testCliKillAndStatus() and also I have verified these manually by running --kill and --status commands.

Author: Devaraj K <devaraj@apache.org>

Closes #13407 from devaraj-kavali/SPARK-15665.
2016-06-03 13:03:28 -07:00
Ioana Delaney 9e2eb13ca5 [SPARK-15677][SQL] Query with scalar sub-query in the SELECT list throws UnsupportedOperationException
## What changes were proposed in this pull request?
Queries with scalar sub-query in the SELECT list run against a local, in-memory relation throw
UnsupportedOperationException exception.

Problem repro:
```SQL
scala> Seq((1, 1), (2, 2)).toDF("c1", "c2").createOrReplaceTempView("t1")
scala> Seq((1, 1), (2, 2)).toDF("c1", "c2").createOrReplaceTempView("t2")
scala> sql("select (select min(c1) from t2) from t1").show()

java.lang.UnsupportedOperationException: Cannot evaluate expression: scalar-subquery#62 []
  at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:215)
  at org.apache.spark.sql.catalyst.expressions.ScalarSubquery.eval(subquery.scala:62)
  at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:142)
  at org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:45)
  at org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:29)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.immutable.List.map(List.scala:285)
  at org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$37.applyOrElse(Optimizer.scala:1473)
```
The problem is specific to local, in memory relations. It is caused by rule ConvertToLocalRelation, which attempts to push down
a scalar-subquery expression to the local tables.

The solution prevents the rule to apply if Project references scalar subqueries.

## How was this patch tested?
Added regression tests to SubquerySuite.scala

Author: Ioana Delaney <ioanamdelaney@gmail.com>

Closes #13418 from ioana-delaney/scalarSubV2.
2016-06-03 12:04:27 -07:00
bomeng 8fa00dd05f [SPARK-15737][CORE] fix jetty warning
## What changes were proposed in this pull request?

After upgrading Jetty to 9.2, we always see "WARN org.eclipse.jetty.server.handler.AbstractHandler: No Server set for org.eclipse.jetty.server.handler.ErrorHandler" while running any test cases.

This PR will fix it.

## How was this patch tested?

The existing test cases will cover it.

Author: bomeng <bmeng@us.ibm.com>

Closes #13475 from bomeng/SPARK-15737.
2016-06-03 09:59:15 -07:00