## What changes were proposed in this pull request?
In HDFS, when we copy a file into target directory, there will a temporary `._COPY_` file for a period of time. The duration depends on file size. If we do not skip this file, we will may read the same data for two times.
## How was this patch tested?
update unit test
Author: uncleGen <hustyugm@gmail.com>
Closes#16370 from uncleGen/SPARK-18960.
## What changes were proposed in this pull request?
This is to workaround an implicit result of #4947 which suppressed the
original Kryo exception if the overflow happened during serialization.
## How was this patch tested?
`KryoSerializerSuite` was augmented to reflect this change.
Author: Sergei Lebedev <superbobry@gmail.com>
Closes#16416 from superbobry/patch-1.
## What changes were proposed in this pull request?
Correct test cases of ```LogisticRegression``` raw2prediction & probability2prediction.
## How was this patch tested?
Changed unit tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#16407 from yanboliang/raw-probability.
## What changes were proposed in this pull request?
Univariate feature selection works by selecting the best features based on univariate statistical tests.
FDR and FWE are a popular univariate statistical test for feature selection.
In 2005, the Benjamini and Hochberg paper on FDR was identified as one of the 25 most-cited statistical papers. The FDR uses the Benjamini-Hochberg procedure in this PR. https://en.wikipedia.org/wiki/False_discovery_rate.
In statistics, FWE is the probability of making one or more false discoveries, or type I errors, among all the hypotheses when performing multiple hypotheses tests.
https://en.wikipedia.org/wiki/Family-wise_error_rate
We add FDR and FWE methods for ChiSqSelector in this PR, like it is implemented in scikit-learn.
http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection
## How was this patch tested?
ut will be added soon
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Author: Peng <peng.meng@intel.com>
Author: Peng, Meng <peng.meng@intel.com>
Closes#15212 from mpjlu/fdr_fwe.
## What changes were proposed in this pull request?
add example with `--pip` and `--r` switch as it is actually done in create-release
## How was this patch tested?
Doc only
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16364 from felixcheung/buildguide.
### What changes were proposed in this pull request?
Since `spark.sql.hive.thriftServer.singleSession` is a configuration of SQL component, this conf can be moved from `SparkConf` to `StaticSQLConf`.
When we introduced `spark.sql.hive.thriftServer.singleSession`, all the SQL configuration are session specific. They can be modified in different sessions.
In Spark 2.1, static SQL configuration is added. It is a perfect fit for `spark.sql.hive.thriftServer.singleSession`. Previously, we did the same move for `spark.sql.warehouse.dir` from `SparkConf` to `StaticSQLConf`
### How was this patch tested?
Added test cases in HiveThriftServer2Suites.scala
Author: gatorsmile <gatorsmile@gmail.com>
Closes#16392 from gatorsmile/hiveThriftServerSingleSession.
## What changes were proposed in this pull request?
On configuration doc page:https://spark.apache.org/docs/latest/configuration.html
We mentioned spark.kryoserializer.buffer.max : Maximum allowable size of Kryo serialization buffer. This must be larger than any object you attempt to serialize. Increase this if you get a "buffer limit exceeded" exception inside Kryo.
from source code, it has hard coded upper limit :
```
val maxBufferSizeMb = conf.getSizeAsMb("spark.kryoserializer.buffer.max", "64m").toInt
if (maxBufferSizeMb >= ByteUnit.GiB.toMiB(2))
{ throw new IllegalArgumentException("spark.kryoserializer.buffer.max must be less than " + s"2048 mb, got: + $maxBufferSizeMb mb.") }
```
We should mention "this value must be less than 2048 mb" on the configuration doc page as well.
## How was this patch tested?
None. Since it's minor doc change.
Author: Yuexin Zhang <yxzhang@cloudera.com>
Closes#16412 from cnZach/SPARK-19006.
## What changes were proposed in this pull request?
`ReplSuite`s hang due to the length limitation on Windows with the exception as below:
```
Spark context available as 'sc' (master = local-cluster[1,1,1024], app id = app-20161223114000-0000).
Spark session available as 'spark'.
Exception in thread "ExecutorRunner for app-20161223114000-0000/26995" java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.Arrays.copyOf(Arrays.java:3332)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:622)
at java.lang.StringBuilder.append(StringBuilder.java:202)
at java.lang.ProcessImpl.createCommandLine(ProcessImpl.java:194)
at java.lang.ProcessImpl.<init>(ProcessImpl.java:340)
at java.lang.ProcessImpl.start(ProcessImpl.java:137)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
at org.apache.spark.deploy.worker.ExecutorRunner.org$apache$spark$deploy$worker$ExecutorRunner$$fetchAndRunExecutor(ExecutorRunner.scala:167)
at org.apache.spark.deploy.worker.ExecutorRunner$$anon$1.run(ExecutorRunner.scala:73)
```
The reason is, it keeps failing and goes in an infinite loop. This fails because it uses the paths (via `getFile`) from URLs in the tests whereas some added afterward are normal local paths.
(`url.getFile` gives `/C:/a/b/c` and some paths are added later as the format of `C:\a\b\c`. )
So, many classpaths are duplicated because normal local paths and paths from URLs are mixed. This length is up to 40K which hits the length limitation problem (32K) on Windows.
The full command line built here is - https://gist.github.com/HyukjinKwon/46af7946c9a5fd4c6fc70a8a0aba1beb
## How was this patch tested?
Manually via AppVeyor.
**Before**
https://ci.appveyor.com/project/spark-test/spark/build/395-find-path-issues
**After**
https://ci.appveyor.com/project/spark-test/spark/build/398-find-path-issues
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#16398 from HyukjinKwon/SPARK-18842-more.
## What changes were proposed in this pull request?
Currently `DatasetBenchmark` use `case class Data(l: Long, s: String)` as the record type of `RDD` and `Dataset`, which introduce serialization overhead only to `Dataset` and is unfair.
This PR use `Long` as the record type, to be fairer for `Dataset`
## How was this patch tested?
existing tests
Author: Wenchen Fan <wenchen@databricks.com>
Closes#16391 from cloud-fan/benchmark.
## What changes were proposed in this pull request?
`JDBCSuite` and `JDBCWriterSuite` have their own `testH2Dialect`s for their testing purposes.
This PR fixes `testH2Dialect` in `JDBCWriterSuite` by removing `getCatalystType` implementation in order to return correct types. Currently, it always returns `Some(StringType)` incorrectly. Note that, for the `testH2Dialect` in `JDBCSuite`, it's intentional because of the test case `Remap types via JdbcDialects`.
## How was this patch tested?
This is a test only update.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#16409 from dongjoon-hyun/SPARK-H2-DIALECT.
## What changes were proposed in this pull request?
`Literal` can use `CodegenContex.addReferenceObj` to implement codegen, instead of `CodegenFallback`. This can also simplify the generated code a little bit, before we will generate: `((Expression) references[1]).eval(null)`, now it's just `references[1]`.
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes#16402 from cloud-fan/minor.
## What changes were proposed in this pull request?
When we describe a table, we only wanna see the information of this table, not read it, so it's ok even if the format class is not present at the classpath.
## How was this patch tested?
new regression test
Author: Wenchen Fan <wenchen@databricks.com>
Closes#16388 from cloud-fan/hive.
## What changes were proposed in this pull request?
Currently we implement `Aggregator` with `DeclarativeAggregate`, which will serialize/deserialize the buffer object every time we process an input.
This PR implements `Aggregator` with `TypedImperativeAggregate` and avoids to serialize/deserialize buffer object many times. The benchmark shows we get about 2 times speed up.
For simple buffer object that doesn't need serialization, we still go with `DeclarativeAggregate`, to avoid performance regression.
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes#16383 from cloud-fan/aggregator.
## What changes were proposed in this pull request?
The root cause of this issue is that RegisterWorkerResponse and LaunchExecutor are sent via two different channels (TCP connections) and their order is not guaranteed.
This PR changes the master and worker codes to use `workerRef` to send RegisterWorkerResponse, so that RegisterWorkerResponse and LaunchExecutor are sent via the same connection. Hence `LaunchExecutor` will always be after `RegisterWorkerResponse` and never be ignored.
## How was this patch tested?
Jenkins
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#16345 from zsxwing/SPARK-17755.
## What changes were proposed in this pull request?
`CSVRelation.csvParser` does type dispatch for each value in each row. We can prevent this because the schema is already kept in `CSVRelation`.
So, this PR proposes that converters are created first according to the schema, and then apply them to each.
I just ran some small benchmarks as below after resembling the logics in 7c33b0fd05/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala (L170-L178) to test the updated logics.
```scala
test("Benchmark for CSV converter") {
var numMalformedRecords = 0
val N = 500 << 12
val schema = StructType(
StructField("a", StringType) ::
StructField("b", StringType) ::
StructField("c", StringType) ::
StructField("d", StringType) :: Nil)
val row = Array("1.0", "test", "2015-08-20 14:57:00", "FALSE")
val data = spark.sparkContext.parallelize(List.fill(N)(row))
val parser = CSVRelation.csvParser(schema, schema.fieldNames, CSVOptions())
val benchmark = new Benchmark("CSV converter", N)
benchmark.addCase("cast CSV string tokens", 10) { _ =>
data.flatMap { recordTokens =>
parser(recordTokens, numMalformedRecords)
}.collect()
}
benchmark.run()
}
```
**Before**
```
CSV converter: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
cast CSV string tokens 1061 / 1130 1.9 517.9 1.0X
```
**After**
```
CSV converter: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
cast CSV string tokens 940 / 1011 2.2 459.2 1.0X
```
## How was this patch tested?
Tests in `CSVTypeCastSuite` and `CSVRelation`
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#16351 from HyukjinKwon/type-dispatch.
## What changes were proposed in this pull request?
This issue was reported by wangyum.
In the AllJobsPage, JobPage and StagePage, the description length was limited before like as follows.
![ui-2 0 0](https://cloud.githubusercontent.com/assets/4736016/21319673/8b225246-c651-11e6-9041-4fcdd04f4dec.gif)
But recently, the limitation seems to have been accidentally removed.
![ui-2 1 0](https://cloud.githubusercontent.com/assets/4736016/21319825/104779f6-c652-11e6-8bfa-dfd800396352.gif)
The cause is that some tables are no longer `sortable` class although they were, and `sortable` class does not only mark tables as sortable but also limited the width of their child `td` elements.
The reason why now some tables are not `sortable` class is because another sortable mechanism was introduced by #13620 and #13708 with pagination feature.
To fix this issue, I've introduced new class `table-cell-width-limited` which limits the description cell width and the description is like what it was.
<img width="1260" alt="2016-12-20 1 00 34" src="https://cloud.githubusercontent.com/assets/4736016/21320478/89141c7a-c654-11e6-8494-f8f91325980b.png">
## How was this patch tested?
Tested manually with my browser.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#16338 from sarutak/SPARK-18837.
## What changes were proposed in this pull request?
`UnsafeKVExternalSorter` uses `UnsafeInMemorySorter` to sort the records of `BytesToBytesMap` if it is given a map.
Currently we use the number of keys in `BytesToBytesMap` to determine if the array used for sort is enough or not. We has an assert that ensures the size of the array is enough: `map.numKeys() <= map.getArray().size() / 2`.
However, each record in the map takes two entries in the array, one is record pointer, another is key prefix. So the correct assert should be `map.numKeys() * 2 <= map.getArray().size() / 2`.
## How was this patch tested?
N/A
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#16232 from viirya/SPARK-18800-fix-UnsafeKVExternalSorter.
## What changes were proposed in this pull request?
Statistics in LogicalPlan should use attributes to refer to columns rather than column names, because two columns from two relations can have the same column name. But CatalogTable doesn't have the concepts of attribute or broadcast hint in Statistics. Therefore, putting Statistics in CatalogTable is confusing.
We define a different statistic structure in CatalogTable, which is only responsible for interacting with metastore, and is converted to statistics in LogicalPlan when it is used.
## How was this patch tested?
add test cases
Author: wangzhenhua <wangzhenhua@huawei.com>
Author: Zhenhua Wang <wzh_zju@163.com>
Closes#16323 from wzhfy/nameToAttr.
## What changes were proposed in this pull request?
The time complexity of ConcurrentHashMap's `remove` is O(1). Changing ContextCleaner.referenceBuffer's type from `ConcurrentLinkedQueue` to `ConcurrentHashMap's` will make the removal much faster.
## How was this patch tested?
Jenkins
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#16390 from zsxwing/SPARK-18991.
fails on big endian. Only change byte order on little endian
## What changes were proposed in this pull request?
Fix test to only change byte order on LE platforms
## How was this patch tested?
Test run on Big Endian and Little Endian platforms
Author: Pete Robbins <robbinspg@gmail.com>
Closes#16375 from robbinspg/SPARK-18963.
## What changes were proposed in this pull request?
It would make it easier to integrate with other component expecting row-based JSON format.
This replaces the non-public toJSON RDD API.
## How was this patch tested?
manual, unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16368 from felixcheung/rJSON.
## What changes were proposed in this pull request?
Right now the name of threads created by Netty for Spark RPC are `shuffle-client-**` and `shuffle-server-**`. It's pretty confusing.
This PR just uses the module name in TransportConf to set the thread name. In addition, it also includes the following minor fixes:
- TransportChannelHandler.channelActive and channelInactive should call the corresponding super methods.
- Make ShuffleBlockFetcherIterator throw NoSuchElementException if it has no more elements. Otherwise, if the caller calls `next` without `hasNext`, it will just hang.
## How was this patch tested?
Jenkins
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#16380 from zsxwing/SPARK-18972.
## What changes were proposed in this pull request?
Add missing InterfaceStability.Evolving for Structured Streaming APIs
## How was this patch tested?
Compiling the codes.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#16385 from zsxwing/SPARK-18985.
## What changes were proposed in this pull request?
This PR is an inheritance from #16000, and is a completion of #15904.
**Description**
- Augment the `org.apache.spark.status.api.v1` package for serving streaming information.
- Retrieve the streaming information through StreamingJobProgressListener.
> this api should cover exceptly the same amount of information as you can get from the web interface
> the implementation is base on the current REST implementation of spark-core
> and will be available for running applications only
>
> https://issues.apache.org/jira/browse/SPARK-18537
## How was this patch tested?
Local test.
Author: saturday_s <shi.indetail@gmail.com>
Author: Chan Chor Pang <ChorPang.Chan@access-company.com>
Author: peterCPChan <universknight@gmail.com>
Closes#16253 from saturday-shi/SPARK-18537.
## What changes were proposed in this pull request?
In current Spark we could add customized SparkListener through `SparkContext#addListener` API, but there's no equivalent API to remove the registered one. In our scenario SparkListener will be added repeatedly accordingly to the changed environment. If lacks the ability to remove listeners, there might be many registered listeners finally, this is unnecessary and potentially affects the performance. So here propose to add an API to remove registered listener.
## How was this patch tested?
Add an unit test to verify it.
Author: jerryshao <sshao@hortonworks.com>
Closes#16382 from jerryshao/SPARK-18975.
## What changes were proposed in this pull request?
SortPartitions and RedistributeData logical operators are not actually used and can be removed. Note that we do have a Sort operator (with global flag false) that subsumed SortPartitions.
## How was this patch tested?
Also updated test cases to reflect the removal.
Author: Reynold Xin <rxin@databricks.com>
Closes#16381 from rxin/SPARK-18973.
## What changes were proposed in this pull request?
This PR cleans up duplicated checking for file paths in implemented data sources and prevent to attempt to list twice in ORC data source.
https://github.com/apache/spark/pull/14585 handles a problem for the partition column name having `_` and the issue itself is resolved correctly. However, it seems the data sources implementing `FileFormat` are validating the paths duplicately. Assuming from the comment in `CSVFileFormat`, `// TODO: Move filtering.`, I guess we don't have to check this duplicately.
Currently, this seems being filtered in `PartitioningAwareFileIndex.shouldFilterOut` and`PartitioningAwareFileIndex.isDataPath`. So, `FileFormat.inferSchema` will always receive leaf files. For example, running to codes below:
``` scala
spark.range(10).withColumn("_locality_code", $"id").write.partitionBy("_locality_code").save("/tmp/parquet")
spark.read.parquet("/tmp/parquet")
```
gives the paths below without directories but just valid data files:
``` bash
/tmp/parquet/_col=0/part-r-00000-094a8efa-bece-4b50-b54c-7918d1f7b3f8.snappy.parquet
/tmp/parquet/_col=1/part-r-00000-094a8efa-bece-4b50-b54c-7918d1f7b3f8.snappy.parquet
/tmp/parquet/_col=2/part-r-00000-25de2b50-225a-4bcf-a2bc-9eb9ed407ef6.snappy.parquet
...
```
to `FileFormat.inferSchema`.
## How was this patch tested?
Unit test added in `HadoopFsRelationTest` and related existing tests.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#14627 from HyukjinKwon/SPARK-16975.
## What changes were proposed in this pull request?
There are several tests failing due to resource-closing-related and path-related problems on Windows as below.
- `SQLQuerySuite`:
```
- specifying database name for a temporary table is not allowed *** FAILED *** (125 milliseconds)
org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-1f4471ab-aac0-4239-ae35-833d54b37e52;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370)
```
- `JsonSuite`:
```
- Loading a JSON dataset from a text file with SQL *** FAILED *** (94 milliseconds)
org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-c918a8b7-fc09-433c-b9d0-36c0f78ae918;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370)
```
- `StateStoreSuite`:
```
- SPARK-18342: commit fails when rename fails *** FAILED *** (16 milliseconds)
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: StateStoreSuite29777261fs://C:%5Cprojects%5Cspark%5Ctarget%5Ctmp%5Cspark-ef349862-7281-4963-aaf3-add0d670a4ad%5C?????-2218c2f8-2cf6-4f80-9cdf-96354e8246a77685899733421033312/0
at org.apache.hadoop.fs.Path.initialize(Path.java:206)
at org.apache.hadoop.fs.Path.<init>(Path.java:116)
at org.apache.hadoop.fs.Path.<init>(Path.java:89)
...
Cause: java.net.URISyntaxException: Relative path in absolute URI: StateStoreSuite29777261fs://C:%5Cprojects%5Cspark%5Ctarget%5Ctmp%5Cspark-ef349862-7281-4963-aaf3-add0d670a4ad%5C?????-2218c2f8-2cf6-4f80-9cdf-96354e8246a77685899733421033312/0
at java.net.URI.checkPath(URI.java:1823)
at java.net.URI.<init>(URI.java:745)
at org.apache.hadoop.fs.Path.initialize(Path.java:203)
```
- `HDFSMetadataLogSuite`:
```
- FileManager: FileContextManager *** FAILED *** (94 milliseconds)
java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-415bb0bd-396b-444d-be82-04599e025f21
at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010)
at org.apache.spark.sql.test.SQLTestUtils$class.withTempDir(SQLTestUtils.scala:127)
at org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.withTempDir(HDFSMetadataLogSuite.scala:38)
- FileManager: FileSystemManager *** FAILED *** (78 milliseconds)
java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-ef8222cd-85aa-47c0-a396-bc7979e15088
at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010)
at org.apache.spark.sql.test.SQLTestUtils$class.withTempDir(SQLTestUtils.scala:127)
at org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.withTempDir(HDFSMetadataLogSuite.scala:38)
```
And, there are some tests being failed due to the length limitation on cmd in Windows as below:
- `LauncherBackendSuite`:
```
- local: launcher handle *** FAILED *** (30 seconds, 120 milliseconds)
The code passed to eventually never returned normally. Attempted 283 times over 30.0960053 seconds. Last failure message: The reference was null. (LauncherBackendSuite.scala:56)
org.scalatest.exceptions.TestFailedDueToTimeoutException:
at org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
- standalone/client: launcher handle *** FAILED *** (30 seconds, 47 milliseconds)
The code passed to eventually never returned normally. Attempted 282 times over 30.037987100000002 seconds. Last failure message: The reference was null. (LauncherBackendSuite.scala:56)
org.scalatest.exceptions.TestFailedDueToTimeoutException:
at org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
```
The executed command is, https://gist.github.com/HyukjinKwon/d3fdd2e694e5c022992838a618a516bd, which is 16K length; however, the length limitation is 8K on Windows. So, it is being failed to launch.
This PR proposes to fix the test failures on Windows and skip the tests failed due to the length limitation
## How was this patch tested?
Manually tested via AppVeyor
**Before**
`SQLQuerySuite `: https://ci.appveyor.com/project/spark-test/spark/build/306-pr-references
`JsonSuite`: https://ci.appveyor.com/project/spark-test/spark/build/307-pr-references
`StateStoreSuite` : https://ci.appveyor.com/project/spark-test/spark/build/305-pr-references
`HDFSMetadataLogSuite`: https://ci.appveyor.com/project/spark-test/spark/build/304-pr-references
`LauncherBackendSuite`: https://ci.appveyor.com/project/spark-test/spark/build/303-pr-references
**After**
`SQLQuerySuite`: https://ci.appveyor.com/project/spark-test/spark/build/293-SQLQuerySuite
`JsonSuite`: https://ci.appveyor.com/project/spark-test/spark/build/294-JsonSuite
`StateStoreSuite`: https://ci.appveyor.com/project/spark-test/spark/build/297-StateStoreSuite
`HDFSMetadataLogSuite`: https://ci.appveyor.com/project/spark-test/spark/build/319-pr-references
`LauncherBackendSuite`: failed test skipped.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#16335 from HyukjinKwon/more-fixes-on-windows.
## What changes were proposed in this pull request?
Starting Spark 2.1.0, bucketing feature is available for all file-based data sources. This patch fixes some function docs that haven't yet been updated to reflect that.
## How was this patch tested?
N/A
Author: Reynold Xin <rxin@databricks.com>
Closes#16349 from rxin/ds-doc.
## What changes were proposed in this pull request?
This patch includes minor changes to improve readability for partition handling code. I'm in the middle of implementing some new feature and found some naming / implicit type inference not as intuitive.
## How was this patch tested?
This patch should have no semantic change and the changes should be covered by existing test cases.
Author: Reynold Xin <rxin@databricks.com>
Closes#16378 from rxin/minor-fix.
## What changes were proposed in this pull request?
This PR audits places using `logicalPlan` in StreamExecution and ensures they all handles the case that `logicalPlan` cannot be created.
In addition, this PR also fixes the following issues in `StreamingQueryException`:
- `StreamingQueryException` and `StreamExecution` are cycle-dependent because in the `StreamingQueryException`'s constructor, it calls `StreamExecution`'s `toDebugString` which uses `StreamingQueryException`. Hence it will output `null` value in the error message.
- Duplicated stack trace when calling Throwable.printStackTrace because StreamingQueryException's toString contains the stack trace.
## How was this patch tested?
The updated `test("max files per trigger - incorrect values")`. I found this issue when I switched from `testStream` to the real codes to verify the failure in this test.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#16322 from zsxwing/SPARK-18907.
## What changes were proposed in this pull request?
make-distribution.sh should find JAVA_HOME for Ubuntu, Mac and other non-RHEL systems
## How was this patch tested?
Manually
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16363 from felixcheung/buildjava.
## What changes were proposed in this pull request?
API for SparkUI URL from SparkContext
## How was this patch tested?
manual, unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16367 from felixcheung/rwebui.
## What changes were proposed in this pull request?
This pr is to fix an `NullPointerException` issue caused by a following `limit + aggregate` query;
```
scala> val df = Seq(("a", 1), ("b", 2), ("c", 1), ("d", 5)).toDF("id", "value")
scala> df.limit(2).groupBy("id").count().show
WARN TaskSetManager: Lost task 0.0 in stage 9.0 (TID 8204, lvsp20hdn012.stubprod.com): java.lang.NullPointerException
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
```
The root culprit is that [`$doAgg()`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L596) skips an initialization of [the buffer iterator](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L603); `BaseLimitExec` sets `stopEarly=true` and `$doAgg()` exits in the middle without the initialization.
## How was this patch tested?
Added a test to check if no exception happens for limit + aggregates in `DataFrameAggregateSuite.scala`.
Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
Closes#15980 from maropu/SPARK-18528.
## What changes were proposed in this pull request?
Made update mode public. As part of that here are the changes.
- Update DatastreamWriter to accept "update"
- Changed package of InternalOutputModes from o.a.s.sql to o.a.s.sql.catalyst
- Added update mode state removing with watermark to StateStoreSaveExec
## How was this patch tested?
Added new tests in changed modules
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#16360 from tdas/SPARK-18234.
Remove spark-tag's compile-scope dependency (and, indirectly, spark-core's compile-scope transitive-dependency) on scalatest by splitting test-oriented tags into spark-tags' test JAR.
Alternative to #16303.
Author: Ryan Williams <ryan.blake.williams@gmail.com>
Closes#16311 from ryan-williams/tt.
## What changes were proposed in this pull request?
When KafkaSource fails on Kafka errors, we should create a new consumer to retry rather than using the existing broken one because it's possible that the broken one will fail again.
This PR also assigns a new group id to the new created consumer for a possible race condition: the broken consumer cannot talk with the Kafka cluster in `close` but the new consumer can talk to Kafka cluster. I'm not sure if this will happen or not. Just for safety to avoid that the Kafka cluster thinks there are two consumers with the same group id in a short time window. (Note: CachedKafkaConsumer doesn't need this fix since `assign` never uses the group id.)
## How was this patch tested?
In https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70370/console , it ran this flaky test 120 times and all passed.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#16282 from zsxwing/kafka-fix.
## What changes were proposed in this pull request?
Currently, Spark writes a single file out per task, sometimes leading to very large files. It would be great to have an option to limit the max number of records written per file in a task, to avoid humongous files.
This patch introduces a new write config option `maxRecordsPerFile` (default to a session-wide setting `spark.sql.files.maxRecordsPerFile`) that limits the max number of records written to a single file. A non-positive value indicates there is no limit (same behavior as not having this flag).
## How was this patch tested?
Added test cases in PartitionedWriteSuite for both dynamic partition insert and non-dynamic partition insert.
Author: Reynold Xin <rxin@databricks.com>
Closes#16204 from rxin/SPARK-18775.
## What changes were proposed in this pull request?
The issue in this test is the cleanup of RDDs may not be able to finish before stopping StreamingContext. This PR basically just puts the assertions into `eventually` and runs it before stopping StreamingContext.
## How was this patch tested?
Jenkins
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#16362 from zsxwing/SPARK-18954.
## What changes were proposed in this pull request?
The failure is because in `test("basic functionality")`, it doesn't block until `ExecutorAllocationManager.manageAllocation` is called. This PR just adds StreamManualClock to allow the tests to block on expected wait time to make the test deterministic.
## How was this patch tested?
Jenkins
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#16321 from zsxwing/SPARK-18031.
## What changes were proposed in this pull request?
Two changes
- Fix how delays specified in months and years are translated to milliseconds
- Following up on #16258, not show watermark when there is no watermarking in the query
## How was this patch tested?
Updated and new unit tests
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#16304 from tdas/SPARK-18834-1.
## What changes were proposed in this pull request?
I recently hit a bug of com.thoughtworks.paranamer/paranamer, which causes jackson fail to handle byte array defined in a case class. Then I find https://github.com/FasterXML/jackson-module-scala/issues/48, which suggests that it is caused by a bug in paranamer. Let's upgrade paranamer. Since we are using jackson 2.6.5 and jackson-module-paranamer 2.6.5 use com.thoughtworks.paranamer/paranamer 2.6, I suggests that we upgrade paranamer to 2.6.
Author: Yin Huai <yhuai@databricks.com>
Closes#16359 from yhuai/SPARK-18951.
## What changes were proposed in this pull request?
It's a huge waste to call `Catalog.listTables` in `SQLContext.tableNames`, which only need the table names, while `Catalog.listTables` will get the table metadata for each table name.
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes#16352 from cloud-fan/minor.
## What changes were proposed in this pull request?
We can build Python API docs by `cd ./python/docs && make html for Python` and R API docs by `cd ./R && sh create-docs.sh for R` separately. However, `jekyll` fails in some environments.
This PR aims to support `SKIP_PYTHONDOC` and `SKIP_RDOC` for documentation build in `docs` folder. Currently, we can use `SKIP_SCALADOC` or `SKIP_API`. The reason providing additional options is that the Spark documentation build uses a number of tools to build HTML docs and API docs in Scala, Python and R. Specifically, for Python and R,
- Python API docs requires `sphinx`.
- R API docs requires `R` installation and `knitr` (and more others libraries).
In other words, we cannot generate Python API docs without R installation. Also, we cannot generate R API docs without Python `sphinx` installation. If Spark provides `SKIP_PYTHONDOC` and `SKIP_RDOC` like `SKIP_SCALADOC`, it would be more convenient.
## How was this patch tested?
Manual.
**Skipping Scala/Java/Python API Doc Build**
```bash
$ cd docs
$ SKIP_SCALADOC=1 SKIP_PYTHONDOC=1 jekyll build
$ ls api
DESCRIPTION R
```
**Skipping Scala/Java/R API Doc Build**
```bash
$ cd docs
$ SKIP_SCALADOC=1 SKIP_RDOC=1 jekyll build
$ ls api
python
```
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#16336 from dongjoon-hyun/SPARK-18923.
### What changes were proposed in this pull request?
Currently, we only have a SQL interface for recovering all the partitions in the directory of a table and update the catalog. `MSCK REPAIR TABLE` or `ALTER TABLE table RECOVER PARTITIONS`. (Actually, very hard for me to remember `MSCK` and have no clue what it means)
After the new "Scalable Partition Handling", the table repair becomes much more important for making visible the data in the created data source partitioned table.
Thus, this PR is to add it into the Catalog interface. After this PR, users can repair the table by
```Scala
spark.catalog.recoverPartitions("testTable")
```
### How was this patch tested?
Modified the existing test cases.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#16356 from gatorsmile/repairTable.
## What changes were proposed in this pull request?
It was pretty flaky before 10 days ago.
https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.execution.streaming.state.StateStoreSuite&test_name=maintenance
Since no code changes went into this code path to not be so flaky, I'm just increasing the timeouts such that load related flakiness shouldn't be a problem. As you may see from the testing, I haven't been able to reproduce it.
## How was this patch tested?
2000 retries 5 times
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#16314 from brkyvz/maint-flaky.
## What changes were proposed in this pull request?
Adds basic TaskContext information to PySpark.
## How was this patch tested?
New unit tests to `tests.py` & existing unit tests.
Author: Holden Karau <holden@us.ibm.com>
Closes#16211 from holdenk/SPARK-18576-pyspark-taskcontext.