## What changes were proposed in this pull request?
As of current 2.1, INSERT OVERWRITE with dynamic partitions against a Datasource table will overwrite the entire table instead of only the partitions matching the static keys, as in Hive. It also doesn't respect custom partition locations.
This PR adds support for all these operations to Datasource tables managed by the Hive metastore. It is implemented as follows
- During planning time, the full set of partitions affected by an INSERT or OVERWRITE command is read from the Hive metastore.
- The planner identifies any partitions with custom locations and includes this in the write task metadata.
- FileFormatWriter tasks refer to this custom locations map when determining where to write for dynamic partition output.
- When the write job finishes, the set of written partitions is compared against the initial set of matched partitions, and the Hive metastore is updated to reflect the newly added / removed partitions.
It was necessary to introduce a method for staging files with absolute output paths to `FileCommitProtocol`. These files are not handled by the Hadoop output committer but are moved to their final locations when the job commits.
The overwrite behavior of legacy Datasource tables is also changed: no longer will the entire table be overwritten if a partial partition spec is present.
cc cloud-fan yhuai
## How was this patch tested?
Unit tests, existing tests.
Author: Eric Liang <ekl@databricks.com>
Author: Wenchen Fan <wenchen@databricks.com>
Closes#15814 from ericl/sc-5027.
## What changes were proposed in this pull request?
We should call `setConf` if `OutputFormat` is `Configurable`, this should be done before we create `OutputCommitter` and `RecordWriter`.
This is follow up of #15769, see discussion [here](https://github.com/apache/spark/pull/15769/files#r87064229)
## How was this patch tested?
Add test of this case in `PairRDDFunctionsSuite`.
Author: jiangxingbo <jiangxb1987@gmail.com>
Closes#15823 from jiangxb1987/config-format.
## What changes were proposed in this pull request?
Application links generated on the history server UI no longer (regression from 1.6) contain the configured spark.ui.proxyBase in the links. To address this, made the uiRoot available globally to all javascripts for Web UI. Updated the mustache template (historypage-template.html) to include the uiroot for rendering links to the applications.
The existing test was not sufficient to verify the scenario where ajax call is used to populate the application listing template, so added a new selenium test case to cover this scenario.
## How was this patch tested?
Existing tests and a new unit test.
No visual changes to the UI.
Author: Vinayak <vijoshi5@in.ibm.com>
Closes#15742 from vijoshi/SPARK-16808_master.
## What changes were proposed in this pull request?
"StandaloneSchedulerBackend.dead" is called in a RPC thread, so it should not call "SparkContext.stop" in the same thread. "SparkContext.stop" will block until all RPC threads exit, if it's called inside a RPC thread, it will be dead-lock.
This PR add a thread local flag inside RPC threads. `SparkContext.stop` uses it to decide if launching a new thread to stop the SparkContext.
## How was this patch tested?
Jenkins
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#15775 from zsxwing/SPARK-18280.
## What changes were proposed in this pull request?
This PR port RDD API to use commit protocol, the changes made here:
1. Add new internal helper class that saves an RDD using a Hadoop OutputFormat named `SparkNewHadoopWriter`, it's similar with `SparkHadoopWriter` but uses commit protocol. This class supports the newer `mapreduce` API, instead of the old `mapred` API which is supported by `SparkHadoopWriter`;
2. Rewrite `PairRDDFunctions.saveAsNewAPIHadoopDataset` function, so it uses commit protocol now.
## How was this patch tested?
Exsiting test cases.
Author: jiangxingbo <jiangxb1987@gmail.com>
Closes#15769 from jiangxb1987/rdd-commit.
## What changes were proposed in this pull request?
This Pull request comprises of the critical bug SPARK-16575 changes. This change rectifies the issue with BinaryFileRDD partition calculations as upon creating an RDD with sc.binaryFiles, the resulting RDD always just consisted of two partitions only.
## How was this patch tested?
The original issue ie. getNumPartitions on binary Files RDD (always having two partitions) was first replicated and then tested upon the changes. Also the unit tests have been checked and passed.
This contribution is my original work and I licence the work to the project under the project's open source license
srowen hvanhovell rxin vanzin skyluc kmader zsxwing datafarmer Please have a look .
Author: fidato <fidato.july13@gmail.com>
Closes#15327 from fidato13/SPARK-16575.
## What changes were proposed in this pull request?
When profiling heap dumps from the HistoryServer and live Spark web UIs, I found a large amount of memory being wasted on duplicated objects and strings. This patch's changes remove most of this duplication, resulting in over 40% memory savings for some benchmarks.
- **Task metrics** (6441f0624dfcda9c7193a64bfb416a145b5aabdf): previously, every `TaskUIData` object would have its own instances of `InputMetricsUIData`, `OutputMetricsUIData`, `ShuffleReadMetrics`, and `ShuffleWriteMetrics`, but for many tasks these metrics are irrelevant because they're all zero. This patch changes how we construct these metrics in order to re-use a single immutable "empty" value for the cases where these metrics are empty.
- **TaskInfo.accumulables** (ade86db901127bf13c0e0bdc3f09c933a093bb76): Previously, every `TaskInfo` object had its own empty `ListBuffer` for holding updates from named accumulators. Tasks which didn't use named accumulators still paid for the cost of allocating and storing this empty buffer. To avoid this overhead, I changed the `val` with a mutable buffer into a `var` which holds an immutable Scala list, allowing tasks which do not have named accumulator updates to share the same singleton `Nil` object.
- **String.intern() in JSONProtocol** (7e05630e9a78c455db8c8c499f0590c864624e05): in the HistoryServer, executor hostnames and ids are deserialized from JSON, leading to massive duplication of these string objects. By calling `String.intern()` on the deserialized values we can remove all of this duplication. Since Spark now requires Java 7+ we don't have to worry about string interning exhausting the permgen (see http://java-performance.info/string-intern-in-java-6-7-8/).
## How was this patch tested?
I ran
```
sc.parallelize(1 to 100000, 100000).count()
```
in `spark-shell` with event logging enabled, then loaded that event log in the HistoryServer, performed a full GC, and took a heap dump. According to YourKit, the changes in this patch reduced memory consumption by roughly 28 megabytes (or 770k Java objects):
![image](https://cloud.githubusercontent.com/assets/50748/19953276/4f3a28aa-a129-11e6-93df-d7fa91396f66.png)
Here's a table illustrating the drop in objects due to deduplication (the drop is <100k for some objects because some events were dropped from the listener bus; this is a separate, existing bug that I'll address separately after CPU-profiling):
![image](https://cloud.githubusercontent.com/assets/50748/19953290/6a271290-a129-11e6-93ad-b825f1448886.png)
Author: Josh Rosen <joshrosen@databricks.com>
Closes#15743 from JoshRosen/spark-ui-memory-usage.
## What changes were proposed in this pull request?
Close `FileStreams`, `ZipFiles` etc to release the resources after using. Not closing the resources will cause IO Exception to be raised while deleting temp files.
## How was this patch tested?
Existing tests
Author: U-FAREAST\tl <tl@microsoft.com>
Author: hyukjinkwon <gurwls223@gmail.com>
Author: Tao LI <tl@microsoft.com>
Closes#15618 from HyukjinKwon/SPARK-14914-1.
## What changes were proposed in this pull request?
Enabled SparkR with Mesos client mode and cluster mode. Just a few changes were required to get this working on Mesos: (1) removed the SparkR on Mesos error checks and (2) do not require "--class" to be specified for R apps. The logic to check spark.mesos.executor.home was already in there.
sun-rui
## How was this patch tested?
1. SparkSubmitSuite
2. On local mesos cluster (on laptop): ran SparkR shell, spark-submit client mode, and spark-submit cluster mode, with the "examples/src/main/R/dataframe.R" example application.
3. On multi-node mesos cluster: ran SparkR shell, spark-submit client mode, and spark-submit cluster mode, with the "examples/src/main/R/dataframe.R" example application. I tested with the following --conf values set: spark.mesos.executor.docker.image and spark.mesos.executor.home
This contribution is my original work and I license the work to the project under the project's open source license.
Author: Susan X. Huynh <xhuynh@mesosphere.com>
Closes#15700 from susanxhuynh/susan-r-branch.
## What changes were proposed in this pull request?
Add comments.
## How was this patch tested?
Build passed.
Author: Weiqing Yang <yangweiqing001@gmail.com>
Closes#15776 from weiqingy/SPARK-17710.
## What changes were proposed in this pull request?
This patch significantly improves the performance of event log replay in the HistoryServer via two simple changes:
- **Don't use `extractOpt`**: it turns out that `json4s`'s `extractOpt` method uses exceptions for control flow, causing huge performance bottlenecks due to the overhead of initializing exceptions. To avoid this overhead, we can simply use our own` Utils.jsonOption` method. This patch replaces all uses of `extractOpt` with `Utils.jsonOption` and adds a style checker rule to ban the use of the slow `extractOpt` method.
- **Don't call `Utils.getFormattedClassName` for every event**: the old code called` Utils.getFormattedClassName` dozens of times per replayed event in order to match up class names in events with SparkListener event names. By simply storing the results of these calls in constants rather than recomputing them, we're able to eliminate a huge performance hotspot by removing thousands of expensive `Class.getSimpleName` calls.
## How was this patch tested?
Tested by profiling the replay of a long event log using YourKit. For an event log containing 1000+ jobs, each of which had thousands of tasks, the changes in this patch cut the replay time in half:
![image](https://cloud.githubusercontent.com/assets/50748/19980953/31154622-a1bd-11e6-9be4-21fbb9b3f9a7.png)
Prior to this patch's changes, the two slowest methods in log replay were internal exceptions thrown by `Json4S` and calls to `Class.getSimpleName()`:
![image](https://cloud.githubusercontent.com/assets/50748/19981052/87416cce-a1bd-11e6-9f25-06a7cd391822.png)
After this patch, these hotspots are completely eliminated.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#15756 from JoshRosen/speed-up-jsonprotocol.
## What changes were proposed in this pull request?
This improvement works by using the fastest comparison test first and we observed a 1% throughput performance improvement on PageRank (HiBench large profile) with this change.
We used tprof and before the change in AppendOnlyMap.changeValue (where the optimisation occurs) this method was being used for 8053 profiling ticks representing 0.72% of the overall application time.
After this change we observed this method only occurring for 2786 ticks and for 0.25% of the overall time.
## How was this patch tested?
Existing unit tests and for performance we used HiBench large, profiling with tprof and IBM Healthcenter.
Author: Adam Roberts <aroberts@uk.ibm.com>
Closes#15714 from a-roberts/patch-9.
## What changes were proposed in this pull request?
Document that Java 7, Python 2.6, Scala 2.10, Hadoop < 2.6 are deprecated in Spark 2.1.0. This does not actually implement any of the change in SPARK-18138, just peppers the documentation with notices about it.
## How was this patch tested?
Doc build
Author: Sean Owen <sowen@cloudera.com>
Closes#15733 from srowen/SPARK-18138.
## What changes were proposed in this pull request?
This patch moves the new commit protocol API from sql/core to core module, so we can use it in the future in the RDD API.
As part of this patch, I also moved the speficiation of the random uuid for the write path out of the commit protocol, and instead pass in a job id.
## How was this patch tested?
N/A
Author: Reynold Xin <rxin@databricks.com>
Closes#15731 from rxin/SPARK-18219.
## What changes were proposed in this pull request?
[SPARK-18200](https://issues.apache.org/jira/browse/SPARK-18200) reports Apache Spark 2.x raises `java.lang.IllegalArgumentException: requirement failed: Invalid initial capacity` while running `triangleCount`. The root cause is that `VertexSet`, a type alias of `OpenHashSet`, does not allow zero as a initial size. This PR loosens the restriction to allow zero.
## How was this patch tested?
Pass the Jenkins test with a new test case in `OpenHashSetSuite`.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#15741 from dongjoon-hyun/SPARK-18200.
## What changes were proposed in this pull request?
spark.files is still passed to driver in yarn mode, so SparkContext will still handle it which cause the error in the jira desc.
## How was this patch tested?
Tested manually in a 5 node cluster. As this issue only happens in multiple node cluster, so I didn't write test for it.
Author: Jeff Zhang <zjffdu@apache.org>
Closes#15669 from zjffdu/SPARK-18160.
## What changes were proposed in this pull request?
When a user appended a column using a "nondeterministic" function to a DataFrame, e.g., `rand`, `randn`, and `monotonically_increasing_id`, the expected semantic is the following:
- The value in each row should remain unchanged, as if we materialize the column immediately, regardless of later DataFrame operations.
However, since we use `TaskContext.getPartitionId` to get the partition index from the current thread, the values from nondeterministic columns might change if we call `union` or `coalesce` after. `TaskContext.getPartitionId` returns the partition index of the current Spark task, which might not be the corresponding partition index of the DataFrame where we defined the column.
See the unit tests below or JIRA for examples.
This PR uses the partition index from `RDD.mapPartitionWithIndex` instead of `TaskContext` and fixes the partition initialization logic in whole-stage codegen, normal codegen, and codegen fallback. `initializeStatesForPartition(partitionIndex: Int)` was added to `Projection`, `Nondeterministic`, and `Predicate` (codegen) and initialized right after object creation in `mapPartitionWithIndex`. `newPredicate` now returns a `Predicate` instance rather than a function for proper initialization.
## How was this patch tested?
Unit tests. (Actually I'm not very confident that this PR fixed all issues without introducing new ones ...)
cc: rxin davies
Author: Xiangrui Meng <meng@databricks.com>
Closes#15567 from mengxr/SPARK-14393.
## What changes were proposed in this pull request?
Fix `Locale.US` for all usages of `DateFormat`, `NumberFormat`
## How was this patch tested?
Existing tests.
Author: Sean Owen <sowen@cloudera.com>
Closes#15610 from srowen/SPARK-18076.
## What changes were proposed in this pull request?
Removing `appUIAddress` attribute since it is no longer in use.
## How was this patch tested?
Local build
Author: Jacek Laskowski <jacek@japila.pl>
Closes#15603 from jaceklaskowski/sparkui-fixes.
## What changes were proposed in this pull request?
This adds information to the web UI thread dump page about the JVM locks
held by threads and the locks that threads are blocked waiting to
acquire. This should help find cases where lock contention is causing
Spark applications to run slowly.
## How was this patch tested?
Tested by applying this patch and viewing the change in the web UI.
![thread-lock-info](https://cloud.githubusercontent.com/assets/87915/18493057/6e5da870-79c3-11e6-8c20-f54c18a37544.png)
Additions:
- A "Thread Locking" column with the locks held by the thread or that are blocking the thread
- Links from the a blocked thread to the thread holding the lock
- Stack frames show where threads are inside `synchronized` blocks, "holding Monitor(...)"
Author: Ryan Blue <blue@apache.org>
Closes#15088 from rdblue/SPARK-17532-add-thread-lock-info.
The `ReplayListenerBus.read()` method is used when implementing a custom `ApplicationHistoryProvider`. The current interface only exposes a `read()` method which takes an `InputStream` and performs stream-to-lines conversion itself, but it would also be useful to expose an overloaded method which accepts an iterator of strings, thereby enabling events to be provided from non-`InputStream` sources.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#15698 from JoshRosen/replay-listener-bus-interface.
## What changes were proposed in this pull request?
Because of the refactoring work in Structured Streaming, the event logs generated by Strucutred Streaming in Spark 2.0.0 and 2.0.1 cannot be parsed.
This PR just ignores these logs in ReplayListenerBus because no places use them.
## How was this patch tested?
- Generated events logs using Spark 2.0.0 and 2.0.1, and saved them as `structured-streaming-query-event-logs-2.0.0.txt` and `structured-streaming-query-event-logs-2.0.1.txt`
- The new added test makes sure ReplayListenerBus will skip these bad jsons.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#15663 from zsxwing/fix-event-log.
## What changes were proposed in this pull request?
This patch makes RBackend connection timeout configurable by user.
## How was this patch tested?
N/A
Author: Hossein <hossein@databricks.com>
Closes#15471 from falaki/SPARK-17919.
## What changes were proposed in this pull request?
To reduce the number of components in SQL named *Catalog, rename *FileCatalog to *FileIndex. A FileIndex is responsible for returning the list of partitions / files to scan given a filtering expression.
```
TableFileCatalog => CatalogFileIndex
FileCatalog => FileIndex
ListingFileCatalog => InMemoryFileIndex
MetadataLogFileCatalog => MetadataLogFileIndex
PrunedTableFileCatalog => PrunedInMemoryFileIndex
```
cc yhuai marmbrus
## How was this patch tested?
N/A
Author: Eric Liang <ekl@databricks.com>
Author: Eric Liang <ekhliang@gmail.com>
Closes#15634 from ericl/rename-file-provider.
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
`Utils.getIteratorZipWithIndex` was added to deal with number of records > 2147483647 in one partition.
method `getIteratorZipWithIndex` accepts `startIndex` < 0, which leads to negative index.
This PR just adds a defensive check on `startIndex` to make sure it is >= 0.
## How was this patch tested?
Add a new unit test.
Author: Miao Wang <miaowang@Miaos-MacBook-Pro.local>
Closes#15639 from wangmiao1981/zip.
## What changes were proposed in this pull request?
Calling `Await.result` will allow other tasks to be run on the same thread when using ForkJoinPool. However, SQL uses a `ThreadLocal` execution id to trace Spark jobs launched by a query, which doesn't work perfectly in ForkJoinPool.
This PR just uses `Awaitable.result` instead to prevent ForkJoinPool from running other tasks in the current waiting thread.
## How was this patch tested?
Jenkins
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#15520 from zsxwing/SPARK-13747.
## What changes were proposed in this pull request?
[SPARK-16757](https://issues.apache.org/jira/browse/SPARK-16757) sets the hadoop `CallerContext` when calling hadoop/hdfs apis to make spark applications more diagnosable in hadoop/hdfs logs. However, the `org.apache.hadoop.ipc.CallerContext` class is only added since [hadoop 2.8](https://issues.apache.org/jira/browse/HDFS-9184), which is not officially releaed yet. So each time `utils.CallerContext.setCurrentContext()` is called (e.g [when a task is created](https://github.com/apache/spark/blob/b678e46/core/src/main/scala/org/apache/spark/scheduler/Task.scala#L95-L96)), a "java.lang.ClassNotFoundException: org.apache.hadoop.ipc.CallerContext"
error is logged, which pollutes the spark logs when there are lots of tasks.
This patch improves this behaviour by only logging the `ClassNotFoundException` once.
## How was this patch tested?
Existing tests.
Author: Shuai Lin <linshuai2012@gmail.com>
Closes#15377 from lins05/spark-17802-improve-callercontext-logging.
## What changes were proposed in this pull request?
Currently users can kill stages via the web ui but not jobs directly (jobs are killed if one of their stages is). I've added the ability to kill jobs via the web ui. This code change is based on #4823 by lianhuiwang and updated to work with the latest code matching how stages are currently killed. In general I've copied the kill stage code warning and note comments and all. I also updated applicable tests and documentation.
## How was this patch tested?
Manually tested and dev/run-tests
![screen shot 2016-10-11 at 4 49 43 pm](https://cloud.githubusercontent.com/assets/13952758/19292857/12f1b7c0-8fd4-11e6-8982-210249f7b697.png)
Author: Alex Bozarth <ajbozart@us.ibm.com>
Author: Lianhui Wang <lianhuiwang09@gmail.com>
Closes#15441 from ajbozarth/spark4411.
spark history server log needs to be fixed to show https url when ssl is enabled
Author: chie8842 <chie@chie-no-Mac-mini.local>
Closes#15611 from hayashidac/SPARK-16988.
## What changes were proposed in this pull request?
allow ReplayListenerBus to skip deserialising and replaying certain events using an inexpensive check of the event log entry. Use this to ensure that when event log replay is triggered for building the application list, we get the ReplayListenerBus to skip over all but the few events needed for our immediate purpose. Refer [SPARK-18010] for the motivation behind this change.
## How was this patch tested?
Tested with existing HistoryServer and ReplayListener unit test suites. All tests pass.
Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.
Author: Vinayak <vijoshi5@in.ibm.com>
Closes#15556 from vijoshi/SAAS-467_master.
`TaskSetManager` should have unique name to avoid adding duplicate ones to parent `Pool` via `SchedulableBuilder`. This problem has been surfaced with following discussion: [[PR: Avoid adding duplicate schedulables]](https://github.com/apache/spark/pull/15326)
**Proposal** :
There is 1x1 relationship between `stageAttemptId` and `TaskSetManager` so `taskSet.Id` covering both `stageId` and `stageAttemptId` looks to be used for uniqueness of `TaskSetManager` name instead of just `stageId`.
**Current TaskSetManager Name** :
`var name = "TaskSet_" + taskSet.stageId.toString`
**Sample**: TaskSet_0
**Proposed TaskSetManager Name** :
`val name = "TaskSet_" + taskSet.Id ` `// taskSet.Id = (stageId + "." + stageAttemptId)`
**Sample** : TaskSet_0.0
Added new Unit Test.
Author: erenavsarogullari <erenavsarogullari@gmail.com>
Closes#15463 from erenavsarogullari/SPARK-17894.
## What changes were proposed in this pull request?
Add missing tests for `truePositiveRate` and `weightedTruePositiveRate` in `MulticlassMetricsSuite`
## How was this patch tested?
added testing
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#15585 from zhengruifeng/mc_missing_test.
## What changes were proposed in this pull request?
add a require check in `CoalescedRDD` to make sure the passed in `partitionCoalescer` to be `serializable`.
and update the document for api `RDD.coalesce`
## How was this patch tested?
Manual.(test code in jira [SPARK-18051])
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#15587 from WeichenXu123/fix_coalescer_bug.
## What changes were proposed in this pull request?
In SPARK-16980, we removed the full in-memory cache of table partitions in favor of loading only needed partitions from the metastore. This greatly improves the initial latency of queries that only read a small fraction of table partitions.
However, since the metastore does not store file statistics, we need to discover those from remote storage. With the loss of the in-memory file status cache this has to happen on each query, increasing the latency of repeated queries over the same partitions.
The proposal is to add back a per-table cache of partition contents, i.e. Map[Path, Array[FileStatus]]. This cache would be retained per-table, and can be invalidated through refreshTable() and refreshByPath(). Unlike the prior cache, it can be incrementally updated as new partitions are read.
## How was this patch tested?
Existing tests and new tests in `HiveTablePerfStatsSuite`.
cc mallman
Author: Eric Liang <ekl@databricks.com>
Author: Michael Allman <michael@videoamp.com>
Author: Eric Liang <ekhliang@gmail.com>
Closes#15539 from ericl/meta-cache.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-17929
Now `CoarseGrainedSchedulerBackend` reset will get the lock,
```
protected def reset(): Unit = synchronized {
numPendingExecutors = 0
executorsPendingToRemove.clear()
// Remove all the lingering executors that should be removed but not yet. The reason might be
// because (1) disconnected event is not yet received; (2) executors die silently.
executorDataMap.toMap.foreach { case (eid, _) =>
driverEndpoint.askWithRetry[Boolean](
RemoveExecutor(eid, SlaveLost("Stale executor after cluster manager re-registered.")))
}
}
```
but on removeExecutor also need the lock "CoarseGrainedSchedulerBackend.this.synchronized", this will cause deadlock.
```
private def removeExecutor(executorId: String, reason: ExecutorLossReason): Unit = {
logDebug(s"Asked to remove executor $executorId with reason $reason")
executorDataMap.get(executorId) match {
case Some(executorInfo) =>
// This must be synchronized because variables mutated
// in this block are read when requesting executors
val killed = CoarseGrainedSchedulerBackend.this.synchronized {
addressToExecutorId -= executorInfo.executorAddress
executorDataMap -= executorId
executorsPendingLossReason -= executorId
executorsPendingToRemove.remove(executorId).getOrElse(false)
}
...
## How was this patch tested?
manual test.
Author: w00228970 <wangfei1@huawei.com>
Closes#15481 from scwf/spark-17929.
## What changes were proposed in this pull request?
NA date values are serialized as "NA" and NA time values are serialized as NaN from R. In the backend we did not have proper logic to deal with them. As a result we got an IllegalArgumentException for Date and wrong value for time. This PR adds support for deserializing NA as Date and Time.
## How was this patch tested?
* [x] TODO
Author: Hossein <hossein@databricks.com>
Closes#15421 from falaki/SPARK-17811.
## What changes were proposed in this pull request?
1) Upgrade the Py4J version on the Java side
2) Update the py4j src zip file we bundle with Spark
## How was this patch tested?
Existing doctests & unit tests pass
Author: Jagadeesan <as2@us.ibm.com>
Closes#15514 from jagadeesanas2/SPARK-17960.
## What changes were proposed in this pull request?
- Fix bug of RDD `zipWithIndex` generating wrong result when one partition contains more than 2147483647 records.
- Fix bug of RDD `zipWithUniqueId` generating wrong result when one partition contains more than 2147483647 records.
## How was this patch tested?
test added.
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#15550 from WeichenXu123/fix_rdd_zipWithIndex_overflow.
## What changes were proposed in this pull request?
I've added a method to `ApplicationHistoryProvider` that returns the html paragraph to display when there are no applications. This allows providers other than `FsHistoryProvider` to determine what is printed. The current hard coded text is now moved into `FsHistoryProvider` since it assumed that's what was being used before.
I chose to make the function return html rather than text because the current text block had inline html in it and it allows a new implementation of `ApplicationHistoryProvider` more versatility. I did not see any security issues with this since injecting html here requires implementing `ApplicationHistoryProvider` and can't be done outside of code.
## How was this patch tested?
Manual testing and dev/run-tests
No visible changes to the UI
Author: Alex Bozarth <ajbozart@us.ibm.com>
Closes#15490 from ajbozarth/spark10541.
## What changes were proposed in this pull request?
Fix hadoop2.2 compilation error.
## How was this patch tested?
Existing tests.
cc tdas zsxwing
Author: Yu Peng <loneknightpy@gmail.com>
Closes#15537 from loneknightpy/fix-17711.
## What changes were proposed in this pull request?
The following code is called when the DirectTaskResult instance is deserialized
```scala
def value(): T = {
if (valueObjectDeserialized) {
valueObject
} else {
// Each deserialization creates a new instance of SerializerInstance, which is very time-consuming
val resultSer = SparkEnv.get.serializer.newInstance()
valueObject = resultSer.deserialize(valueBytes)
valueObjectDeserialized = true
valueObject
}
}
```
In the case of stage has a lot of tasks, reuse SerializerInstance instance can improve the scheduling performance of three times
The test data is TPC-DS 2T (Parquet) and SQL statement as follows (query 2):
```sql
select i_item_id,
avg(ss_quantity) agg1,
avg(ss_list_price) agg2,
avg(ss_coupon_amt) agg3,
avg(ss_sales_price) agg4
from store_sales, customer_demographics, date_dim, item, promotion
where ss_sold_date_sk = d_date_sk and
ss_item_sk = i_item_sk and
ss_cdemo_sk = cd_demo_sk and
ss_promo_sk = p_promo_sk and
cd_gender = 'M' and
cd_marital_status = 'M' and
cd_education_status = '4 yr Degree' and
(p_channel_email = 'N' or p_channel_event = 'N') and
d_year = 2001
group by i_item_id
order by i_item_id
limit 100;
```
`spark-defaults.conf` file:
```
spark.master yarn-client
spark.executor.instances 20
spark.driver.memory 16g
spark.executor.memory 30g
spark.executor.cores 5
spark.default.parallelism 100
spark.sql.shuffle.partitions 100000
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.maxResultSize 0
spark.rpc.netty.dispatcher.numThreads 8
spark.executor.extraJavaOptions -XX:+UseG1GC -XX:+UseStringDeduplication -XX:G1HeapRegionSize=16M -XX:MetaspaceSize=256M
spark.cleaner.referenceTracking.blocking true
spark.cleaner.referenceTracking.blocking.shuffle true
```
Performance test results are as follows
[SPARK-17930](https://github.com/witgo/spark/tree/SPARK-17930)| [ed14633](ed14633414])
------------ | -------------
54.5 s|231.7 s
## How was this patch tested?
Existing tests.
Author: Guoqiang Li <witgo@qq.com>
Closes#15512 from witgo/SPARK-17930.
## What changes were proposed in this pull request?
This PR adds support for executor log compression.
## How was this patch tested?
Unit tests
cc: yhuai tdas mengxr
Author: Yu Peng <loneknightpy@gmail.com>
Closes#15285 from loneknightpy/compress-executor-log.
This work has largely been done by lw-lin in his PR #15497. This is a slight refactoring of it.
## What changes were proposed in this pull request?
There were two sources of flakiness in StreamingQueryListener test.
- When testing with manual clock, consecutive attempts to advance the clock can occur without the stream execution thread being unblocked and doing some work between the two attempts. Hence the following can happen with the current ManualClock.
```
+-----------------------------------+--------------------------------+
| StreamExecution thread | testing thread |
+-----------------------------------+--------------------------------+
| ManualClock.waitTillTime(100) { | |
| _isWaiting = true | |
| wait(10) | |
| still in wait(10) | if (_isWaiting) advance(100) |
| still in wait(10) | if (_isWaiting) advance(200) | <- this should be disallowed !
| still in wait(10) | if (_isWaiting) advance(300) | <- this should be disallowed !
| wake up from wait(10) | |
| current time is 600 | |
| _isWaiting = false | |
| } | |
+-----------------------------------+--------------------------------+
```
- Second source of flakiness is that the adding data to memory stream may get processing in any trigger, not just the first trigger.
My fix is to make the manual clock wait for the other stream execution thread to start waiting for the clock at the right wait start time. That is, `advance(200)` (see above) will wait for stream execution thread to complete the wait that started at time 0, and start a new wait at time 200 (i.e. time stamp after the previous `advance(100)`).
In addition, since this is a feature that is solely used by StreamExecution, I removed all the non-generic code from ManualClock and put them in StreamManualClock inside StreamTest.
## How was this patch tested?
Ran existing unit test MANY TIME in Jenkins
Author: Tathagata Das <tathagata.das1565@gmail.com>
Author: Liwei Lin <lwlin7@gmail.com>
Closes#15519 from tdas/metrics-flaky-test-fix.
## What changes were proposed in this pull request?
Currently we use BufferedInputStream to read the shuffle file which copies the file content from os buffer cache to the user buffer. This adds additional latency in reading the spill files. We made a change to use java nio's direct buffer to read the spill files and for certain pipelines spilling significant amount of data, we see up to 7% speedup for the entire pipeline.
## How was this patch tested?
Tested by running the job in the cluster and observed up to 7% speedup.
Author: Sital Kedia <skedia@fb.com>
Closes#15408 from sitalkedia/skedia/nio_spill_read.
This reverts commit ed14633414.
The patch merged had obvious quality and documentation issue. The idea is useful, and we should work towards improving its quality and merging it in again.
## What changes were proposed in this pull request?
Restructure the code and implement two new task assigner.
PackedAssigner: try to allocate tasks to the executors with least available cores, so that spark can release reserved executors when dynamic allocation is enabled.
BalancedAssigner: try to allocate tasks to the executors with more available cores in order to balance the workload across all executors.
By default, the original round robin assigner is used.
We test a pipeline, and new PackedAssigner save around 45% regarding the reserved cpu and memory with dynamic allocation enabled.
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Both unit test in TaskSchedulerImplSuite and manual tests in production pipeline.
Author: Zhan Zhang <zhanzhang@fb.com>
Closes#15218 from zhzhan/packed-scheduler.