This problem lies in `BypassMergeSortShuffleWriter`, empty partition will also generate a temp shuffle file with several bytes. So here change to only create file when partition is not empty.
This problem only lies in here, no such issue in `HashShuffleWriter`.
Please help to review, thanks a lot.
Author: jerryshao <sshao@hortonworks.com>
Closes#10376 from jerryshao/SPARK-12400.
I hit the exception below. The `UnsafeKVExternalSorter` does pass `null` as the consumer when creating an `UnsafeInMemorySorter`. Normally the NPE doesn't occur because the `inMemSorter` is set to null later and the `free()` method is not called. It happens when there is another exception like OOM thrown before setting `inMemSorter` to null. Anyway, we can add the null check to avoid it.
```
ERROR spark.TaskContextImpl: Error in TaskCompletionListener
java.lang.NullPointerException
at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.free(UnsafeInMemorySorter.java:110)
at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.cleanupResources(UnsafeExternalSorter.java:288)
at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter$1.onTaskCompletion(UnsafeExternalSorter.java:141)
at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:79)
at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:77)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:77)
at org.apache.spark.scheduler.Task.run(Task.scala:91)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
```
Author: Carson Wang <carson.wang@intel.com>
Closes#10637 from carsonwang/FixNPE.
Fix the style violation (space before , and :).
This PR is a followup for #10643
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#10719 from sarutak/SPARK-12692-followup-core.
- [x] Upgrade Py4J to 0.9.1
- [x] SPARK-12657: Revert SPARK-12617
- [x] SPARK-12658: Revert SPARK-12511
- Still keep the change that only reading checkpoint once. This is a manual change and worth to take a look carefully. bfd4b5c040
- [x] Verify no leak any more after reverting our workarounds
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#10692 from zsxwing/py4j-0.9.1.
[SPARK-12582][Test] IndexShuffleBlockResolverSuite fails in windows
* IndexShuffleBlockResolverSuite fails in windows due to file is not closed.
* mv IndexShuffleBlockResolverSuite.scala from "test/java" to "test/scala".
https://issues.apache.org/jira/browse/SPARK-12582
Author: Yucai Yu <yucai.yu@intel.com>
Closes#10526 from yucai/master.
Currently, RDD function aggregate's parameter doesn't explain well, especially parameter "zeroValue".
It's helpful to let junior scala user know that "zeroValue" attend both "seqOp" and "combOp" phase.
Author: Tommy YU <tummyyu@163.com>
Closes#10587 from Wenpei/rdd_aggregate_doc.
This patch deduplicates some test code in BlockManagerSuite. I'm splitting this change off from a larger PR in order to make things easier to review.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#10667 from JoshRosen/block-mgr-tests-cleanup.
Replace Guava `Optional` with (an API clone of) Java 8 `java.util.Optional` (edit: and a clone of Guava `Optional`)
See also https://github.com/apache/spark/pull/10512
Author: Sean Owen <sowen@cloudera.com>
Closes#10513 from srowen/SPARK-4819.
…s on secure Hadoop
https://issues.apache.org/jira/browse/SPARK-12654
So the bug here is that WholeTextFileRDD.getPartitions has:
val conf = getConf
in getConf if the cloneConf=true it creates a new Hadoop Configuration. Then it uses that to create a new newJobContext.
The newJobContext will copy credentials around, but credentials are only present in a JobConf not in a Hadoop Configuration. So basically when it is cloning the hadoop configuration its changing it from a JobConf to Configuration and dropping the credentials that were there. NewHadoopRDD just uses the conf passed in for the getPartitions (not getConf) which is why it works.
Author: Thomas Graves <tgraves@staydecay.corp.gq1.yahoo.com>
Closes#10651 from tgravescs/SPARK-12654.
Changed Logging FileAppender to use join in `awaitTermination` to ensure that thread is properly finished before returning.
Author: Bryan Cutler <cutlerb@gmail.com>
Closes#10654 from BryanCutler/fileAppender-join-thread-SPARK-12701.
Fix most build warnings: mostly deprecated API usages. I'll annotate some of the changes below. CC rxin who is leading the charge to remove the deprecated APIs.
Author: Sean Owen <sowen@cloudera.com>
Closes#10570 from srowen/SPARK-12618.
The default serializer in Kryo is FieldSerializer and it ignores transient fields and never calls `writeObject` or `readObject`. So we should register OpenHashMapBasedStateMap using `DefaultSerializer` to make it work with Kryo.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#10609 from zsxwing/SPARK-12591.
Per rxin, let's use the casting for countByKey and countByValue as well. Let's see if this passes.
Author: Sean Owen <sowen@cloudera.com>
Closes#10641 from srowen/SPARK-12604.2.
There is a bug in the calculation of ```maxSplitSize```. The ```totalLen``` should be divided by ```minPartitions``` and not by ```files.size```.
Author: Darek Blasiak <darek.blasiak@640labs.com>
Closes#10546 from datafarmer/setminpartitionsbug.
…mprovements
Please review and merge at your convenience. Thanks!
Author: Jacek Laskowski <jacek@japila.pl>
Closes#10595 from jaceklaskowski/streaming-minor-fixes.
This PR manage the memory used by window functions (buffered rows), also enable external spilling.
After this PR, we can run window functions on a partition with hundreds of millions of rows with only 1G.
Author: Davies Liu <davies@databricks.com>
Closes#10605 from davies/unsafe_window.
MapPartitionsRDD was keeping a reference to `prev` after a call to
`clearDependencies` which could lead to memory leak.
Author: Guillaume Poulin <poulin.guillaume@gmail.com>
Closes#10623 from gpoulin/map_partition_deps.
This PR removes `spark.cleaner.ttl` and the associated TTL-based metadata cleaning code.
Now that we have the `ContextCleaner` and a timer to trigger periodic GCs, I don't think that `spark.cleaner.ttl` is necessary anymore. The TTL-based cleaning isn't enabled by default, isn't included in our end-to-end tests, and has been a source of user confusion when it is misconfigured. If the TTL is set too low, data which is still being used may be evicted / deleted, leading to hard to diagnose bugs.
For all of these reasons, I think that we should remove this functionality in Spark 2.0. Additional benefits of doing this include marginally reduced memory usage, since we no longer need to store timetsamps in hashmaps, and a handful fewer threads.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#10534 from JoshRosen/remove-ttl-based-cleaning.
[SPARK-12640][SQL] Add simple benchmarking utility class and add Parquet scan benchmarks.
We've run benchmarks ad hoc to measure the scanner performance. We will continue to invest in this
and it makes sense to get these benchmarks into code. This adds a simple benchmarking utility to do
this.
Author: Nong Li <nong@databricks.com>
Author: Nong <nongli@gmail.com>
Closes#10589 from nongli/spark-12640.
Change Java countByKey, countApproxDistinctByKey return types to use Java Long, not Scala; update similar methods for consistency on java.long.Long.valueOf with no API change
Author: Sean Owen <sowen@cloudera.com>
Closes#10554 from srowen/SPARK-12604.
Whole code of Vector.scala, VectorSuite.scala and GraphKryoRegistrator.scala are no longer used so it's time to remove them in Spark 2.0.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#10613 from sarutak/SPARK-12665.
Cartesian product use UnsafeExternalSorter without comparator to do spilling, it will NPE if spilling happens.
This bug also hitted by #10605
cc JoshRosen
Author: Davies Liu <davies@databricks.com>
Closes#10606 from davies/fix_spilling.
I looked at each case individually and it looks like they can all be removed. The only one that I had to think twice was toArray (I even thought about un-deprecating it, until I realized it was a problem in Java to have toArray returning java.util.List).
Author: Reynold Xin <rxin@databricks.com>
Closes#10569 from rxin/SPARK-12615.
Currently we don't support Hadoop 0.23 but there is a few code related to it so let's clean it up.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#10590 from sarutak/SPARK-12641.
This patch updates the ExecutorRunner's terminate path to use the new java 8 API
to terminate processes more forcefully if possible. If the executor is unhealthy,
it would previously ignore the destroy() call. Presumably, the new java API was
added to handle cases like this.
We could update the termination path in the future to use OS specific commands
for older java versions.
Author: Nong Li <nong@databricks.com>
Closes#10438 from nongli/spark-12486-executors.
### Remove AkkaRpcEnv
Keep `SparkEnv.actorSystem` because Streaming still uses it. Will remove it and AkkaUtils after refactoring Streaming actorStream API.
### Remove systemName
There are 2 places using `systemName`:
* `RpcEnvConfig.name`. Actually, although it's used as `systemName` in `AkkaRpcEnv`, `NettyRpcEnv` uses it as the service name to output the log `Successfully started service *** on port ***`. Since the service name in log is useful, I keep `RpcEnvConfig.name`.
* `def setupEndpointRef(systemName: String, address: RpcAddress, endpointName: String)`. Each `ActorSystem` has a `systemName`. Akka requires `systemName` in its URI and will refuse a connection if `systemName` is not matched. However, `NettyRpcEnv` doesn't use it. So we can remove `systemName` from `setupEndpointRef` since we are removing `AkkaRpcEnv`.
### Remove RpcEnv.uriOf
`uriOf` exists because Akka uses different URI formats for with and without authentication, e.g., `akka.ssl.tcp...` and `akka.tcp://...`. But `NettyRpcEnv` uses the same format. So it's not necessary after removing `AkkaRpcEnv`.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#10459 from zsxwing/remove-akka-rpc-env.
It was research code and has been deprecated since 1.0.0. No one really uses it since they can just use event logging.
Author: Reynold Xin <rxin@databricks.com>
Closes#10530 from rxin/SPARK-12561.
We switched to TorrentBroadcast in Spark 1.1, and HttpBroadcast has been undocumented since then. It's time to remove it in Spark 2.0.
Author: Reynold Xin <rxin@databricks.com>
Closes#10531 from rxin/SPARK-12588.
I got an exception when accessing the below REST API with an unknown application Id.
`http://<server-url>:18080/api/v1/applications/xxx/jobs`
Instead of an exception, I expect an error message "no such app: xxx" which is a similar error message when I access `/api/v1/applications/xxx`
```
org.spark-project.guava.util.concurrent.UncheckedExecutionException: java.util.NoSuchElementException: no app with key xxx
at org.spark-project.guava.cache.LocalCache$Segment.get(LocalCache.java:2263)
at org.spark-project.guava.cache.LocalCache.get(LocalCache.java:4000)
at org.spark-project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
at org.spark-project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
at org.apache.spark.deploy.history.HistoryServer.getSparkUI(HistoryServer.scala:116)
at org.apache.spark.status.api.v1.UIRoot$class.withSparkUI(ApiRootResource.scala:226)
at org.apache.spark.deploy.history.HistoryServer.withSparkUI(HistoryServer.scala:46)
at org.apache.spark.status.api.v1.ApiRootResource.getJobs(ApiRootResource.scala:66)
```
Author: Carson Wang <carson.wang@intel.com>
Closes#10352 from carsonwang/unknownAppFix.
Updated the Worker Unit IllegalStateException message to indicate no values less than 1MB instead of 0 to help solve this.
Requesting review
Author: Neelesh Srinivas Salian <nsalian@cloudera.com>
Closes#10483 from nssalian/SPARK-12263.
The web UI's paginated table uses Javascript to implement certain navigation controls, such as table sorting and the "go to page" form. This is unnecessary and should be simplified to use plain HTML form controls and links.
/cc zsxwing, who wrote this original code, and yhuai.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#10441 from JoshRosen/simplify-paginated-table-sorting.
Include the following changes:
1. Close `java.sql.Statement`
2. Fix incorrect `asInstanceOf`.
3. Remove unnecessary `synchronized` and `ReentrantLock`.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#10440 from zsxwing/findbugs.
Since we only need to implement `def skipBytes(n: Int)`,
code in #10213 could be simplified.
davies scwf
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#10253 from adrian-wang/kryo.
The feature was first added at commit: 7b877b2705 but was later removed (probably by mistake) at commit: fc8b58195a.
This change sets the default path of RDDs created via sc.textFile(...) to the path argument.
Here is the symptom:
* Using spark-1.5.2-bin-hadoop2.6:
scala> sc.textFile("/home/root/.bashrc").name
res5: String = null
scala> sc.binaryFiles("/home/root/.bashrc").name
res6: String = /home/root/.bashrc
* while using Spark 1.3.1:
scala> sc.textFile("/home/root/.bashrc").name
res0: String = /home/root/.bashrc
scala> sc.binaryFiles("/home/root/.bashrc").name
res1: String = /home/root/.bashrc
Author: Yaron Weinsberg <wyaron@gmail.com>
Author: yaron <yaron@il.ibm.com>
Closes#10456 from wyaron/master.
Instead of just cancel the registrationRetryTimer to avoid driver retry connect to master, change the function to schedule.
It is no need to register to master iteratively.
Author: echo2mei <534384876@qq.com>
Closes#10447 from echoTomei/master.
In SparkContext method `setCheckpointDir`, a warning is issued when spark master is not local and the passed directory for the checkpoint dir appears to be local.
In practice, when relying on HDFS configuration file and using a relative path for the checkpoint directory (using an incomplete URI without HDFS scheme, ...), this warning should not be issued and might be confusing.
In fact, in this case, the checkpoint directory is successfully created, and the checkpointing mechanism works as expected.
This PR uses the `FileSystem` instance created with the given directory, and checks whether it is local or not.
(The rationale is that since this same `FileSystem` instance is used to create the checkpoint dir anyway and can therefore be reliably used to determine if it is local or not).
The warning is only issued if the directory is not local, on top of the existing conditions.
Author: pierre-borckmans <pierre.borckmans@realimpactanalytics.com>
Closes#10392 from pierre-borckmans/SPARK-12440_CheckpointDir_Warning_NonLocal.
Restore the original value of os.arch property after each test
Since some of tests forced to set the specific value to os.arch property, we need to set the original value.
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes#10289 from kiszk/SPARK-12311.
Fix Tachyon deprecations; pull Tachyon dependency into `TachyonBlockManager` only
CC calvinjia as I probably need a double-check that the usage of the new API is correct.
Author: Sean Owen <sowen@cloudera.com>
Closes#10449 from srowen/SPARK-12500.
According the benchmark [1], LZ4-java could be 80% (or 30%) faster than Snappy.
After changing the compressor to LZ4, I saw 20% improvement on end-to-end time for a TPCDS query (Q4).
[1] https://github.com/ning/jvm-compressor-benchmark/wiki
cc rxin
Author: Davies Liu <davies@databricks.com>
Closes#10342 from davies/lz4.
```
[info] ReplayListenerSuite:
[info] - Simple replay (58 milliseconds)
java.lang.NullPointerException
at org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:982)
at org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:980)
```
https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-Master-SBT/4316/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=spark-test/consoleFull
This was introduced in #10284. It's harmless because the NPE is caused by a race that occurs mainly in `local-cluster` tests (but don't actually fail the tests).
Tested locally to verify that the NPE is gone.
Author: Andrew Or <andrew@databricks.com>
Closes#10417 from andrewor14/fix-harmless-npe.