Author: Shiti <ssaxena.ece@gmail.com>
Closes#2810 from Shiti/master and squashes the following commits:
051d82f [Shiti] setting the default value of uri scheme to "file" where matching "file" or None yields the same result
This is another implementation about #1256
cc andrewor14 vanzin
Author: GuoQiang Li <witgo@qq.com>
Closes#2379 from witgo/SPARK-2098-new and squashes the following commits:
4ef1cbd [GuoQiang Li] review commit
49ef70e [GuoQiang Li] Refactor getDefaultPropertiesFile
c45d20c [GuoQiang Li] All Spark processes should support spark-defaults.conf, config file
Author: shitis <ssaxena.ece@gmail.com>
Closes#2795 from Shiti/master and squashes the following commits:
46897d7 [shitis] Using Option Wrapper to convert String with value null to None
Validate the memory is greater than zero when set from the SPARK_WORKER_MEMORY environment variable or command line without a g or m label. Added unit tests. If memory is 0 an IllegalStateException is thrown. Updated unit tests to mock environment variables by subclassing SparkConf (tip provided by Josh Rosen). Updated WorkerArguments to use SparkConf.getenv instead of System.getenv for reading the SPARK_WORKER_MEMORY environment variable.
Author: Bill Bejeck <bbejeck@gmail.com>
Closes#2309 from bbejeck/spark-memory-worker and squashes the following commits:
51cf915 [Bill Bejeck] SPARK-3178 - Validate the memory is greater than zero when set from the SPARK_WORKER_MEMORY environment variable or command line without a g or m label. Added unit tests. If memory is 0 an IllegalStateException is thrown. Updated unit tests to mock environment variables by subclassing SparkConf (tip provided by Josh Rosen). Updated WorkerArguments to use SparkConf.getenv instead of System.getenv for reading the SPARK_WORKER_MEMORY environment variable.
The goal of this patch is to fix the swapped arguments in standalone mode, which was caused by 79e45c9323 (diff-79391110e9f26657e415aa169a004998R153).
More details can be found in the JIRA: [SPARK-3921](https://issues.apache.org/jira/browse/SPARK-3921)
Tested in Standalone mode, but not in Mesos.
Author: Aaron Davidson <aaron@databricks.com>
Closes#2779 from aarondav/fix-standalone and squashes the following commits:
725227a [Aaron Davidson] Fix ExecutorRunnerTest
9d703fe [Aaron Davidson] [SPARK-3921] Fix CoarseGrainedExecutorBackend's arguments for Standalone mode
In the comment (Line 1083), it says: "Otherwise, interpolate the number of partitions we need to try, but overestimate it by 50%."
`(1.5 * num * partsScanned / buf.size).toInt` is the guess of "num of total partitions needed". In every iteration, we should consider the increment `(1.5 * num * partsScanned / buf.size).toInt - partsScanned`
Existing implementation 'exponentially' grows `partsScanned ` ( roughly: `x_{n+1} >= (1.5 + 1) x_n`)
This could be a performance problem. (unless this is the intended behavior)
Author: yingjieMiao <yingjie@42go.com>
Closes#2648 from yingjieMiao/rdd_take and squashes the following commits:
d758218 [yingjieMiao] scala style fix
a8e74bb [yingjieMiao] python style fix
4b6e777 [yingjieMiao] infix operator style fix
4391d3b [yingjieMiao] typo fix.
692f4e6 [yingjieMiao] cap numPartsToTry
c4483dc [yingjieMiao] style fix
1d2c410 [yingjieMiao] also change in rdd.py and AsyncRDD
d31ff7e [yingjieMiao] handle the edge case after 1 iteration
a2aa36b [yingjieMiao] RDD take method: overestimate too much
Author: GuoQiang Li <witgo@qq.com>
Closes#2763 from witgo/SPARK-3905 and squashes the following commits:
17d7990 [GuoQiang Li] The keys for sorting the columns of Executor page ,Stage page Storage page are incorrect
val path = ... //path to seq file with BytesWritable as type of both key and value
val file = sc.sequenceFile[Array[Byte],Array[Byte]](path)
file.take(1)(0)._1
This prints incorrect content of byte array. Actual content starts with correct one and some "random" bytes and zeros are appended. BytesWritable has two methods:
getBytes() - return content of all internal array which is often longer then actual value stored. It usually contains the rest of previous longer values
copyBytes() - return just begining of internal array determined by internal length property
It looks like in implicit conversion between BytesWritable and Array[byte] getBytes is used instead of correct copyBytes.
dbtsai
Author: Jakub Dubovský <james64@inMail.sk>
Author: Dubovsky Jakub <dubovsky@avast.com>
Closes#2712 from james64/3121-bugfix and squashes the following commits:
f85d24c [Jakub Dubovský] Test name changed, comments added
1b20d51 [Jakub Dubovský] Import placed correctly
406e26c [Jakub Dubovský] Scala style fixed
f92ffa6 [Dubovsky Jakub] performance tuning
480f9cd [Dubovsky Jakub] Bug 3121 fixed
When reporting that a remote error occurred, the ConnectionManager should also log the stacktrace of the remote exception. This PR accomplishes this by sending the remote exception's stacktrace as the payload in the "negative ACK / error message."
Author: Josh Rosen <joshrosen@apache.org>
Closes#2741 from JoshRosen/propagate-cm-exceptions-to-sender and squashes the following commits:
b5366cc [Josh Rosen] Explicitly encode error messages using UTF-8.
cef18b3 [Josh Rosen] [SPARK-3887] Send stracktrace in ConnectionManager error messages.
...riden alternatives, can have default argument.
Author: Prashant Sharma <prashant.s@imaginea.com>
Closes#2750 from ScrapCodes/SPARK-2924/default-args-removed and squashes the following commits:
d9785c3 [Prashant Sharma] [SPARK-2924] Required by scala 2.11, only one function/ctor amongst overriden alternatives, can have default argument.
In general, individual shuffle blocks are frequently small, so mmapping them often creates a lot of waste. It may not be bad to mmap the larger ones, but it is pretty inconvenient to get configuration into ManagedBuffer, and besides it is unlikely to help all that much.
Author: Aaron Davidson <aaron@databricks.com>
Closes#2742 from aarondav/mmap and squashes the following commits:
a152065 [Aaron Davidson] Add other pathway back
52b6cd2 [Aaron Davidson] [SPARK-3889] Attempt to avoid SIGBUS by not mmapping files in ConnectionManager
This is a second rev of the Akka upgrade (earlier merged, but reverted). I made a slight modification which is that I also upgrade Hive to deal with a compatibility issue related to the protocol buffers library.
Author: Anand Avati <avati@redhat.com>
Author: Patrick Wendell <pwendell@gmail.com>
Closes#2752 from pwendell/akka-upgrade and squashes the following commits:
4c7ca3f [Patrick Wendell] Upgrading to new hive->protobuf version
57a2315 [Anand Avati] SPARK-1812: streaming - remove tests which depend on akka.actor.IO
2a551d3 [Anand Avati] SPARK-1812: core - upgrade to akka 2.3.4
I noticed a few issues with how temp directories are created and deleted:
*Minor*
* Guava's `Files.createTempDir()` plus `File.deleteOnExit()` is used in many tests to make a temp dir, but `Utils.createTempDir()` seems to be the standard Spark mechanism
* Call to `File.deleteOnExit()` could be pushed into `Utils.createTempDir()` as well, along with this replacement
* _I messed up the message in an exception in `Utils` in SPARK-3794; fixed here_
*Bit Less Minor*
* `Utils.deleteRecursively()` fails immediately if any `IOException` occurs, instead of trying to delete any remaining files and subdirectories. I've observed this leave temp dirs around. I suggest changing it to continue in the face of an exception and throw one of the possibly several exceptions that occur at the end.
* `Utils.createTempDir()` will add a JVM shutdown hook every time the method is called. Even if the subdir is the parent of another parent dir, since this check is inside the hook. However `Utils` manages a set of all dirs to delete on shutdown already, called `shutdownDeletePaths`. A single hook can be registered to delete all of these on exit. This is how Tachyon temp paths are cleaned up in `TachyonBlockManager`.
I noticed a few other things that might be changed but wanted to ask first:
* Shouldn't the set of dirs to delete be `File`, not just `String` paths?
* `Utils` manages the set of `TachyonFile` that have been registered for deletion, but the shutdown hook is managed in `TachyonBlockManager`. Should this logic not live together, and not in `Utils`? it's more specific to Tachyon, and looks a slight bit odd to import in such a generic place.
Author: Sean Owen <sowen@cloudera.com>
Closes#2670 from srowen/SPARK-3811 and squashes the following commits:
071ae60 [Sean Owen] Update per @vanzin's review
da0146d [Sean Owen] Make Utils.deleteRecursively try to delete all paths even when an exception occurs; use one shutdown hook instead of one per method call to delete temp dirs
3a0faa4 [Sean Owen] Standardize on Utils.createTempDir instead of Files.createTempDir
This pull request addresses a few issues related to PySpark's IPython support:
- Fix the remaining uses of the '-u' flag, which IPython doesn't support (see SPARK-3772).
- Change PYSPARK_PYTHON_OPTS to PYSPARK_DRIVER_PYTHON_OPTS, so that the old name is reserved in case we ever want to allow the worker Python options to be customized (this variable was introduced in #2554 and hasn't landed in a release yet, so this doesn't break any compatibility).
- Introduce a PYSPARK_DRIVER_PYTHON option that allows the driver to use `ipython` while the workers use a different Python version.
- Attempt to use Python 2.7 by default if PYSPARK_PYTHON is not specified.
- Retain the old semantics for IPYTHON=1 and IPYTHON_OPTS (to avoid breaking existing example programs).
There are more details in a block comment in `bin/pyspark`.
Author: Josh Rosen <joshrosen@apache.org>
Closes#2651 from JoshRosen/SPARK-3772 and squashes the following commits:
7b8eb86 [Josh Rosen] More changes to PySpark python executable configuration:
c4f5778 [Josh Rosen] [SPARK-3772] Allow ipython to be used by Pyspark workers; IPython fixes:
...re logs to avoid Executors swallowing errors
This PR made the following changes:
* Register a callback to `Connection` so that the error will be propagated properly.
* Add more logs so that the errors won't be swallowed by Executors.
* Use trySuccess/tryFailure because `Promise` doesn't allow to call success/failure more than once.
Author: zsxwing <zsxwing@gmail.com>
Closes#2593 from zsxwing/SPARK-3741 and squashes the following commits:
1d5aed5 [zsxwing] Fix naming
0b8a61c [zsxwing] Merge branch 'master' into SPARK-3741
764aec5 [zsxwing] [SPARK-3741] Make ConnectionManager propagate errors properly and add more logs to avoid Executors swallowing errors
Truncate appName in WebUI if it is too long.
Author: Xiangrui Meng <meng@databricks.com>
Closes#2707 from mengxr/truncate-app-name and squashes the following commits:
87834ce [Xiangrui Meng] move scala import below java
c7111dc [Xiangrui Meng] truncate appName in WebUI if it is too long
Upgrade to akka 2.3.4
Author: Anand Avati <avati@redhat.com>
Closes#1685 from avati/SPARK-1812-akka-2.3 and squashes the following commits:
57a2315 [Anand Avati] SPARK-1812: streaming - remove tests which depend on akka.actor.IO
2a551d3 [Anand Avati] SPARK-1812: core - upgrade to akka 2.3.4
There is a Spark logo on the header of HistoryPage.
We can have too many HistoryPages if we run 20+ applications. So I think, it's useful if the logo is as a link to the HistoryPage's page number 1.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#2690 from sarutak/SPARK-3829 and squashes the following commits:
908c109 [Kousuke Saruta] Removed extra space.
00bfbd7 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3829
dd87480 [Kousuke Saruta] Made header Spark log image as a link to History Server's top page.
Now the Stage page only displays "Executor"(host) for tasks. However, there may be more than one Executors running in the same host. Currently, when some task is hung, I only know the host of the faulty executor. Therefore I have to check all executors in the host.
Adding "Executor ID" in the Tasks table. would be helpful to locate the faulty executor. Here is the new page:
![add_executor_id_for_tasks](https://cloud.githubusercontent.com/assets/1000778/4505774/acb9648c-4afa-11e4-8826-8768a0a60cc9.png)
Author: zsxwing <zsxwing@gmail.com>
Closes#2642 from zsxwing/SPARK-3777 and squashes the following commits:
37945af [zsxwing] Put Executor ID and Host into one cell
4bbe2c7 [zsxwing] [SPARK-3777] Display "Executor ID" for Tasks in Stage page
Before:
```
14/10/06 16:45:42 WARN CacheManager: Not enough space to cache partition rdd_0_2
in memory! Free memory is 481861527 bytes.
```
After:
```
14/10/07 11:08:24 WARN MemoryStore: Not enough space to cache rdd_2_0 in memory!
(computed 68.8 MB so far)
14/10/07 11:08:24 INFO MemoryStore: Memory use = 1088.0 B (blocks) + 445.1 MB
(scratch space shared across 8 thread(s)) = 445.1 MB. Storage limit = 459.5 MB.
```
Author: Andrew Or <andrewor14@gmail.com>
Closes#2688 from andrewor14/cache-log-message and squashes the following commits:
28e33d6 [Andrew Or] Shy away from "unrolling"
5638c49 [Andrew Or] Grammar
39a0c28 [Andrew Or] Log more detail when unrolling a block fails
The parent.getOrCompute() of PythonRDD is executed in a separated thread, it should release the memory reserved for shuffle and unrolling finally.
Author: Davies Liu <davies.liu@gmail.com>
Closes#2668 from davies/leak and squashes the following commits:
ae98be2 [Davies Liu] fix memory leak in PythonRDD
SparkEnv is cached in ThreadLocal object, so after stop and create a new SparkContext, old SparkEnv is still used by some threads, it will trigger many problems, for example, pyspark will have problem after restart SparkContext, because py4j use thread pool for RPC.
This patch will clear all the references after stop a SparkEnv.
cc mateiz tdas pwendell
Author: Davies Liu <davies.liu@gmail.com>
Closes#2624 from davies/env and squashes the following commits:
a69f30c [Davies Liu] deprecate getThreadLocal
ba77ca4 [Davies Liu] remove getThreadLocal(), update docs
ee62bb7 [Davies Liu] cleanup ThreadLocal of SparnENV
4d0ea8b [Davies Liu] clear reference of SparkEnv after stop
With Spark SQL we generate very long RDD names. These names are not properly rendered in the web UI.
This PR fixes the rendering issue.
[SPARK-3827] #comment Linking PR with JIRA
Author: Hossein <hossein@databricks.com>
Closes#2687 from falaki/sparkTableUI and squashes the following commits:
fd06409 [Hossein] Limit width of cell when RDD name is too long
AccumulableParam gave its generic parameters as 'R, T', whereas SparkContext labeled them 'T, R'.
Trivial, but really confusing.
I resolved this in favor of AccumulableParam, because it seemed to have some logic for its names. I also extended this minimal, but at least present, justification into the SparkContext comments.
Author: Nathan Kronenfeld <nkronenfeld@oculusinfo.com>
Closes#2637 from nkronenfeld/accumulators and squashes the following commits:
98d6b74 [Nathan Kronenfeld] Rectify gereneric parameter names between SparkContext and AccumulableParam
Remove references to Commons IO FileUtils and replace with pure Java version, which doesn't need to traverse the whole directory tree first.
I think this method could be refined further if it would be alright to rename it and its args and break it down into two methods. I'm starting with a simple recursive rendition.
Author: Sean Owen <sowen@cloudera.com>
Closes#2662 from srowen/SPARK-3794 and squashes the following commits:
4cd172f [Sean Owen] Remove references to Commons IO FileUtils and replace with pure Java version, which doesn't need to traverse the whole directory tree first
JIRA: https://issues.apache.org/jira/browse/SPARK-1656
Author: zsxwing <zsxwing@gmail.com>
Closes#577 from zsxwing/SPARK-1656 and squashes the following commits:
c431095 [zsxwing] Add a comment and fix the code style
2de96e5 [zsxwing] Make sure file will be deleted if exception happens
28b90dc [zsxwing] Update to follow the code style
4521d6e [zsxwing] Merge branch 'master' into SPARK-1656
afc3383 [zsxwing] Update to follow the code style
071fdd1 [zsxwing] SPARK-1656: Fix potential resource leaks
The MesosSchedulerBackend did not previously implement `killTask`,
resulting in an exception.
Author: Brenden Matthews <brenden@diddyinc.com>
Closes#2453 from brndnmtthws/implement-killtask and squashes the following commits:
23ddcdc [Brenden Matthews] [SPARK-3597][Mesos] Implement `killTask`.
First contribution to the project, so apologize for any significant errors.
This PR addresses [SPARK-1860]. The application directories are now cleaned up in a more conservative manner.
Previously, app-* directories were cleaned up if the directory's timestamp was older than a given time. However, the timestamp on a directory does not reflect the modification times of the files in that directory. Therefore, app-* directories were wiped out even if the files inside them were created recently and possibly being used by Executor tasks.
The solution is to change the cleanup logic to inspect all files within the app-* directory and only eliminate the app-* directory if all files in the directory are stale.
Author: mcheah <mcheah@palantir.com>
Closes#2609 from mccheah/worker-better-app-dir-cleanup and squashes the following commits:
87b5d03 [mcheah] [SPARK-1860] Using more string interpolation. Better error logging.
802473e [mcheah] [SPARK-1860] Cleaning up the logs generated when cleaning directories.
e0a1f2e [mcheah] [SPARK-1860] Fixing broken unit test.
77a9de0 [mcheah] [SPARK-1860] More conservative app directory cleanup.
This PR is another solution for #2250
I'm using codahale base MetricsSystem of Spark with JMX or Graphite, and I saw following 2 problems.
(1) When applications which have same spark.app.name run on cluster at the same time, some metrics names are mixed. For instance, if 2+ application is running on the cluster at the same time, each application emits the same named metric like "SparkPi.DAGScheduler.stage.failedStages" and Graphite cannot distinguish the metrics is for which application.
(2) When 2+ executors run on the same machine, JVM metrics of each executors are mixed. For instance, 2+ executors running on the same node can emit the same named metric "jvm.memory" and Graphite cannot distinguish the metrics is from which application.
And there is an similar issue. The directory for event logs is named using application name.
Application name is defined by user and the name can includes illegal character for path names.
Further more, the directory name consists of application name and System.currentTimeMillis even though each application has unique Application ID so if we run jobs which have same name, it's difficult to identify which directory is for which application.
Closes#2250Closes#1067
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#2432 from sarutak/metrics-structure-improvement2 and squashes the following commits:
3288b2b [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2
39169e4 [Kousuke Saruta] Fixed style
6570494 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2
817e4f0 [Kousuke Saruta] Simplified MetricsSystem#buildRegistryName
67fa5eb [Kousuke Saruta] Unified MetricsSystem#registerSources and registerSinks in start
10be654 [Kousuke Saruta] Fixed style.
990c078 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2
f0c7fba [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2
59cc2cd [Kousuke Saruta] Modified SparkContextSchedulerCreationSuite
f9b6fb3 [Kousuke Saruta] Modified style.
2cf8a0f [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2
389090d [Kousuke Saruta] Replaced taskScheduler.applicationId() with getApplicationId in SparkContext#postApplicationStart
ff45c89 [Kousuke Saruta] Added some test cases to MetricsSystemSuite
69c46a6 [Kousuke Saruta] Added warning logging logic to MetricsSystem#buildRegistryName
5cca0d2 [Kousuke Saruta] Added Javadoc comment to SparkContext#getApplicationId
16a9f01 [Kousuke Saruta] Added data types to be returned to some methods
6434b06 [Kousuke Saruta] Reverted changes related to ApplicationId
0413b90 [Kousuke Saruta] Deleted ApplicationId.java and ApplicationIdSuite.java
a42300c [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2
0fc1b09 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2
42bea55 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2
248935d [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2
f6af132 [Kousuke Saruta] Modified SchedulerBackend and TaskScheduler to return System.currentTimeMillis as an unique Application Id
1b8b53e [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2
97cb85c [Kousuke Saruta] Modified confliction of MimExcludes
2cdd009 [Kousuke Saruta] Modified defailt implementation of applicationId
9aadb0b [Kousuke Saruta] Modified NetworkReceiverSuite to ensure "executor.start()" is finished in test "network receiver life cycle"
3011efc [Kousuke Saruta] Added ApplicationIdSuite.scala
d009c55 [Kousuke Saruta] Modified ApplicationId#equals to compare appIds
dfc83fd [Kousuke Saruta] Modified ApplicationId to implement Serializable
9ff4851 [Kousuke Saruta] Modified MimaExcludes.scala to ignore createTaskScheduler method in SparkContext
4567ffc [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2
6a91b14 [Kousuke Saruta] Modified SparkContextSchedulerCreationSuite, ExecutorRunnerTest and EventLoggingListenerSuite
0325caf [Kousuke Saruta] Added ApplicationId.scala
0a2fc14 [Kousuke Saruta] Modified style
eabda80 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2
0f890e6 [Kousuke Saruta] Modified SparkDeploySchedulerBackend and Master to pass baseLogDir instead f eventLogDir
bcf25bf [Kousuke Saruta] Modified directory name for EventLogs
28d4d93 [Kousuke Saruta] Modified SparkContext and EventLoggingListener so that the directory for EventLogs is named same for Application ID
203634e [Kousuke Saruta] Modified comment in SchedulerBackend#applicationId and TaskScheduler#applicationId
424fea4 [Kousuke Saruta] Modified the subclasses of TaskScheduler and SchedulerBackend so that they can return non-optional Unique Application ID
b311806 [Kousuke Saruta] Swapped last 2 arguments passed to CoarseGrainedExecutorBackend
8a2b6ec [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2
086ee25 [Kousuke Saruta] Merge branch 'metrics-structure-improvement2' of github.com:sarutak/spark into metrics-structure-improvement2
e705386 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2
36d2f7a [Kousuke Saruta] Added warning message for the situation we cannot get application id for the prefix for the name of metrics
eea6e19 [Kousuke Saruta] Modified CoarseGrainedMesosSchedulerBackend and MesosSchedulerBackend so that we can get Application ID
c229fbe [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2
e719c39 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2
4a93c7f [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2
4776f9e [Kousuke Saruta] Modified MetricsSystemSuite.scala
efcb6e1 [Kousuke Saruta] Modified to add application id to metrics name
2ec848a [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement
3ea7896 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement
ead8966 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement
08e627e [Kousuke Saruta] Revert "tmp"
7b67f5a [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement
45bd33d [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement
93e263a [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement
848819c [Kousuke Saruta] Merge branch 'metrics-structure-improvement' of github.com:sarutak/spark into metrics-structure-improvement
912a637 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement
e4a4593 [Kousuke Saruta] tmp
3e098d8 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement
4603a39 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement
fa7175b [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement
15f88a3 [Kousuke Saruta] Modified MetricsSystem#buildRegistryName because conf.get does not return null when correspondin entry is absent
6f7dcd4 [Kousuke Saruta] Modified constructor of DAGSchedulerSource and BlockManagerSource because the instance of SparkContext is no longer used
6fc5560 [Kousuke Saruta] Modified sourceName of ExecutorSource, DAGSchedulerSource and BlockManagerSource
4e057c9 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement
85ffc02 [Kousuke Saruta] Revert "Modified sourceName of ExecutorSource, DAGSchedulerSource and BlockManagerSource"
868e326 [Kousuke Saruta] Modified MetricsSystem to set registry name with unique application-id and driver/executor-id
71609f5 [Kousuke Saruta] Modified sourceName of ExecutorSource, DAGSchedulerSource and BlockManagerSource
55debab [Kousuke Saruta] Modified SparkContext and Executor to set spark.executor.id to identifiers
4180993 [Kousuke Saruta] Modified SparkContext to retain spark.unique.app.name property in SparkConf
The existing code only considered one of the RMs when running in
Yarn HA mode, so it was possible to get errors if the active RM
was not registered in the filter.
The change makes use of a new API added to Yarn that returns all
proxy addresses, and falls back to the old behavior if the API
is not present. While there, I also made a change to look for the
scheme (http or https) being used by Yarn when building the proxy
URIs.
Since, in the case of multiple RMs, Yarn uses commas as a separator,
it was not possible anymore to use spark.filter.params to propagate
this information (which used commas to delimit different config params).
Instead, I added a new param (spark.filter.jsonParams) which expects
a JSON string containing a map with the config data. I chose not to
add it to the documentation at this point since I don't believe users
will use it directly.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#2469 from vanzin/SPARK-3606 and squashes the following commits:
aeb458a [Marcelo Vanzin] Undelete needed import.
65e400d [Marcelo Vanzin] Remove unused import.
d121883 [Marcelo Vanzin] Use separate config for each param instead of json.
04bc156 [Marcelo Vanzin] Review feedback.
4d4d6b9 [Marcelo Vanzin] [SPARK-3606] [yarn] Correctly configure AmIpFilter for Yarn HA.
Author: Brenden Matthews <brenden@diddyinc.com>
Closes#2401 from brndnmtthws/master and squashes the following commits:
4abaa5d [Brenden Matthews] [SPARK-3535][Mesos] Fix resource handling.
Update of PR #997.
With this PR, setting SPARK_CONF_DIR overrides SPARK_HOME/conf (not only spark-defaults.conf and spark-env).
Author: EugenCepoi <cepoi.eugen@gmail.com>
Closes#2481 from EugenCepoi/SPARK-2058 and squashes the following commits:
0bb32c2 [EugenCepoi] use orElse orNull and fixing trailing percent in compute-classpath.cmd
77f35d7 [EugenCepoi] SPARK-2058: Overriding SPARK_HOME/conf with SPARK_CONF_DIR
SparkSubmitDriverBootstrapper.scala now returns the exit code of the driver process, instead of always returning 0.
Author: Eric Eijkelenboom <ee@userreport.com>
Closes#2628 from ericeijkelenboom/master and squashes the following commits:
cc4a571 [Eric Eijkelenboom] Return the exit code of the driver process
pwendell, ```tryPort``` is not compatible with old code in last PR, this is to fix it.
And after discuss with srowen renamed the title to "avoid trying privileged port when request a non-privileged port". Plz refer to the discuss for detail.
Author: scwf <wangfei1@huawei.com>
Closes#2623 from scwf/1-1024 and squashes the following commits:
10a4437 [scwf] add comment
de3fd17 [scwf] do not try privileged port when request a non-privileged port
42cb0fa [scwf] make tryPort compatible with old code
cb8cc76 [scwf] do not use port 1 - 1024
If you turn authentication on and you are using a lot of executors. There is a chance that all the of the threads in the handleMessageExecutor could be waiting to send a message because they are blocked waiting on authentication to happen. This can cause a temporary deadlock until the connection times out.
To fix it, I got rid of the wait/notify and use a single outbox but only send security messages from it until authentication has completed.
Author: Thomas Graves <tgraves@apache.org>
Closes#2484 from tgravescs/cm_threads_auth and squashes the following commits:
a0a961d [Thomas Graves] give it a type
b6bc80b [Thomas Graves] Rework comments
d6d4175 [Thomas Graves] update from comments
081b765 [Thomas Graves] cleanup
4d7f8f5 [Thomas Graves] Change to not use wait/notify while waiting for authentication
If a block manager (say, A) wants to replicate a block and the node chosen for replication (say, B) is dead, then the attempt to send the block to B fails. However, this continues to fail indefinitely. Even if the driver learns about the demise of the B, A continues to try replicating to B and failing miserably.
The reason behind this bug is that A initially fetches a list of peers from the driver (when B was active), but never updates it after B is dead. This affects Spark Streaming as its receiver uses block replication.
The solution in this patch adds the following.
- Changed BlockManagerMaster to return all the peers of a block manager, rather than the requested number. It also filters out driver BlockManager.
- Refactored BlockManager's replication code to handle peer caching correctly.
+ The peer for replication is randomly selected. This is different from past behavior where for a node A, a node B was deterministically chosen for the lifetime of the application.
+ If replication fails to one node, the peers are refetched.
+ The peer cached has a TTL of 1 second to enable discovery of new peers and using them for replication.
- Refactored use of \<driver\> in BlockManager into a new method `BlockManagerId.isDriver`
- Added replication unit tests (replication was not tested till now, duh!)
This should not make a difference in performance of Spark workloads where replication is not used.
@andrewor14 @JoshRosen
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#2366 from tdas/replication-fix and squashes the following commits:
9690f57 [Tathagata Das] Moved replication tests to a new BlockManagerReplicationSuite.
0661773 [Tathagata Das] Minor changes based on PR comments.
a55a65c [Tathagata Das] Added a unit test to test replication behavior.
012afa3 [Tathagata Das] Bug fix
89f91a0 [Tathagata Das] Minor change.
68e2c72 [Tathagata Das] Made replication peer selection logic more efficient.
08afaa9 [Tathagata Das] Made peer selection for replication deterministic to block id
3821ab9 [Tathagata Das] Fixes based on PR comments.
08e5646 [Tathagata Das] More minor changes.
d402506 [Tathagata Das] Fixed imports.
4a20531 [Tathagata Das] Filtered driver block manager from peer list, and also consolidated the use of <driver> in BlockManager.
7598f91 [Tathagata Das] Minor changes.
03de02d [Tathagata Das] Change replication logic to correctly refetch peers from master on failure and on new worker addition.
d081bf6 [Tathagata Das] Fixed bug in get peers and unit tests to test get-peers and replication under executor churn.
9f0ac9f [Tathagata Das] Modified replication tests to fail on replication bug.
af0c1da [Tathagata Das] Added replication unit tests to BlockManagerSuite
### Problem
The section "Using the shell" in Spark Programming Guide (https://spark.apache.org/docs/latest/programming-guide.html#using-the-shell) says that we can run pyspark REPL through IPython.
But a folloing command does not run IPython but a default Python executable.
```
$ IPYTHON=1 ./bin/pyspark
Python 2.7.8 (default, Jul 2 2014, 10:14:46)
...
```
the spark/bin/pyspark script on the commit b235e01363 decides which executable and options it use folloing way.
1. if PYSPARK_PYTHON unset
* → defaulting to "python"
2. if IPYTHON_OPTS set
* → set IPYTHON "1"
3. some python scripts passed to ./bin/pyspak → run it with ./bin/spark-submit
* out of this issues scope
4. if IPYTHON set as "1"
* → execute $PYSPARK_PYTHON (default: ipython) with arguments $IPYTHON_OPTS
* otherwise execute $PYSPARK_PYTHON
Therefore, when PYSPARK_PYTHON is unset, python is executed though IPYTHON is "1".
In other word, when PYSPARK_PYTHON is unset, IPYTHON_OPS and IPYTHON has no effect on decide which command to use.
PYSPARK_PYTHON | IPYTHON_OPTS | IPYTHON | resulting command | expected command
---- | ---- | ----- | ----- | -----
(unset → defaults to python) | (unset) | (unset) | python | (same)
(unset → defaults to python) | (unset) | 1 | python | ipython
(unset → defaults to python) | an_option | (unset → set to 1) | python an_option | ipython an_option
(unset → defaults to python) | an_option | 1 | python an_option | ipython an_option
ipython | (unset) | (unset) | ipython | (same)
ipython | (unset) | 1 | ipython | (same)
ipython | an_option | (unset → set to 1) | ipython an_option | (same)
ipython | an_option | 1 | ipython an_option | (same)
### Suggestion
The pyspark script should determine firstly whether a user wants to run IPython or other executables.
1. if IPYTHON_OPTS set
* set IPYTHON "1"
2. if IPYTHON has a value "1"
* PYSPARK_PYTHON defaults to "ipython" if not set
3. PYSPARK_PYTHON defaults to "python" if not set
See the pull request for more detailed modification.
Author: cocoatomo <cocoatomo77@gmail.com>
Closes#2554 from cocoatomo/issues/cannot-run-ipython-without-options and squashes the following commits:
d2a9b06 [cocoatomo] [SPARK-3706][PySpark] Use PYTHONUNBUFFERED environment variable instead of -u option
264114c [cocoatomo] [SPARK-3706][PySpark] Remove the sentence about deprecated environment variables
42e02d5 [cocoatomo] [SPARK-3706][PySpark] Replace environment variables used to customize execution of PySpark REPL
10d56fb [cocoatomo] [SPARK-3706][PySpark] Cannot run IPython REPL with IPYTHON set to "1" and PYSPARK_PYTHON unset
This change reorders the replicas returned by
HadoopRDD#getPreferredLocations so that replicas cached by HDFS are at
the start of the list. This requires Hadoop 2.5 or higher; previous
versions of Hadoop do not expose the information needed to determine
whether a replica is cached.
Author: Colin Patrick Mccabe <cmccabe@cloudera.com>
Closes#1486 from cmccabe/SPARK-1767 and squashes the following commits:
338d4f8 [Colin Patrick Mccabe] SPARK-1767: Prefer HDFS-cached replicas when scheduling data-local tasks
FutureAction is the only type exposed through the async APIs, so
for job IDs to be useful they need to be exposed there. The complication
is that some async jobs run more than one job (e.g. takeAsync),
so the exposed ID has to actually be a list of IDs that can actually
change over time. So the interface doesn't look very nice, but...
Change is actually small, I just added a basic test to make sure
it works.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#2337 from vanzin/SPARK-3446 and squashes the following commits:
e166a68 [Marcelo Vanzin] Fix comment.
1fed2bc [Marcelo Vanzin] [SPARK-3446] Expose underlying job ids in FutureAction.
https://issues.apache.org/jira/browse/SPARK-3658
And keep the `CLASS_NOT_FOUND_EXIT_STATUS` and exit message in `SparkSubmit.scala`.
Author: WangTaoTheTonic <barneystinson@aliyun.com>
Author: WangTao <barneystinson@aliyun.com>
Closes#2509 from WangTaoTheTonic/thriftserver and squashes the following commits:
5dcaab2 [WangTaoTheTonic] issue about coupling
8ad9f95 [WangTaoTheTonic] generalization
598e21e [WangTao] take thrift server as a daemon
Non-root user use port 1- 1024 to start jetty server will get the exception " java.net.SocketException: Permission denied", so not use these ports
Author: scwf <wangfei1@huawei.com>
Closes#2610 from scwf/1-1024 and squashes the following commits:
cb8cc76 [scwf] do not use port 1 - 1024
1. broadcast is triggle unexpected
2. fd is leaked in JVM (also leak in parallelize())
3. broadcast is not unpersisted in JVM after RDD is not be used any more.
cc JoshRosen , sorry for these stupid bugs.
Author: Davies Liu <davies.liu@gmail.com>
Closes#2603 from davies/fix_broadcast and squashes the following commits:
080a743 [Davies Liu] fix bugs in broadcast large closure of RDD
Author: Reynold Xin <rxin@apache.org>
Closes#2599 from rxin/SPARK-3747 and squashes the following commits:
a74c04d [Reynold Xin] Added a line of comment explaining NonFatal
0e8d44c [Reynold Xin] [SPARK-3747] TaskResultGetter could incorrectly abort a stage if it cannot get result for a specific task
Author: Reynold Xin <rxin@apache.org>
Closes#2602 from rxin/warning and squashes the following commits:
130186b [Reynold Xin] Remove compiler warning from TaskContext change.
As suggested by mateiz , and because it came up on the mailing list again last week, this attempts to document that ordering of elements is not guaranteed across RDD evaluations in groupBy, zip, and partition-wise RDD methods. Suggestions welcome about the wording, or other methods that need a note.
Author: Sean Owen <sowen@cloudera.com>
Closes#2508 from srowen/SPARK-3356 and squashes the following commits:
b7c96fd [Sean Owen] Undo change to programming guide
ad4aeec [Sean Owen] Don't mention ordering in partition-wise methods, reword description of ordering for zip methods per review, and add similar note to programming guide, which mentions groupByKey (but not zip methods)
fce943b [Sean Owen] Note that ordering of elements is not guaranteed across RDD evaluations in groupBy, zip, and partition-wise RDD methods
When using spark-submit in `cluster` mode to submit a job to a Spark Standalone
cluster, if the JAVA_HOME environment variable was set on the submitting
machine then DriverRunner would attempt to use the submitter's JAVA_HOME to
launch the driver process (instead of the worker's JAVA_HOME), causing the
driver to fail unless the submitter and worker had the same Java location.
This commit fixes this by reading JAVA_HOME from sys.env instead of
command.environment.
Author: Josh Rosen <joshrosen@apache.org>
Closes#2586 from JoshRosen/SPARK-3734 and squashes the following commits:
e9513d9 [Josh Rosen] [SPARK-3734] DriverRunner should not read SPARK_HOME from submitter's environment.