Commit graph

3896 commits

Author SHA1 Message Date
Josh Rosen bf589fc717 [SPARK-3616] Add basic Selenium tests to WebUISuite
This patch adds Selenium tests for Spark's web UI.  To avoid adding extra
dependencies to the test environment, the tests use Selenium's HtmlUnitDriver,
which is pure-Java, instead of, say, ChromeDriver.

I added new tests to try to reproduce a few UI bugs reported on JIRA, namely
SPARK-3021, SPARK-2105, and SPARK-2527.  I wasn't able to reproduce these bugs;
I suspect that the older ones might have been fixed by other patches.

In order to use HtmlUnitDriver, I added an explicit dependency on the
org.apache.httpcomponents version of httpclient in order to prevent jets3t's
older version from taking precedence on the classpath.

I also upgraded ScalaTest to 2.2.1.

Author: Josh Rosen <joshrosen@apache.org>
Author: Josh Rosen <joshrosen@databricks.com>

Closes #2474 from JoshRosen/webui-selenium-tests and squashes the following commits:

fcc9e83 [Josh Rosen] scalautils -> scalactic package rename
510e54a [Josh Rosen] [SPARK-3616] Add basic Selenium tests to WebUISuite.
2014-10-26 11:29:27 -07:00
Sean Owen df7974b8e5 SPARK-3359 [DOCS] sbt/sbt unidoc doesn't work with Java 8
This follows https://github.com/apache/spark/pull/2893 , but does not completely fix SPARK-3359 either. This fixes minor scaladoc/javadoc issues that Javadoc 8 will treat as errors.

Author: Sean Owen <sowen@cloudera.com>

Closes #2909 from srowen/SPARK-3359 and squashes the following commits:

f62c347 [Sean Owen] Fix some javadoc issues that javadoc 8 considers errors. This is not all of the errors turned up when javadoc 8 runs on output of genjavadoc.
2014-10-25 23:18:02 -07:00
Andrew Or c683444008 [SPARK-4071] Unroll fails silently if BlockManager is small
In tests, we may want to have BlockManagers of size < 1MB (spark.storage.unrollMemoryThreshold). However, these BlockManagers are useless because we can't unroll anything in them ever. At the very least we need to log a warning.

tdas

Author: Andrew Or <andrew@databricks.com>

Closes #2917 from andrewor14/unroll-safely-logging and squashes the following commits:

38947e3 [Andrew Or] Warn against starting a block manager that's too small
fd621b4 [Andrew Or] Warn against failure to reserve initial memory threshold
2014-10-25 20:07:44 -07:00
Josh Rosen 9530316887 [SPARK-2321] Stable pull-based progress / status API
This pull request is a first step towards the implementation of a stable, pull-based progress / status API for Spark (see [SPARK-2321](https://issues.apache.org/jira/browse/SPARK-2321)).  For now, I'd like to discuss the basic implementation, API names, and overall interface design.  Once we arrive at a good design, I'll go back and add additional methods to expose more information via these API.

#### Design goals:

- Pull-based API
- Usable from Java / Scala / Python (eventually, likely with a wrapper)
- Can be extended to expose more information without introducing binary incompatibilities.
- Returns immutable objects.
- Don't leak any implementation details, preserving our freedom to change the implementation.

#### Implementation:

- Add public methods (`getJobInfo`, `getStageInfo`) to SparkContext to allow status / progress information to be retrieved.
- Add public interfaces (`SparkJobInfo`, `SparkStageInfo`) for our API return values.  These interfaces consist entirely of Java-style getter methods.  The interfaces are currently implemented in Java.  I decided to explicitly separate the interface from its implementation (`SparkJobInfoImpl`, `SparkStageInfoImpl`) in order to prevent users from constructing these responses themselves.
-Allow an existing JobProgressListener to be used when constructing a live SparkUI.  This allows us to re-use this listeners in the implementation of this status API.  There are a few reasons why this listener re-use makes sense:
   - The status API and web UI are guaranteed to show consistent information.
   - These listeners are already well-tested.
   - The same garbage-collection / information retention configurations can apply to both this API and the web UI.
- Extend JobProgressListener to maintain `jobId -> Job` and `stageId -> Stage` mappings.

The progress API methods are implemented in a separate trait that's mixed into SparkContext.  This helps to avoid SparkContext.scala from becoming larger and more difficult to read.

Author: Josh Rosen <joshrosen@databricks.com>
Author: Josh Rosen <joshrosen@apache.org>

Closes #2696 from JoshRosen/progress-reporting-api and squashes the following commits:

e6aa78d [Josh Rosen] Add tests.
b585c16 [Josh Rosen] Accept SparkListenerBus instead of more specific subclasses.
c96402d [Josh Rosen] Address review comments.
2707f98 [Josh Rosen] Expose current stage attempt id
c28ba76 [Josh Rosen] Update demo code:
646ff1d [Josh Rosen] Document spark.ui.retainedJobs.
7f47d6d [Josh Rosen] Clean up SparkUI constructors, per Andrew's feedback.
b77b3d8 [Josh Rosen] Merge remote-tracking branch 'origin/master' into progress-reporting-api
787444c [Josh Rosen] Move status API methods into trait that can be mixed into SparkContext.
f9a9a00 [Josh Rosen] More review comments:
3dc79af [Josh Rosen] Remove creation of unused listeners in SparkContext.
249ca16 [Josh Rosen] Address several review comments:
da5648e [Josh Rosen] Add example of basic progress reporting in Java.
7319ffd [Josh Rosen] Add getJobIdsForGroup() and num*Tasks() methods.
cc568e5 [Josh Rosen] Add note explaining that interfaces should not be implemented outside of Spark.
6e840d4 [Josh Rosen] Remove getter-style names and "consistent snapshot" semantics:
08cbec9 [Josh Rosen] Begin to sketch the interfaces for a stable, public status API.
ac2d13a [Josh Rosen] Add jobId->stage, stageId->stage mappings in JobProgressListener
24de263 [Josh Rosen] Create UI listeners in SparkContext instead of in Tabs:
2014-10-25 00:06:57 -07:00
Josh Rosen 6c98c29ae0 [SPARK-4080] Only throw IOException from [write|read][Object|External]
If classes implementing Serializable or Externalizable interfaces throw
exceptions other than IOException or ClassNotFoundException from their
(de)serialization methods, then this results in an unhelpful
"IOException: unexpected exception type" rather than the actual exception that
produced the (de)serialization error.

This patch fixes this by adding a utility method that re-wraps any uncaught
exceptions in IOException (unless they are already instances of IOException).

Author: Josh Rosen <joshrosen@databricks.com>

Closes #2932 from JoshRosen/SPARK-4080 and squashes the following commits:

cd3a9be [Josh Rosen] [SPARK-4080] Only throw IOException from [write|read][Object|External].
2014-10-24 15:06:15 -07:00
Nan Zhu f80dcf2aee [SPARK-4067] refactor ExecutorUncaughtExceptionHandler
https://issues.apache.org/jira/browse/SPARK-4067

currently , we call Utils.tryOrExit everywhere
AppClient
Executor
TaskSchedulerImpl
It makes the name of ExecutorUncaughtExceptionHandler unfit to the real case....

Author: Nan Zhu <nanzhu@Nans-MacBook-Pro.local>
Author: Nan Zhu <nanzhu@nans-mbp.home>

Closes #2913 from CodingCat/SPARK-4067 and squashes the following commits:

035ee3d [Nan Zhu] make RAT happy
e62e416 [Nan Zhu] add some general Exit code
a10b63f [Nan Zhu] refactor
2014-10-24 13:47:06 -07:00
Andrew Or b563987e8d [SPARK-4013] Do not create multiple actor systems on each executor
In the existing code, each coarse-grained executor has two concurrently running actor systems. This causes many more error messages to be logged than necessary when the executor is lost or killed because we receive a disassociation event for each of these actor systems.

This is blocking #2840.

Author: Andrew Or <andrewor14@gmail.com>

Closes #2863 from andrewor14/executor-actor-system and squashes the following commits:

44ce2e0 [Andrew Or] Avoid starting two actor systems on each executor
2014-10-24 13:32:23 -07:00
Kousuke Saruta 098f83c7cc [SPARK-4075] [Deploy] Jar url validation is not enough for Jar file
In deploy.ClientArguments.isValidJarUrl, the url is checked as follows.

    def isValidJarUrl(s: String): Boolean = s.matches("(.+):(.+)jar")

So, it allows like 'hdfs:file.jar' (no authority).

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2925 from sarutak/uri-syntax-check-improvement and squashes the following commits:

cf06173 [Kousuke Saruta] Improved URI syntax checking
2014-10-24 13:08:59 -07:00
Li Zhihui 7aacb7bfad [SPARK-2713] Executors of same application in same host should only download files & jars once
If Spark lunched multiple executors in one host for one application, every executor would download it dependent files and jars (if not using local: url) independently. It maybe result in huge latency. In my case, it result in 20 seconds latency to download dependent jars(size about 17M) when I lunched 32 executors in every host(total 4 hosts).

This patch will cache downloaded files and jars for executors to reduce network throughput and download latency. In my case, the latency was reduced from 20 seconds to less than 1 second.

Author: Li Zhihui <zhihui.li@intel.com>
Author: li-zhihui <zhihui.li@intel.com>

Closes #1616 from li-zhihui/cachefiles and squashes the following commits:

36940df [Li Zhihui] Close cache for local mode
935fed6 [Li Zhihui] Clean code.
f9330d4 [Li Zhihui] Clean code again
7050d46 [Li Zhihui] Clean code
074a422 [Li Zhihui] Fix: deal with spark.files.overwrite
03ed3a8 [li-zhihui] rename cache file name as XXXXXXXXX_cache
2766055 [li-zhihui] Use url.hashCode + timestamp as cachedFileName
76a7b66 [Li Zhihui] Clean code & use applcation work directory as cache directory
3510eb0 [Li Zhihui] Keep fetchFile private
2ffd742 [Li Zhihui] add comment for FileLock
e0ebd48 [Li Zhihui] Try and finally lock.release
7fb7c0b [Li Zhihui] Release lock before copy files
6b997bf [Li Zhihui] Executors of same application in same host should only download files & jars once
2014-10-24 13:01:36 -07:00
Davies Liu e595c8d08a [SPARK-3993] [PySpark] fix bug while reuse worker after take()
After take(), maybe there are some garbage left in the socket, then next task assigned to this worker will hang because of corrupted data.

We should make sure the socket is clean before reuse it, write END_OF_STREAM at the end, and check it after read out all result from python.

Author: Davies Liu <davies.liu@gmail.com>
Author: Davies Liu <davies@databricks.com>

Closes #2838 from davies/fix_reuse and squashes the following commits:

8872914 [Davies Liu] fix tests
660875b [Davies Liu] fix bug while reuse worker after take()
2014-10-23 17:20:00 -07:00
Josh Rosen 83b7a1c650 [SPARK-4019] [SPARK-3740] Fix MapStatus compression bug that could lead to empty results or Snappy errors
This commit fixes a bug in MapStatus that could cause jobs to wrongly return
empty results if those jobs contained stages with more than 2000 partitions
where most of those partitions were empty.

For jobs with > 2000 partitions, MapStatus uses HighlyCompressedMapStatus,
which only stores the average size of blocks.  If the average block size is
zero, then this will cause all blocks to be reported as empty, causing
BlockFetcherIterator to mistakenly skip them.

For example, this would return an empty result:

    sc.makeRDD(0 until 10, 1000).repartition(2001).collect()

This can also lead to deserialization errors (e.g. Snappy decoding errors)
for jobs with > 2000 partitions where the average block size is non-zero but
there is at least one empty block.  In this case, the BlockFetcher attempts to
fetch empty blocks and fails when trying to deserialize them.

The root problem here is that MapStatus has a (previously undocumented)
correctness property that was violated by HighlyCompressedMapStatus:

    If a block is non-empty, then getSizeForBlock must be non-zero.

I fixed this by modifying HighlyCompressedMapStatus to store the average size
of _non-empty_ blocks and to use a compressed bitmap to track which blocks are
empty.

I also removed a test which was broken as originally written: it attempted
to check that HighlyCompressedMapStatus's size estimation error was < 10%,
but this was broken because HighlyCompressedMapStatus is only used for map
statuses with > 2000 partitions, but the test only created 50.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #2866 from JoshRosen/spark-4019 and squashes the following commits:

fc8b490 [Josh Rosen] Roll back hashset change, which didn't improve performance.
5faa0a4 [Josh Rosen] Incorporate review feedback
c8b8cae [Josh Rosen] Two performance fixes:
3b892dd [Josh Rosen] Address Reynold's review comments
ba2e71c [Josh Rosen] Add missing newline
609407d [Josh Rosen] Use Roaring Bitmap to track non-empty blocks.
c23897a [Josh Rosen] Use sets when comparing collect() results
91276a3 [Josh Rosen] [SPARK-4019] Fix MapStatus compression bug that could lead to empty results.
2014-10-23 16:39:32 -07:00
Tal Sliwowicz 6b48522527 [SPARK-4006] In long running contexts, we encountered the situation of double registe...
...r without a remove in between. The cause for that is unknown, and assumed a temp network issue.

However, since the second register is with a BlockManagerId on a different port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor returns Some. This inconsistency is caught in a conditional statement that does System.exit(1), which is a huge robustness issue for us.

The fix - simply remove the old id from both maps during register when this happens. We are mimicking the behavior of expireDeadHosts(), by doing local cleanup of the maps before trying to add new ones.

Also - added some logging for register and unregister.

This is just like https://github.com/apache/spark/pull/2854 except it's on master

Author: Tal Sliwowicz <tal.s@taboola.com>

Closes #2886 from tsliwowicz/master-block-mgr-removal and squashes the following commits:

094d508 [Tal Sliwowicz] some more white space change undone
41a2217 [Tal Sliwowicz] some more whitspaces change undone
7bcfc3d [Tal Sliwowicz] whitspaces fix
df9d98f [Tal Sliwowicz] Code review comments fixed
f48bce9 [Tal Sliwowicz] In long running contexts, we encountered the situation of double register without a remove in between. The cause for that is unknown, and assumed a temp network issue.
2014-10-23 10:53:53 -07:00
Josh Rosen 813effc701 [SPARK-3426] Fix sort-based shuffle error when spark.shuffle.compress and spark.shuffle.spill.compress settings are different
This PR fixes SPARK-3426, an issue where sort-based shuffle crashes if the
`spark.shuffle.spill.compress` and `spark.shuffle.compress` settings have
different values.

The problem is that sort-based shuffle's read and write paths use different
settings for determining whether to apply compression.  ExternalSorter writes
runs to files using `TempBlockId` ids, which causes
`spark.shuffle.spill.compress` to be used for enabling compression, but these
spilled files end up being shuffled over the network and read as shuffle files
using `ShuffleBlockId` by BlockStoreShuffleFetcher, which causes
`spark.shuffle.compress` to be used for enabling decompression.  As a result,
this leads to errors when these settings disagree.

Based on the discussions in #2247 and #2178, it sounds like we don't want to
remove the `spark.shuffle.spill.compress` setting.  Therefore, I've tried to
come up with a fix where `spark.shuffle.spill.compress` is used to compress
data that's read and written locally and `spark.shuffle.compress` is used to
compress any data that will be fetched / read as shuffle blocks.

To do this, I split `TempBlockId` into two new id types, `TempLocalBlockId` and
`TempShuffleBlockId`, which map to `spark.shuffle.spill.compress` and
`spark.shuffle.compress`, respectively.  ExternalAppendOnlyMap also used temp
blocks for spilling data.  It looks like ExternalSorter was designed to be
a generic sorter but its configuration already happens to be tied to sort-based
shuffle, so I think it's fine if we use `spark.shuffle.compress` to compress
its spills; we can move the compression configuration to the constructor in
a later commit if we find that ExternalSorter is being used in other contexts
where we want different configuration options to control compression.  To
summarize:

**Before:**

|       | ExternalAppendOnlyMap        | ExternalSorter               |
|-------|------------------------------|------------------------------|
| Read  | spark.shuffle.spill.compress | spark.shuffle.compress       |
| Write | spark.shuffle.spill.compress | spark.shuffle.spill.compress |

**After:**

|       | ExternalAppendOnlyMap        | ExternalSorter         |
|-------|------------------------------|------------------------|
| Read  | spark.shuffle.spill.compress | spark.shuffle.compress |
| Write | spark.shuffle.spill.compress | spark.shuffle.compress |

Thanks to andrewor14 for debugging this with me!

Author: Josh Rosen <joshrosen@databricks.com>

Closes #2890 from JoshRosen/SPARK-3426 and squashes the following commits:

1921cf6 [Josh Rosen] Minor edit for clarity.
c8dd8f2 [Josh Rosen] Add comment explaining use of createTempShuffleBlock().
2c687b9 [Josh Rosen] Fix SPARK-3426.
91e7e40 [Josh Rosen] Combine tests into single test of all combinations
76ca65e [Josh Rosen] Add regression test for SPARK-3426.
2014-10-22 14:49:58 -07:00
CrazyJvm f05e09b4c9 use isRunningLocally rather than runningLocally
runningLocally is deprecated now

Author: CrazyJvm <crazyjvm@gmail.com>

Closes #2879 from CrazyJvm/runningLocally and squashes the following commits:

bec0b3e [CrazyJvm] use isRunningLocally rather than runningLocally
2014-10-22 00:51:33 -07:00
Sandy Ryza 6bb56faea8 SPARK-1813. Add a utility to SparkConf that makes using Kryo really easy
Author: Sandy Ryza <sandy@cloudera.com>

Closes #789 from sryza/sandy-spark-1813 and squashes the following commits:

48b05e9 [Sandy Ryza] Simplify
b824932 [Sandy Ryza] Allow both spark.kryo.classesToRegister and spark.kryo.registrator at the same time
6a15bb7 [Sandy Ryza] Small fix
a2278c0 [Sandy Ryza] Respond to review comments
6ef592e [Sandy Ryza] SPARK-1813. Add a utility to SparkConf that makes using Kryo really easy
2014-10-21 21:53:09 -07:00
Aaron Davidson 5fdaf52a9d [SPARK-3994] Use standard Aggregator code path for countByKey and countByValue
See [JIRA](https://issues.apache.org/jira/browse/SPARK-3994) for more information. Also adds
a note which warns against using these methods.

Author: Aaron Davidson <aaron@databricks.com>

Closes #2839 from aarondav/countByKey and squashes the following commits:

d6fdb2a [Aaron Davidson] Respond to comments
e1f06d3 [Aaron Davidson] [SPARK-3994] Use standard Aggregator code path for countByKey and countByValue
2014-10-21 13:15:29 -07:00
Andrew Or 61ca7742d2 [SPARK-4020] Do not rely on timeouts to remove failed block managers
If an executor fails without being scheduled to run any tasks, then `DAGScheduler` won't notify `BlockManagerMasterActor` that the associated block manager should be removed. Instead, the associated block manager will be expired only after a few rounds of heartbeat timeouts. In terms of removal treatment, there should really be no distinction between executors that have been scheduled tasks and those that have not.

The fix, then, is to add all known executors to `TaskSchedulerImpl`'s `activeExecutorIds` whether or not it has been scheduled a task. In fact, the existing comment above `activeExecutorIds` is
```
// Which executor IDs we have executors on
val activeExecutorIds = new HashSet[String]
```
not  "Which executors have been scheduled tasks thus far."

Author: Andrew Or <andrewor14@gmail.com>

Closes #2865 from andrewor14/active-executors and squashes the following commits:

ff3172b [Andrew Or] Add all known executors to `activeExecutorIds`
2014-10-21 11:22:25 -07:00
zsxwing c262cd5ddc [SPARK-4035] Fix a wrong format specifier
Just found a typo. Should not use "%f" for Long.

Author: zsxwing <zsxwing@gmail.com>

Closes #2875 from zsxwing/SPARK-4035 and squashes the following commits:

ce347e2 [zsxwing] Fix a wrong format specifier
2014-10-21 10:37:13 -07:00
Josh Rosen 5a8f64f336 [SPARK-3958] TorrentBroadcast cleanup / debugging improvements.
This PR makes several changes to TorrentBroadcast in order to make
it easier to reason about, which should help when debugging SPARK-3958.
The key changes:

- Remove all state from the global TorrentBroadcast object.  This state
  consisted mainly of configuration options, like the block size and
  compression codec, and was read by the blockify / unblockify methods.
  Unfortunately, the use of `lazy val` for `BLOCK_SIZE` meant that the block
  size was always determined by the first SparkConf that TorrentBroadast was
  initialized with; as a result, unit tests could not properly test
  TorrentBroadcast with different block sizes.

  Instead, blockifyObject and unBlockifyObject now accept compression codecs
  and blockSizes as arguments.  These arguments are supplied at the call sites
  inside of TorrentBroadcast instances.  Each TorrentBroadcast instance
  determines these values from SparkEnv's SparkConf.  I was careful to ensure
  that we do not accidentally serialize CompressionCodec or SparkConf objects
  as part of the TorrentBroadcast object.

- Remove special-case handling of local-mode in TorrentBroadcast.  I don't
  think that broadcast implementations should know about whether we're running
  in local mode.  If we want to optimize the performance of broadcast in local
  mode, then we should detect this at a higher level and use a dummy
  LocalBroadcastFactory implementation instead.

  Removing this code fixes a subtle error condition: in the old local mode
  code, a failure to find the broadcast in the local BlockManager would lead
  to an attempt to deblockify zero blocks, which could lead to confusing
  deserialization or decompression errors when we attempted to decompress
  an empty byte array.  This should never have happened, though: a failure to
  find the block in local mode is evidence of some other error.  The changes
  here will make it easier to debug those errors if they ever happen.

- Add a check that throws an exception when attempting to deblockify an
  empty array.

- Use ScalaCheck to add a test to check that TorrentBroadcast's
  blockifyObject and unBlockifyObject methods are inverses.

- Misc. cleanup and logging improvements.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #2844 from JoshRosen/torrentbroadcast-bugfix and squashes the following commits:

1e8268d [Josh Rosen] Address Reynold's review comments
2a9fdfd [Josh Rosen] Address Reynold's review comments.
c3b08f9 [Josh Rosen] Update TorrentBroadcast tests to reflect removal of special local-mode optimizations.
5c22782 [Josh Rosen] Store broadcast variable's value in the driver.
33fc754 [Josh Rosen] Change blockify/unblockifyObject to accept serializer as argument.
618a872 [Josh Rosen] [SPARK-3958] TorrentBroadcast cleanup / debugging improvements.
2014-10-21 00:49:11 -07:00
mcheah 4afe9a4852 [SPARK-3736] Workers reconnect when disassociated from the master.
Before, if the master node is killed and restarted, the worker nodes
would not attempt to reconnect to the Master. Therefore, when the Master
node was restarted, the worker nodes needed to be restarted as well.

Now, when the Master node is disconnected, the worker nodes will
continuously ping the master node in attempts to reconnect to it. Once
the master node restarts, it will detect one of the registration
requests from its former workers. The result is that the cluster
re-enters a healthy state.

In addition, when the master does not receive a heartbeat from the
worker, the worker was removed; however, when the worker sent a
heartbeat to the master, the master used to ignore the heartbeat. Now,
a master that receives a heartbeat from a worker that had been
disconnected will request the worker to re-attempt the registration
process, at which point the worker will send a RegisterWorker request
and be re-connected accordingly.

Re-connection attempts per worker are submitted every N seconds, where N
is configured by the property spark.worker.reconnect.interval - this has
a default of 60 seconds right now.

Author: mcheah <mcheah@palantir.com>

Closes #2828 from mccheah/reconnect-dead-workers and squashes the following commits:

83f8bc9 [mcheah] [SPARK-3736] More informative log message, and fixing some indentation.
fe0e02f [mcheah] [SPARK-3736] Moving reconnection logic to registerWithMaster().
94ddeca [mcheah] [SPARK-3736] Changing a log warning to a log info.
a698e35 [mcheah] [SPARK-3736] Addressing PR comment to make some defs private.
b9a3077 [mcheah] [SPARK-3736] Addressing PR comments related to reconnection.
2ad5ed5 [mcheah] [SPARK-3736] Cancel attempts to reconnect if the master changes.
b5b34af [mcheah] [SPARK-3736] Workers reconnect when disassociated from the master.
2014-10-20 11:35:18 -07:00
GuoQiang Li 51afde9d8b [SPARK-4010][Web UI]Spark UI returns 500 in yarn-client mode
The problem caused by #1966
CC YanTangZhai andrewor14

Author: GuoQiang Li <witgo@qq.com>

Closes #2858 from witgo/SPARK-4010 and squashes the following commits:

9866fbf [GuoQiang Li] Spark UI returns 500 in yarn-client mode
2014-10-20 11:03:53 -07:00
jerryshao c7aeecd08f [SPARK-3948][Shuffle]Fix stream corruption bug in sort-based shuffle
Kernel 2.6.32 bug will lead to unexpected behavior of transferTo in copyStream, and this will corrupt the shuffle output file in sort-based shuffle, which will somehow introduce PARSING_ERROR(2), deserialization error or offset out of range. Here fix this by adding append flag, also add some position checking code. Details can be seen in [SPARK-3948](https://issues.apache.org/jira/browse/SPARK-3948).

Author: jerryshao <saisai.shao@intel.com>

Closes #2824 from jerryshao/SPARK-3948 and squashes the following commits:

be0533a [jerryshao] Address the comments
a82b184 [jerryshao] add configuration to control the NIO way of copying stream
e17ada2 [jerryshao] Fix kernel 2.6.32 bug led unexpected behavior of transferTo
2014-10-20 10:20:21 -07:00
Josh Rosen d1966f3a8b [SPARK-3902] [SPARK-3590] Stabilize AsynRDDActions and add Java API
This PR adds a Java API for AsyncRDDActions and promotes the API from `Experimental` to stable.

Author: Josh Rosen <joshrosen@apache.org>
Author: Josh Rosen <joshrosen@databricks.com>

Closes #2760 from JoshRosen/async-rdd-actions-in-java and squashes the following commits:

0d45fbc [Josh Rosen] Whitespace fix.
ad3ae53 [Josh Rosen] Merge remote-tracking branch 'origin/master' into async-rdd-actions-in-java
c0153a5 [Josh Rosen] Remove unused variable.
e8e2867 [Josh Rosen] Updates based on Marcelo's review feedback
7a1417f [Josh Rosen] Removed unnecessary java.util import.
6f8f6ac [Josh Rosen] Fix import ordering.
ff28e49 [Josh Rosen] Add MiMa excludes and fix a scalastyle error.
346e46e [Josh Rosen] [SPARK-3902] Stabilize AsyncRDDActions; add Java API.
2014-10-19 20:02:31 -07:00
Josh Rosen 7e63bb49c5 [SPARK-2546] Clone JobConf for each task (branch-1.0 / 1.1 backport)
This patch attempts to fix SPARK-2546 in `branch-1.0` and `branch-1.1`.  The underlying problem is that thread-safety issues in Hadoop Configuration objects may cause Spark tasks to get stuck in infinite loops.  The approach taken here is to clone a new copy of the JobConf for each task rather than sharing a single copy between tasks.  Note that there are still Configuration thread-safety issues that may affect the driver, but these seem much less likely to occur in practice and will be more complex to fix (see discussion on the SPARK-2546 ticket).

This cloning is guarded by a new configuration option (`spark.hadoop.cloneConf`) and is disabled by default in order to avoid unexpected performance regressions for workloads that are unaffected by the Configuration thread-safety issues.

Author: Josh Rosen <joshrosen@apache.org>

Closes #2684 from JoshRosen/jobconf-fix-backport and squashes the following commits:

f14f259 [Josh Rosen] Add configuration option to control cloning of Hadoop JobConf.
b562451 [Josh Rosen] Remove unused jobConfCacheKey field.
dd25697 [Josh Rosen] [SPARK-2546] [1.0 / 1.1 backport] Clone JobConf for each task.

(cherry picked from commit 2cd40db2b3)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>

Conflicts:
	core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala
2014-10-19 00:35:05 -07:00
Sean Owen f406a83918 SPARK-3926 [CORE] Result of JavaRDD.collectAsMap() is not Serializable
Make JavaPairRDD.collectAsMap result Serializable since Java Maps generally are

Author: Sean Owen <sowen@cloudera.com>

Closes #2805 from srowen/SPARK-3926 and squashes the following commits:

ecb78ee [Sean Owen] Fix conflict between java.io.Serializable and use of Scala's Serializable
f4717f9 [Sean Owen] Oops, fix compile problem
ae1b36f [Sean Owen] Expand to cover Maps returned from other Java API methods as well
51c26c2 [Sean Owen] Make JavaPairRDD.collectAsMap result Serializable since Java Maps generally are
2014-10-18 12:38:18 -07:00
likun c351862064 [SPARK-3935][Core] log the number of records that has been written
There is a unused variable(count) in saveAsHadoopDataset in PairRDDFunctions.scala. The initial idea of this variable seems to count the number of records, so I am adding a log statement to log the number of records that has been written to the writer.

Author: likun <jacky.likun@huawei.com>
Author: jackylk <jacky.likun@huawei.com>

Closes #2791 from jackylk/SPARK-3935 and squashes the following commits:

a874047 [jackylk] removing the unused variable in PairRddFunctions.scala
3bf43c7 [likun] log the number of records has been written
2014-10-17 10:33:45 -07:00
Shivaram Venkataraman e678b9f02a [SPARK-3973] Print call site information for broadcasts
Its hard to debug which broadcast variables refer to what in a big codebase. Printing call site information helps in debugging.

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #2829 from shivaram/spark-broadcast-print and squashes the following commits:

cd6dbdf [Shivaram Venkataraman] Print call site information for broadcasts
2014-10-16 19:43:33 -07:00
yantangzhai dedace83f3 [SPARK-3067] JobProgressPage could not show Fair Scheduler Pools section sometimes
JobProgressPage could not show Fair Scheduler Pools section sometimes.
SparkContext starts webui and then postEnvironmentUpdate. Sometimes JobProgressPage is accessed between webui starting and postEnvironmentUpdate, then the lazy val isFairScheduler will be false. The Fair Scheduler Pools section will not display any more.

Author: yantangzhai <tyz0303@163.com>
Author: YanTangZhai <hakeemzhai@tencent.com>

Closes #1966 from YanTangZhai/SPARK-3067 and squashes the following commits:

d4323f8 [yantangzhai] update [SPARK-3067] JobProgressPage could not show Fair Scheduler Pools section sometimes
8a00106 [YanTangZhai] Merge pull request #6 from apache/master
b6391cc [yantangzhai] revert [SPARK-3067] JobProgressPage could not show Fair Scheduler Pools section sometimes
d2226cd [yantangzhai] [SPARK-3067] JobProgressPage could not show Fair Scheduler Pools section sometimes
cbcba66 [YanTangZhai] Merge pull request #3 from apache/master
aac7f7b [yantangzhai] [SPARK-3067] JobProgressPage could not show Fair Scheduler Pools section sometimes
cdef539 [YanTangZhai] Merge pull request #1 from apache/master
2014-10-16 19:27:19 -07:00
zsxwing 56fd34af52 [SPARK-3741] Add afterExecute for handleConnectExecutor
Sorry. I found that I forgot to add `afterExecute` for `handleConnectExecutor` in #2593.

Author: zsxwing <zsxwing@gmail.com>

Closes #2794 from zsxwing/SPARK-3741 and squashes the following commits:

a0bc4dd [zsxwing] Add afterExecute for handleConnectExecutor
2014-10-16 19:22:02 -07:00
Zhang, Liye 642b246beb [SPARK-3941][CORE] _remainingmem should not increase twice when updateBlockInfo
In BlockManagermasterActor, _remainingMem would increase memSize for twice when updateBlockInfo if new storageLevel is invalid and old storageLevel is "useMemory". Also, _remainingMem should increase with original memory size instead of new memSize.

Author: Zhang, Liye <liye.zhang@intel.com>

Closes #2792 from liyezhang556520/spark-3941-remainMem and squashes the following commits:

3d487cc [Zhang, Liye] make the code concise
0380a32 [Zhang, Liye] [SPARK-3941][CORE] _remainingmem should not increase twice when updateBlockInfo
2014-10-16 19:07:37 -07:00
Aaron Davidson 7f7b50ed9d [SPARK-3923] Increase Akka heartbeat pause above heartbeat interval
Something about the 2.3.4 upgrade seems to have made the issue manifest where all the services disconnect from each other after exactly 1000 seconds (which is the heartbeat interval). [This post](https://groups.google.com/forum/#!topic/akka-user/X3xzpTCbEFs) suggests that heartbeat pause should be greater than heartbeat interval, and increasing the pause from 600s to 6000s seems to have rectified the issue. My current cluster has now exceeded 1400s of uptime without failure!

I do not know why this fixed it, because the threshold we have set for the failure detector is the exponent of a timeout, and 300 is extremely large. Perhaps the default failure detector changed in 2.3.4 and now ignores threshold.

Author: Aaron Davidson <aaron@databricks.com>

Closes #2784 from aarondav/fix-timeout and squashes the following commits:

bd1151a [Aaron Davidson] Increase pause, don't decrease interval
9cb0372 [Aaron Davidson] [SPARK-3923] Decrease Akka heartbeat interval below heartbeat pause
2014-10-16 18:58:18 -07:00
Prashant Sharma 2fe0ba9561 SPARK-3874: Provide stable TaskContext API
This is a small number of clean-up changes on top of #2782. Closes #2782.

Author: Prashant Sharma <prashant.s@imaginea.com>
Author: Patrick Wendell <pwendell@gmail.com>

Closes #2803 from pwendell/pr-2782 and squashes the following commits:

56d5b7a [Patrick Wendell] Minor clean-up
44089ec [Patrick Wendell] Clean-up the TaskContext API.
ed551ce [Prashant Sharma] Fixed a typo
df261d0 [Prashant Sharma] Josh's suggestion
facf3b1 [Prashant Sharma] Fixed the mima issue.
7ecc2fe [Prashant Sharma] CR, Moved implementations to TaskContextImpl
bbd9e05 [Prashant Sharma] adding missed out files to git.
ef633f5 [Prashant Sharma] SPARK-3874, Provide stable TaskContext API
2014-10-16 21:38:45 -04:00
Davies Liu 091d32c52e [SPARK-3971] [MLLib] [PySpark] hotfix: Customized pickler should work in cluster mode
Customized pickler should be registered before unpickling, but in executor, there is no way to register the picklers before run the tasks.

So, we need to register the picklers in the tasks itself, duplicate the javaToPython() and pythonToJava() in MLlib, call SerDe.initialize() before pickling or unpickling.

Author: Davies Liu <davies.liu@gmail.com>

Closes #2830 from davies/fix_pickle and squashes the following commits:

0c85fb9 [Davies Liu] revert the privacy change
6b94e15 [Davies Liu] use JavaConverters instead of JavaConversions
0f02050 [Davies Liu] hotfix: Customized pickler does not work in cluster
2014-10-16 14:56:50 -07:00
Shiti 4c589cac44 [SPARK-3944][Core] Code re-factored as suggested
Author: Shiti <ssaxena.ece@gmail.com>

Closes #2810 from Shiti/master and squashes the following commits:

051d82f [Shiti] setting the default value of uri scheme to "file"  where matching "file" or None yields the same result
2014-10-16 10:52:06 -07:00
GuoQiang Li 293a0b5dbb [SPARK-2098] All Spark processes should support spark-defaults.conf, config file
This is another implementation about #1256
cc andrewor14 vanzin

Author: GuoQiang Li <witgo@qq.com>

Closes #2379 from witgo/SPARK-2098-new and squashes the following commits:

4ef1cbd [GuoQiang Li] review commit
49ef70e [GuoQiang Li] Refactor getDefaultPropertiesFile
c45d20c [GuoQiang Li] All Spark processes should support spark-defaults.conf, config file
2014-10-14 22:16:38 -07:00
shitis 24b818b971 [SPARK-3944][Core] Using Option[String] where value of String can be null
Author: shitis <ssaxena.ece@gmail.com>

Closes #2795 from Shiti/master and squashes the following commits:

46897d7 [shitis] Using Option Wrapper to convert String with value null to None
2014-10-14 14:16:45 -07:00
Bill Bejeck 9b6de6fbc0 SPARK-3178 setting SPARK_WORKER_MEMORY to a value without a label (m or g) sets the worker memory limit to zero
Validate the memory is greater than zero when set from the SPARK_WORKER_MEMORY environment variable or command line without a g or m label.  Added unit tests. If memory is 0 an IllegalStateException is thrown. Updated unit tests to mock environment variables by subclassing SparkConf (tip provided by Josh Rosen).   Updated WorkerArguments to use SparkConf.getenv instead of System.getenv for reading the SPARK_WORKER_MEMORY environment variable.

Author: Bill Bejeck <bbejeck@gmail.com>

Closes #2309 from bbejeck/spark-memory-worker and squashes the following commits:

51cf915 [Bill Bejeck] SPARK-3178 - Validate the memory is greater than zero when set from the SPARK_WORKER_MEMORY environment variable or command line without a g or m label.  Added unit tests. If memory is 0 an IllegalStateException is thrown. Updated unit tests to mock environment variables by subclassing SparkConf (tip provided by Josh Rosen).   Updated WorkerArguments to use SparkConf.getenv instead of System.getenv for reading the SPARK_WORKER_MEMORY environment variable.
2014-10-14 12:12:38 -07:00
Aaron Davidson 186b497c94 [SPARK-3921] Fix CoarseGrainedExecutorBackend's arguments for Standalone mode
The goal of this patch is to fix the swapped arguments in standalone mode, which was caused by  79e45c9323 (diff-79391110e9f26657e415aa169a004998R153).

More details can be found in the JIRA: [SPARK-3921](https://issues.apache.org/jira/browse/SPARK-3921)

Tested in Standalone mode, but not in Mesos.

Author: Aaron Davidson <aaron@databricks.com>

Closes #2779 from aarondav/fix-standalone and squashes the following commits:

725227a [Aaron Davidson] Fix ExecutorRunnerTest
9d703fe [Aaron Davidson] [SPARK-3921] Fix CoarseGrainedExecutorBackend's arguments for Standalone mode
2014-10-13 23:31:37 -07:00
yingjieMiao 49bbdcb660 [Spark] RDD take() method: overestimate too much
In the comment (Line 1083), it says: "Otherwise, interpolate the number of partitions we need to try, but overestimate it by 50%."

`(1.5 * num * partsScanned / buf.size).toInt` is the guess of "num of total partitions needed". In every iteration, we should consider the increment `(1.5 * num * partsScanned / buf.size).toInt - partsScanned`
Existing implementation 'exponentially' grows `partsScanned ` ( roughly: `x_{n+1} >= (1.5 + 1) x_n`)

This could be a performance problem. (unless this is the intended behavior)

Author: yingjieMiao <yingjie@42go.com>

Closes #2648 from yingjieMiao/rdd_take and squashes the following commits:

d758218 [yingjieMiao] scala style fix
a8e74bb [yingjieMiao] python style fix
4b6e777 [yingjieMiao] infix operator style fix
4391d3b [yingjieMiao] typo fix.
692f4e6 [yingjieMiao] cap numPartsToTry
c4483dc [yingjieMiao] style fix
1d2c410 [yingjieMiao] also change in rdd.py and AsyncRDD
d31ff7e [yingjieMiao] handle the edge case after 1 iteration
a2aa36b [yingjieMiao] RDD take method: overestimate too much
2014-10-13 13:11:55 -07:00
GuoQiang Li b4a7fa7a66 [SPARK-3905][Web UI]The keys for sorting the columns of Executor page ,Stage page Storage page are incorrect
Author: GuoQiang Li <witgo@qq.com>

Closes #2763 from witgo/SPARK-3905 and squashes the following commits:

17d7990 [GuoQiang Li] The keys for sorting the columns of Executor page ,Stage page Storage page are incorrect
2014-10-12 22:48:54 -07:00
Jakub Dubovský fc616d51a5 [SPARK-3121] Wrong implementation of implicit bytesWritableConverter
val path = ... //path to seq file with BytesWritable as type of both key and value
val file = sc.sequenceFile[Array[Byte],Array[Byte]](path)
file.take(1)(0)._1

This prints incorrect content of byte array. Actual content starts with correct one and some "random" bytes and zeros are appended. BytesWritable has two methods:

getBytes() - return content of all internal array which is often longer then actual value stored. It usually contains the rest of previous longer values

copyBytes() - return just begining of internal array determined by internal length property

It looks like in implicit conversion between BytesWritable and Array[byte] getBytes is used instead of correct copyBytes.

dbtsai

Author: Jakub Dubovský <james64@inMail.sk>
Author: Dubovsky Jakub <dubovsky@avast.com>

Closes #2712 from james64/3121-bugfix and squashes the following commits:

f85d24c [Jakub Dubovský] Test name changed, comments added
1b20d51 [Jakub Dubovský] Import placed correctly
406e26c [Jakub Dubovský] Scala style fixed
f92ffa6 [Dubovsky Jakub] performance tuning
480f9cd [Dubovsky Jakub] Bug 3121 fixed
2014-10-12 22:03:26 -07:00
Josh Rosen 18bd67c24b [SPARK-3887] Send stracktrace in ConnectionManager error replies
When reporting that a remote error occurred, the ConnectionManager should also log the stacktrace of the remote exception. This PR accomplishes this by sending the remote exception's stacktrace as the payload in the "negative ACK / error message."

Author: Josh Rosen <joshrosen@apache.org>

Closes #2741 from JoshRosen/propagate-cm-exceptions-to-sender and squashes the following commits:

b5366cc [Josh Rosen] Explicitly encode error messages using UTF-8.
cef18b3 [Josh Rosen] [SPARK-3887] Send stracktrace in ConnectionManager error messages.
2014-10-12 13:08:42 -07:00
giwa 69c67abaa9 [SPARK-2377] Python API for Streaming
This patch brings Python API for Streaming.

This patch is based on work from @giwa

Author: giwa <ugw.gi.world@gmail.com>
Author: Ken Takagiwa <ken@Kens-MacBook-Pro.local>
Author: Davies Liu <davies.liu@gmail.com>
Author: Ken Takagiwa <ken@kens-mbp.gateway.sonic.net>
Author: Tathagata Das <tathagata.das1565@gmail.com>
Author: Ken <ugw.gi.world@gmail.com>
Author: Ken Takagiwa <ugw.gi.world@gmail.com>
Author: Matthew Farrellee <matt@redhat.com>

Closes #2538 from davies/streaming and squashes the following commits:

64561e4 [Davies Liu] fix tests
331ecce [Davies Liu] fix example
3e2492b [Davies Liu] change updateStateByKey() to easy API
182be73 [Davies Liu] Merge branch 'master' of github.com:apache/spark into streaming
02d0575 [Davies Liu] add wrapper for foreachRDD()
bebeb4a [Davies Liu] address all comments
6db00da [Davies Liu] Merge branch 'master' of github.com:apache/spark into streaming
8380064 [Davies Liu] Merge branch 'master' of github.com:apache/spark into streaming
52c535b [Davies Liu] remove fix for sum()
e108ec1 [Davies Liu]  address comments
37fe06f [Davies Liu] use random port for callback server
d05871e [Davies Liu] remove reuse of PythonRDD
be5e5ff [Davies Liu] merge branch of env, make tests stable.
8071541 [Davies Liu] Merge branch 'env' into streaming
c7bbbce [Davies Liu] fix sphinx docs
6bb9d91 [Davies Liu] Merge branch 'master' of github.com:apache/spark into streaming
4d0ea8b [Davies Liu] clear reference of SparkEnv after stop
54bd92b [Davies Liu] improve tests
c2b31cb [Davies Liu] Merge branch 'master' of github.com:apache/spark into streaming
7a88f9f [Davies Liu] rollback RDD.setContext(), use textFileStream() to test checkpointing
bd8a4c2 [Davies Liu] fix scala style
7797c70 [Davies Liu] refactor
ff88bec [Davies Liu] rename RDDFunction to TransformFunction
d328aca [Davies Liu] fix serializer in queueStream
6f0da2f [Davies Liu] recover from checkpoint
fa7261b [Davies Liu] refactor
a13ff34 [Davies Liu] address comments
8466916 [Davies Liu] support checkpoint
9a16bd1 [Davies Liu] change number of partitions during tests
b98d63f [Davies Liu] change private[spark] to private[python]
eed6e2a [Davies Liu] rollback not needed changes
e00136b [Davies Liu] address comments
069a94c [Davies Liu] fix the number of partitions during window()
338580a [Davies Liu] change _first(), _take(), _collect() as private API
19797f9 [Davies Liu] clean up
6ebceca [Davies Liu] add more tests
c40c52d [Davies Liu] change first(), take(n) to has the same behavior as RDD
98ac6c2 [Davies Liu] support ssc.transform()
b983f0f [Davies Liu] address comments
847f9b9 [Davies Liu] add more docs, add first(), take()
e059ca2 [Davies Liu] move check of window into Python
fce0ef5 [Davies Liu] rafactor of foreachRDD()
7001b51 [Davies Liu] refactor of queueStream()
26ea396 [Davies Liu] refactor
74df565 [Davies Liu] fix print and docs
b32774c [Davies Liu] move java_import into streaming
604323f [Davies Liu] enable streaming tests
c499ba0 [Davies Liu] remove Time and Duration
3f0fb4b [Davies Liu] refactor fix tests
c28f520 [Davies Liu] support updateStateByKey
d357b70 [Davies Liu] support windowed dstream
bd13026 [Davies Liu] fix examples
eec401e [Davies Liu] refactor, combine TransformedRDD, fix reuse PythonRDD, fix union
9a57685 [Davies Liu] fix python style
bd27874 [Davies Liu] fix scala style
7339be0 [Davies Liu] delete tests
7f53086 [Davies Liu] support transform(), refactor and cleanup
df098fc [Davies Liu] Merge branch 'master' into giwa
550dfd9 [giwa] WIP fixing 1.1 merge
5cdb6fa [giwa] changed for SCCallSiteSync
e685853 [giwa] meged with rebased 1.1 branch
2d32a74 [giwa] added some StreamingContextTestSuite
4a59e1e [giwa] WIP:added more test for StreamingContext
8ffdbf1 [giwa] added atexit to handle callback server
d5f5fcb [giwa] added comment for StreamingContext.sparkContext
63c881a [giwa] added StreamingContext.sparkContext
d39f102 [giwa] added StreamingContext.remember
d542743 [giwa] clean up code
2fdf0de [Matthew Farrellee] Fix scalastyle errors
c0a06bc [giwa] delete not implemented functions
f385976 [giwa] delete inproper comments
b0f2015 [giwa] added comment in dstream._test_output
bebb3f3 [giwa] remove the last brank line
fbed8da [giwa] revert pom.xml
8ed93af [giwa] fixed explanaiton
066ba90 [giwa] revert pom.xml
fa4af88 [giwa] remove duplicated import
6ae3caa [giwa] revert pom.xml
7dc7391 [giwa] fixed typo
62dc7a3 [giwa] clean up exmples
f04882c [giwa] clen up examples
b171ec3 [giwa] fixed pep8 violation
f198d14 [giwa] clean up code
3166d31 [giwa] clean up
c00e091 [giwa] change test case not to use awaitTermination
e80647e [giwa] adopted the latest compression way of python command
58e41ff [giwa] merge with master
455e5af [giwa] removed wasted print in DStream
af336b7 [giwa] add comments
ddd4ee1 [giwa] added TODO coments
99ce042 [giwa] added saveAsTextFiles and saveAsPickledFiles
2a06cdb [giwa] remove waste duplicated code
c5ecfc1 [giwa] basic function test cases are passed
8dcda84 [giwa] all tests are passed if numSlice is 2 and the numver of each input is over 4
795b2cd [giwa] broke something
1e126bf [giwa] WIP: solved partitioned and None is not recognized
f67cf57 [giwa] added mapValues and flatMapVaules WIP for glom and mapPartitions test
953deb0 [giwa] edited the comment to add more precise description
af610d3 [giwa] removed unnesessary changes
c1d546e [giwa] fixed PEP-008 violation
99410be [giwa] delete waste file
b3b0362 [giwa] added basic operation test cases
9cde7c9 [giwa] WIP added test case
bd3ba53 [giwa] WIP
5c04a5f [giwa] WIP: added PythonTestInputStream
019ef38 [giwa] WIP
1934726 [giwa] update comment
376e3ac [giwa] WIP
932372a [giwa] clean up dstream.py
0b09cff [giwa] added stop in StreamingContext
92e333e [giwa] implemented reduce and count function in Dstream
1b83354 [giwa] Removed the waste line
88f7506 [Ken Takagiwa] Kill py4j callback server properly
54b5358 [Ken Takagiwa] tried to restart callback server
4f07163 [Tathagata Das] Implemented DStream.foreachRDD in the Python API using Py4J callback server.
fe02547 [Ken Takagiwa] remove waste file
2ad7bd3 [Ken Takagiwa] clean up codes
6197a11 [Ken Takagiwa] clean up code
eb4bf48 [Ken Takagiwa] fix map function
98c2a00 [Ken Takagiwa] added count operation but this implementation need double check
58591d2 [Ken Takagiwa] reduceByKey is working
0df7111 [Ken Takagiwa] delete old file
f485b1d [Ken Takagiwa] fied input of socketTextDStream
dd6de81 [Ken Takagiwa] initial commit for socketTextStream
247fd74 [Ken Takagiwa] modified the code base on comment in https://github.com/tdas/spark/pull/10
4bcb318 [Ken Takagiwa] implementing transform function in Python
38adf95 [Ken Takagiwa] added reducedByKey not working yet
66fcfff [Ken Takagiwa] modify dstream.py to fix indent error
41886c2 [Ken Takagiwa] comment PythonDStream.PairwiseDStream
0b99bec [Ken] initial commit for pySparkStreaming
c214199 [giwa] added testcase for combineByKey
5625bdc [giwa] added gorupByKey testcase
10ab87b [giwa] added sparkContext as input parameter in StreamingContext
10b5b04 [giwa] removed wasted print in DStream
e54f986 [giwa] add comments
16aa64f [giwa] added TODO coments
74535d4 [giwa] added saveAsTextFiles and saveAsPickledFiles
f76c182 [giwa] remove waste duplicated code
18c8723 [giwa] modified streaming test case to add coment
13fb44c [giwa] basic function test cases are passed
3000b2b [giwa] all tests are passed if numSlice is 2 and the numver of each input is over 4
ff14070 [giwa] broke something
bcdec33 [giwa] WIP: solved partitioned and None is not recognized
270a9e1 [giwa] added mapValues and flatMapVaules WIP for glom and mapPartitions test
bb10956 [giwa] edited the comment to add more precise description
253a863 [giwa] removed unnesessary changes
3d37822 [giwa] fixed PEP-008 violation
f21cab3 [giwa] delete waste file
878bad7 [giwa] added basic operation test cases
ce2acd2 [giwa] WIP added test case
9ad6855 [giwa] WIP
1df77f5 [giwa] WIP: added PythonTestInputStream
1523b66 [giwa] WIP
8a0fbbc [giwa] update comment
fe648e3 [giwa] WIP
29c2bc5 [giwa] initial commit for testcase
4d40d63 [giwa] clean up dstream.py
c462bb3 [giwa] added stop in StreamingContext
d2c01ba [giwa] clean up examples
3c45cd2 [giwa] implemented reduce and count function in Dstream
b349649 [giwa] Removed the waste line
3b498e1 [Ken Takagiwa] Kill py4j callback server properly
84a9668 [Ken Takagiwa] tried to restart callback server
9ab8952 [Tathagata Das] Added extra line.
05e991b [Tathagata Das] Added missing file
b1d2a30 [Tathagata Das] Implemented DStream.foreachRDD in the Python API using Py4J callback server.
678e854 [Ken Takagiwa] remove waste file
0a8bbbb [Ken Takagiwa] clean up codes
bab31c1 [Ken Takagiwa] clean up code
72b9738 [Ken Takagiwa] fix map function
d3ee86a [Ken Takagiwa] added count operation but this implementation need double check
15feea9 [Ken Takagiwa] edit python sparkstreaming example
6f98e50 [Ken Takagiwa] reduceByKey is working
c455c8d [Ken Takagiwa] added reducedByKey not working yet
dc6995d [Ken Takagiwa] delete old file
b31446a [Ken Takagiwa] fixed typo of network_workdcount.py
ccfd214 [Ken Takagiwa] added doctest for pyspark.streaming.duration
0d1b954 [Ken Takagiwa] fied input of socketTextDStream
f746109 [Ken Takagiwa] initial commit for socketTextStream
bb7ccf3 [Ken Takagiwa] remove unused import in python
224fc5e [Ken Takagiwa] add empty line
d2099d8 [Ken Takagiwa] sorted the import following Spark coding convention
5bac7ec [Ken Takagiwa] revert streaming/pom.xml
e1df940 [Ken Takagiwa] revert pom.xml
494cae5 [Ken Takagiwa] remove not implemented DStream functions in python
17a74c6 [Ken Takagiwa] modified the code base on comment in https://github.com/tdas/spark/pull/10
1a0f065 [Ken Takagiwa] implementing transform function in Python
d7b4d6f [Ken Takagiwa] added reducedByKey not working yet
87438e2 [Ken Takagiwa] modify dstream.py to fix indent error
b406252 [Ken Takagiwa] comment PythonDStream.PairwiseDStream
454981d [Ken] initial commit for pySparkStreaming
150b94c [giwa] added some StreamingContextTestSuite
f7bc8f9 [giwa] WIP:added more test for StreamingContext
ee50c5a [giwa] added atexit to handle callback server
fdc9125 [giwa] added comment for StreamingContext.sparkContext
f5bfb70 [giwa] added StreamingContext.sparkContext
da09768 [giwa] added StreamingContext.remember
d68b568 [giwa] clean up code
4afa390 [giwa] clean up code
1fd6bc7 [Ken Takagiwa] Merge pull request #2 from mattf/giwa-master
d9d59fe [Matthew Farrellee] Fix scalastyle errors
67473a9 [giwa] delete not implemented functions
c97377c [giwa] delete inproper comments
2ea769e [giwa] added comment in dstream._test_output
3b27bd4 [giwa] remove the last brank line
acfcaeb [giwa] revert pom.xml
93f7637 [giwa] fixed explanaiton
50fd6f9 [giwa] revert pom.xml
4f82c89 [giwa] remove duplicated import
9d1de23 [giwa] revert pom.xml
7339df2 [giwa] fixed typo
9c85e48 [giwa] clean up exmples
24f95db [giwa] clen up examples
0d30109 [giwa] fixed pep8 violation
b7dab85 [giwa] improve test case
583e66d [giwa] move tests for streaming inside streaming directory
1d84142 [giwa] remove unimplement test
f0ea311 [giwa] clean up code
171edeb [giwa] clean up
4dedd2d [giwa] change test case not to use awaitTermination
268a6a5 [giwa] Changed awaitTermination not to call awaitTermincation in Scala. Just use time.sleep instread
09a28bf [giwa] improve testcases
58150f5 [giwa] Changed the test case to focus the test operation
199e37f [giwa] adopted the latest compression way of python command
185fdbf [giwa] merge with master
f1798c4 [giwa] merge with master
e70f706 [giwa] added testcase for combineByKey
e162822 [giwa] added gorupByKey testcase
97742fe [giwa] added sparkContext as input parameter in StreamingContext
14d4c0e [giwa] removed wasted print in DStream
6d8190a [giwa] add comments
4aa99e4 [giwa] added TODO coments
e9fab72 [giwa] added saveAsTextFiles and saveAsPickledFiles
94f2b65 [giwa] remove waste duplicated code
580fbc2 [giwa] modified streaming test case to add coment
99e4bb3 [giwa] basic function test cases are passed
7051a84 [giwa] all tests are passed if numSlice is 2 and the numver of each input is over 4
35933e1 [giwa] broke something
9767712 [giwa] WIP: solved partitioned and None is not recognized
4f2d7e6 [giwa] added mapValues and flatMapVaules WIP for glom and mapPartitions test
33c0f94d [giwa] edited the comment to add more precise description
774f18d [giwa] removed unnesessary changes
3a671cc [giwa] remove export PYSPARK_PYTHON in spark submit
8efa266 [giwa] fixed PEP-008 violation
fa75d71 [giwa] delete waste file
7f96294 [giwa] added basic operation test cases
3dda31a [giwa] WIP added test case
1f68b78 [giwa] WIP
c05922c [giwa] WIP: added PythonTestInputStream
1fd12ae [giwa] WIP
c880a33 [giwa] update comment
5d22c92 [giwa] WIP
ea4b06b [giwa] initial commit for testcase
5a9b525 [giwa] clean up dstream.py
79c5809 [giwa] added stop in StreamingContext
189dcea [giwa] clean up examples
b8d7d24 [giwa] implemented reduce and count function in Dstream
b6468e6 [giwa] Removed the waste line
b47b5fd [Ken Takagiwa] Kill py4j callback server properly
19ddcdd [Ken Takagiwa] tried to restart callback server
c9fc124 [Tathagata Das] Added extra line.
4caae3f [Tathagata Das] Added missing file
4eff053 [Tathagata Das] Implemented DStream.foreachRDD in the Python API using Py4J callback server.
5e822d4 [Ken Takagiwa] remove waste file
aeaf8a5 [Ken Takagiwa] clean up codes
9fa249b [Ken Takagiwa] clean up code
05459c6 [Ken Takagiwa] fix map function
a9f4ecb [Ken Takagiwa] added count operation but this implementation need double check
d1ee6ca [Ken Takagiwa] edit python sparkstreaming example
0b8b7d0 [Ken Takagiwa] reduceByKey is working
d25d5cf [Ken Takagiwa] added reducedByKey not working yet
7f7c5d1 [Ken Takagiwa] delete old file
967dc26 [Ken Takagiwa] fixed typo of network_workdcount.py
57fb740 [Ken Takagiwa] added doctest for pyspark.streaming.duration
4b69fb1 [Ken Takagiwa] fied input of socketTextDStream
02f618a [Ken Takagiwa] initial commit for socketTextStream
4ce4058 [Ken Takagiwa] remove unused import in python
856d98e [Ken Takagiwa] add empty line
490e338 [Ken Takagiwa] sorted the import following Spark coding convention
5594bd4 [Ken Takagiwa] revert pom.xml
2adca84 [Ken Takagiwa] remove not implemented DStream functions in python
e551e13 [Ken Takagiwa] add coment for hack why PYSPARK_PYTHON is needed in spark-submit
3758175 [Ken Takagiwa] add coment for hack why PYSPARK_PYTHON is needed in spark-submit
c5518b4 [Ken Takagiwa] modified the code base on comment in https://github.com/tdas/spark/pull/10
dcf243f [Ken Takagiwa] implementing transform function in Python
9af03f4 [Ken Takagiwa] added reducedByKey not working yet
6e0d9c7 [Ken Takagiwa] modify dstream.py to fix indent error
e497b9b [Ken Takagiwa] comment PythonDStream.PairwiseDStream
5c3a683 [Ken] initial commit for pySparkStreaming
665bfdb [giwa] added testcase for combineByKey
a3d2379 [giwa] added gorupByKey testcase
636090a [giwa] added sparkContext as input parameter in StreamingContext
e7ebb08 [giwa] removed wasted print in DStream
d8b593b [giwa] add comments
ea9c873 [giwa] added TODO coments
89ae38a [giwa] added saveAsTextFiles and saveAsPickledFiles
e3033fc [giwa] remove waste duplicated code
a14c7e1 [giwa] modified streaming test case to add coment
536def4 [giwa] basic function test cases are passed
2112638 [giwa] all tests are passed if numSlice is 2 and the numver of each input is over 4
080541a [giwa] broke something
0704b86 [giwa] WIP: solved partitioned and None is not recognized
90a6484 [giwa] added mapValues and flatMapVaules WIP for glom and mapPartitions test
a65f302 [giwa] edited the comment to add more precise description
bdde697 [giwa] removed unnesessary changes
e8c7bfc [giwa] remove export PYSPARK_PYTHON in spark submit
3334169 [giwa] fixed PEP-008 violation
db0a303 [giwa] delete waste file
2cfd3a0 [giwa] added basic operation test cases
90ae568 [giwa] WIP added test case
a120d07 [giwa] WIP
f671cdb [giwa] WIP: added PythonTestInputStream
56fae45 [giwa] WIP
e35e101 [giwa] Merge branch 'master' into testcase
ba5112d [giwa] update comment
28aa56d [giwa] WIP
fb08559 [giwa] initial commit for testcase
a613b85 [giwa] clean up dstream.py
c40c0ef [giwa] added stop in StreamingContext
31e4260 [giwa] clean up examples
d2127d6 [giwa] implemented reduce and count function in Dstream
48f7746 [giwa] Removed the waste line
0f83eaa [Ken Takagiwa] delete py4j 0.8.1
1679808 [Ken Takagiwa] Kill py4j callback server properly
f96cd4e [Ken Takagiwa] tried to restart callback server
fe86198 [Ken Takagiwa] add py4j 0.8.2.1 but server is not launched
1064fe0 [Ken Takagiwa] Merge branch 'master' of https://github.com/giwa/spark
28c6620 [Ken Takagiwa] Implemented DStream.foreachRDD in the Python API using Py4J callback server
85b0fe1 [Ken Takagiwa] Merge pull request #1 from tdas/python-foreach
54e2e8c [Tathagata Das] Added extra line.
e185338 [Tathagata Das] Added missing file
a778d4b [Tathagata Das] Implemented DStream.foreachRDD in the Python API using Py4J callback server.
cc2092b [Ken Takagiwa] remove waste file
d042ac6 [Ken Takagiwa] clean up codes
84a021f [Ken Takagiwa] clean up code
bd20e17 [Ken Takagiwa] fix map function
d01a125 [Ken Takagiwa] added count operation but this implementation need double check
7d05109 [Ken Takagiwa] merge with remote branch
ae464e0 [Ken Takagiwa] edit python sparkstreaming example
04af046 [Ken Takagiwa] reduceByKey is working
3b6d7b0 [Ken Takagiwa] implementing transform function in Python
571d52d [Ken Takagiwa] added reducedByKey not working yet
5720979 [Ken Takagiwa] delete old file
e604fcb [Ken Takagiwa] fixed typo of network_workdcount.py
4b7c08b [Ken Takagiwa] Merge branch 'master' of https://github.com/giwa/spark
ce7d426 [Ken Takagiwa] added doctest for pyspark.streaming.duration
a8c9fd5 [Ken Takagiwa] fixed for socketTextStream
a61fa9e [Ken Takagiwa] fied input of socketTextDStream
1e84f41 [Ken Takagiwa] initial commit for socketTextStream
6d012f7 [Ken Takagiwa] remove unused import in python
25d30d5 [Ken Takagiwa] add empty line
6e0a64a [Ken Takagiwa] sorted the import following Spark coding convention
fa4a7fc [Ken Takagiwa] revert streaming/pom.xml
8f8202b [Ken Takagiwa] revert streaming pom.xml
c9d79dd [Ken Takagiwa] revert pom.xml
57e3e52 [Ken Takagiwa] remove not implemented DStream functions in python
0a516f5 [Ken Takagiwa] add coment for hack why PYSPARK_PYTHON is needed in spark-submit
a7a0b5c [Ken Takagiwa] add coment for hack why PYSPARK_PYTHON is needed in spark-submit
72bfc66 [Ken Takagiwa] modified the code base on comment in https://github.com/tdas/spark/pull/10
69e9cd3 [Ken Takagiwa] implementing transform function in Python
94a0787 [Ken Takagiwa] added reducedByKey not working yet
88068cf [Ken Takagiwa] modify dstream.py to fix indent error
1367be5 [Ken Takagiwa] comment PythonDStream.PairwiseDStream
eb2b3ba [Ken] Merge remote-tracking branch 'upstream/master'
d8e51f9 [Ken] initial commit for pySparkStreaming
2014-10-12 02:46:56 -07:00
Prashant Sharma 0e8203f4fb [SPARK-2924] Required by scala 2.11, only one fun/ctor amongst overriden alternatives, can have default argument(s).
...riden alternatives, can have default argument.

Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #2750 from ScrapCodes/SPARK-2924/default-args-removed and squashes the following commits:

d9785c3 [Prashant Sharma] [SPARK-2924] Required by scala 2.11, only one function/ctor amongst overriden alternatives, can have default argument.
2014-10-10 18:39:55 -07:00
Aaron Davidson 90f73fcc47 [SPARK-3889] Attempt to avoid SIGBUS by not mmapping files in ConnectionManager
In general, individual shuffle blocks are frequently small, so mmapping them often creates a lot of waste. It may not be bad to mmap the larger ones, but it is pretty inconvenient to get configuration into ManagedBuffer, and besides it is unlikely to help all that much.

Author: Aaron Davidson <aaron@databricks.com>

Closes #2742 from aarondav/mmap and squashes the following commits:

a152065 [Aaron Davidson] Add other pathway back
52b6cd2 [Aaron Davidson] [SPARK-3889] Attempt to avoid SIGBUS by not mmapping files in ConnectionManager
2014-10-10 01:44:36 -07:00
Anand Avati 411cf29fff [SPARK-2805] Upgrade Akka to 2.3.4
This is a second rev of the Akka upgrade (earlier merged, but reverted). I made a slight modification which is that I also upgrade Hive to deal with a compatibility issue related to the protocol buffers library.

Author: Anand Avati <avati@redhat.com>
Author: Patrick Wendell <pwendell@gmail.com>

Closes #2752 from pwendell/akka-upgrade and squashes the following commits:

4c7ca3f [Patrick Wendell] Upgrading to new hive->protobuf version
57a2315 [Anand Avati] SPARK-1812: streaming - remove tests which depend on akka.actor.IO
2a551d3 [Anand Avati] SPARK-1812: core - upgrade to akka 2.3.4
2014-10-10 00:46:56 -07:00
Sean Owen 363baacade SPARK-3811 [CORE] More robust / standard Utils.deleteRecursively, Utils.createTempDir
I noticed a few issues with how temp directories are created and deleted:

*Minor*

* Guava's `Files.createTempDir()` plus `File.deleteOnExit()` is used in many tests to make a temp dir, but `Utils.createTempDir()` seems to be the standard Spark mechanism
* Call to `File.deleteOnExit()` could be pushed into `Utils.createTempDir()` as well, along with this replacement
* _I messed up the message in an exception in `Utils` in SPARK-3794; fixed here_

*Bit Less Minor*

* `Utils.deleteRecursively()` fails immediately if any `IOException` occurs, instead of trying to delete any remaining files and subdirectories. I've observed this leave temp dirs around. I suggest changing it to continue in the face of an exception and throw one of the possibly several exceptions that occur at the end.
* `Utils.createTempDir()` will add a JVM shutdown hook every time the method is called. Even if the subdir is the parent of another parent dir, since this check is inside the hook. However `Utils` manages a set of all dirs to delete on shutdown already, called `shutdownDeletePaths`. A single hook can be registered to delete all of these on exit. This is how Tachyon temp paths are cleaned up in `TachyonBlockManager`.

I noticed a few other things that might be changed but wanted to ask first:

* Shouldn't the set of dirs to delete be `File`, not just `String` paths?
* `Utils` manages the set of `TachyonFile` that have been registered for deletion, but the shutdown hook is managed in `TachyonBlockManager`. Should this logic not live together, and not in `Utils`? it's more specific to Tachyon, and looks a slight bit odd to import in such a generic place.

Author: Sean Owen <sowen@cloudera.com>

Closes #2670 from srowen/SPARK-3811 and squashes the following commits:

071ae60 [Sean Owen] Update per @vanzin's review
da0146d [Sean Owen] Make Utils.deleteRecursively try to delete all paths even when an exception occurs; use one shutdown hook instead of one per method call to delete temp dirs
3a0faa4 [Sean Owen] Standardize on Utils.createTempDir instead of Files.createTempDir
2014-10-09 18:21:59 -07:00
Josh Rosen 4e9b551a0b [SPARK-3772] Allow ipython to be used by Pyspark workers; IPython support improvements:
This pull request addresses a few issues related to PySpark's IPython support:

- Fix the remaining uses of the '-u' flag, which IPython doesn't support (see SPARK-3772).
- Change PYSPARK_PYTHON_OPTS to PYSPARK_DRIVER_PYTHON_OPTS, so that the old name is reserved in case we ever want to allow the worker Python options to be customized (this variable was introduced in #2554 and hasn't landed in a release yet, so this doesn't break any compatibility).
- Introduce a PYSPARK_DRIVER_PYTHON option that allows the driver to use `ipython` while the workers use a different Python version.
- Attempt to use Python 2.7 by default if PYSPARK_PYTHON is not specified.
- Retain the old semantics for IPYTHON=1 and IPYTHON_OPTS (to avoid breaking existing example programs).

There are more details in a block comment in `bin/pyspark`.

Author: Josh Rosen <joshrosen@apache.org>

Closes #2651 from JoshRosen/SPARK-3772 and squashes the following commits:

7b8eb86 [Josh Rosen] More changes to PySpark python executable configuration:
c4f5778 [Josh Rosen] [SPARK-3772] Allow ipython to be used by Pyspark workers; IPython fixes:
2014-10-09 16:08:07 -07:00
Patrick Wendell 1faa1135a3 Revert "[SPARK-2805] Upgrade to akka 2.3.4"
This reverts commit b9df8af62e.
2014-10-09 14:50:36 -07:00
zsxwing 73bf3f2e0c [SPARK-3741] Make ConnectionManager propagate errors properly and add mo...
...re logs to avoid Executors swallowing errors

This PR made the following changes:
* Register a callback to `Connection` so that the error will be propagated properly.
* Add more logs so that the errors won't be swallowed by Executors.
* Use trySuccess/tryFailure because `Promise` doesn't allow to call success/failure more than once.

Author: zsxwing <zsxwing@gmail.com>

Closes #2593 from zsxwing/SPARK-3741 and squashes the following commits:

1d5aed5 [zsxwing] Fix naming
0b8a61c [zsxwing] Merge branch 'master' into SPARK-3741
764aec5 [zsxwing] [SPARK-3741] Make ConnectionManager propagate errors properly and add more logs to avoid Executors swallowing errors
2014-10-09 11:27:21 -07:00
Xiangrui Meng 86b392942d [SPARK-3844][UI] Truncate appName in WebUI if it is too long
Truncate appName in WebUI if it is too long.

Author: Xiangrui Meng <meng@databricks.com>

Closes #2707 from mengxr/truncate-app-name and squashes the following commits:

87834ce [Xiangrui Meng] move scala import below java
c7111dc [Xiangrui Meng] truncate appName in WebUI if it is too long
2014-10-09 00:00:24 -07:00
Anand Avati b9df8af62e [SPARK-2805] Upgrade to akka 2.3.4
Upgrade to akka 2.3.4

Author: Anand Avati <avati@redhat.com>

Closes #1685 from avati/SPARK-1812-akka-2.3 and squashes the following commits:

57a2315 [Anand Avati] SPARK-1812: streaming - remove tests which depend on akka.actor.IO
2a551d3 [Anand Avati] SPARK-1812: core - upgrade to akka 2.3.4
2014-10-08 23:45:21 -07:00
Kousuke Saruta b69c9fb6fb [SPARK-3829] Make Spark logo image on the header of HistoryPage as a link to HistoryPage's page #1
There is a Spark logo on the header of HistoryPage.
We can have too many HistoryPages if we run 20+ applications. So I think, it's useful if the logo is as a link to the HistoryPage's page number 1.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2690 from sarutak/SPARK-3829 and squashes the following commits:

908c109 [Kousuke Saruta] Removed extra space.
00bfbd7 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3829
dd87480 [Kousuke Saruta] Made header Spark log image as a link to History Server's top page.
2014-10-07 16:54:49 -07:00
zsxwing 446063eca9 [SPARK-3777] Display "Executor ID" for Tasks in Stage page
Now the Stage page only displays "Executor"(host) for tasks. However, there may be more than one Executors running in the same host. Currently, when some task is hung, I only know the host of the faulty executor. Therefore I have to check all executors in the host.

Adding "Executor ID" in the Tasks table. would be helpful to locate the faulty executor. Here is the new page:

![add_executor_id_for_tasks](https://cloud.githubusercontent.com/assets/1000778/4505774/acb9648c-4afa-11e4-8826-8768a0a60cc9.png)

Author: zsxwing <zsxwing@gmail.com>

Closes #2642 from zsxwing/SPARK-3777 and squashes the following commits:

37945af [zsxwing] Put Executor ID and Host into one cell
4bbe2c7 [zsxwing] [SPARK-3777] Display "Executor ID" for Tasks in Stage page
2014-10-07 16:00:22 -07:00
Andrew Or 553737c6e6 [SPARK-3825] Log more detail when unrolling a block fails
Before:
```
14/10/06 16:45:42 WARN CacheManager: Not enough space to cache partition rdd_0_2
in memory! Free memory is 481861527 bytes.
```
After:
```
14/10/07 11:08:24 WARN MemoryStore: Not enough space to cache rdd_2_0 in memory!
(computed 68.8 MB so far)
14/10/07 11:08:24 INFO MemoryStore: Memory use = 1088.0 B (blocks) + 445.1 MB
(scratch space shared across 8 thread(s)) = 445.1 MB. Storage limit = 459.5 MB.
```

Author: Andrew Or <andrewor14@gmail.com>

Closes #2688 from andrewor14/cache-log-message and squashes the following commits:

28e33d6 [Andrew Or] Shy away from "unrolling"
5638c49 [Andrew Or] Grammar
39a0c28 [Andrew Or] Log more detail when unrolling a block fails
2014-10-07 12:52:10 -07:00
Davies Liu bc87cc410f [SPARK-3731] [PySpark] fix memory leak in PythonRDD
The parent.getOrCompute() of PythonRDD is executed in a separated thread, it should release the memory reserved for shuffle and unrolling finally.

Author: Davies Liu <davies.liu@gmail.com>

Closes #2668 from davies/leak and squashes the following commits:

ae98be2 [Davies Liu] fix memory leak in PythonRDD
2014-10-07 12:20:12 -07:00
Davies Liu 655032965f [SPARK-3762] clear reference of SparkEnv after stop
SparkEnv is cached in ThreadLocal object, so after stop and create a new SparkContext, old SparkEnv is still used by some threads, it will trigger many problems, for example, pyspark will have problem after restart SparkContext, because py4j use thread pool for RPC.

This patch will clear all the references after stop a SparkEnv.

cc mateiz tdas pwendell

Author: Davies Liu <davies.liu@gmail.com>

Closes #2624 from davies/env and squashes the following commits:

a69f30c [Davies Liu] deprecate getThreadLocal
ba77ca4 [Davies Liu] remove getThreadLocal(), update docs
ee62bb7 [Davies Liu] cleanup ThreadLocal of SparnENV
4d0ea8b [Davies Liu] clear reference of SparkEnv after stop
2014-10-07 12:06:12 -07:00
Hossein d65fd554b4 [SPARK-3827] Very long RDD names are not rendered properly in web UI
With Spark SQL we generate very long RDD names. These names are not properly rendered in the web UI.

This PR fixes the rendering issue.

[SPARK-3827] #comment Linking PR with JIRA

Author: Hossein <hossein@databricks.com>

Closes #2687 from falaki/sparkTableUI and squashes the following commits:

fd06409 [Hossein] Limit width of cell when RDD name is too long
2014-10-07 11:46:26 -07:00
Nathan Kronenfeld fd7b155396 Rectify gereneric parameter names between SparkContext and AccumulablePa...
AccumulableParam gave its generic parameters as 'R, T', whereas SparkContext labeled them 'T, R'.

Trivial, but really confusing.

I resolved this in favor of AccumulableParam, because it seemed to have some logic for its names.  I also extended this minimal, but at least present, justification into the SparkContext comments.

Author: Nathan Kronenfeld <nkronenfeld@oculusinfo.com>

Closes #2637 from nkronenfeld/accumulators and squashes the following commits:

98d6b74 [Nathan Kronenfeld] Rectify gereneric parameter names between SparkContext and AccumulableParam
2014-10-05 21:03:48 -07:00
Sean Owen 8d22dbb5ec SPARK-3794 [CORE] Building spark core fails due to inadvertent dependency on Commons IO
Remove references to Commons IO FileUtils and replace with pure Java version, which doesn't need to traverse the whole directory tree first.

I think this method could be refined further if it would be alright to rename it and its args and break it down into two methods. I'm starting with a simple recursive rendition.

Author: Sean Owen <sowen@cloudera.com>

Closes #2662 from srowen/SPARK-3794 and squashes the following commits:

4cd172f [Sean Owen] Remove references to Commons IO FileUtils and replace with pure Java version, which doesn't need to traverse the whole directory tree first
2014-10-05 18:44:12 -07:00
zsxwing a7c73130f1 SPARK-1656: Fix potential resource leaks
JIRA: https://issues.apache.org/jira/browse/SPARK-1656

Author: zsxwing <zsxwing@gmail.com>

Closes #577 from zsxwing/SPARK-1656 and squashes the following commits:

c431095 [zsxwing] Add a comment and fix the code style
2de96e5 [zsxwing] Make sure file will be deleted if exception happens
28b90dc [zsxwing] Update to follow the code style
4521d6e [zsxwing] Merge branch 'master' into SPARK-1656
afc3383 [zsxwing] Update to follow the code style
071fdd1 [zsxwing] SPARK-1656: Fix potential resource leaks
2014-10-05 09:56:23 -07:00
Brenden Matthews 32fad4233f [SPARK-3597][Mesos] Implement killTask.
The MesosSchedulerBackend did not previously implement `killTask`,
resulting in an exception.

Author: Brenden Matthews <brenden@diddyinc.com>

Closes #2453 from brndnmtthws/implement-killtask and squashes the following commits:

23ddcdc [Brenden Matthews] [SPARK-3597][Mesos] Implement `killTask`.
2014-10-05 09:49:24 -07:00
mcheah cf1d32e3e1 [SPARK-1860] More conservative app directory cleanup.
First contribution to the project, so apologize for any significant errors.

This PR addresses [SPARK-1860]. The application directories are now cleaned up in a more conservative manner.

Previously, app-* directories were cleaned up if the directory's timestamp was older than a given time. However, the timestamp on a directory does not reflect the modification times of the files in that directory. Therefore, app-* directories were wiped out even if the files inside them were created recently and possibly being used by Executor tasks.

The solution is to change the cleanup logic to inspect all files within the app-* directory and only eliminate the app-* directory if all files in the directory are stale.

Author: mcheah <mcheah@palantir.com>

Closes #2609 from mccheah/worker-better-app-dir-cleanup and squashes the following commits:

87b5d03 [mcheah] [SPARK-1860] Using more string interpolation. Better error logging.
802473e [mcheah] [SPARK-1860] Cleaning up the logs generated when cleaning directories.
e0a1f2e [mcheah] [SPARK-1860] Fixing broken unit test.
77a9de0 [mcheah] [SPARK-1860] More conservative app directory cleanup.
2014-10-03 14:22:52 -07:00
Kousuke Saruta 79e45c9323 [SPARK-3377] [SPARK-3610] Metrics can be accidentally aggregated / History server log name should not be based on user input
This PR is another solution for #2250

I'm using codahale base MetricsSystem of Spark with JMX or Graphite, and I saw following 2 problems.

(1) When applications which have same spark.app.name run on cluster at the same time, some metrics names are mixed. For instance, if 2+ application is running on the cluster at the same time, each application emits the same named metric like "SparkPi.DAGScheduler.stage.failedStages" and Graphite cannot distinguish the metrics is for which application.

(2) When 2+ executors run on the same machine, JVM metrics of each executors are mixed. For instance, 2+ executors running on the same node can emit the same named metric "jvm.memory" and Graphite cannot distinguish the metrics is from which application.

And there is an similar issue. The directory for event logs is named using application name.
Application name is defined by user and the name can includes illegal character for path names.
Further more, the directory name consists of application name and System.currentTimeMillis even though each application has unique Application ID so if we run jobs which have same name, it's difficult to identify which directory is for which application.

Closes #2250
Closes #1067

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2432 from sarutak/metrics-structure-improvement2 and squashes the following commits:

3288b2b [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2
39169e4 [Kousuke Saruta] Fixed style
6570494 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2
817e4f0 [Kousuke Saruta] Simplified MetricsSystem#buildRegistryName
67fa5eb [Kousuke Saruta] Unified MetricsSystem#registerSources and registerSinks in start
10be654 [Kousuke Saruta] Fixed style.
990c078 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2
f0c7fba [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2
59cc2cd [Kousuke Saruta] Modified SparkContextSchedulerCreationSuite
f9b6fb3 [Kousuke Saruta] Modified style.
2cf8a0f [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2
389090d [Kousuke Saruta] Replaced taskScheduler.applicationId() with getApplicationId in SparkContext#postApplicationStart
ff45c89 [Kousuke Saruta] Added some test cases to MetricsSystemSuite
69c46a6 [Kousuke Saruta] Added warning logging logic to MetricsSystem#buildRegistryName
5cca0d2 [Kousuke Saruta] Added Javadoc comment to SparkContext#getApplicationId
16a9f01 [Kousuke Saruta] Added data types to be returned to some methods
6434b06 [Kousuke Saruta] Reverted changes related to ApplicationId
0413b90 [Kousuke Saruta] Deleted ApplicationId.java and ApplicationIdSuite.java
a42300c [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2
0fc1b09 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2
42bea55 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2
248935d [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2
f6af132 [Kousuke Saruta] Modified SchedulerBackend and TaskScheduler to return System.currentTimeMillis as an unique Application Id
1b8b53e [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2
97cb85c [Kousuke Saruta] Modified confliction of MimExcludes
2cdd009 [Kousuke Saruta] Modified defailt implementation of applicationId
9aadb0b [Kousuke Saruta] Modified NetworkReceiverSuite to ensure "executor.start()" is finished in test "network receiver life cycle"
3011efc [Kousuke Saruta] Added ApplicationIdSuite.scala
d009c55 [Kousuke Saruta] Modified ApplicationId#equals to compare appIds
dfc83fd [Kousuke Saruta] Modified ApplicationId to implement Serializable
9ff4851 [Kousuke Saruta] Modified MimaExcludes.scala to ignore createTaskScheduler method in SparkContext
4567ffc [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2
6a91b14 [Kousuke Saruta] Modified SparkContextSchedulerCreationSuite, ExecutorRunnerTest and EventLoggingListenerSuite
0325caf [Kousuke Saruta] Added ApplicationId.scala
0a2fc14 [Kousuke Saruta] Modified style
eabda80 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2
0f890e6 [Kousuke Saruta] Modified SparkDeploySchedulerBackend and Master to pass baseLogDir instead f eventLogDir
bcf25bf [Kousuke Saruta] Modified directory name for EventLogs
28d4d93 [Kousuke Saruta] Modified SparkContext and EventLoggingListener so that the directory for EventLogs is named same for Application ID
203634e [Kousuke Saruta] Modified comment in SchedulerBackend#applicationId and TaskScheduler#applicationId
424fea4 [Kousuke Saruta] Modified  the subclasses of TaskScheduler and SchedulerBackend so that they can return non-optional Unique Application ID
b311806 [Kousuke Saruta] Swapped last 2 arguments passed to CoarseGrainedExecutorBackend
8a2b6ec [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2
086ee25 [Kousuke Saruta] Merge branch 'metrics-structure-improvement2' of github.com:sarutak/spark into metrics-structure-improvement2
e705386 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2
36d2f7a [Kousuke Saruta] Added warning message for the situation we cannot get application id for the prefix for the name of metrics
eea6e19 [Kousuke Saruta] Modified CoarseGrainedMesosSchedulerBackend and MesosSchedulerBackend so that we can get Application ID
c229fbe [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2
e719c39 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2
4a93c7f [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2
4776f9e [Kousuke Saruta] Modified MetricsSystemSuite.scala
efcb6e1 [Kousuke Saruta] Modified to add application id to metrics name
2ec848a [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement
3ea7896 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement
ead8966 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement
08e627e [Kousuke Saruta] Revert "tmp"
7b67f5a [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement
45bd33d [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement
93e263a [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement
848819c [Kousuke Saruta] Merge branch 'metrics-structure-improvement' of github.com:sarutak/spark into metrics-structure-improvement
912a637 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement
e4a4593 [Kousuke Saruta] tmp
3e098d8 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement
4603a39 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement
fa7175b [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement
15f88a3 [Kousuke Saruta] Modified MetricsSystem#buildRegistryName because conf.get does not return null when correspondin entry is absent
6f7dcd4 [Kousuke Saruta] Modified constructor of DAGSchedulerSource and BlockManagerSource because the instance of SparkContext is no longer used
6fc5560 [Kousuke Saruta] Modified sourceName of ExecutorSource, DAGSchedulerSource and BlockManagerSource
4e057c9 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement
85ffc02 [Kousuke Saruta] Revert "Modified sourceName of ExecutorSource, DAGSchedulerSource and BlockManagerSource"
868e326 [Kousuke Saruta] Modified MetricsSystem to set registry name with unique application-id and driver/executor-id
71609f5 [Kousuke Saruta] Modified sourceName of ExecutorSource, DAGSchedulerSource and BlockManagerSource
55debab [Kousuke Saruta] Modified SparkContext and Executor to set spark.executor.id to identifiers
4180993 [Kousuke Saruta] Modified SparkContext to retain spark.unique.app.name property in SparkConf
2014-10-03 13:48:56 -07:00
Marcelo Vanzin 30abef1547 [SPARK-3606] [yarn] Correctly configure AmIpFilter for Yarn HA.
The existing code only considered one of the RMs when running in
Yarn HA mode, so it was possible to get errors if the active RM
was not registered in the filter.

The change makes use of a new API added to Yarn that returns all
proxy addresses, and falls back to the old behavior if the API
is not present. While there, I also made a change to look for the
scheme (http or https) being used by Yarn when building the proxy
URIs.

Since, in the case of multiple RMs, Yarn uses commas as a separator,
it was not possible anymore to use spark.filter.params to propagate
this information (which used commas to delimit different config params).
Instead, I added a new param (spark.filter.jsonParams) which expects
a JSON string containing a map with the config data. I chose not to
add it to the documentation at this point since I don't believe users
will use it directly.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #2469 from vanzin/SPARK-3606 and squashes the following commits:

aeb458a [Marcelo Vanzin] Undelete needed import.
65e400d [Marcelo Vanzin] Remove unused import.
d121883 [Marcelo Vanzin] Use separate config for each param instead of json.
04bc156 [Marcelo Vanzin] Review feedback.
4d4d6b9 [Marcelo Vanzin] [SPARK-3606] [yarn] Correctly configure AmIpFilter for Yarn HA.
2014-10-03 13:18:35 -07:00
Brenden Matthews a8c52d5343 [SPARK-3535][Mesos] Fix resource handling.
Author: Brenden Matthews <brenden@diddyinc.com>

Closes #2401 from brndnmtthws/master and squashes the following commits:

4abaa5d [Brenden Matthews] [SPARK-3535][Mesos] Fix resource handling.
2014-10-03 12:58:04 -07:00
EugenCepoi f0811f928e SPARK-2058: Overriding SPARK_HOME/conf with SPARK_CONF_DIR
Update of PR #997.

With this PR, setting SPARK_CONF_DIR overrides SPARK_HOME/conf (not only spark-defaults.conf and spark-env).

Author: EugenCepoi <cepoi.eugen@gmail.com>

Closes #2481 from EugenCepoi/SPARK-2058 and squashes the following commits:

0bb32c2 [EugenCepoi] use orElse orNull and fixing trailing percent in compute-classpath.cmd
77f35d7 [EugenCepoi] SPARK-2058: Overriding SPARK_HOME/conf with SPARK_CONF_DIR
2014-10-03 10:03:15 -07:00
Eric Eijkelenboom 42d5077fd3 [DEPLOY] SPARK-3759: Return the exit code of the driver process
SparkSubmitDriverBootstrapper.scala now returns the exit code of the driver process, instead of always returning 0.

Author: Eric Eijkelenboom <ee@userreport.com>

Closes #2628 from ericeijkelenboom/master and squashes the following commits:

cc4a571 [Eric Eijkelenboom] Return the exit code of the driver process
2014-10-02 18:04:38 -07:00
scwf 8081ce8bd1 [SPARK-3755][Core] avoid trying privileged port when request a non-privileged port
pwendell, ```tryPort``` is not compatible with old code in last PR, this is to fix it.
And after discuss with srowen renamed the title to "avoid trying privileged port when request a non-privileged port". Plz refer to the discuss for detail.

Author: scwf <wangfei1@huawei.com>

Closes #2623 from scwf/1-1024 and squashes the following commits:

10a4437 [scwf] add comment
de3fd17 [scwf] do not try privileged port when request a non-privileged port
42cb0fa [scwf] make tryPort compatible with old code
cb8cc76 [scwf] do not use port 1 - 1024
2014-10-02 17:47:56 -07:00
Thomas Graves 127e97bee1 [SPARK-3632] ConnectionManager can run out of receive threads with authentication on
If you turn authentication on and you are using a lot of executors. There is a chance that all the of the threads in the handleMessageExecutor could be waiting to send a message because they are blocked waiting on authentication to happen. This can cause a temporary deadlock until the connection times out.

To fix it, I got rid of the wait/notify and use a single outbox but only send security messages from it until authentication has completed.

Author: Thomas Graves <tgraves@apache.org>

Closes #2484 from tgravescs/cm_threads_auth and squashes the following commits:

a0a961d [Thomas Graves] give it a type
b6bc80b [Thomas Graves] Rework comments
d6d4175 [Thomas Graves] update from comments
081b765 [Thomas Graves] cleanup
4d7f8f5 [Thomas Graves] Change to not use wait/notify while waiting for authentication
2014-10-02 13:52:54 -07:00
Tathagata Das 5db78e6b87 [SPARK-3495] Block replication fails continuously when the replication target node is dead AND [SPARK-3496] Block replication by mistake chooses driver as target
If a block manager (say, A) wants to replicate a block and the node chosen for replication (say, B) is dead, then the attempt to send the block to B fails. However, this continues to fail indefinitely. Even if the driver learns about the demise of the B, A continues to try replicating to B and failing miserably.

The reason behind this bug is that A initially fetches a list of peers from the driver (when B was active), but never updates it after B is dead. This affects Spark Streaming as its receiver uses block replication.

The solution in this patch adds the following.
- Changed BlockManagerMaster to return all the peers of a block manager, rather than the requested number. It also filters out driver BlockManager.
- Refactored BlockManager's replication code to handle peer caching correctly.
    + The peer for replication is randomly selected. This is different from past behavior where for a node A, a node B was deterministically chosen for the lifetime of the application.
    + If replication fails to one node, the peers are refetched.
    + The peer cached has a TTL of 1 second to enable discovery of new peers and using them for replication.
- Refactored use of \<driver\> in BlockManager into a new method `BlockManagerId.isDriver`
- Added replication unit tests (replication was not tested till now, duh!)

This should not make a difference in performance of Spark workloads where replication is not used.

@andrewor14 @JoshRosen

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #2366 from tdas/replication-fix and squashes the following commits:

9690f57 [Tathagata Das] Moved replication tests to a new BlockManagerReplicationSuite.
0661773 [Tathagata Das] Minor changes based on PR comments.
a55a65c [Tathagata Das] Added a unit test to test replication behavior.
012afa3 [Tathagata Das] Bug fix
89f91a0 [Tathagata Das] Minor change.
68e2c72 [Tathagata Das] Made replication peer selection logic more efficient.
08afaa9 [Tathagata Das] Made peer selection for replication deterministic to block id
3821ab9 [Tathagata Das] Fixes based on PR comments.
08e5646 [Tathagata Das] More minor changes.
d402506 [Tathagata Das] Fixed imports.
4a20531 [Tathagata Das] Filtered driver block manager from peer list, and also consolidated the use of <driver> in BlockManager.
7598f91 [Tathagata Das] Minor changes.
03de02d [Tathagata Das] Change replication logic to correctly refetch peers from master on failure and on new worker addition.
d081bf6 [Tathagata Das] Fixed bug in get peers and unit tests to test get-peers and replication under executor churn.
9f0ac9f [Tathagata Das] Modified replication tests to fail on replication bug.
af0c1da [Tathagata Das] Added replication unit tests to BlockManagerSuite
2014-10-02 13:49:47 -07:00
cocoatomo 5b4a5b1acd [SPARK-3706][PySpark] Cannot run IPython REPL with IPYTHON set to "1" and PYSPARK_PYTHON unset
### Problem

The section "Using the shell" in Spark Programming Guide (https://spark.apache.org/docs/latest/programming-guide.html#using-the-shell) says that we can run pyspark REPL through IPython.
But a folloing command does not run IPython but a default Python executable.

```
$ IPYTHON=1 ./bin/pyspark
Python 2.7.8 (default, Jul  2 2014, 10:14:46)
...
```

the spark/bin/pyspark script on the commit b235e01363 decides which executable and options it use folloing way.

1. if PYSPARK_PYTHON unset
   * → defaulting to "python"
2. if IPYTHON_OPTS set
   * → set IPYTHON "1"
3. some python scripts passed to ./bin/pyspak → run it with ./bin/spark-submit
   * out of this issues scope
4. if IPYTHON set as "1"
   * → execute $PYSPARK_PYTHON (default: ipython) with arguments $IPYTHON_OPTS
   * otherwise execute $PYSPARK_PYTHON

Therefore, when PYSPARK_PYTHON is unset, python is executed though IPYTHON is "1".
In other word, when PYSPARK_PYTHON is unset, IPYTHON_OPS and IPYTHON has no effect on decide which command to use.

PYSPARK_PYTHON | IPYTHON_OPTS | IPYTHON | resulting command | expected command
---- | ---- | ----- | ----- | -----
(unset → defaults to python) | (unset) | (unset) | python | (same)
(unset → defaults to python) | (unset) | 1 | python | ipython
(unset → defaults to python) | an_option | (unset → set to 1) | python an_option | ipython an_option
(unset → defaults to python) | an_option | 1 | python an_option | ipython an_option
ipython | (unset) | (unset) | ipython | (same)
ipython | (unset) | 1 | ipython | (same)
ipython | an_option | (unset → set to 1) | ipython an_option | (same)
ipython | an_option | 1 | ipython an_option | (same)

### Suggestion

The pyspark script should determine firstly whether a user wants to run IPython or other executables.

1. if IPYTHON_OPTS set
   * set IPYTHON "1"
2.  if IPYTHON has a value "1"
   * PYSPARK_PYTHON defaults to "ipython" if not set
3. PYSPARK_PYTHON defaults to "python" if not set

See the pull request for more detailed modification.

Author: cocoatomo <cocoatomo77@gmail.com>

Closes #2554 from cocoatomo/issues/cannot-run-ipython-without-options and squashes the following commits:

d2a9b06 [cocoatomo] [SPARK-3706][PySpark] Use PYTHONUNBUFFERED environment variable instead of -u option
264114c [cocoatomo] [SPARK-3706][PySpark] Remove the sentence about deprecated environment variables
42e02d5 [cocoatomo] [SPARK-3706][PySpark] Replace environment variables used to customize execution of PySpark REPL
10d56fb [cocoatomo] [SPARK-3706][PySpark] Cannot run IPython REPL with IPYTHON set to "1" and PYSPARK_PYTHON unset
2014-10-02 11:13:19 -07:00
Colin Patrick Mccabe 6e27cb630d SPARK-1767: Prefer HDFS-cached replicas when scheduling data-local tasks
This change reorders the replicas returned by
HadoopRDD#getPreferredLocations so that replicas cached by HDFS are at
the start of the list.  This requires Hadoop 2.5 or higher; previous
versions of Hadoop do not expose the information needed to determine
whether a replica is cached.

Author: Colin Patrick Mccabe <cmccabe@cloudera.com>

Closes #1486 from cmccabe/SPARK-1767 and squashes the following commits:

338d4f8 [Colin Patrick Mccabe] SPARK-1767: Prefer HDFS-cached replicas when scheduling data-local tasks
2014-10-02 00:29:31 -07:00
Marcelo Vanzin 29c3513203 [SPARK-3446] Expose underlying job ids in FutureAction.
FutureAction is the only type exposed through the async APIs, so
for job IDs to be useful they need to be exposed there. The complication
is that some async jobs run more than one job (e.g. takeAsync),
so the exposed ID has to actually be a list of IDs that can actually
change over time. So the interface doesn't look very nice, but...

Change is actually small, I just added a basic test to make sure
it works.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #2337 from vanzin/SPARK-3446 and squashes the following commits:

e166a68 [Marcelo Vanzin] Fix comment.
1fed2bc [Marcelo Vanzin] [SPARK-3446] Expose underlying job ids in FutureAction.
2014-10-01 19:24:22 -07:00
Patrick Wendell 4e79970d32 Revert "[SPARK-3755][Core] Do not bind port 1 - 1024 to server in spark"
This reverts commit 6390aae4ea.
2014-10-01 16:03:00 -07:00
WangTaoTheTonic d61f2c15bb [SPARK-3658][SQL] Start thrift server as a daemon
https://issues.apache.org/jira/browse/SPARK-3658

And keep the `CLASS_NOT_FOUND_EXIT_STATUS` and exit message in `SparkSubmit.scala`.

Author: WangTaoTheTonic <barneystinson@aliyun.com>
Author: WangTao <barneystinson@aliyun.com>

Closes #2509 from WangTaoTheTonic/thriftserver and squashes the following commits:

5dcaab2 [WangTaoTheTonic] issue about coupling
8ad9f95 [WangTaoTheTonic] generalization
598e21e [WangTao] take thrift server as a daemon
2014-10-01 15:15:24 -07:00
scwf 2fedb5dddc [SPARK-3756] [Core]check exception is caused by an address-port collision properly
Jetty server use MultiException to handle exceptions when start server
refer https://github.com/eclipse/jetty.project/blob/jetty-8.1.14.v20131031/jetty-server/src/main/java/org/eclipse/jetty/server/Server.java

So in ```isBindCollision``` add the logical to cover MultiException

Author: scwf <wangfei1@huawei.com>

Closes #2611 from scwf/fix-isBindCollision and squashes the following commits:

984cb12 [scwf] optimize the fix
3a6c849 [scwf] fix bug in isBindCollision
2014-10-01 11:51:30 -07:00
scwf 6390aae4ea [SPARK-3755][Core] Do not bind port 1 - 1024 to server in spark
Non-root user use port 1- 1024 to start jetty server will get the exception " java.net.SocketException: Permission denied", so not use these ports

Author: scwf <wangfei1@huawei.com>

Closes #2610 from scwf/1-1024 and squashes the following commits:

cb8cc76 [scwf] do not use port 1 - 1024
2014-10-01 11:30:29 -07:00
Davies Liu abf588f47a [SPARK-3749] [PySpark] fix bugs in broadcast large closure of RDD
1. broadcast is triggle unexpected
2. fd is leaked in JVM (also leak in parallelize())
3. broadcast is not unpersisted in JVM after RDD is not be used any more.

cc JoshRosen , sorry for these stupid bugs.

Author: Davies Liu <davies.liu@gmail.com>

Closes #2603 from davies/fix_broadcast and squashes the following commits:

080a743 [Davies Liu] fix bugs in broadcast large closure of RDD
2014-10-01 11:21:34 -07:00
Masayoshi TSUZUKI 0bfd3afb00 [SPARK-3757] mvn clean doesn't delete some files
Added directory to be deleted into maven-clean-plugin in pom.xml.

Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp>

Closes #2613 from tsudukim/feature/SPARK-3757 and squashes the following commits:

8804bfc [Masayoshi TSUZUKI] Modified indent.
67c7171 [Masayoshi TSUZUKI] [SPARK-3757] mvn clean doesn't delete some files
2014-10-01 08:55:04 -07:00
Reynold Xin 3888ee2f38 [SPARK-3748] Log thread name in unit test logs
Thread names are useful for correlating failures.

Author: Reynold Xin <rxin@apache.org>

Closes #2600 from rxin/log4j and squashes the following commits:

83ffe88 [Reynold Xin] [SPARK-3748] Log thread name in unit test logs
2014-10-01 01:03:49 -07:00
Reynold Xin eb43043f41 [SPARK-3747] TaskResultGetter could incorrectly abort a stage if it cannot get result for a specific task
Author: Reynold Xin <rxin@apache.org>

Closes #2599 from rxin/SPARK-3747 and squashes the following commits:

a74c04d [Reynold Xin] Added a line of comment explaining NonFatal
0e8d44c [Reynold Xin] [SPARK-3747] TaskResultGetter could incorrectly abort a stage if it cannot get result for a specific task
2014-10-01 00:29:14 -07:00
Reynold Xin 6c696d7da6 Remove compiler warning from TaskContext change.
Author: Reynold Xin <rxin@apache.org>

Closes #2602 from rxin/warning and squashes the following commits:

130186b [Reynold Xin] Remove compiler warning from TaskContext change.
2014-09-30 15:55:04 -07:00
Sean Owen ab6dd80ba0 [SPARK-3356] [DOCS] Document when RDD elements' ordering within partitions is nondeterministic
As suggested by mateiz , and because it came up on the mailing list again last week, this attempts to document that ordering of elements is not guaranteed across RDD evaluations in groupBy, zip, and partition-wise RDD methods. Suggestions welcome about the wording, or other methods that need a note.

Author: Sean Owen <sowen@cloudera.com>

Closes #2508 from srowen/SPARK-3356 and squashes the following commits:

b7c96fd [Sean Owen] Undo change to programming guide
ad4aeec [Sean Owen] Don't mention ordering in partition-wise methods, reword description of ordering for zip methods per review, and add similar note to programming guide, which mentions groupByKey (but not zip methods)
fce943b [Sean Owen] Note that ordering of elements is not guaranteed across RDD evaluations in groupBy, zip, and partition-wise RDD methods
2014-09-30 11:15:38 -07:00
Josh Rosen b167a8c7e7 [SPARK-3734] DriverRunner should not read SPARK_HOME from submitter's environment
When using spark-submit in `cluster` mode to submit a job to a Spark Standalone
cluster, if the JAVA_HOME environment variable was set on the submitting
machine then DriverRunner would attempt to use the submitter's JAVA_HOME to
launch the driver process (instead of the worker's JAVA_HOME), causing the
driver to fail unless the submitter and worker had the same Java location.

This commit fixes this by reading JAVA_HOME from sys.env instead of
command.environment.

Author: Josh Rosen <joshrosen@apache.org>

Closes #2586 from JoshRosen/SPARK-3734 and squashes the following commits:

e9513d9 [Josh Rosen] [SPARK-3734] DriverRunner should not read SPARK_HOME from submitter's environment.
2014-09-29 23:36:10 -07:00
Reynold Xin de700d3177 [SPARK-3709] Executors don't always report broadcast block removal properly back to the driver
The problem was that the 2nd argument in RemoveBroadcast is not tellMaster! It is "removeFromDriver". Basically when removeFromDriver is not true, we don't report broadcast block removal back to the driver, and then other executors mistakenly think that the executor would still have the block, and try to fetch from it.

cc @tdas

Author: Reynold Xin <rxin@apache.org>

Closes #2588 from rxin/debug and squashes the following commits:

6dab2e3 [Reynold Xin] Don't log random messages.
f430686 [Reynold Xin] Always report broadcast removal back to master.
2a13f70 [Reynold Xin] iii
2014-09-29 23:17:53 -07:00
Reynold Xin 6b79bfb425 [SPARK-3613] Record only average block size in MapStatus for large stages
This changes the way we send MapStatus from executors back to driver for large stages (>2000 tasks). For large stages, we no longer send one byte per block. Instead, we just send the average block size.

This makes large jobs (tens of thousands of tasks) much more reliable since the driver no longer sends huge amount of data.

Author: Reynold Xin <rxin@apache.org>

Closes #2470 from rxin/mapstatus and squashes the following commits:

822ff54 [Reynold Xin] Code review feedback.
3b86f56 [Reynold Xin] Added MimaExclude.
f89d182 [Reynold Xin] Fixed a bug in MapStatus
6a0401c [Reynold Xin] [SPARK-3613] Record only average block size in MapStatus for large stages.
2014-09-29 22:56:22 -07:00
Reynold Xin 210404a561 Minor cleanup of code.
Author: Reynold Xin <rxin@apache.org>

Closes #2581 from rxin/minor-cleanup and squashes the following commits:

736a91b [Reynold Xin] Minor cleanup of code.
2014-09-29 21:53:21 -07:00
Reynold Xin e43c72fe04 Add more debug message for ManagedBuffer
This is to help debug the error reported at http://apache-spark-user-list.1001560.n3.nabble.com/SQL-queries-fail-in-1-2-0-SNAPSHOT-td15327.html

Author: Reynold Xin <rxin@apache.org>

Closes #2580 from rxin/buffer-debug and squashes the following commits:

5814292 [Reynold Xin] Logging close() in case close() fails.
323dfec [Reynold Xin] Add more debug message.
2014-09-29 12:38:24 -07:00
jerryshao dab1b0ae29 [SPARK-3032][Shuffle] Fix key comparison integer overflow introduced sorting exception
Previous key comparison in `ExternalSorter` will get wrong sorting result or exception when key comparison overflows, details can be seen in [SPARK-3032](https://issues.apache.org/jira/browse/SPARK-3032). Here fix this and add a unit test to prove it.

Author: jerryshao <saisai.shao@intel.com>

Closes #2514 from jerryshao/SPARK-3032 and squashes the following commits:

6f3c302 [jerryshao] Improve the unit test according to comments
01911e6 [jerryshao] Change the test to show the contract violate exception
83acb38 [jerryshao] Minor changes according to comments
fa2a08f [jerryshao] Fix key comparison integer overflow introduced sorting exception
2014-09-29 11:25:32 -07:00
Zhang, Liye 657bdff41a [CORE] Bugfix: LogErr format in DAGScheduler.scala
Author: Zhang, Liye <liye.zhang@intel.com>

Closes #2572 from liyezhang556520/DAGLogErr and squashes the following commits:

5be2491 [Zhang, Liye] Bugfix: LogErr format in DAGScheduler.scala
2014-09-29 01:13:15 -07:00
Reynold Xin f350cd3070 [SPARK-3543] TaskContext remaining cleanup work.
Author: Reynold Xin <rxin@apache.org>

Closes #2560 from rxin/TaskContext and squashes the following commits:

9eff95a [Reynold Xin] [SPARK-3543] remaining cleanup work.
2014-09-28 20:32:54 -07:00
Jim Lim 25164a89dd SPARK-2761 refactor #maybeSpill into Spillable
Moved `#maybeSpill` in ExternalSorter and EAOM into `Spillable`.

Author: Jim Lim <jim@quixey.com>

Closes #2416 from jimjh/SPARK-2761 and squashes the following commits:

cf8be9a [Jim Lim] SPARK-2761 fix documentation, reorder code
f94d522 [Jim Lim] SPARK-2761 refactor Spillable to simplify sig
e75a24e [Jim Lim] SPARK-2761 use protected over protected[this]
7270e0d [Jim Lim] SPARK-2761 refactor #maybeSpill into Spillable
2014-09-28 19:04:24 -07:00
Reynold Xin 8e874185ed Revert "[SPARK-1021] Defer the data-driven computation of partition bounds in so..."
This reverts commit 2d972fd84a.

The commit was hanging correlationoptimizer14.
2014-09-28 18:33:11 -07:00
Reynold Xin 66e1c40c67 Minor fix for the previous commit. 2014-09-27 22:18:02 -07:00
Dale 9966d1a8aa SPARK-CORE [SPARK-3651] Group common CoarseGrainedSchedulerBackend variables together
from [SPARK-3651]
In CoarseGrainedSchedulerBackend, we have:

    private val executorActor = new HashMap[String, ActorRef]
    private val executorAddress = new HashMap[String, Address]
    private val executorHost = new HashMap[String, String]
    private val freeCores = new HashMap[String, Int]
    private val totalCores = new HashMap[String, Int]

We only ever put / remove stuff from these maps together. It would simplify the code if we consolidate these all into one map as we have done in JobProgressListener in https://issues.apache.org/jira/browse/SPARK-2299.

Author: Dale <tigerquoll@outlook.com>

Closes #2533 from tigerquoll/SPARK-3651 and squashes the following commits:

d1be0a9 [Dale] [SPARK-3651]  implemented suggested changes. Changed a reference from executorInfo to executorData to be consistent with other usages
6890663 [Dale] [SPARK-3651]  implemented suggested changes
7d671cf [Dale] [SPARK-3651]  Grouped variables under a ExecutorDataObject, and reference them via a map entry as they are all retrieved under the same key
2014-09-27 22:08:10 -07:00
Reynold Xin 5b922bb458 [SPARK-3543] Clean up Java TaskContext implementation.
This addresses some minor issues in https://github.com/apache/spark/pull/2425

Author: Reynold Xin <rxin@apache.org>

Closes #2557 from rxin/TaskContext and squashes the following commits:

a51e5f6 [Reynold Xin] [SPARK-3543] Clean up Java TaskContext implementation.
2014-09-27 14:46:00 -07:00
Reynold Xin 436a7730b6 Minor cleanup to tighten visibility and remove compilation warning.
Author: Reynold Xin <rxin@apache.org>

Closes #2555 from rxin/cleanup and squashes the following commits:

6add199 [Reynold Xin] Minor cleanup to tighten visibility and remove compilation warning.
2014-09-27 00:57:26 -07:00
Erik Erlandson 2d972fd84a [SPARK-1021] Defer the data-driven computation of partition bounds in so...
...rtByKey() until evaluation.

Author: Erik Erlandson <eerlands@redhat.com>

Closes #1689 from erikerlandson/spark-1021-pr and squashes the following commits:

50b6da6 [Erik Erlandson] use standard getIteratorSize in countAsync
4e334a9 [Erik Erlandson] exception mystery fixed by fixing bug in ComplexFutureAction
b88b5d4 [Erik Erlandson] tweak async actions to use ComplexFutureAction[T] so they handle RangePartitioner sampling job properly
b2b20e8 [Erik Erlandson] Fix bug in exception passing with ComplexFutureAction[T]
ca8913e [Erik Erlandson] RangePartition sampling job -> FutureAction
7143f97 [Erik Erlandson] [SPARK-1021] modify range bounds variable to be thread safe
ac67195 [Erik Erlandson] [SPARK-1021] Defer the data-driven computation of partition bounds in sortByKey() until evaluation.
2014-09-26 23:15:10 -07:00
Prashant Sharma 5e34855cf0 [SPARK-3543] Write TaskContext in Java and expose it through a static accessor.
Author: Prashant Sharma <prashant.s@imaginea.com>
Author: Shashank Sharma <shashank21j@gmail.com>

Closes #2425 from ScrapCodes/SPARK-3543/withTaskContext and squashes the following commits:

8ae414c [Shashank Sharma] CR
ee8bd00 [Prashant Sharma] Added internal API in docs comments.
ddb8cbe [Prashant Sharma] Moved setting the thread local to where TaskContext is instantiated.
a7d5e23 [Prashant Sharma] Added doc comments.
edf945e [Prashant Sharma] Code review git add -A
f716fd1 [Prashant Sharma] introduced thread local for getting the task context.
333c7d6 [Prashant Sharma] Translated Task context from scala to java.
2014-09-26 21:29:54 -07:00