Commit graph

5053 commits

Author SHA1 Message Date
Jeff Zhang ffed00493a [SPARK-11125] [SQL] Uninformative exception when running spark-sql witho…
…ut building with -Phive-thriftserver and SPARK_PREPEND_CLASSES is set

This is the exception after this patch. Please help review.
```
java.lang.NoClassDefFoundError: org/apache/hadoop/hive/cli/CliDriver
	at java.lang.ClassLoader.defineClass1(Native Method)
	at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
	at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
	at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
	at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:412)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:270)
	at org.apache.spark.util.Utils$.classForName(Utils.scala:173)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:647)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hive.cli.CliDriver
	at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
	... 21 more
Failed to load hive class.
You need to build Spark with -Phive and -Phive-thriftserver.
```

Author: Jeff Zhang <zjffdu@apache.org>

Closes #9134 from zjffdu/SPARK-11125.
2015-10-23 22:56:55 -07:00
Sun Rui 2462dbcce8 [SPARK-10971][SPARKR] RRunner should allow setting path to Rscript.
Add a new spark conf option "spark.sparkr.r.driver.command" to specify the executable for an R script in client modes.

The existing spark conf option "spark.sparkr.r.command" is used to specify the executable for an R script in cluster modes for both driver and workers. See also [launch R worker script](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/RRDD.scala#L395).

BTW, [envrionment variable "SPARKR_DRIVER_R"](https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L275) is used to locate R shell on the local host.

For your information, PYSPARK has two environment variables serving simliar purpose:
PYSPARK_PYTHON	      Python binary executable to use for PySpark in both driver and workers (default is `python`).
PYSPARK_DRIVER_PYTHON	Python binary executable to use for PySpark in driver only (default is PYSPARK_PYTHON).
pySpark use the code [here](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala#L41) to determine the python executable for a python script.

Author: Sun Rui <rui.sun@intel.com>

Closes #9179 from sun-rui/SPARK-10971.
2015-10-23 21:38:04 -07:00
Jacek Laskowski b1c1597e3c Fix a (very tiny) typo
Author: Jacek Laskowski <jacek.laskowski@deepsense.io>

Closes #9230 from jaceklaskowski/utils-seconds-typo.
2015-10-22 22:42:15 -07:00
Marcelo Vanzin fa6a4fbf08 [SPARK-11134][CORE] Increase LauncherBackendSuite timeout.
This test can take a little while to finish on slow / loaded machines.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #9235 from vanzin/SPARK-11134.
2015-10-22 22:41:21 -07:00
zsxwing a88c66ca87 [SPARK-11098][CORE] Add Outbox to cache the sending messages to resolve the message disorder issue
The current NettyRpc has a message order issue because it uses a thread pool to send messages. E.g., running the following two lines in the same thread,

```
ref.send("A")
ref.send("B")
```

The remote endpoint may see "B" before "A" because sending "A" and "B" are in parallel.
To resolve this issue, this PR added an outbox for each connection, and if we are connecting to the remote node when sending messages, just cache the sending messages in the outbox and send them one by one when the connection is established.

Author: zsxwing <zsxwing@gmail.com>

Closes #9197 from zsxwing/rpc-outbox.
2015-10-22 21:01:01 -07:00
Andrew Or 34e71c6d89 [SPARK-11251] Fix page size calculation in local mode
```
// My machine only has 8 cores
$ bin/spark-shell --master local[32]
scala> val df = sc.parallelize(Seq((1, 1), (2, 2))).toDF("a", "b")
scala> df.as("x").join(df.as("y"), $"x.a" === $"y.a").count()

Caused by: java.io.IOException: Unable to acquire 2097152 bytes of memory
	at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:351)
```

Author: Andrew Or <andrew@databricks.com>

Closes #9209 from andrewor14/fix-local-page-size.
2015-10-22 15:58:08 -07:00
Kay Ousterhout 3535b91ddc [SPARK-11163] Remove unnecessary addPendingTask calls.
This commit removes unnecessary calls to addPendingTask in
TaskSetManager.executorLost. These calls are unnecessary: for
tasks that are still pending and haven't been launched, they're
still in all of the correct pending lists, so calling addPendingTask
has no effect. For tasks that are currently running (which may still be
in the pending lists, depending on how they were scheduled), we call
addPendingTask in handleFailedTask, so the calls at the beginning
of executorLost are redundant.

I think these calls are left over from when we re-computed the locality
levels in addPendingTask; now that we call recomputeLocality separately,
I don't think these are necessary.

Now that those calls are removed, the readding parameter in addPendingTask
is no longer necessary, so this commit also removes that parameter.

markhamstra can you take a look at this?

cc vanzin

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #9154 from kayousterhout/SPARK-11163.
2015-10-22 11:39:06 -07:00
zsxwing 7bb6d31cff [SPARK-11232][CORE] Use 'offer' instead of 'put' to make sure calling send won't be interrupted
The current `NettyRpcEndpointRef.send` can be interrupted because it uses `LinkedBlockingQueue.put`, which may hang the application.

Image the following execution order:

  | thread 1: TaskRunner.kill | thread 2: TaskRunner.run
------------- | ------------- | -------------
1 | killed = true |
2 |  | if (killed) {
3 |  | throw new TaskKilledException
4 |  | case _: TaskKilledException  _: InterruptedException if task.killed =>
5 | task.kill(interruptThread): interruptThread is true |
6 | | execBackend.statusUpdate(taskId, TaskState.KILLED, ser.serialize(TaskKilled))
7 | | localEndpoint.send(StatusUpdate(taskId, state, serializedData)): in LocalBackend

Then `localEndpoint.send(StatusUpdate(taskId, state, serializedData))` will throw `InterruptedException`. This will prevent the executor from updating the task status and hang the application.

An failure caused by the above issue here: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44062/consoleFull

Since `receivers` is an unbounded `LinkedBlockingQueue`, we can just use `LinkedBlockingQueue.offer` to resolve this issue.

Author: zsxwing <zsxwing@gmail.com>

Closes #9198 from zsxwing/dont-interrupt-send.
2015-10-22 11:31:47 -07:00
Josh Rosen f6d06adf05 [SPARK-10708] Consolidate sort shuffle implementations
There's a lot of duplication between SortShuffleManager and UnsafeShuffleManager. Given that these now provide the same set of functionality, now that UnsafeShuffleManager supports large records, I think that we should replace SortShuffleManager's serialized shuffle implementation with UnsafeShuffleManager's and should merge the two managers together.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #8829 from JoshRosen/consolidate-sort-shuffle-implementations.
2015-10-22 09:46:30 -07:00
zhichao.li c03b6d1158 [SPARK-11121][CORE] Correct the TaskLocation type
Correct the logic to return `HDFSCacheTaskLocation` instance when the input `str` is a in memory location.

Author: zhichao.li <zhichao.li@intel.com>

Closes #9096 from zhichao-li/uselessBranch.
2015-10-22 03:59:26 -07:00
Reynold Xin 555b2086a1 Minor cleanup of ShuffleMapStage.outputLocs code.
I was looking at this code and found the documentation to be insufficient. I added more documentation, and refactored some relevant code path slightly to improve encapsulation. There are more that I want to do, but I want to get these changes in before doing more work.

My goal is to reduce exposing internal fields directly in ShuffleMapStage to improve encapsulation. After this change, DAGScheduler no longer directly writes outputLocs. There are still 3 places that reads outputLocs directly, but we can change those later.

Author: Reynold Xin <rxin@databricks.com>

Closes #9175 from rxin/stage-cleanup.
2015-10-21 15:33:13 -07:00
Holden Karau e18b571c33 [SPARK-10447][SPARK-3842][PYSPARK] upgrade pyspark to py4j0.9
Upgrade to Py4j0.9

Author: Holden Karau <holden@pigscanfly.ca>
Author: Holden Karau <holden@us.ibm.com>

Closes #8615 from holdenk/SPARK-10447-upgrade-pyspark-to-py4j0.9.
2015-10-20 10:52:49 -07:00
Jakob Odersky 60851bc7bf [SPARK-11110][BUILD] Remove transient annotation for parameters.
`transient` annotations on class parameters (not case class parameters or vals) causes compilation errors during compilation with Scala 2.11.
I understand that transient *parameters* make no sense, however I don't quite understand why the 2.10 compiler accepted them.

Note: in case it is preferred to keep the annotations in case someone would in the future want to redefine them as vals, it would also be possible to just add `val` after the annotation, e.g. `class Foo(transient x: Int)` becomes `class Foo(transient private val x: Int)`.

I chose to remove the annotation as it also reduces needles clutter, however please feel free to tell me if you prefer the second option and I'll update the PR

Author: Jakob Odersky <jodersky@gmail.com>

Closes #9126 from jodersky/sbt-scala-2.11.
2015-10-20 08:54:34 +01:00
Jean-Baptiste Onofré 8f74aa6397 [SPARK-10876] Display total uptime for completed applications
Author: Jean-Baptiste Onofré <jbonofre@apache.org>

Closes #9059 from jbonofre/SPARK-10876.
2015-10-20 08:45:39 +01:00
Ryan Williams 16906ef23a [SPARK-11120] Allow sane default number of executor failures when dynamically allocating in YARN
I also added some information to container-failure error msgs about what host they failed on, which would have helped me identify the problem that lead me to this JIRA and PR sooner.

Author: Ryan Williams <ryan.blake.williams@gmail.com>

Closes #9147 from ryan-williams/dyn-exec-failures.
2015-10-19 16:34:15 -07:00
Chris Bannister fc26f32cf1 [SPARK-9708][MESOS] Spark should create local temporary directories in Mesos sandbox when launched with Mesos
This is my own original work and I license this to the project under the project's open source license

Author: Chris Bannister <chris.bannister@swiftkey.com>
Author: Chris Bannister <chris.bannister@swiftkey.net>

Closes #8358 from Zariel/mesos-local-dir.
2015-10-19 16:24:40 -07:00
Liang-Chi Hsieh a1413b3662 [SPARK-11051][CORE] Do not allow local checkpointing after the RDD is materialized and checkpointed
JIRA: https://issues.apache.org/jira/browse/SPARK-11051

When a `RDD` is materialized and checkpointed, its partitions and dependencies are cleared. If we allow local checkpointing on it and assign `LocalRDDCheckpointData` to its `checkpointData`. Next time when the RDD is materialized again, the error will be thrown.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #9072 from viirya/no-localcheckpoint-after-checkpoint.
2015-10-19 16:16:31 -07:00
Marcelo Vanzin 7ab0ce6501 [SPARK-11131][CORE] Fix race in worker registration protocol.
Because the registration RPC was not really an RPC, but a bunch of
disconnected messages, it was possible for other messages to be
sent before the reply to the registration arrived, and that would
confuse the Worker. Especially in local-cluster mode, the worker was
succeptible to receiving an executor request before it received a
message from the master saying registration succeeded.

On top of the above, the change also fixes a ClassCastException when
the registration fails, which also affects the executor registration
protocol. Because the `ask` is issued with a specific return type,
if the error message (of a different type) was returned instead, the
code would just die with an exception. This is fixed by having a common
base trait for these reply messages.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #9138 from vanzin/SPARK-11131.
2015-10-19 16:14:50 -07:00
Jacek Laskowski bd64c2d550 [SPARK-10921][YARN] Completely remove the use of SparkContext.prefer…
…redNodeLocationData

Author: Jacek Laskowski <jacek.laskowski@deepsense.io>

Closes #8976 from jaceklaskowski/SPARK-10921.
2015-10-19 09:59:18 +01:00
ph 022a8f6a1f [SPARK-11129] [MESOS] Link Spark WebUI from Mesos WebUI
Mesos has a feature for linking to frameworks running on top of Mesos
from the Mesos WebUI. This commit enables Spark to make use of this
feature so one can directly visit the running Spark WebUIs from the
Mesos WebUI.

Author: ph <ph@plista.com>

Closes #9135 from philipphoffmann/SPARK-11129.
2015-10-17 15:37:51 -07:00
Reynold Xin 2549374206 [SPARK-11165] Logging trait should be private - not DeveloperApi.
Its classdoc actually says; "NOTE: DO NOT USE this class outside of Spark. It is intended as an internal utility."

Author: Reynold Xin <rxin@databricks.com>

Closes #9155 from rxin/private-logging-trait.
2015-10-17 12:41:42 -07:00
gweidner ac09a3a465 [SPARK-11109] [CORE] Move FsHistoryProvider off deprecated AccessControlException
Switched from deprecated org.apache.hadoop.fs.permission.AccessControlException to org.apache.hadoop.security.AccessControlException.

Author: gweidner <gweidner@us.ibm.com>

Closes #9144 from gweidner/SPARK-11109.
2015-10-16 14:02:12 -07:00
navis.ryu b9c5e5d4ac [SPARK-11124] JsonParser/Generator should be closed for resource recycle
Some json parsers are not closed. parser in JacksonParser#parseJson, for example.

Author: navis.ryu <navis@apache.org>

Closes #9130 from navis/SPARK-11124.
2015-10-16 11:19:37 -07:00
Andrew Or 3b364ff0a4 [SPARK-11078] Ensure spilling tests actually spill
#9084 uncovered that many tests that test spilling don't actually spill. This is a follow-up patch to fix that to ensure our unit tests actually catch potential bugs in spilling. The size of this patch is inflated by the refactoring of `ExternalSorterSuite`, which had a lot of duplicate code and logic.

Author: Andrew Or <andrew@databricks.com>

Closes #9124 from andrewor14/spilling-tests.
2015-10-15 14:50:01 -07:00
KaiXinXiaoLei 2d000124b7 [SPARK-10515] When killing executor, the pending replacement executors should not be lost
If the heartbeat receiver kills executors (and new ones are not registered to replace them), the idle timeout for the old executors will be lost (and then change a total number of executors requested by Driver), So new ones will be not to asked to replace them.
For example, executorsPendingToRemove=Set(1), and executor 2 is idle timeout before a new executor is asked to replace executor 1. Then driver kill executor 2, and sending RequestExecutors to AM. But executorsPendingToRemove=Set(1,2), So AM doesn't allocate a executor to replace 1.

see: https://github.com/apache/spark/pull/8668

Author: KaiXinXiaoLei <huleilei1@huawei.com>
Author: huleilei <huleilei1@huawei.com>

Closes #8945 from KaiXinXiaoLei/pendingexecutor.
2015-10-15 14:48:01 -07:00
Carson Wang d45a0d3ca2 [SPARK-11047] Internal accumulators miss the internal flag when replaying events in the history server
Internal accumulators don't write the internal flag to event log. So on the history server Web UI, all accumulators are not internal. This causes incorrect peak execution memory and unwanted accumulator table displayed on the stage page.
To fix it, I add the "internal" property of AccumulableInfo when writing the event log.

Author: Carson Wang <carson.wang@intel.com>

Closes #9061 from carsonwang/accumulableBug.
2015-10-15 10:36:54 -07:00
shellberg 523adc24a6 [SPARK-11066] Update DAGScheduler's "misbehaved ResultHandler"
Restrict tasks (of job) to only 1 to ensure that the causing Exception asserted for job failure is the deliberately thrown DAGSchedulerSuiteDummyException intended, not an UnsupportedOperationException from any second/subsequent tasks that can propagate from a race condition during code execution.

Author: shellberg <sah@zepler.org>

Closes #9076 from shellberg/shellberg-DAGSchedulerSuite-misbehavedResultHandlerTest-patch-1.
2015-10-15 18:07:10 +01:00
Adam Lewandowski 0f62c2282b [SPARK-11093] [CORE] ChildFirstURLClassLoader#getResources should return all found resources, not just those in the child classloader
Author: Adam Lewandowski <alewandowski@ipcoop.com>

Closes #9106 from alewando/childFirstFix.
2015-10-15 09:45:54 -07:00
Reynold Xin cf2e0ae720 [SPARK-11096] Post-hoc review Netty based RPC implementation - round 2
A few more changes:

1. Renamed IDVerifier -> RpcEndpointVerifier
2. Renamed NettyRpcAddress -> RpcEndpointAddress
3. Simplified NettyRpcHandler a bit by removing the connection count tracking. This is OK because I now force spark.shuffle.io.numConnectionsPerPeer to 1
4. Reduced spark.rpc.connect.threads to 64. It would be great to eventually remove this extra thread pool.
5. Minor cleanup & documentation.

Author: Reynold Xin <rxin@databricks.com>

Closes #9112 from rxin/SPARK-11096.
2015-10-14 12:41:02 -07:00
Tom Graves 135a2ce5b0 [SPARK-10619] Can't sort columns on Executor Page
should pick into spark 1.5.2 also.

https://issues.apache.org/jira/browse/SPARK-10619

looks like this was broken by commit: fb1d06fc24 (diff-b8adb646ef90f616c34eb5c98d1ebd16)
It looks like somethings were change to use the UIUtils.listingTable but executor page wasn't converted so when it removed sortable from the UIUtils. TABLE_CLASS_NOT_STRIPED it broke this page.

Simply add the sortable tag back in and it fixes both active UI and the history server UI.

Author: Tom Graves <tgraves@yahoo-inc.com>

Closes #9101 from tgravescs/SPARK-10619.
2015-10-14 10:12:25 -07:00
Andrew Or b3ffac5178 [SPARK-10983] Unified memory manager
This patch unifies the memory management of the storage and execution regions such that either side can borrow memory from each other. When memory pressure arises, storage will be evicted in favor of execution. To avoid regressions in cases where storage is crucial, we dynamically allocate a fraction of space for storage that execution cannot evict. Several configurations are introduced:

- **spark.memory.fraction (default 0.75)**: ​fraction of the heap space used for execution and storage. The lower this is, the more frequently spills and cached data eviction occur. The purpose of this config is to set aside memory for internal metadata, user data structures, and imprecise size estimation in the case of sparse, unusually large records.

- **spark.memory.storageFraction (default 0.5)**: size of the storage region within the space set aside by `s​park.memory.fraction`. ​Cached data may only be evicted if total storage exceeds this region.

- **spark.memory.useLegacyMode (default false)**: whether to use the memory management that existed in Spark 1.5 and before. This is mainly for backward compatibility.

For a detailed description of the design, see [SPARK-10000](https://issues.apache.org/jira/browse/SPARK-10000). This patch builds on top of the `MemoryManager` interface introduced in #9000.

Author: Andrew Or <andrew@databricks.com>

Closes #9084 from andrewor14/unified-memory-manager.
2015-10-13 13:49:59 -07:00
Sun Rui 5e3868ba13 [SPARK-10051] [SPARKR] Support collecting data of StructType in DataFrame
Two points in this PR:

1.    Originally thought was that a named R list is assumed to be a struct in SerDe. But this is problematic because some R functions will implicitly generate named lists that are not intended to be a struct when transferred by SerDe. So SerDe clients have to explicitly mark a names list as struct by changing its class from "list" to "struct".

2.    SerDe is in the Spark Core module, and data of StructType is represented as GenricRow which is defined in Spark SQL module. SerDe can't import GenricRow as in maven build  Spark SQL module depends on Spark Core module. So this PR adds a registration hook in SerDe to allow SQLUtils in Spark SQL module to register its functions for serialization and deserialization of StructType.

Author: Sun Rui <rui.sun@intel.com>

Closes #8794 from sun-rui/SPARK-10051.
2015-10-13 10:02:21 -07:00
Reynold Xin 1797055dbf [SPARK-11079] Post-hoc review Netty-based RPC - round 1
I'm going through the implementation right now for post-doc review. Adding more comments and renaming things as I go through them.

I also want to write higher level documentation about how the whole thing works -- but those will come in other pull requests.

Author: Reynold Xin <rxin@databricks.com>

Closes #9091 from rxin/rpc-review.
2015-10-13 09:51:20 -07:00
Tom Graves 63c340a710 [SPARK-10858] YARN: archives/jar/files rename with # doesn't work unl
https://issues.apache.org/jira/browse/SPARK-10858

The issue here is that in resolveURI we default to calling new File(path).getAbsoluteFile().toURI().  But if the path passed in already has a # in it then File(path) will think that is supposed to be part of the actual file path and not a fragment so it changes # to %23. Then when we try to parse that  later in Client as a URI it doesn't recognize there is a fragment.

so to fix we just check if there is a fragment, still create the File like we did before and then add the fragment back on.

Author: Tom Graves <tgraves@yahoo-inc.com>

Closes #9035 from tgravescs/SPARK-10858.
2015-10-09 14:06:25 -07:00
Marcelo Vanzin 015f7ef503 [SPARK-8673] [LAUNCHER] API and infrastructure for communicating with child apps.
This change adds an API that encapsulates information about an app
launched using the library. It also creates a socket-based communication
layer for apps that are launched as child processes; the launching
application listens for connections from launched apps, and once
communication is established, the channel can be used to send updates
to the launching app, or to send commands to the child app.

The change also includes hooks for local, standalone/client and yarn
masters.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #7052 from vanzin/SPARK-8673.
2015-10-09 15:28:09 -05:00
Andrew Or 67fbecbf32 [SPARK-10956] Common MemoryManager interface for storage and execution
This patch introduces a `MemoryManager` that is the central arbiter of how much memory to grant to storage and execution. This patch is primarily concerned only with refactoring while preserving the existing behavior as much as possible.

This is the first step away from the existing rigid separation of storage and execution memory, which has several major drawbacks discussed on the [issue](https://issues.apache.org/jira/browse/SPARK-10956). It is the precursor of a series of patches that will attempt to address those drawbacks.

Author: Andrew Or <andrew@databricks.com>
Author: Josh Rosen <joshrosen@databricks.com>
Author: andrewor14 <andrew@databricks.com>

Closes #9000 from andrewor14/memory-manager.
2015-10-08 21:44:59 -07:00
Marcelo Vanzin 56a9692fc0 [SPARK-10987] [YARN] Workaround for missing netty rpc disconnection event.
In YARN client mode, when the AM connects to the driver, it may be the case
that the driver never needs to send a message back to the AM (i.e., no
dynamic allocation or preemption). This triggers an issue in the netty rpc
backend where no disconnection event is sent to endpoints, and the AM never
exits after the driver shuts down.

The real fix is too complicated, so this is a quick hack to unblock YARN
client mode until we can work on the real fix. It forces the driver to
send a message to the AM when the AM registers, thus establishing that
connection and enabling the disconnection event when the driver goes
away.

Also, a minor side issue: when the executor is shutting down, it needs
to send an "ack" back to the driver when using the netty rpc backend; but
that "ack" wasn't being sent because the handler was shutting down the rpc
env before returning. So added a change to delay the shutdown a little bit,
allowing the ack to be sent back.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #9021 from vanzin/SPARK-10987.
2015-10-08 09:47:58 -07:00
Marcelo Vanzin 6ca27f8550 [SPARK-10964] [YARN] Correctly register the AM with the driver.
The `self` method returns null when called from the constructor;
instead, registration should happen in the `onStart` method, at
which point the `self` reference has already been initialized.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #9005 from vanzin/SPARK-10964.
2015-10-07 11:38:47 -07:00
Marcelo Vanzin 744f03e700 [SPARK-10916] [YARN] Set perm gen size when launching containers on YARN.
This makes YARN containers behave like all other processes launched by
Spark, which launch with a default perm gen size of 256m unless
overridden by the user (or not needed by the vm).

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #8970 from vanzin/SPARK-10916.
2015-10-06 10:17:12 -07:00
Reynold Xin ae6570ec2b Remove TODO in ShuffleMemoryManager. 2015-10-03 18:08:25 -07:00
zsxwing 107320c9bb [SPARK-6028] [CORE] Remerge #6457: new RPC implemetation and also pick #8905
This PR just reverted 02144d6745 to remerge #6457 and also included the commits in #8905.

Author: zsxwing <zsxwing@gmail.com>

Closes #8944 from zsxwing/SPARK-6028.
2015-10-03 01:04:35 -07:00
Joshi f85aa06464 [SPARK-10317] [CORE] Compatibility between history server script and functionality
Compatibility between history server script and functionality

The history server has its argument parsing class in HistoryServerArguments. However, this doesn't get involved in the start-history-server.sh codepath where the $0 arg is assigned to spark.history.fs.logDirectory and all other arguments discarded (e.g --property-file.)
This stops the other options being usable from this script

Author: Joshi <rekhajoshm@gmail.com>
Author: Rekha Joshi <rekhajoshm@gmail.com>

Closes #8758 from rekhajoshm/SPARK-10317.
2015-10-02 15:26:11 -07:00
Takeshi YAMAMURO 2272962eb0 [SPARK-9867] [SQL] Move utilities for binary data into ByteArray
The utilities such as Substring#substringBinarySQL and BinaryPrefixComparator#computePrefix for binary data are put together in ByteArray for easy-to-read.

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #8122 from maropu/CleanUpForBinaryType.
2015-10-01 21:33:27 -04:00
zsxwing 9b3e7768a2 [SPARK-10058] [CORE] [TESTS] Fix the flaky tests in HeartbeatReceiverSuite
Fixed the test failure here: https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-1.5-SBT/116/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=spark-test/testReport/junit/org.apache.spark/HeartbeatReceiverSuite/normal_heartbeat/

This failure is because `HeartbeatReceiverSuite. heartbeatReceiver` may receive `SparkListenerExecutorAdded("driver")` sent from [LocalBackend](8fb3a65cbb/core/src/main/scala/org/apache/spark/scheduler/local/LocalBackend.scala (L121)).

There are other race conditions in `HeartbeatReceiverSuite` because `HeartbeatReceiver.onExecutorAdded` and `HeartbeatReceiver.onExecutorRemoved` are asynchronous. This PR also fixed them.

Author: zsxwing <zsxwing@gmail.com>

Closes #8946 from zsxwing/SPARK-10058.
2015-10-01 07:09:31 -07:00
Sun Rui c7b29ae641 [SPARK-10851] [SPARKR] Exception not failing R applications (in yarn cluster mode)
The YARN backend doesn't like when user code calls System.exit, since it cannot know the exit status and thus cannot set an appropriate final status for the application.

This PR remove the usage of system.exit to exit the RRunner. Instead, when the R process running an SparkR script returns an exit code other than 0, throws SparkUserAppException which will be caught by ApplicationMaster and ApplicationMaster knows it failed. For other failures, throws SparkException.

Author: Sun Rui <rui.sun@intel.com>

Closes #8938 from sun-rui/SPARK-10851.
2015-09-30 11:03:08 -07:00
zsxwing dba95ea032 [SPARK-10825] [CORE] [TESTS] Fix race conditions in StandaloneDynamicAllocationSuite
Fix the following issues in StandaloneDynamicAllocationSuite:

1. It should not assume master and workers start in order
2. It should not assume master and workers get ready at once
3. It should not assume the application is already registered with master after creating SparkContext
4. It should not access Master.app and idToApp which are not thread safe

The changes includes:
* Use `eventually` to wait until master and workers are ready to fix 1 and 2
* Use `eventually`  to wait until the application is registered with master to fix 3
* Use `askWithRetry[MasterStateResponse](RequestMasterState)` to get the application info to fix 4

Author: zsxwing <zsxwing@gmail.com>

Closes #8914 from zsxwing/fix-StandaloneDynamicAllocationSuite.
2015-09-29 11:53:28 -07:00
Sean Owen bf4199e261 [SPARK-10833] [BUILD] Inline, organize BSD/MIT licenses in LICENSE
In the course of https://issues.apache.org/jira/browse/LEGAL-226 it came to light that the guidance at http://www.apache.org/dev/licensing-howto.html#permissive-deps means that permissively-licensed dependencies has a different interpretation than we (er, I) had been operating under. "pointer ... to the license within the source tree" specifically means a copy of the license within Spark's distribution, whereas at the moment, Spark's LICENSE has a pointer to the project's license in the other project's source tree.

The remedy is simply to inline all such license references (i.e. BSD/MIT licenses) or include their text in "licenses" subdirectory and point to that.

Along the way, we can also treat other BSD/MIT licenses, whose text has been inlined into LICENSE, in the same way.

The LICENSE file can continue to provide a helpful list of BSD/MIT licensed projects and a pointer to their sites. This would be over and above including license text in the distro, which is the essential thing.

Author: Sean Owen <sowen@cloudera.com>

Closes #8919 from srowen/SPARK-10833.
2015-09-28 22:56:43 -04:00
Holden Karau d8d50ed388 [SPARK-10812] [YARN] Spark hadoop util support switching to yarn
While this is likely not a huge issue for real production systems, for test systems which may setup a Spark Context and tear it down and stand up a Spark Context with a different master (e.g. some local mode & some yarn mode) tests this cane be an issue. Discovered during work on spark-testing-base on Spark 1.4.1, but seems like the logic that triggers it is present in master (see SparkHadoopUtil object). A valid work around for users encountering this issue is to fork a different JVM, however this can be heavy weight.

```
[info] SampleMiniClusterTest:
[info] Exception encountered when attempting to run a suite with class name: com.holdenkarau.spark.testing.SampleMiniClusterTest *** ABORTED ***
[info] java.lang.ClassCastException: org.apache.spark.deploy.SparkHadoopUtil cannot be cast to org.apache.spark.deploy.yarn.YarnSparkHadoopUtil
[info] at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$.get(YarnSparkHadoopUtil.scala:163)
[info] at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:257)
[info] at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:561)
[info] at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:115)
[info] at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:57)
[info] at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141)
[info] at org.apache.spark.SparkContext.<init>(SparkContext.scala:497)
[info] at com.holdenkarau.spark.testing.SharedMiniCluster$class.setup(SharedMiniCluster.scala:186)
[info] at com.holdenkarau.spark.testing.SampleMiniClusterTest.setup(SampleMiniClusterTest.scala:26)
[info] at com.holdenkarau.spark.testing.SharedMiniCluster$class.beforeAll(SharedMiniCluster.scala:103)
```

Author: Holden Karau <holden@pigscanfly.ca>

Closes #8911 from holdenk/SPARK-10812-spark-hadoop-util-support-switching-to-yarn.
2015-09-28 06:33:45 -07:00
Matei Zaharia 21fd12cb17 [SPARK-9852] Let reduce tasks fetch multiple map output partitions
This makes two changes:

- Allow reduce tasks to fetch multiple map output partitions -- this is a pretty small change to HashShuffleFetcher
- Move shuffle locality computation out of DAGScheduler and into ShuffledRDD / MapOutputTracker; this was needed because the code in DAGScheduler wouldn't work for RDDs that fetch multiple map output partitions from each reduce task

I also added an AdaptiveSchedulingSuite that creates RDDs depending on multiple map output partitions.

Author: Matei Zaharia <matei@databricks.com>

Closes #8844 from mateiz/spark-9852.
2015-09-24 23:39:04 -04:00
Josh Rosen 8023242e77 [SPARK-10761] Refactor DiskBlockObjectWriter to not require BlockId
The DiskBlockObjectWriter constructor took a BlockId parameter but never used it. As part of some general cleanup in these interfaces, this patch refactors its constructor to eliminate this parameter.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #8871 from JoshRosen/disk-block-object-writer-blockid-cleanup.
2015-09-24 14:18:33 -07:00
Xiangrui Meng 02144d6745 Revert "[SPARK-6028][Core]A new RPC implemetation based on the network module"
This reverts commit 084e4e1262.
2015-09-24 08:25:44 -07:00
Andrew Or 83f6f54d12 [SPARK-10474] [SQL] Aggregation fails to allocate memory for pointer array (round 2)
This patch reverts most of the changes in a previous fix #8827.

The real cause of the issue is that in `TungstenAggregate`'s prepare method we only reserve 1 page, but later when we switch to sort-based aggregation we try to acquire 1 page AND a pointer array. The longer-term fix should be to reserve also the pointer array, but for now ***we will simply not track the pointer array***. (Note that elsewhere we already don't track the pointer array, e.g. [here](a18208047f/sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeKVExternalSorter.java (L88)))

Note: This patch reuses the unit test added in #8827 so it doesn't show up in the diff.

Author: Andrew Or <andrew@databricks.com>

Closes #8888 from andrewor14/dont-track-pointer-array.
2015-09-23 19:34:31 -07:00
zsxwing 084e4e1262 [SPARK-6028][Core]A new RPC implemetation based on the network module
Design doc: https://docs.google.com/document/d/1CF5G6rGVQMKSyV_QKo4D2M-x6rxz5x1Ew7aK3Uq6u8c/edit?usp=sharing

Author: zsxwing <zsxwing@gmail.com>

Closes #6457 from zsxwing/new-rpc.
2015-09-23 18:59:49 -07:00
Reynold Xin 9952217749 [SPARK-10731] [SQL] Delegate to Scala's DataFrame.take implementation in Python DataFrame.
Python DataFrame.head/take now requires scanning all the partitions. This pull request changes them to delegate the actual implementation to Scala DataFrame (by calling DataFrame.take).

This is more of a hack for fixing this issue in 1.5.1. A more proper fix is to change executeCollect and executeTake to return InternalRow rather than Row, and thus eliminate the extra round-trip conversion.

Author: Reynold Xin <rxin@databricks.com>

Closes #8876 from rxin/SPARK-10731.
2015-09-23 16:43:21 -07:00
tedyu 27bfa9ab3a [SPARK-10721] Log warning when file deletion fails
Author: tedyu <yuzhihong@gmail.com>

Closes #8843 from tedyu/master.
2015-09-23 10:01:28 +01:00
Tathagata Das 5548a25475 [SPARK-10652] [SPARK-10742] [STREAMING] Set meaningful job descriptions for all streaming jobs
Here is the screenshot after adding the job descriptions to threads that run receivers and the scheduler thread running the batch jobs.

## All jobs page
* Added job descriptions with links to relevant batch details page
![image](https://cloud.githubusercontent.com/assets/663212/9924165/cda4a372-5cb1-11e5-91ca-d43a32c699e9.png)

## All stages page
* Added stage descriptions with links to relevant batch details page
![image](https://cloud.githubusercontent.com/assets/663212/9923814/2cce266a-5cae-11e5-8a3f-dad84d06c50e.png)

## Streaming batch details page
* Added the +details link
![image](https://cloud.githubusercontent.com/assets/663212/9921977/24014a32-5c98-11e5-958e-457b6c38065b.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #8791 from tdas/SPARK-10652.
2015-09-22 22:44:09 -07:00
Andrew Or 61d4c07f4b [SPARK-10640] History server fails to parse TaskCommitDenied
... simply because the code is missing!

Author: Andrew Or <andrew@databricks.com>

Closes #8828 from andrewor14/task-end-reason-json.
2015-09-22 16:35:43 -07:00
Reynold Xin a96ba40f7e [SPARK-10714] [SPARK-8632] [SPARK-10685] [SQL] Refactor Python UDF handling
This patch refactors Python UDF handling:

1. Extract the per-partition Python UDF calling logic from PythonRDD into a PythonRunner. PythonRunner itself expects iterator as input/output, and thus has no dependency on RDD. This way, we can use PythonRunner directly in a mapPartitions call, or in the future in an environment without RDDs.
2. Use PythonRunner in Spark SQL's BatchPythonEvaluation.
3. Updated BatchPythonEvaluation to only use its input once, rather than twice. This should fix Python UDF performance regression in Spark 1.5.

There are a number of small cleanups I wanted to do when I looked at the code, but I kept most of those out so the diff looks small.

This basically implements the approach in https://github.com/apache/spark/pull/8833, but with some code moving around so the correctness doesn't depend on the inner workings of Spark serialization and task execution.

Author: Reynold Xin <rxin@databricks.com>

Closes #8835 from rxin/python-iter-refactor.
2015-09-22 14:11:46 -07:00
Josh Rosen 1ca5e2e0b8 [SPARK-10704] Rename HashShuffleReader to BlockStoreShuffleReader
The current shuffle code has an interface named ShuffleReader with only one implementation, HashShuffleReader. This naming is confusing, since the same read path code is used for both sort- and hash-based shuffle. This patch addresses this by renaming HashShuffleReader to BlockStoreShuffleReader.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #8825 from JoshRosen/shuffle-reader-cleanup.
2015-09-22 11:50:22 -07:00
xutingjun 2ea0f2e11b [SPARK-9585] Delete the input format caching because some input format are non thread safe
If we cache the  InputFormat, all tasks on the same executor will share it.
Some InputFormat is thread safety, but some are not, such as HiveHBaseTableInputFormat. If tasks share a non thread safe InputFormat, unexpected error may be occurs.
To avoid it, I think we should delete the input format  caching.

Author: xutingjun <xutingjun@huawei.com>
Author: meiyoula <1039320815@qq.com>
Author: Xutingjun <xutingjun@huawei.com>

Closes #7918 from XuTingjun/cached_inputFormat.
2015-09-22 11:01:32 -07:00
Rekha Joshi 7278f792a7 [SPARK-10718] [BUILD] Update License on conf files and corresponding excludes file update
Update License on conf files and corresponding excludes file update

Author: Rekha Joshi <rekhajoshm@gmail.com>
Author: Joshi <rekhajoshm@gmail.com>

Closes #8842 from rekhajoshm/SPARK-10718.
2015-09-22 11:03:21 +01:00
Andrew Or fd61b00487 [Minor] style fix for previous commit f24316e 2015-09-22 00:05:30 -07:00
Madhusudanan Kandasamy f24316e6d9 [SPARK-10458] [SPARK CORE] Added isStopped() method in SparkContext
Added isStopped() method in SparkContext

Author: Madhusudanan Kandasamy <madhusudanan@in.ibm.com>

Closes #8749 from kmadhugit/SPARK-10458.
2015-09-22 00:03:48 -07:00
Hossein c986e933a9 [SPARK-10711] [SPARKR] Do not assume spark.submit.deployMode is always set
In ```RUtils.sparkRPackagePath()``` we
1. Call ``` sys.props("spark.submit.deployMode")``` which returns null if ```spark.submit.deployMode``` is not suet
2. Call ``` sparkConf.get("spark.submit.deployMode")``` which throws ```NoSuchElementException``` if ```spark.submit.deployMode``` is not set. This patch simply passes a default value ("cluster") for ```spark.submit.deployMode```.

cc rxin

Author: Hossein <hossein@databricks.com>

Closes #8832 from falaki/SPARK-10711.
2015-09-21 21:09:59 -07:00
Tathagata Das 72869883f1 [SPARK-10649] [STREAMING] Prevent inheriting job group and irrelevant job description in streaming jobs
The job group, and job descriptions information is passed through thread local properties, and get inherited by child threads. In case of spark streaming, the streaming jobs inherit these properties from the thread that called streamingContext.start(). This may not make sense.

1. Job group: This is mainly used for cancelling a group of jobs together. It does not make sense to cancel streaming jobs like this, as the effect will be unpredictable. And its not a valid usecase any way, to cancel a streaming context, call streamingContext.stop()

2. Job description: This is used to pass on nice text descriptions for jobs to show up in the UI. The job description of the thread that calls streamingContext.start() is not useful for all the streaming jobs, as it does not make sense for all of the streaming jobs to have the same description, and the description may or may not be related to streaming.

The solution in this PR is meant for the Spark master branch, where local properties are inherited by cloning the properties. The job group and job description in the thread that starts the streaming scheduler are explicitly removed, so that all the subsequent child threads does not inherit them. Also, the starting is done in a new child thread, so that setting the job group and description for streaming, does not change those properties in the thread that called streamingContext.start().

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #8781 from tdas/SPARK-10649.
2015-09-21 16:47:52 -07:00
hushan[胡珊] b78c65b03a [SPARK-5259] [CORE] don't submit stage until its dependencies map outputs are registered
Track pending tasks by partition ID instead of Task objects.

Before this change, failure & retry could result in a case where a stage got submitted before the map output from its dependencies get registered.  This was due to an error in the condition for registering map outputs.

Author: hushan[胡珊] <hushan@xiaomi.com>
Author: Imran Rashid <irashid@cloudera.com>

Closes #7699 from squito/SPARK-5259.
2015-09-21 14:26:15 -05:00
zsxwing ebbf85f07b [SPARK-7989] [SPARK-10651] [CORE] [TESTS] Increase timeout to fix flaky tests
I noticed only one block manager registered with master in an unsuccessful build (https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=spark-test/3534/)
```
15/09/16 13:02:30.981 pool-1-thread-1-ScalaTest-running-BroadcastSuite INFO SparkContext: Running Spark version 1.6.0-SNAPSHOT
...
15/09/16 13:02:38.133 sparkDriver-akka.actor.default-dispatcher-19 INFO BlockManagerMasterEndpoint: Registering block manager localhost:48196 with 530.3 MB RAM, BlockManagerId(0, localhost, 48196)
```
In addition, the first block manager needed 7+ seconds to start. But the test expected 2 block managers so it failed.

However, there was no exception in this log file. So I checked a successful build (https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/3536/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=spark-test/) and it needed 4-5 seconds to set up the local cluster:
```
15/09/16 18:11:27.738 sparkWorker1-akka.actor.default-dispatcher-5 INFO Worker: Running Spark version 1.6.0-SNAPSHOT
...
15/09/16 18:11:30.838 sparkDriver-akka.actor.default-dispatcher-20 INFO BlockManagerMasterEndpoint: Registering block manager localhost:54202 with 530.3 MB RAM, BlockManagerId(1, localhost, 54202)
15/09/16 18:11:32.112 sparkDriver-akka.actor.default-dispatcher-20 INFO BlockManagerMasterEndpoint: Registering block manager localhost:32955 with 530.3 MB RAM, BlockManagerId(0, localhost, 32955)
```
In this build, the first block manager needed only 3+ seconds to start.

Comparing these two builds, I guess it's possible that the local cluster in `BroadcastSuite` cannot be ready in 10 seconds if the Jenkins worker is busy. So I just increased the timeout to 60 seconds to see if this can fix the issue.

Author: zsxwing <zsxwing@gmail.com>

Closes #8813 from zsxwing/fix-BroadcastSuite.
2015-09-21 11:39:04 -07:00
Josh Rosen 2117eea71e [SPARK-10710] Remove ability to disable spilling in core and SQL
It does not make much sense to set `spark.shuffle.spill` or `spark.sql.planner.externalSort` to false: I believe that these configurations were initially added as "escape hatches" to guard against bugs in the external operators, but these operators are now mature and well-tested. In addition, these configurations are not handled in a consistent way anymore: SQL's Tungsten codepath ignores these configurations and will continue to use spilling operators. Similarly, Spark Core's `tungsten-sort` shuffle manager does not respect `spark.shuffle.spill=false`.

This pull request removes these configurations, adds warnings at the appropriate places, and deletes a large amount of code which was only used in code paths that did not support spilling.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #8831 from JoshRosen/remove-ability-to-disable-spilling.
2015-09-19 21:40:21 -07:00
Andrew Or 7ff8d68cc1 [SPARK-10474] [SQL] Aggregation fails to allocate memory for pointer array
When `TungstenAggregation` hits memory pressure, it switches from hash-based to sort-based aggregation in-place. However, in the process we try to allocate the pointer array for writing to the new `UnsafeExternalSorter` *before* actually freeing the memory from the hash map. This lead to the following exception:
```
 java.io.IOException: Could not acquire 65536 bytes of memory
        at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
        at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
        at org.apache.spark.sql.execution.UnsafeKVExternalSorter.<init>(UnsafeKVExternalSorter.java:126)
        at org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
        at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
```

Author: Andrew Or <andrew@databricks.com>

Closes #8827 from andrewor14/allocate-pointer-array.
2015-09-18 23:58:25 -07:00
Mingyu Kim 8074208fa4 [SPARK-10611] Clone Configuration for each task for NewHadoopRDD
This patch attempts to fix the Hadoop Configuration thread safety issue for NewHadoopRDD in the same way SPARK-2546 fixed the issue for HadoopRDD.

Author: Mingyu Kim <mkim@palantir.com>

Closes #8763 from mingyukim/mkim/SPARK-10611.
2015-09-18 15:40:58 -07:00
Reynold Xin 348d7c9a93 [SPARK-9808] Remove hash shuffle file consolidation.
Author: Reynold Xin <rxin@databricks.com>

Closes #8812 from rxin/SPARK-9808-1.
2015-09-18 13:48:41 -07:00
linweizhong 93c7650ab6 [SPARK-9522] [SQL] SparkSubmit process can not exit if kill application when HiveThriftServer was starting
When we start HiveThriftServer, we will start SparkContext first, then start HiveServer2, if we kill application while HiveServer2 is starting then SparkContext will stop successfully, but SparkSubmit process can not exit.

Author: linweizhong <linweizhong@huawei.com>

Closes #7853 from Sephiroth-Lin/SPARK-9522.
2015-09-17 22:25:24 -07:00
Jeff Zhang 36d8b278d8 [SPARK-10531] [CORE] AppId is set as AppName in status rest api
Verify it manually.

Author: Jeff Zhang <zjffdu@apache.org>

Closes #8688 from zjffdu/SPARK-10531.
2015-09-17 10:25:18 -07:00
Josiah Samuel 81b4db374d [SPARK-10172] [CORE] disable sort in HistoryServer webUI
This pull request is to address the JIRA SPARK-10172 (History Server web UI gets messed up when sorting on any column).
The content of the table gets messed up due to the rowspan attribute of the table data(cell) during sorting.
The current table sort library used in SparkUI (sorttable.js) doesn't support/handle cells(td) with rowspans.
The fix will disable the table sort in the web UI, when there are jobs listed with multiple attempts.

Author: Josiah Samuel <josiah_sams@in.ibm.com>

Closes #8506 from josiahsams/SPARK-10172.
2015-09-17 10:18:21 -07:00
Cheng Lian 69c9830d28 [MINOR] [CORE] Fixes minor variable name typo
Author: Cheng Lian <lian@databricks.com>

Closes #8784 from liancheng/typo-fix.
2015-09-17 00:48:57 -07:00
Sun Rui 896edb51ab [SPARK-10050] [SPARKR] Support collecting data of MapType in DataFrame.
1. Support collecting data of MapType from DataFrame.
2. Support data of MapType in createDataFrame.

Author: Sun Rui <rui.sun@intel.com>

Closes #8711 from sun-rui/SPARK-10050.
2015-09-16 13:20:39 -07:00
Sean Owen 5dbaf3d391 [SPARK-10589] [WEBUI] Add defense against external site framing
Set `X-Frame-Options: SAMEORIGIN` to protect against frame-related vulnerability

Author: Sean Owen <sowen@cloudera.com>

Closes #8745 from srowen/SPARK-10589.
2015-09-16 19:19:23 +01:00
Josh Rosen 38700ea40c [SPARK-10381] Fix mixup of taskAttemptNumber & attemptId in OutputCommitCoordinator
When speculative execution is enabled, consider a scenario where the authorized committer of a particular output partition fails during the OutputCommitter.commitTask() call. In this case, the OutputCommitCoordinator is supposed to release that committer's exclusive lock on committing once that task fails. However, due to a unit mismatch (we used task attempt number in one place and task attempt id in another) the lock will not be released, causing Spark to go into an infinite retry loop.

This bug was masked by the fact that the OutputCommitCoordinator does not have enough end-to-end tests (the current tests use many mocks). Other factors contributing to this bug are the fact that we have many similarly-named identifiers that have different semantics but the same data types (e.g. attemptNumber and taskAttemptId, with inconsistent variable naming which makes them difficult to distinguish).

This patch adds a regression test and fixes this bug by always using task attempt numbers throughout this code.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #8544 from JoshRosen/SPARK-10381.
2015-09-15 17:11:21 -07:00
vinodkc 99ecfa5945 [SPARK-10575] [SPARK CORE] Wrapped RDD.takeSample with Scope
Remove return statements in RDD.takeSample and wrap it withScope

Author: vinodkc <vinod.kc.in@gmail.com>
Author: vinodkc <vinodkc@users.noreply.github.com>
Author: Vinod K C <vinod.kc@huawei.com>

Closes #8730 from vinodkc/fix_takesample_return.
2015-09-15 17:01:39 -07:00
Andrew Or b6e998634e [SPARK-10548] [SPARK-10563] [SQL] Fix concurrent SQL executions
*Note: this is for master branch only.* The fix for branch-1.5 is at #8721.

The query execution ID is currently passed from a thread to its children, which is not the intended behavior. This led to `IllegalArgumentException: spark.sql.execution.id is already set` when running queries in parallel, e.g.:
```
(1 to 100).par.foreach { _ =>
  sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count()
}
```
The cause is `SparkContext`'s local properties are inherited by default. This patch adds a way to exclude keys we don't want to be inherited, and makes SQL go through that code path.

Author: Andrew Or <andrew@databricks.com>

Closes #8710 from andrewor14/concurrent-sql-executions.
2015-09-15 16:45:47 -07:00
Reynold Xin 09b7e7c198 Update version to 1.6.0-SNAPSHOT.
Author: Reynold Xin <rxin@databricks.com>

Closes #8350 from rxin/1.6.
2015-09-15 00:54:20 -07:00
Matei Zaharia 1a0955250b [SPARK-9851] Support submitting map stages individually in DAGScheduler
This patch adds support for submitting map stages in a DAG individually so that we can make downstream decisions after seeing statistics about their output, as part of SPARK-9850. I also added more comments to many of the key classes in DAGScheduler. By itself, the patch is not super useful except maybe to switch between a shuffle and broadcast join, but with the other subtasks of SPARK-9850 we'll be able to do more interesting decisions.

The main entry point is SparkContext.submitMapStage, which lets you run a map stage and see stats about the map output sizes. Other stats could also be collected through accumulators. See AdaptiveSchedulingSuite for a short example.

Author: Matei Zaharia <matei@databricks.com>

Closes #8180 from mateiz/spark-9851.
2015-09-14 21:47:40 -04:00
Andrew Or 7b6c856367 [SPARK-10564] ThreadingSuite: assertion failures in threads don't fail the test (round 2)
This is a follow-up patch to #8723. I missed one case there.

Author: Andrew Or <andrew@databricks.com>

Closes #8727 from andrewor14/fix-threading-suite.
2015-09-14 15:09:43 -07:00
Forest Fang fd1e8cddf2 [SPARK-10543] [CORE] Peak Execution Memory Quantile should be Per-task Basis
Read `PEAK_EXECUTION_MEMORY` using `update` to get per task partial value instead of cumulative value.

I tested with this workload:

```scala
val size = 1000
val repetitions = 10
val data = sc.parallelize(1 to size, 5).map(x => (util.Random.nextInt(size / repetitions),util.Random.nextDouble)).toDF("key", "value")
val res = data.toDF.groupBy("key").agg(sum("value")).count
```

Before:
![image](https://cloud.githubusercontent.com/assets/4317392/9828197/07dd6874-58b8-11e5-9bd9-6ba927c38b26.png)

After:
![image](https://cloud.githubusercontent.com/assets/4317392/9828151/a5ddff30-58b7-11e5-8d31-eda5dc4eae79.png)

Tasks view:
![image](https://cloud.githubusercontent.com/assets/4317392/9828199/17dc2b84-58b8-11e5-92a8-be89ce4d29d1.png)

cc andrewor14 I appreciate if you can give feedback on this since I think you introduced display of this metric.

Author: Forest Fang <forest.fang@outlook.com>

Closes #8726 from saurfang/stagepage.
2015-09-14 15:07:13 -07:00
Sean Owen 4e2242bb41 [SPARK-10576] [BUILD] Move .java files out of src/main/scala
Move .java files in `src/main/scala` to `src/main/java` root, except for `package-info.java` (to stay next to package.scala)

Author: Sean Owen <sowen@cloudera.com>

Closes #8736 from srowen/SPARK-10576.
2015-09-14 15:03:51 -07:00
Wenchen Fan 32407bfd2b [SPARK-9899] [SQL] log warning for direct output committer with speculation enabled
This is a follow-up of https://github.com/apache/spark/pull/8317.

When speculation is enabled, there may be multiply tasks writing to the same path. Generally it's OK as we will write to a temporary directory first and only one task can commit the temporary directory to target path.

However, when we use direct output committer, tasks will write data to target path directly without temporary directory. This causes problems like corrupted data. Please see [PR comment](https://github.com/apache/spark/pull/8191#issuecomment-131598385) for more details.

Unfortunately, we don't have a simple flag to tell if a output committer will write to temporary directory or not, so for safety, we have to disable any customized output committer when `speculation` is true.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #8687 from cloud-fan/direct-committer.
2015-09-14 11:51:39 -07:00
Josh Rosen b3a7480ab0 [SPARK-10330] Add Scalastyle rule to require use of SparkHadoopUtil JobContext methods
This is a followup to #8499 which adds a Scalastyle rule to mandate the use of SparkHadoopUtil's JobContext accessor methods and fixes the existing violations.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #8521 from JoshRosen/SPARK-10330-part2.
2015-09-12 16:23:55 -07:00
Sean Owen 22730ad54d [SPARK-10547] [TEST] Streamline / improve style of Java API tests
Fix a few Java API test style issues: unused generic types, exceptions, wrong assert argument order

Author: Sean Owen <sowen@cloudera.com>

Closes #8706 from srowen/SPARK-10547.
2015-09-12 10:40:10 +01:00
Nithin Asokan 8285e3b0d3 [SPARK-10554] [CORE] Fix NPE with ShutdownHook
https://issues.apache.org/jira/browse/SPARK-10554

Fixes NPE when ShutdownHook tries to cleanup temporary folders

Author: Nithin Asokan <Nithin.Asokan@Cerner.com>

Closes #8720 from nasokan/SPARK-10554.
2015-09-12 09:50:49 +01:00
Daniel Imfeld 6d8367807c [SPARK-10566] [CORE] SnappyCompressionCodec init exception handling masks important error information
When throwing an IllegalArgumentException in SnappyCompressionCodec.init, chain the existing exception. This allows potentially important debugging info to be passed to the user.

Manual testing shows the exception chained properly, and the test suite still looks fine as well.

This contribution is my original work and I license the work to the project under the project's open source license.

Author: Daniel Imfeld <daniel@danielimfeld.com>

Closes #8725 from dimfeld/dimfeld-patch-1.
2015-09-12 09:19:59 +01:00
Andrew Or d74c6a143c [SPARK-10564] ThreadingSuite: assertion failures in threads don't fail the test
This commit ensures if an assertion fails within a thread, it will ultimately fail the test. Otherwise we end up potentially masking real bugs by not propagating assertion failures properly.

Author: Andrew Or <andrew@databricks.com>

Closes #8723 from andrewor14/fix-threading-suite.
2015-09-11 15:02:59 -07:00
tedyu b231ab8938 [SPARK-10546] Check partitionId's range in ExternalSorter#spill()
See this thread for background:
http://search-hadoop.com/m/q3RTt0rWvIkHAE81

We should check the range of partition Id and provide meaningful message through exception.

Alternatively, we can use abs() and modulo to force the partition Id into legitimate range. However, expectation is that user should correct the logic error in his / her code.

Author: tedyu <yuzhihong@gmail.com>

Closes #8703 from tedyu/master.
2015-09-11 21:45:45 +01:00
Matt Massie 0eabea8a05 [SPARK-9043] Serialize key, value and combiner classes in ShuffleDependency
ShuffleManager implementations are currently not given type information for
the key, value and combiner classes. Serialization of shuffle objects relies
on objects being JavaSerializable, with methods defined for reading/writing
the object or, alternatively, serialization via Kryo which uses reflection.

Serialization systems like Avro, Thrift and Protobuf generate classes with
zero argument constructors and explicit schema information
(e.g. IndexedRecords in Avro have get, put and getSchema methods).

By serializing the key, value and combiner class names in ShuffleDependency,
shuffle implementations will have access to schema information when
registerShuffle() is called.

Author: Matt Massie <massie@cs.berkeley.edu>

Closes #7403 from massie/shuffle-classtags.
2015-09-10 17:24:33 -07:00
Sun Rui 45e3be5c13 [SPARK-10049] [SPARKR] Support collecting data of ArraryType in DataFrame.
this PR :
1.  Enhance reflection in RBackend. Automatically matching a Java array to Scala Seq when finding methods. Util functions like seq(), listToSeq() in R side can be removed, as they will conflict with the Serde logic that transferrs a Scala seq to R side.

2.  Enhance the SerDe to support transferring  a Scala seq to R side. Data of ArrayType in DataFrame
after collection is observed to be of Scala Seq type.

3.  Support ArrayType in createDataFrame().

Author: Sun Rui <rui.sun@intel.com>

Closes #8458 from sun-rui/SPARK-10049.
2015-09-10 12:21:13 -07:00
Akash Mishra a5ef2d0600 [SPARK-10514] [MESOS] waiting for min no of total cores acquired by Spark by implementing the sufficientResourcesRegistered method
spark.scheduler.minRegisteredResourcesRatio configuration parameter works for YARN mode but not for Mesos Coarse grained mode.

If the parameter specified default value of 0 will be set for spark.scheduler.minRegisteredResourcesRatio in base class and this method will always return true.

There are no existing test for YARN mode too. Hence not added test for the same.

Author: Akash Mishra <akash.mishra20@gmail.com>

Closes #8672 from SleepyThread/master.
2015-09-10 12:04:02 -07:00
Iulian Dragos f0562e8cdb [SPARK-6350] [MESOS] Fine-grained mode scheduler respects mesosExecutor.cores
This is a regression introduced in #4960, this commit fixes it and adds a test.

tnachen andrewor14 please review, this should be an easy one.

Author: Iulian Dragos <jaguarul@gmail.com>

Closes #8653 from dragos/issue/mesos/fine-grained-maxExecutorCores.
2015-09-10 12:00:21 -07:00
mcheah af3bc59d1f [SPARK-8167] Make tasks that fail from YARN preemption not fail job
The architecture is that, in YARN mode, if the driver detects that an executor has disconnected, it asks the ApplicationMaster why the executor died. If the ApplicationMaster is aware that the executor died because of preemption, all tasks associated with that executor are not marked as failed. The executor
is still removed from the driver's list of available executors, however.

There's a few open questions:
1. Should standalone mode have a similar "get executor loss reason" as well? I localized this change as much as possible to affect only YARN, but there could be a valid case to differentiate executor losses in standalone mode as well.
2. I make a pretty strong assumption in YarnAllocator that getExecutorLossReason(executorId) will only be called once per executor id; I do this so that I can remove the metadata from the in-memory map to avoid object accumulation. It's not clear if I'm being overly zealous to save space, however.

cc vanzin specifically for review because it collided with some earlier YARN scheduling work.
cc JoshRosen because it's similar to output commit coordination we did in the past
cc andrewor14 for our discussion on how to get executor exit codes and loss reasons

Author: mcheah <mcheah@palantir.com>

Closes #8007 from mccheah/feature/preemption-handling.
2015-09-10 11:58:54 -07:00
Cheng Hao e048111376 [SPARK-10466] [SQL] UnsafeRow SerDe exception with data spill
Data Spill with UnsafeRow causes assert failure.

```
java.lang.AssertionError: assertion failed
	at scala.Predef$.assert(Predef.scala:165)
	at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeKey(UnsafeRowSerializer.scala:75)
	at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:180)
	at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:688)
	at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:687)
	at scala.collection.Iterator$class.foreach(Iterator.scala:727)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
	at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:687)
	at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:683)
	at scala.collection.Iterator$class.foreach(Iterator.scala:727)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
	at org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:683)
	at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
	at org.apache.spark.scheduler.Task.run(Task.scala:88)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
```

To reproduce that with code (thanks andrewor14):
```scala
bin/spark-shell --master local
  --conf spark.shuffle.memoryFraction=0.005
  --conf spark.shuffle.sort.bypassMergeThreshold=0

sc.parallelize(1 to 2 * 1000 * 1000, 10)
  .map { i => (i, i) }.toDF("a", "b").groupBy("b").avg().count()
```

Author: Cheng Hao <hao.cheng@intel.com>

Closes #8635 from chenghao-intel/unsafe_spill.
2015-09-10 11:48:43 -07:00
Liang-Chi Hsieh 45de518742 [SPARK-9730] [SQL] Add Full Outer Join support for SortMergeJoin
This PR is based on #8383 , thanks to viirya

JIRA: https://issues.apache.org/jira/browse/SPARK-9730

This patch adds the Full Outer Join support for SortMergeJoin. A new class SortMergeFullJoinScanner is added to scan rows from left and right iterators. FullOuterIterator is simply a wrapper of type RowIterator to consume joined rows from SortMergeFullJoinScanner.

Closes #8383

Author: Liang-Chi Hsieh <viirya@appier.com>
Author: Davies Liu <davies@databricks.com>

Closes #8579 from davies/smj_fullouter.
2015-09-09 16:02:27 -07:00
Luc Bourlier c1bc4f439f [SPARK-10227] fatal warnings with sbt on Scala 2.11
The bulk of the changes are on `transient` annotation on class parameter. Often the compiler doesn't generate a field for this parameters, so the the transient annotation would be unnecessary.
But if the class parameter are used in methods, then fields are created. So it is safer to keep the annotations.

The remainder are some potential bugs, and deprecated syntax.

Author: Luc Bourlier <luc.bourlier@typesafe.com>

Closes #8433 from skyluc/issue/sbt-2.11.
2015-09-09 09:57:58 +01:00
Reynold Xin 5ffe752b59 [SPARK-9767] Remove ConnectionManager.
We introduced the Netty network module for shuffle in Spark 1.2, and has turned it on by default for 3 releases. The old ConnectionManager is difficult to maintain. If we merge the patch now, by the time it is released, it would be 1 yr for which ConnectionManager is off by default. It's time to remove it.

Author: Reynold Xin <rxin@databricks.com>

Closes #8161 from rxin/SPARK-9767.
2015-09-07 10:42:30 -10:00
robbins 2e1c17553d [SPARK-10454] [SPARK CORE] wait for empty event queue
Author: robbins <robbins@uk.ibm.com>

Closes #8605 from robbinspg/DAGSchedulerSuite-fix.
2015-09-04 15:23:29 -07:00
Timothy Chen b087d23e28 [SPARK-9669] [MESOS] Support PySpark on Mesos cluster mode.
Support running pyspark with cluster mode on Mesos!
This doesn't upload any scripts, so if running in a remote Mesos requires the user to specify the script from a available URI.

Author: Timothy Chen <tnachen@gmail.com>

Closes #8349 from tnachen/mesos_python.
2015-09-04 15:21:31 -07:00
Andrew Or cf42138643 [SPARK-10003] Improve readability of DAGScheduler
Note: this is not intended to be in Spark 1.5!

This patch rewrites some code in the `DAGScheduler` to make it more readable. In particular
- there were blocks of code that are unnecessary and removed for simplicity
- there were abstractions that are unnecessary and made the code hard to navigate
- other minor changes

Author: Andrew Or <andrew@databricks.com>

Closes #8217 from andrewor14/dag-scheduler-readability and squashes the following commits:

57abca3 [Andrew Or] Move comment back into if case
574fb1e [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-scheduler-readability
64a9ed2 [Andrew Or] Remove unnecessary code + minor code rewrites
2015-09-03 17:55:10 -07:00
Andrew Or 08b0750974 [SPARK-10435] Spark submit should fail fast for Mesos cluster mode with R
It's not supported yet so we should error with a clear message.

Author: Andrew Or <andrew@databricks.com>

Closes #8590 from andrewor14/mesos-cluster-r-guard.
2015-09-03 13:57:20 -07:00
jeanlyn db4c130f9e [SPARK-9591] [CORE] Job may fail for exception during getting remote block
[SPARK-9591](https://issues.apache.org/jira/browse/SPARK-9591)
When we getting the broadcast variable, we can fetch the block form several location,but now when connecting the lost blockmanager(idle for enough time removed by driver when using dynamic resource allocate and so on) will cause task fail,and the worse case will cause the job fail.

Author: jeanlyn <jeanlyn92@gmail.com>

Closes #7927 from jeanlyn/catch_exception.
2015-09-03 13:56:11 -07:00
Vinod K C 11ef32c5a1 [SPARK-10430] [CORE] Added hashCode methods in AccumulableInfo and RDDOperationScope
Author: Vinod K C <vinod.kc@huawei.com>

Closes #8581 from vinodkc/fix_RDDOperationScope_Hashcode.
2015-09-03 13:55:02 -07:00
Pat Shields e62f4a46f4 [SPARK-9672] [MESOS] Don’t include SPARK_ENV_LOADED when passing env vars
This contribution is my original work and I license the work to the project under the project's open source license.

Author: Pat Shields <yeoldefortran@gmail.com>

Closes #7979 from pashields/env-loading-on-driver.
2015-09-03 13:53:18 -07:00
robbins d911c682f0 [SPARK-10431] [CORE] Fix intermittent test failure. Wait for event queue to be clear
Author: robbins <robbins@uk.ibm.com>

Closes #8582 from robbinspg/InputOutputMetricsSuite.
2015-09-03 13:47:25 -07:00
CHOIJAEHONG af0e3125cb [SPARK-8951] [SPARKR] support Unicode characters in collect()
Spark gives an error message and does not show the output when a field of the result DataFrame contains characters in CJK.
I changed SerDe.scala in order that Spark support Unicode characters when writes a string to R.

Author: CHOIJAEHONG <redrock07@naver.com>

Closes #7494 from CHOIJAEHONG1/SPARK-8951.
2015-09-03 13:38:26 -07:00
Davies Liu 62b4690d6b [SPARK-10379] preserve first page in UnsafeShuffleExternalSorter
Author: Davies Liu <davies@databricks.com>

Closes #8543 from davies/preserve_page.
2015-09-02 22:15:54 -07:00
Imran Rashid 3ddb9b3233 [SPARK-10247] [CORE] improve readability of a test case in DAGSchedulerSuite
This is pretty minor, just trying to improve the readability of `DAGSchedulerSuite`, I figure every bit helps.  Before whenever I read this test, I never knew what "should work" and "should be ignored" really meant -- this adds some asserts & updates comments to make it more clear.  Also some reformatting per a suggestion from markhamstra on https://github.com/apache/spark/pull/7699

Author: Imran Rashid <irashid@cloudera.com>

Closes #8434 from squito/SPARK-10247.
2015-09-02 22:14:50 -07:00
Evan Racah f6c447f875 Removed code duplication in ShuffleBlockFetcherIterator
Added fetchUpToMaxBytes() to prevent having to update both code blocks when a change is made.

Author: Evan Racah <ejracah@gmail.com>

Closes #8514 from eracah/master.
2015-09-02 22:13:37 -07:00
navis.ryu 0985d2c30e [SPARK-8707] RDD#toDebugString fails if any cached RDD has invalid partitions
Added numPartitions(evaluate: Boolean) to RDD. With "evaluate=true" the method is same with "partitions.length". With "evaluate=false", it checks checked-out or already evaluated partitions in the RDD to get number of partition. If it's not those cases, returns -1. RDDInfo.partitionNum calls numPartition only when it's accessed.

Author: navis.ryu <navis@apache.org>

Closes #7127 from navis/SPARK-8707.
2015-09-02 22:11:51 -07:00
Ilya Ganelin 4bd85d06e0 [SPARK-5945] Spark should not retry a stage infinitely on a FetchFailedException
The ```Stage``` class now tracks whether there were a sufficient number of consecutive failures of that stage to trigger an abort.

To avoid an infinite loop of stage retries, we abort the job completely after 4 consecutive stage failures for one stage. We still allow more than 4 consecutive stage failures if there is an intervening successful attempt for the stage, so that in very long-lived applications, where a stage may get reused many times, we don't abort the job after failures that have been recovered from successfully.

I've added test cases to exercise the most obvious scenarios.

Author: Ilya Ganelin <ilya.ganelin@capitalone.com>

Closes #5636 from ilganeli/SPARK-5945.
2015-09-02 22:08:24 -07:00
Marcelo Vanzin 2da3a9e98e [SPARK-10004] [SHUFFLE] Perform auth checks when clients read shuffle data.
To correctly isolate applications, when requests to read shuffle data
arrive at the shuffle service, proper authorization checks need to
be performed. This change makes sure that only the application that
created the shuffle data can read from it.

Such checks are only enabled when "spark.authenticate" is enabled,
otherwise there's no secure way to make sure that the client is really
who it says it is.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #8218 from vanzin/SPARK-10004.
2015-09-02 12:53:24 -07:00
Chuan Shao c3b881a7d7 [SPARK-7336] [HISTORYSERVER] Fix bug that applications status incorrect on JobHistory UI.
Author: ArcherShao <shaochuan@huawei.com>

Closes #5886 from ArcherShao/SPARK-7336.
2015-09-02 11:02:57 -07:00
zhuol ec01280533 [SPARK-4223] [CORE] Support * in acls.
SPARK-4223.

Currently we support setting view and modify acls but you have to specify a list of users. It would be nice to support * meaning all users have access.

Manual tests to verify that: "*" works for any user in:
a. Spark ui: view and kill stage.     Done.
b. Spark history server.                  Done.
c. Yarn application killing.  Done.

Author: zhuol <zhuol@yahoo-inc.com>

Closes #8398 from zhuoliu/4223.
2015-09-01 11:14:59 -10:00
Davies Liu 540bdee931 [SPARK-10341] [SQL] fix memory starving in unsafe SMJ
In SMJ, the first ExternalSorter could consume all the memory before spilling, then the second can not even acquire the first page.

Before we have a better memory allocator, SMJ should call prepare() before call any compute() of it's children.

cc rxin JoshRosen

Author: Davies Liu <davies@databricks.com>

Closes #8511 from davies/smj_memory.
2015-08-31 15:55:22 -07:00
EugenCepoi 72f6dbf7b0 [SPARK-8730] Fixes - Deser objects containing a primitive class attribute
Author: EugenCepoi <cepoi.eugen@gmail.com>

Closes #7122 from EugenCepoi/master.
2015-08-31 13:24:35 -05:00
ihainan 1bfd934782 [SPARK-10184] [CORE] Optimization for bounds determination in RangePartitioner
JIRA Issue: https://issues.apache.org/jira/browse/SPARK-10184

Change `cumWeight > target` to `cumWeight >= target` in `RangePartitioner.determineBounds` method to make the output partitions more balanced.

Author: ihainan <ihainan72@gmail.com>

Closes #8397 from ihainan/opt_for_rangepartitioner.
2015-08-30 08:26:14 +01:00
Marcelo Vanzin c53c902fa9 [SPARK-9284] [TESTS] Allow all tests to run without an assembly.
This change aims at speeding up the dev cycle a little bit, by making
sure that all tests behave the same w.r.t. where the code to be tested
is loaded from. Namely, that means that tests don't rely on the assembly
anymore, rather loading all needed classes from the build directories.

The main change is to make sure all build directories (classes and test-classes)
are added to the classpath of child processes when running tests.

YarnClusterSuite required some custom code since the executors are run
differently (i.e. not through the launcher library, like standalone and
Mesos do).

I also found a couple of tests that could leak a SparkContext on failure,
and added code to handle those.

With this patch, it's possible to run the following command from a clean
source directory and have all tests pass:

  mvn -Pyarn -Phadoop-2.4 -Phive-thriftserver install

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #7629 from vanzin/SPARK-9284.
2015-08-28 12:33:40 -07:00
Sean Owen cc39803062 [SPARK-10295] [CORE] Dynamic allocation in Mesos does not release when RDDs are cached
Remove obsolete warning about dynamic allocation not working with cached RDDs

See discussion in https://issues.apache.org/jira/browse/SPARK-10295

Author: Sean Owen <sowen@cloudera.com>

Closes #8489 from srowen/SPARK-10295.
2015-08-28 09:32:23 +01:00
Ram Sriharsha de0278286c [SPARK-10251] [CORE] some common types are not registered for Kryo Serializat…
…ion by default

Author: Ram Sriharsha <rsriharsha@hw11853.local>

Closes #8465 from harsha2010/SPARK-10251.
2015-08-26 23:12:55 -07:00
Sun Rui 71a138cd0e [SPARK-10048] [SPARKR] Support arbitrary nested Java array in serde.
This PR:
1. supports transferring arbitrary nested array from JVM to R side in SerDe;
2. based on 1, collect() implemenation is improved. Now it can support collecting data of complex types
   from a DataFrame.

Author: Sun Rui <rui.sun@intel.com>

Closes #8276 from sun-rui/SPARK-10048.
2015-08-25 13:14:10 -07:00
Sean Owen 69c9c17716 [SPARK-9613] [CORE] Ban use of JavaConversions and migrate all existing uses to JavaConverters
Replace `JavaConversions` implicits with `JavaConverters`

Most occurrences I've seen so far are necessary conversions; a few have been avoidable. None are in critical code as far as I see, yet.

Author: Sean Owen <sowen@cloudera.com>

Closes #8033 from srowen/SPARK-9613.
2015-08-25 12:33:13 +01:00
ehnalis 7f1e507bf7 Fixed a typo in DAGScheduler.
Author: ehnalis <zoltan.zvara@gmail.com>

Closes #8308 from ehnalis/master.
2015-08-25 12:30:06 +01:00
Zhang, Liye 5c14890159 [DOC] add missing parameters in SparkContext.scala for scala doc
Author: Zhang, Liye <liye.zhang@intel.com>

Closes #8412 from liyezhang556520/minorDoc.
2015-08-25 11:48:55 +01:00
Andrew Or 662bb96676 [SPARK-10144] [UI] Actually show peak execution memory by default
The peak execution memory metric was introduced in SPARK-8735. That was before Tungsten was enabled by default, so it assumed that `spark.sql.unsafe.enabled` must be explicitly set to true. The result is that the memory is not displayed by default.

Author: Andrew Or <andrew@databricks.com>

Closes #8345 from andrewor14/show-memory-default.
2015-08-24 14:10:50 -07:00
Imran Rashid 708036c1de [SPARK-9439] [YARN] External shuffle service robust to NM restarts using leveldb
https://issues.apache.org/jira/browse/SPARK-9439

In general, Yarn apps should be robust to NodeManager restarts.  However, if you run spark with the external shuffle service on, after a NM restart all shuffles fail, b/c the shuffle service has lost some state with info on each executor.  (Note the shuffle data is perfectly fine on disk across a NM restart, the problem is we've lost the small bit of state that lets us *find* those files.)

The solution proposed here is that the external shuffle service can write out its state to leveldb (backed by a local file) every time an executor is added.  When running with yarn, that file is in the NM's local dir.  Whenever the service is started, it looks for that file, and if it exists, it reads the file and re-registers all executors there.

Nothing is changed in non-yarn modes with this patch.  The service is not given a place to save the state to, so it operates the same as before.  This should make it easy to update other cluster managers as well, by just supplying the right file & the equivalent of yarn's `initializeApplication` -- I'm not familiar enough with those modes to know how to do that.

Author: Imran Rashid <irashid@cloudera.com>

Closes #7943 from squito/leveldb_external_shuffle_service_NM_restart and squashes the following commits:

0d285d3 [Imran Rashid] review feedback
70951d6 [Imran Rashid] Merge branch 'master' into leveldb_external_shuffle_service_NM_restart
5c71c8c [Imran Rashid] save executor to db before registering; style
2499c8c [Imran Rashid] explicit dependency on jackson-annotations
795d28f [Imran Rashid] review feedback
81f80e2 [Imran Rashid] Merge branch 'master' into leveldb_external_shuffle_service_NM_restart
594d520 [Imran Rashid] use json to serialize application executor info
1a7980b [Imran Rashid] version
8267d2a [Imran Rashid] style
e9f99e8 [Imran Rashid] cleanup the handling of bad dbs a little
9378ba3 [Imran Rashid] fail gracefully on corrupt leveldb files
acedb62 [Imran Rashid] switch to writing out one record per executor
79922b7 [Imran Rashid] rely on yarn to call stopApplication; assorted cleanup
12b6a35 [Imran Rashid] save registered executors when apps are removed; add tests
c878fbe [Imran Rashid] better explanation of shuffle service port handling
694934c [Imran Rashid] only open leveldb connection once per service
d596410 [Imran Rashid] store executor data in leveldb
59800b7 [Imran Rashid] Files.move in case renaming is unsupported
32fe5ae [Imran Rashid] Merge branch 'master' into external_shuffle_service_NM_restart
d7450f0 [Imran Rashid] style
f729e2b [Imran Rashid] debugging
4492835 [Imran Rashid] lol, dont use a PrintWriter b/c of scalastyle checks
0a39b98 [Imran Rashid] Merge branch 'master' into external_shuffle_service_NM_restart
55f49fc [Imran Rashid] make sure the service doesnt die if the registered executor file is corrupt; add tests
245db19 [Imran Rashid] style
62586a6 [Imran Rashid] just serialize the whole executors map
bdbbf0d [Imran Rashid] comments, remove some unnecessary changes
857331a [Imran Rashid] better tests & comments
bb9d1e6 [Imran Rashid] formatting
bdc4b32 [Imran Rashid] rename
86e0cb9 [Imran Rashid] for tests, shuffle service finds an open port
23994ff [Imran Rashid] style
7504de8 [Imran Rashid] style
a36729c [Imran Rashid] cleanup
efb6195 [Imran Rashid] proper unit test, and no longer leak if apps stop during NM restart
dd93dc0 [Imran Rashid] test for shuffle service w/ NM restarts
d596969 [Imran Rashid] cleanup imports
0e9d69b [Imran Rashid] better names
9eae119 [Imran Rashid] cleanup lots of duplication
1136f44 [Imran Rashid] test needs to have an actual shuffle
0b588bd [Imran Rashid] more fixes ...
ad122ef [Imran Rashid] more fixes
5e5a7c3 [Imran Rashid] fix build
c69f46b [Imran Rashid] maybe working version, needs tests & cleanup ...
bb3ba49 [Imran Rashid] minor cleanup
36127d3 [Imran Rashid] wip
b9d2ced [Imran Rashid] incomplete setup for external shuffle service tests
2015-08-21 08:41:36 -05:00
Xiangrui Meng cdd9a2bb10 [SPARK-10140] [DOC] add target fields to @Since
so constructors parameters and public fields can be annotated. rxin MechCoder

Author: Xiangrui Meng <meng@databricks.com>

Closes #8344 from mengxr/SPARK-10140.2.
2015-08-20 20:01:13 -07:00
Alex Shkurenko 39e91fe2fd [SPARK-9982] [SPARKR] SparkR DataFrame fail to return data of Decimal type
Author: Alex Shkurenko <ashkurenko@enova.com>

Closes #8239 from ashkurenko/master.
2015-08-20 10:16:38 -07:00
Timothy Chen 73431d8afb [SPARK-10124] [MESOS] Fix removing queued driver in mesos cluster mode.
Currently the spark applications can be queued to the Mesos cluster dispatcher, but when multiple jobs are in queue we don't handle removing jobs from the buffer correctly while iterating and causes null pointer exception.

This patch copies the buffer before iterating them, so exceptions aren't thrown when the jobs are removed.

Author: Timothy Chen <tnachen@gmail.com>

Closes #8322 from tnachen/fix_cluster_mode.
2015-08-19 19:43:26 -07:00
Marcelo Vanzin e0dd1309ac [SPARK-10119] [CORE] Fix isDynamicAllocationEnabled when config is expliticly disabled.
Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #8316 from vanzin/SPARK-10119.
2015-08-19 14:33:32 -07:00
Joshi f3391ff2b8 [SPARK-8889] [CORE] Fix for OOM for graph creation
Fix for OOM for graph creation

Author: Joshi <rekhajoshm@gmail.com>
Author: Rekha Joshi <rekhajoshm@gmail.com>

Closes #7602 from rekhajoshm/SPARK-8889.
2015-08-19 21:23:02 +01:00
Yu ISHIKAWA 2fcb9cb955 [SPARK-9856] [SPARKR] Add expression functions into SparkR whose params are complicated
I added lots of Column functinos into SparkR. And I also added `rand(seed: Int)` and `randn(seed: Int)` in Scala. Since we need such APIs for R integer type.

### JIRA
[[SPARK-9856] Add expression functions into SparkR whose params are complicated - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-9856)

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8264 from yu-iskw/SPARK-9856-3.
2015-08-19 10:41:14 -07:00
Han JU 3d16a54500 [SPARK-8949] Print warnings when using preferred locations feature
Add warnings according to SPARK-8949 in `SparkContext`
  - warnings in scaladoc
  - log warnings when preferred locations feature is used through `SparkContext`'s constructor

However I didn't found any documentation reference of this feature. Please direct me if you know any reference to this feature.

Author: Han JU <ju.han.felix@gmail.com>

Closes #7874 from darkjh/SPARK-8949.
2015-08-19 13:04:16 +01:00
Tathagata Das bc9a0e0323 [SPARK-9967] [SPARK-10099] [STREAMING] Renamed conf spark.streaming.backpressure.{enable-->enabled} and fixed deprecated annotations
Small changes
- Renamed conf spark.streaming.backpressure.{enable --> enabled}
- Change Java Deprecated annotations to Scala deprecated annotation with more information.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #8299 from tdas/SPARK-9967.
2015-08-18 23:37:57 -07:00
Josh Rosen 010b03ed52 [SPARK-9952] Fix N^2 loop when DAGScheduler.getPreferredLocsInternal accesses cacheLocs
In Scala, `Seq.fill` always seems to return a List. Accessing a list by index is an O(N) operation. Thus, the following code will be really slow (~10 seconds on my machine):

```scala
val numItems = 100000
val s = Seq.fill(numItems)(1)
for (i <- 0 until numItems) s(i)
```

It turns out that we had a loop like this in DAGScheduler code, although it's a little tricky to spot. In `getPreferredLocsInternal`, there's a call to `getCacheLocs(rdd)(partition)`.  The `getCacheLocs` call returns a Seq. If this Seq is a List and the RDD contains many partitions, then indexing into this list will cost O(partitions). Thus, when we loop over our tasks to compute their individual preferred locations we implicitly perform an N^2 loop, reducing scheduling throughput.

This patch fixes this by replacing `Seq` with `Array`.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #8178 from JoshRosen/dagscheduler-perf.
2015-08-18 22:30:13 -07:00
Marcelo Vanzin c1840a862e [SPARK-7736] [CORE] Fix a race introduced in PythonRunner.
The fix for SPARK-7736 introduced a race where a port value of "-1"
could be passed down to the pyspark process, causing it to fail to
connect back to the JVM. This change adds code to fix that race.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #8258 from vanzin/SPARK-7736.
2015-08-18 11:36:36 -07:00
CodingCat c34e9ff0ea [MINOR] fix the comments in IndexShuffleBlockResolver
it might be a typo  introduced at the first moment or some leftover after some renaming......

the name of the method accessing the index file is called `getBlockData` now (not `getBlockLocation` as indicated in the comments)

Author: CodingCat <zhunansjtu@gmail.com>

Closes #8238 from CodingCat/minor_1.
2015-08-18 10:31:11 +01:00
Marcelo Vanzin f68d024096 [SPARK-7736] [CORE] [YARN] Make pyspark fail YARN app on failure.
The YARN backend doesn't like when user code calls `System.exit`,
since it cannot know the exit status and thus cannot set an
appropriate final status for the application.

So, for pyspark, avoid that call and instead throw an exception with
the exit code. SparkSubmit handles that exception and exits with
the given exit code, while YARN uses the exit code as the failure
code for the Spark app.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #7751 from vanzin/SPARK-9416.
2015-08-17 10:34:22 -07:00
Rohit Agarwal ed092a06c2 [SPARK-9924] [WEB UI] Don't schedule checkForLogs while some of them are already running.
Author: Rohit Agarwal <rohita@qubole.com>

Closes #8153 from mindprince/SPARK-9924.
2015-08-17 10:31:57 -07:00
Matei Zaharia cf016075a0 [SPARK-10008] Ensure shuffle locality doesn't take precedence over narrow deps
The shuffle locality patch made the DAGScheduler aware of shuffle data,
but for RDDs that have both narrow and shuffle dependencies, it can
cause them to place tasks based on the shuffle dependency instead of the
narrow one. This case is common in iterative join-based algorithms like
PageRank and ALS, where one RDD is hash-partitioned and one isn't.

Author: Matei Zaharia <matei@databricks.com>

Closes #8220 from mateiz/shuffle-loc-fix.
2015-08-16 00:34:58 -07:00
Herman van Hovell a85fb6c07f [SPARK-9980] [BUILD] Fix SBT publishLocal error due to invalid characters in doc
Tiny modification to a few comments ```sbt publishLocal``` work again.

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #8209 from hvanhovell/SPARK-9980.
2015-08-15 10:46:04 +01:00
Davies Liu 37586e5449 [HOTFIX] fix duplicated braces
Author: Davies Liu <davies@databricks.com>

Closes #8219 from davies/fix_typo.
2015-08-14 20:56:55 -07:00
Reynold Xin e5fd60415f [SPARK-9934] Deprecate NIO ConnectionManager.
Deprecate NIO ConnectionManager in Spark 1.5.0, before removing it in Spark 1.6.0.

Author: Reynold Xin <rxin@databricks.com>

Closes #8162 from rxin/SPARK-9934.
2015-08-14 20:55:32 -07:00
jerryshao 9407baa2a7 [SPARK-9877] [CORE] Fix StandaloneRestServer NPE when submitting application
Detailed exception log can be seen in [SPARK-9877](https://issues.apache.org/jira/browse/SPARK-9877), the problem is when creating `StandaloneRestServer`, `self` (`masterEndpoint`) is null.  So this fix is creating `StandaloneRestServer` when `self` is available.

Author: jerryshao <sshao@hortonworks.com>

Closes #8127 from jerryshao/SPARK-9877.
2015-08-14 13:44:38 -07:00
Andrew Or 6518ef6303 [SPARK-9948] Fix flaky AccumulatorSuite - internal accumulators
In these tests, we use a custom listener and we assert on fields in the stage / task completion events. However, these events are posted in a separate thread so they're not guaranteed to be posted in time. This commit fixes this flakiness through a job end registration callback.

Author: Andrew Or <andrew@databricks.com>

Closes #8176 from andrewor14/fix-accumulator-suite.
2015-08-14 13:42:53 -07:00
Carson Wang 33bae585d4 [SPARK-9809] Task crashes because the internal accumulators are not properly initialized
When a stage failed and another stage was resubmitted with only part of partitions to compute, all the tasks failed with error message: java.util.NoSuchElementException: key not found: peakExecutionMemory.
This is because the internal accumulators are not properly initialized for this stage while other codes assume the internal accumulators always exist.

Author: Carson Wang <carson.wang@intel.com>

Closes #8090 from carsonwang/SPARK-9809.
2015-08-14 13:38:25 -07:00
Neelesh Srinivas Salian 57c2d08800 [SPARK-9923] [CORE] ShuffleMapStage.numAvailableOutputs should be an Int instead of Long
Modified type of ShuffleMapStage.numAvailableOutputs from Long to Int

Author: Neelesh Srinivas Salian <nsalian@cloudera.com>

Closes #8183 from nssalian/SPARK-9923.
2015-08-14 20:03:50 +01:00
Davies Liu bd35385d53 [SPARK-9945] [SQL] pageSize should be calculated from executor.memory
Currently, pageSize of TungstenSort is calculated from driver.memory, it should use executor.memory instead.

Also, in the worst case, the safeFactor could be 4 (because of rounding), increase it to 16.

cc rxin

Author: Davies Liu <davies@databricks.com>

Closes #8175 from davies/page_size.
2015-08-13 21:12:59 -07:00
Andrew Or 8815ba2f67 [SPARK-9649] Fix MasterSuite, third time's a charm
This particular test did not load the default configurations so
it continued to start the REST server, which causes port bind
exceptions.
2015-08-13 11:31:10 -07:00
Davies Liu a8ab2634c1 [SPARK-9832] [SQL] add a thread-safe lookup for BytesToBytseMap
This patch add a thread-safe lookup for BytesToBytseMap, and use that in broadcasted HashedRelation.

Author: Davies Liu <davies@databricks.com>

Closes #8151 from davies/safeLookup.
2015-08-12 21:26:00 -07:00
Josh Rosen 7b13ed27c1 [SPARK-9870] Disable driver UI and Master REST server in SparkSubmitSuite
I think that we should pass additional configuration flags to disable the driver UI and Master REST server in SparkSubmitSuite and HiveSparkSubmitSuite. This might cut down on port-contention-related flakiness in Jenkins.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #8124 from JoshRosen/disable-ui-in-sparksubmitsuite.
2015-08-12 18:52:11 -07:00
Rohit Agarwal 0d1d146c22 [SPARK-9724] [WEB UI] Avoid unnecessary redirects in the Spark Web UI.
Author: Rohit Agarwal <rohita@qubole.com>

Closes #8014 from mindprince/SPARK-9724 and squashes the following commits:

a7af5ff [Rohit Agarwal] [SPARK-9724] [WEB UI] Inline attachPrefix and attachPrefixForRedirect. Fix logic of attachPrefix
8a977cd [Rohit Agarwal] [SPARK-9724] [WEB UI] Address review comments: Remove unneeded code, update scaladoc.
b257844 [Rohit Agarwal] [SPARK-9724] [WEB UI] Avoid unnecessary redirects in the Spark Web UI.
2015-08-12 17:48:43 -07:00
Michel Lemay ab7e721cfe [SPARK-9826] [CORE] Fix cannot use custom classes in log4j.properties
Refactor Utils class and create ShutdownHookManager.

NOTE: Wasn't able to run /dev/run-tests on windows machine.
Manual tests were conducted locally using custom log4j.properties file with Redis appender and logstash formatter (bundled in the fat-jar submitted to spark)

ex:
log4j.rootCategory=WARN,console,redis
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

log4j.logger.org.eclipse.jetty=WARN
log4j.logger.org.eclipse.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.spark.graphx.Pregel=INFO

log4j.appender.redis=com.ryantenney.log4j.FailoverRedisAppender
log4j.appender.redis.endpoints=hostname:port
log4j.appender.redis.key=mykey
log4j.appender.redis.alwaysBatch=false
log4j.appender.redis.layout=net.logstash.log4j.JSONEventLayoutV1

Author: michellemay <mlemay@gmail.com>

Closes #8109 from michellemay/SPARK-9826.
2015-08-12 16:41:35 -07:00
Niranjan Padmanabhan 738f353988 [SPARK-9092] Fixed incompatibility when both num-executors and dynamic...
… allocation are set. Now, dynamic allocation is set to false when num-executors is explicitly specified as an argument. Consequently, executorAllocationManager in not initialized in the SparkContext.

Author: Niranjan Padmanabhan <niranjan.padmanabhan@cloudera.com>

Closes #7657 from neurons/SPARK-9092.
2015-08-12 16:10:21 -07:00
Xiangrui Meng 6f60298b1d [SPARK-8967] [DOC] add Since annotation
Add `Since` as a Scala annotation. The benefit is that we can use it without having explicit JavaDoc. This is useful for inherited methods. The limitation is that is doesn't show up in the generated Java API documentation. This might be fixed by modifying genjavadoc. I think we could leave it as a TODO.

This is how the generated Scala doc looks:

`since` JavaDoc tag:

![screen shot 2015-08-11 at 10 00 37 pm](https://cloud.githubusercontent.com/assets/829644/9230761/fa72865c-40d8-11e5-807e-0f3c815c5acd.png)

`Since` annotation:

![screen shot 2015-08-11 at 10 00 28 pm](https://cloud.githubusercontent.com/assets/829644/9230764/0041d7f4-40d9-11e5-8124-c3f3e5d5b31f.png)

rxin

Author: Xiangrui Meng <meng@databricks.com>

Closes #8131 from mengxr/SPARK-8967.
2015-08-12 14:28:23 -07:00
Andrew Or e0110792ef [SPARK-9747] [SQL] Avoid starving an unsafe operator in aggregation
This is the sister patch to #8011, but for aggregation.

In a nutshell: create the `TungstenAggregationIterator` before computing the parent partition. Internally this creates a `BytesToBytesMap` which acquires a page in the constructor as of this patch. This ensures that the aggregation operator is not starved since we reserve at least 1 page in advance.

rxin yhuai

Author: Andrew Or <andrew@databricks.com>

Closes #8038 from andrewor14/unsafe-starve-memory-agg.
2015-08-12 10:08:35 -07:00
Andrew Or be5d191207 [SPARK-9795] Dynamic allocation: avoid double counting when killing same executor twice
This is based on KaiXinXiaoLei's changes in #7716.

The issue is that when someone calls `sc.killExecutor("1")` on the same executor twice quickly, then the executor target will be adjusted downwards by 2 instead of 1 even though we're only actually killing one executor. In certain cases where we don't adjust the target back upwards quickly, we'll end up with jobs hanging.

This is a common danger because there are many places where this is called:
- `HeartbeatReceiver` kills an executor that has not been sending heartbeats
- `ExecutorAllocationManager` kills an executor that has been idle
- The user code might call this, which may interfere with the previous callers

While it's not clear whether this fixes SPARK-9745, fixing this potential race condition seems like a strict improvement. I've added a regression test to illustrate the issue.

Author: Andrew Or <andrew@databricks.com>

Closes #8078 from andrewor14/da-double-kill.
2015-08-12 09:24:50 -07:00
Tom White 2e680668f7 [SPARK-8625] [CORE] Propagate user exceptions in tasks back to driver
This allows clients to retrieve the original exception from the
cause field of the SparkException that is thrown by the driver.
If the original exception is not in fact Serializable then it will
not be returned, but the message and stacktrace will be. (All Java
Throwables implement the Serializable interface, but this is no
guarantee that a particular implementation can actually be
serialized.)

Author: Tom White <tom@cloudera.com>

Closes #7014 from tomwhite/propagate-user-exceptions.
2015-08-12 10:07:11 -05:00
Timothy Chen 5c99d8bf98 [SPARK-8798] [MESOS] Allow additional uris to be fetched with mesos
Some users like to download additional files in their sandbox that they can refer to from their spark program, or even later mount these files to another directory.

Author: Timothy Chen <tnachen@gmail.com>

Closes #7195 from tnachen/mesos_files.
2015-08-11 23:26:33 -07:00
Carson Wang bab8923285 [SPARK-9426] [WEBUI] Job page DAG visualization is not shown
To reproduce the issue, go to the stage page and click DAG Visualization once, then go to the job page to show the job DAG visualization. You will only see the first stage of the job.
Root cause: the java script use local storage to remember your selection. Once you click the stage DAG visualization, the local storage set `expand-dag-viz-arrow-stage` to true. When you go to the job page, the js checks `expand-dag-viz-arrow-stage` in the local storage first and will try to show stage DAG visualization on the job page.
To fix this, I set an id to the DAG span to differ job page and stage page. In the js code, we check the id and local storage together to make sure we show the correct DAG visualization.

Author: Carson Wang <carson.wang@intel.com>

Closes #8104 from carsonwang/SPARK-9426.
2015-08-11 23:25:02 -07:00
zsxwing 4e3f4b934f [SPARK-9829] [WEBUI] Display the update value for peak execution memory
The peak execution memory is not correct because it shows the sum of finished tasks' values when a task finishes.

This PR fixes it by using the update value rather than the accumulator value.

Author: zsxwing <zsxwing@gmail.com>

Closes #8121 from zsxwing/SPARK-9829.
2015-08-11 23:23:17 -07:00
Rohit Agarwal a807fcbe50 [SPARK-9806] [WEB UI] Don't share ReplayListenerBus between multiple applications
Author: Rohit Agarwal <rohita@qubole.com>

Closes #8088 from mindprince/SPARK-9806.
2015-08-11 23:20:39 -07:00
xutingjun b85f9a242a [SPARK-8366] maxNumExecutorsNeeded should properly handle failed tasks
Author: xutingjun <xutingjun@huawei.com>
Author: meiyoula <1039320815@qq.com>

Closes #6817 from XuTingjun/SPARK-8366.
2015-08-11 23:19:35 -07:00
zsxwing f16bc68dfb [SPARK-9824] [CORE] Fix the issue that InternalAccumulator leaks WeakReference
`InternalAccumulator.create` doesn't call `registerAccumulatorForCleanup` to register itself with ContextCleaner, so `WeakReference`s for these accumulators in `Accumulators.originals` won't be removed.

This PR added `registerAccumulatorForCleanup` for internal accumulators to avoid the memory leak.

Author: zsxwing <zsxwing@gmail.com>

Closes #8108 from zsxwing/internal-accumulators-leak.
2015-08-11 14:06:23 -07:00
Jeff Zhang bce72797f3 Fix comment error
API is updated but its doc comment is not updated.

Author: Jeff Zhang <zjffdu@apache.org>

Closes #8097 from zjffdu/dev.
2015-08-11 10:42:17 -07:00
Reynold Xin d378396f86 [SPARK-9815] Rename PlatformDependent.UNSAFE -> Platform.
PlatformDependent.UNSAFE is way too verbose.

Author: Reynold Xin <rxin@databricks.com>

Closes #8094 from rxin/SPARK-9815 and squashes the following commits:

229b603 [Reynold Xin] [SPARK-9815] Rename PlatformDependent.UNSAFE -> Platform.
2015-08-11 08:41:06 -07:00
Marcelo Vanzin 0f3366a4c7 [SPARK-9710] [TEST] Fix RPackageUtilsSuite when R is not available.
RUtils.isRInstalled throws an exception if R is not installed,
instead of returning false. Fix that.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #8008 from vanzin/SPARK-9710 and squashes the following commits:

df72d8c [Marcelo Vanzin] [SPARK-9710] [test] Fix RPackageUtilsSuite when R is not available.
2015-08-10 10:10:40 -07:00
Shivaram Venkataraman 46025616b4 [CORE] [SPARK-9760] Use Option instead of Some for Ivy repos
This was introduced in #7599

cc rxin brkyvz

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #8055 from shivaram/spark-packages-repo-fix and squashes the following commits:

890f306 [Shivaram Venkataraman] Remove test case
51d69ee [Shivaram Venkataraman] Add test case for --packages without --repository
c02e0b4 [Shivaram Venkataraman] Use Option instead of Some for Ivy repos
2015-08-09 14:30:30 -07:00
Reynold Xin e9c36938ba [SPARK-9752][SQL] Support UnsafeRow in Sample operator.
In order for this to work, I had to disable gap sampling.

Author: Reynold Xin <rxin@databricks.com>

Closes #8040 from rxin/SPARK-9752 and squashes the following commits:

f9e248c [Reynold Xin] Fix the test case for real this time.
adbccb3 [Reynold Xin] Fixed test case.
589fb23 [Reynold Xin] Merge branch 'SPARK-9752' of github.com:rxin/spark into SPARK-9752
55ccddc [Reynold Xin] Fixed core test.
78fa895 [Reynold Xin] [SPARK-9752][SQL] Support UnsafeRow in Sample operator.
c9e7112 [Reynold Xin] [SPARK-9752][SQL] Support UnsafeRow in Sample operator.
2015-08-09 10:58:36 -07:00
Carson Wang ef062c1599 [SPARK-9731] Standalone scheduling incorrect cores if spark.executor.cores is not set
The issue only happens if `spark.executor.cores` is not set and executor memory is set to a high value.
For example, if we have a worker with 4G and 10 cores and we set `spark.executor.memory` to 3G, then only 1 core is assigned to the executor. The correct number should be 10 cores.
I've added a unit test to illustrate the issue.

Author: Carson Wang <carson.wang@intel.com>

Closes #8017 from carsonwang/SPARK-9731 and squashes the following commits:

d09ec48 [Carson Wang] Fix code style
86b651f [Carson Wang] Simplify the code
943cc4c [Carson Wang] fix scheduling correct cores to executors
2015-08-07 23:36:26 -07:00
Andrew Or 881548ab20 [SPARK-9674] Re-enable ignored test in SQLQuerySuite
The original code that this test tests is removed in 9270bd06fd. It was ignored shortly before that so we never caught it. This patch re-enables the test and adds the code necessary to make it pass.

JoshRosen yhuai

Author: Andrew Or <andrew@databricks.com>

Closes #8015 from andrewor14/SPARK-9674 and squashes the following commits:

225eac2 [Andrew Or] Merge branch 'master' of github.com:apache/spark into SPARK-9674
8c24209 [Andrew Or] Fix NPE
e541d64 [Andrew Or] Track aggregation memory for both sort and hash
0be3a42 [Andrew Or] Fix test
2015-08-07 14:20:13 -07:00
zsxwing ebfd91c542 [SPARK-9467][SQL]Add SQLMetric to specialize accumulators to avoid boxing
This PR adds SQLMetric/SQLMetricParam/SQLMetricValue to specialize accumulators to avoid boxing. All SQL metrics should use these classes rather than `Accumulator`.

Author: zsxwing <zsxwing@gmail.com>

Closes #7996 from zsxwing/sql-accu and squashes the following commits:

14a5f0a [zsxwing] Address comments
367ca23 [zsxwing] Use localValue directly to avoid changing Accumulable
42f50c3 [zsxwing] Add SQLMetric to specialize accumulators to avoid boxing
2015-08-07 00:09:58 -07:00
Davies Liu 15bd6f338d [SPARK-9453] [SQL] support records larger than page size in UnsafeShuffleExternalSorter
This patch follows exactly #7891 (except testing)

Author: Davies Liu <davies@databricks.com>

Closes #8005 from davies/larger_record and squashes the following commits:

f9c4aff [Davies Liu] address comments
9de5c72 [Davies Liu] support records larger than page size in UnsafeShuffleExternalSorter
2015-08-06 23:40:38 -07:00
Reynold Xin 4309262ec9 [SPARK-9700] Pick default page size more intelligently.
Previously, we use 64MB as the default page size, which was way too big for a lot of Spark applications (especially for single node).

This patch changes it so that the default page size, if unset by the user, is determined by the number of cores available and the total execution memory available.

Author: Reynold Xin <rxin@databricks.com>

Closes #8012 from rxin/pagesize and squashes the following commits:

16f4756 [Reynold Xin] Fixed failing test.
5afd570 [Reynold Xin] private...
0d5fb98 [Reynold Xin] Update default value.
674a6cd [Reynold Xin] Address review feedback.
dc00e05 [Reynold Xin] Merge with master.
73ebdb6 [Reynold Xin] [SPARK-9700] Pick default page size more intelligently.
2015-08-06 23:18:29 -07:00
zsxwing 672f467668 [SPARK-8057][Core]Call TaskAttemptContext.getTaskAttemptID using Reflection
Someone may use the Spark core jar in the maven repo with hadoop 1. SPARK-2075 has already resolved the compatibility issue to support it. But `SparkHadoopMapRedUtil.commitTask` broke it recently.

This PR uses Reflection to call `TaskAttemptContext.getTaskAttemptID` to fix the compatibility issue.

Author: zsxwing <zsxwing@gmail.com>

Closes #6599 from zsxwing/SPARK-8057 and squashes the following commits:

f7a343c [zsxwing] Remove the redundant import
6b7f1af [zsxwing] Call TaskAttemptContext.getTaskAttemptID using Reflection
2015-08-06 21:42:42 -07:00
Andrew Or 014a9f9d8c [SPARK-9709] [SQL] Avoid starving unsafe operators that use sort
The issue is that a task may run multiple sorts, and the sorts run by the child operator (i.e. parent RDD) may acquire all available memory such that other sorts in the same task do not have enough to proceed. This manifests itself in an `IOException("Unable to acquire X bytes of memory")` thrown by `UnsafeExternalSorter`.

The solution is to reserve a page in each sorter in the chain before computing the child operator's (parent RDD's) partitions. This requires us to use a new special RDD that does some preparation before computing the parent's partitions.

Author: Andrew Or <andrew@databricks.com>

Closes #8011 from andrewor14/unsafe-starve-memory and squashes the following commits:

35b69a4 [Andrew Or] Simplify test
0b07782 [Andrew Or] Minor: update comments
5d5afdf [Andrew Or] Merge branch 'master' of github.com:apache/spark into unsafe-starve-memory
254032e [Andrew Or] Add tests
234acbd [Andrew Or] Reserve a page in sorter when preparing each partition
b889e08 [Andrew Or] MapPartitionsWithPreparationRDD
2015-08-06 19:04:57 -07:00
Reynold Xin b87825310a [SPARK-9692] Remove SqlNewHadoopRDD's generated Tuple2 and InterruptibleIterator.
A small performance optimization – we don't need to generate a Tuple2 and then immediately discard the key. We also don't need an extra wrapper from InterruptibleIterator.

Author: Reynold Xin <rxin@databricks.com>

Closes #8000 from rxin/SPARK-9692 and squashes the following commits:

1d4d0b3 [Reynold Xin] [SPARK-9692] Remove SqlNewHadoopRDD's generated Tuple2 and InterruptibleIterator.
2015-08-06 18:25:38 -07:00
Marcelo Vanzin e234ea1b49 [SPARK-9645] [YARN] [CORE] Allow shuffle service to read shuffle files.
Spark should not mess with the permissions of directories created
by the cluster manager. Here, by setting the block manager dir
permissions to 700, the shuffle service (running as the YARN user)
wouldn't be able to serve shuffle files created by applications.

Also, the code to protect the local app dir was missing in standalone's
Worker; that has been now added. Since all processes run as the same
user in standalone, `chmod 700` should not cause problems.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #7966 from vanzin/SPARK-9645 and squashes the following commits:

6e07b31 [Marcelo Vanzin] Protect the app dir in standalone mode.
384ba6a [Marcelo Vanzin] [SPARK-9645] [yarn] [core] Allow shuffle service to read shuffle files.
2015-08-06 15:30:27 -07:00
Tathagata Das 0a078303d0 [SPARK-9556] [SPARK-9619] [SPARK-9624] [STREAMING] Make BlockGenerator more robust and make all BlockGenerators subscribe to rate limit updates
In some receivers, instead of using the default `BlockGenerator` in `ReceiverSupervisorImpl`, custom generator with their custom listeners are used for reliability (see [`ReliableKafkaReceiver`](https://github.com/apache/spark/blob/master/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/ReliableKafkaReceiver.scala#L99) and [updated `KinesisReceiver`](https://github.com/apache/spark/pull/7825/files)). These custom generators do not receive rate updates. This PR modifies the code to allow custom `BlockGenerator`s to be created through the `ReceiverSupervisorImpl` so that they can be kept track and rate updates can be applied.

In the process, I did some simplification, and de-flaki-fication of some rate controller related tests. In particular.
- Renamed `Receiver.executor` to `Receiver.supervisor` (to match `ReceiverSupervisor`)
- Made `RateControllerSuite` faster (by increasing batch interval) and less flaky
- Changed a few internal API to return the current rate of block generators as Long instead of Option\[Long\] (was inconsistent at places).
- Updated existing `ReceiverTrackerSuite` to test that custom block generators get rate updates as well.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #7913 from tdas/SPARK-9556 and squashes the following commits:

41d4461 [Tathagata Das] fix scala style
eb9fd59 [Tathagata Das] Updated kinesis receiver
d24994d [Tathagata Das] Updated BlockGeneratorSuite to use manual clock in BlockGenerator
d70608b [Tathagata Das] Updated BlockGenerator with states and proper synchronization
f6bd47e [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-9556
31da173 [Tathagata Das] Fix bug
12116df [Tathagata Das] Add BlockGeneratorSuite
74bd069 [Tathagata Das] Fix style
989bb5c [Tathagata Das] Made BlockGenerator fail is used after stop, and added better unit tests for it
3ff618c [Tathagata Das] Fix test
b40eff8 [Tathagata Das] slight refactoring
f0df0f1 [Tathagata Das] Scala style fixes
51759cb [Tathagata Das] Refactored rate controller tests and added the ability to update rate of any custom block generator
2015-08-06 14:35:30 -07:00
Liang-Chi Hsieh 21fdfd7d6f [SPARK-9548][SQL] Add a destructive iterator for BytesToBytesMap
This pull request adds a destructive iterator to BytesToBytesMap. When used, the iterator frees pages as it traverses them. This is part of the effort to avoid starving when we have more than one operators that can exhaust memory.

This is based on #7924, but fixes a bug there (Don't use destructive iterator in UnsafeKVExternalSorter).

Closes #7924.

Author: Liang-Chi Hsieh <viirya@appier.com>
Author: Reynold Xin <rxin@databricks.com>

Closes #8003 from rxin/map-destructive-iterator and squashes the following commits:

6b618c3 [Reynold Xin] Don't use destructive iterator in UnsafeKVExternalSorter.
a7bd8ec [Reynold Xin] Merge remote-tracking branch 'viirya/destructive_iter' into map-destructive-iterator
7652083 [Liang-Chi Hsieh] For comments: add destructiveIterator(), modify unit test, remove code block.
4a3e9de [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into destructive_iter
581e9e3 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into destructive_iter
f0ff783 [Liang-Chi Hsieh] No need to free last page.
9e9d2a3 [Liang-Chi Hsieh] Add a destructive iterator for BytesToBytesMap.
2015-08-06 14:33:29 -07:00
Yin Huai 4581badbc8 [SPARK-9611] [SQL] Fixes a few corner cases when we spill a UnsafeFixedWidthAggregationMap
This PR has the following three small fixes.

1. UnsafeKVExternalSorter does not use 0 as the initialSize to create an UnsafeInMemorySorter if its BytesToBytesMap is empty.
2. We will not not spill a InMemorySorter if it is empty.
3. We will not add a SpillReader to a SpillMerger if this SpillReader is empty.

JIRA: https://issues.apache.org/jira/browse/SPARK-9611

Author: Yin Huai <yhuai@databricks.com>

Closes #7948 from yhuai/unsafeEmptyMap and squashes the following commits:

9727abe [Yin Huai] Address Josh's comments.
34b6f76 [Yin Huai] 1. UnsafeKVExternalSorter does not use 0 as the initialSize to create an UnsafeInMemorySorter if its BytesToBytesMap is empty. 2. Do not spill a InMemorySorter if it is empty. 3. Do not add spill to SpillMerger if this spill is empty.
2015-08-05 19:19:09 -07:00
Marcelo Vanzin 4399b7b090 [SPARK-9651] Fix UnsafeExternalSorterSuite.
First, it's probably a bad idea to call generated Scala methods
from Java. In this case, the method being called wasn't actually
"Utils.createTempDir()", but actually the method that returns the
first default argument to the actual createTempDir method, which
is just the location of java.io.tmpdir; meaning that all tests in
the class were using the same temp dir, and thus affecting each
other.

Second, spillingOccursInResponseToMemoryPressure was not writing
enough records to actually cause a spill.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #7970 from vanzin/SPARK-9651 and squashes the following commits:

74d357f [Marcelo Vanzin] Clean up temp dir on test tear down.
a64f36a [Marcelo Vanzin] [SPARK-9651] Fix UnsafeExternalSorterSuite.
2015-08-05 17:58:36 -07:00
Andrew Or 5f0fb6466f [SPARK-9649] Fix flaky test MasterSuite - randomize ports
```
Error Message

Failed to bind to: /127.0.0.1:7093: Service 'sparkMaster' failed after 16 retries!
Stacktrace

      java.net.BindException: Failed to bind to: /127.0.0.1:7093: Service 'sparkMaster' failed after 16 retries!
      at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
      at akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:393)
      at akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:389)
      at scala.util.Success$$anonfun$map$1.apply(Try.scala:206)
      at scala.util.Try$.apply(Try.scala:161)
```

Author: Andrew Or <andrew@databricks.com>

Closes #7968 from andrewor14/fix-master-flaky-test and squashes the following commits:

fcc42ef [Andrew Or] Randomize port
2015-08-05 14:12:22 -07:00
Josh Rosen 26b06f1c46 [HOTFIX] Add static import to fix build break from #7676. 2015-08-05 02:40:50 -07:00
zsxwing 1b0317f64c [SPARK-8861][SPARK-8862][SQL] Add basic instrumentation to each SparkPlan operator and add a new SQL tab
This PR includes the following changes:

### SPARK-8862: Add basic instrumentation to each SparkPlan operator

A SparkPlan can override `def accumulators: Map[String, Accumulator[_]]` to expose its metrics that can be displayed in UI. The UI will use them to track the updates and show them in the web page in real-time.

### SparkSQLExecution and SQLSparkListener

`SparkSQLExecution.withNewExecutionId` will set `spark.sql.execution.id` to the local properties so that we can use it to track all jobs that belong to the same query.

SQLSparkListener is a listener to track all accumulator updates of all tasks for a query. It receives them from heartbeats can the UI can query them in real-time.

When running a query, `SQLSparkListener.onExecutionStart` will be called. When a query is finished,  `SQLSparkListener.onExecutionEnd` will be called. And the Spark jobs with the same execution id will be tracked and stored with this query.

`SQLSparkListener` has to store all accumulator updates for tasks separately. When a task fails and starts to retry, we need to drop the old accumulator updates. Because we can not revert our changes to an accumulator, we have to maintain these accumulator updates by ourselves so as to drop accumulator updates for a failed task.

### SPARK-8862: A new SQL tab
Includes two pages:
#### A page for all DataFrame/SQL queries
It will show the running, completed and failed queries in 3 tables. It also displays the jobs and their links for a query in each row.
#### A detail page for a DataFrame/SQL query
In this page, it also shows the SparkPlan metrics in real-time. Run a long-running query, such as
```
val testData = sc.parallelize((1 to 1000000).map(i => (i, i.toString))).toDF()
testData.select($"_1").filter($"_1" < 1000).foreach(_ => Thread.sleep(60))
```
and you will see the metrics keep updating in real-time.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/7774)
<!-- Reviewable:end -->

Author: zsxwing <zsxwing@gmail.com>

Closes #7774 from zsxwing/sql-ui and squashes the following commits:

5a2bc99 [zsxwing] Remove UISeleniumSuite and its dependency
57d4cd2 [zsxwing] Use VisibleForTesting annotation
cc1c736 [zsxwing] Add SparkPlan.trackNumOfRowsEnabled to make subclasses easy to track the number of rows; fix the issue that the "save" action cannot collect metrics
3771ab0 [zsxwing] Register SQL metrics accmulators
3a101c0 [zsxwing] Change prepareCalled's type to AtomicBoolean for thread-safety
b8d5605 [zsxwing] Make prepare idempotent; call children's prepare in SparkPlan.prepare; change doPrepare to def
4ed11a1 [zsxwing] var -> val
332639c [zsxwing] Ignore UISeleniumSuite and SQLListenerSuite."no memory leak" because of SPARK-9580
bb52359 [zsxwing] Address other commens in SQLListener
c4d0f5d [zsxwing] Move newPredicate out of the iterator loop
957473c [zsxwing] Move STATIC_RESOURCE_DIR to object SQLTab
7ab4816 [zsxwing] Make SparkPlan accumulator API private[sql]
dae195e [zsxwing] Fix the code style and comments
3a66207 [zsxwing] Ignore irrelevant accumulators
b8484a1 [zsxwing] Merge branch 'master' into sql-ui
9406592 [zsxwing] Implement the SparkPlan viz
4ebce68 [zsxwing] Add SparkPlan.prepare to support BroadcastHashJoin to run background work in parallel
ca1811f [zsxwing] Merge branch 'master' into sql-ui
fef6fc6 [zsxwing] Fix a corner case
25f335c [zsxwing] Fix the code style
6eae828 [zsxwing] SQLSparkListener -> SQLListener; SparkSQLExecutionUIData -> SQLExecutionUIData; SparkSQLExecution -> SQLExecution
822af75 [zsxwing] Add SQLSparkListenerSuite and fix the issue about onExecutionEnd and onJobEnd
6be626f [zsxwing] Add UISeleniumSuite to test UI
d02a24d [zsxwing] Make ExecutionPage private
23abf73 [zsxwing] [SPARK-8862][SPARK-8862][SQL] Add basic instrumentation to each SparkPlan operator and add a new SQL tab
2015-08-05 01:51:22 -07:00
Takeshi YAMAMURO 6d8a6e4161 [SPARK-9360] [SQL] Support BinaryType in PrefixComparators for UnsafeExternalSort
The current implementation of UnsafeExternalSort uses NoOpPrefixComparator for binary-typed data.
So, we need to add BinaryPrefixComparator in PrefixComparators.

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #7676 from maropu/BinaryTypePrefixComparator and squashes the following commits:

fe6f31b [Takeshi YAMAMURO] Apply comments
d943c04 [Takeshi YAMAMURO] Add a codegen'd entry for BinaryType in SortPrefix
ecf3ac5 [Takeshi YAMAMURO] Support BinaryType in PrefixComparator
2015-08-05 00:56:35 -07:00
Burak Yavuz c9a4c36d05 [SPARK-8313] R Spark packages support
shivaram cafreeman Could you please help me in testing this out? Exposing and running `rPackageBuilder` from inside the shell works, but for some reason, I can't get it to work during Spark Submit. It just starts relaunching Spark Submit.

For testing, you may use the R branch with [sbt-spark-package](https://github.com/databricks/sbt-spark-package). You can call spPackage, and then pass the jar using `--jars`.

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #7139 from brkyvz/r-submit and squashes the following commits:

0de384f [Burak Yavuz] remove unused imports 2
d253708 [Burak Yavuz] removed unused imports
6603d0d [Burak Yavuz] addressed comments
4258ffe [Burak Yavuz] merged master
ddfcc06 [Burak Yavuz] added zipping test
3a1be7d [Burak Yavuz] don't zip
77995df [Burak Yavuz] fix URI
ac45527 [Burak Yavuz] added zipping of all libs
e6bf7b0 [Burak Yavuz] add println ignores
1bc5554 [Burak Yavuz] add assumes for tests
9778e03 [Burak Yavuz] addressed comments
b42b300 [Burak Yavuz] merged master
ffd134e [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into r-submit
d867756 [Burak Yavuz] add apache header
eff5ba1 [Burak Yavuz] ready for review
8838edb [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into r-submit
e5b5a06 [Burak Yavuz] added doc
bb751ce [Burak Yavuz] fix null bug
0226768 [Burak Yavuz] fixed issues
8810beb [Burak Yavuz] R packages support
2015-08-04 18:20:12 -07:00
CodingCat 9d668b7368 [SPARK-9602] remove "Akka/Actor" words from comments
https://issues.apache.org/jira/browse/SPARK-9602

Although we have hidden Akka behind RPC interface, I found that the Akka/Actor-related comments are still spreading everywhere. To make it consistent, we shall remove "actor"/"akka" words from the comments...

Author: CodingCat <zhunansjtu@gmail.com>

Closes #7936 from CodingCat/SPARK-9602 and squashes the following commits:

e8296a3 [CodingCat] remove actor words from comments
2015-08-04 14:54:11 -07:00
Josh Rosen ab8ee1a3b9 [SPARK-9452] [SQL] Support records larger than page size in UnsafeExternalSorter
This patch extends UnsafeExternalSorter to support records larger than the page size. The basic strategy is the same as in #7762: store large records in their own overflow pages.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7891 from JoshRosen/large-records-in-sql-sorter and squashes the following commits:

967580b [Josh Rosen] Merge remote-tracking branch 'origin/master' into large-records-in-sql-sorter
948c344 [Josh Rosen] Add large records tests for KV sorter.
3c17288 [Josh Rosen] Combine memory and disk cleanup into general cleanupResources() method
380f217 [Josh Rosen] Merge remote-tracking branch 'origin/master' into large-records-in-sql-sorter
27eafa0 [Josh Rosen] Fix page size in PackedRecordPointerSuite
a49baef [Josh Rosen] Address initial round of review comments
3edb931 [Josh Rosen] Remove accidentally-committed debug statements.
2b164e2 [Josh Rosen] Support large records in UnsafeExternalSorter.
2015-08-04 14:42:11 -07:00
Carson Wang cb7fa0aa93 [SPARK-2016] [WEBUI] RDD partition table pagination for the RDD Page
Add pagination for the RDD page to avoid unresponsive UI when the number of the RDD partitions is large.
Before:
![rddpagebefore](https://cloud.githubusercontent.com/assets/9278199/8951533/3d9add54-3601-11e5-99d0-5653b473c49b.png)
After:
![rddpageafter](https://cloud.githubusercontent.com/assets/9278199/8951536/439d66e0-3601-11e5-9cee-1b380fe6620d.png)

Author: Carson Wang <carson.wang@intel.com>

Closes #7692 from carsonwang/SPARK-2016 and squashes the following commits:

03c7168 [Carson Wang] Fix style issues
612c18c [Carson Wang] RDD partition table pagination for the RDD Page
2015-08-04 22:12:30 +09:00
Sean Owen 76d74090d6 [SPARK-9534] [BUILD] Enable javac lint for scalac parity; fix a lot of build warnings, 1.5.0 edition
Enable most javac lint warnings; fix a lot of build warnings. In a few cases, touch up surrounding code in the process.

I'll explain several of the changes inline in comments.

Author: Sean Owen <sowen@cloudera.com>

Closes #7862 from srowen/SPARK-9534 and squashes the following commits:

ea51618 [Sean Owen] Enable most javac lint warnings; fix a lot of build warnings. In a few cases, touch up surrounding code in the process.
2015-08-04 12:02:26 +01:00
Reynold Xin 5eb89f67e3 [SPARK-9577][SQL] Surface concrete iterator types in various sort classes.
We often return abstract iterator types in various sort-related classes (e.g. UnsafeKVExternalSorter). It is actually better to return a more concrete type, so the callsite uses that type and JIT can inline the iterator calls.

Author: Reynold Xin <rxin@databricks.com>

Closes #7911 from rxin/surface-concrete-type and squashes the following commits:

0422add [Reynold Xin] [SPARK-9577][SQL] Surface concrete iterator types in various sort classes.
2015-08-03 18:47:02 -07:00
CodingCat 3b0e44490a [SPARK-8416] highlight and topping the executor threads in thread dumping page
https://issues.apache.org/jira/browse/SPARK-8416

To facilitate debugging, I made this patch with three changes:

* render the executor-thread and non executor-thread entries with different background colors

* put the executor threads on the top of the list

* sort the threads alphabetically

Author: CodingCat <zhunansjtu@gmail.com>

Closes #7808 from CodingCat/SPARK-8416 and squashes the following commits:

34fc708 [CodingCat] fix className
d7b79dd [CodingCat] lowercase threadName
d032882 [CodingCat] sort alphabetically and change the css class name
f0513b1 [CodingCat] change the color & group threads by name
2da6e06 [CodingCat] small fix
3fc9f36 [CodingCat] define classes in webui.css
8ee125e [CodingCat] highlight and put on top the executor threads in thread dumping page
2015-08-03 18:20:40 -07:00
Burak Yavuz 1633d0a261 [SPARK-9263] Added flags to exclude dependencies when using --packages
While the functionality is there to exclude packages, there are no flags that allow users to exclude dependencies, in case of dependency conflicts. We should provide users with a flag to add dependency exclusions in case the packages are not resolved properly (or not available due to licensing).

The flag I added was --packages-exclude, but I'm open on renaming it. I also added property flags in case people would like to use a conf file to provide dependencies, which is possible if there is a long list of dependencies or exclusions.

cc andrewor14 vanzin pwendell

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #7599 from brkyvz/packages-exclusions and squashes the following commits:

636f410 [Burak Yavuz] addressed nits
6e54ede [Burak Yavuz] is this the culprit
b5e508e [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into packages-exclusions
154f5db [Burak Yavuz] addressed initial comments
1536d7a [Burak Yavuz] Added flags to exclude packages using --packages-exclude
2015-08-03 17:42:03 -07:00
Andrew Or 702aa9d7fb [SPARK-8735] [SQL] Expose memory usage for shuffles, joins and aggregations
This patch exposes the memory used by internal data structures on the SparkUI. This tracks memory used by all spilling operations and SQL operators backed by Tungsten, e.g. `BroadcastHashJoin`, `ExternalSort`, `GeneratedAggregate` etc. The metric exposed is "peak execution memory", which broadly refers to the peak in-memory sizes of each of these data structure.

A separate patch will extend this by linking the new information to the SQL operators themselves.

<img width="950" alt="screen shot 2015-07-29 at 7 43 17 pm" src="https://cloud.githubusercontent.com/assets/2133137/8974776/b90fc980-362a-11e5-9e2b-842da75b1641.png">
<img width="802" alt="screen shot 2015-07-29 at 7 43 05 pm" src="https://cloud.githubusercontent.com/assets/2133137/8974777/baa76492-362a-11e5-9b77-e364a6a6b64e.png">

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/7770)
<!-- Reviewable:end -->

Author: Andrew Or <andrew@databricks.com>

Closes #7770 from andrewor14/expose-memory-metrics and squashes the following commits:

9abecb9 [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics
f5b0d68 [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics
d7df332 [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics
8eefbc5 [Andrew Or] Fix non-failing tests
9de2a12 [Andrew Or] Fix tests due to another logical merge conflict
876bfa4 [Andrew Or] Fix failing test after logical merge conflict
361a359 [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics
40b4802 [Andrew Or] Fix style?
d0fef87 [Andrew Or] Fix tests?
b3b92f6 [Andrew Or] Address comments
0625d73 [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics
c00a197 [Andrew Or] Fix potential NPEs
10da1cd [Andrew Or] Fix compile
17f4c2d [Andrew Or] Fix compile?
a87b4d0 [Andrew Or] Fix compile?
d70874d [Andrew Or] Fix test compile + address comments
2840b7d [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics
6aa2f7a [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics
b889a68 [Andrew Or] Minor changes: comments, spacing, style
663a303 [Andrew Or] UnsafeShuffleWriter: update peak memory before close
d090a94 [Andrew Or] Fix style
2480d84 [Andrew Or] Expand test coverage
5f1235b [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics
1ecf678 [Andrew Or] Minor changes: comments, style, unused imports
0b6926c [Andrew Or] Oops
111a05e [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics
a7a39a5 [Andrew Or] Strengthen presence check for accumulator
a919eb7 [Andrew Or] Add tests for unsafe shuffle writer
23c845d [Andrew Or] Add tests for SQL operators
a757550 [Andrew Or] Address comments
b5c51c1 [Andrew Or] Re-enable test in JavaAPISuite
5107691 [Andrew Or] Add tests for internal accumulators
59231e4 [Andrew Or] Fix tests
9528d09 [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics
5b5e6f3 [Andrew Or] Add peak execution memory to summary table + tooltip
92b4b6b [Andrew Or] Display peak execution memory on the UI
eee5437 [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics
d9b9015 [Andrew Or] Track execution memory in unsafe shuffles
770ee54 [Andrew Or] Track execution memory in broadcast joins
9c605a4 [Andrew Or] Track execution memory in GeneratedAggregate
9e824f2 [Andrew Or] Add back execution memory tracking for *ExternalSort
4ef4cb1 [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics
e6c3e2f [Andrew Or] Move internal accumulators creation to Stage
a417592 [Andrew Or] Expose memory metrics in UnsafeExternalSorter
3c4f042 [Andrew Or] Track memory usage in ExternalAppendOnlyMap / ExternalSorter
bd7ab3f [Andrew Or] Add internal accumulators to TaskContext
2015-08-03 14:22:07 -07:00
Andrew Or b41a32718d [SPARK-1855] Local checkpointing
Certain use cases of Spark involve RDDs with long lineages that must be truncated periodically (e.g. GraphX). The existing way of doing it is through `rdd.checkpoint()`, which is expensive because it writes to HDFS. This patch provides an alternative to truncate lineages cheaply *without providing the same level of fault tolerance*.

**Local checkpointing** writes checkpointed data to the local file system through the block manager. It is much faster than replicating to a reliable storage and provides the same semantics as long as executors do not fail. It is accessible through a new operator `rdd.localCheckpoint()` and leaves the old one unchanged. Users may even decide to combine the two and call the reliable one less frequently.

The bulk of this patch involves refactoring the checkpointing interface to accept custom implementations of checkpointing. [Design doc](https://issues.apache.org/jira/secure/attachment/12741708/SPARK-7292-design.pdf).

Author: Andrew Or <andrew@databricks.com>

Closes #7279 from andrewor14/local-checkpoint and squashes the following commits:

729600f [Andrew Or] Oops, fix tests
34bc059 [Andrew Or] Avoid computing all partitions in local checkpoint
e43bbb6 [Andrew Or] Merge branch 'master' of github.com:apache/spark into local-checkpoint
3be5aea [Andrew Or] Address comments
bf846a6 [Andrew Or] Merge branch 'master' of github.com:apache/spark into local-checkpoint
ab003a3 [Andrew Or] Fix compile
c2e111b [Andrew Or] Address comments
33f167a [Andrew Or] Merge branch 'master' of github.com:apache/spark into local-checkpoint
e908a42 [Andrew Or] Fix tests
f5be0f3 [Andrew Or] Use MEMORY_AND_DISK as the default local checkpoint level
a92657d [Andrew Or] Update a few comments
e58e3e3 [Andrew Or] Merge branch 'master' of github.com:apache/spark into local-checkpoint
4eb6eb1 [Andrew Or] Merge branch 'master' of github.com:apache/spark into local-checkpoint
1bbe154 [Andrew Or] Simplify LocalCheckpointRDD
48a9996 [Andrew Or] Avoid traversing dependency tree + rewrite tests
62aba3f [Andrew Or] Merge branch 'master' of github.com:apache/spark into local-checkpoint
db70dc2 [Andrew Or] Express local checkpointing through caching the original RDD
87d43c6 [Andrew Or] Merge branch 'master' of github.com:apache/spark into local-checkpoint
c449b38 [Andrew Or] Fix style
4a182f3 [Andrew Or] Add fine-grained tests for local checkpointing
53b363b [Andrew Or] Rename a few more awkwardly named methods (minor)
e4cf071 [Andrew Or] Simplify LocalCheckpointRDD + docs + clean ups
4880deb [Andrew Or] Fix style
d096c67 [Andrew Or] Fix mima
172cb66 [Andrew Or] Fix mima?
e53d964 [Andrew Or] Fix style
56831c5 [Andrew Or] Add a few warnings and clear exception messages
2e59646 [Andrew Or] Add local checkpoint clean up tests
4dbbab1 [Andrew Or] Refactor CheckpointSuite to test local checkpointing
4514dc9 [Andrew Or] Clean local checkpoint files through RDD cleanups
0477eec [Andrew Or] Rename a few methods with awkward names (minor)
2e902e5 [Andrew Or] First implementation of local checkpointing
8447454 [Andrew Or] Fix tests
4ac1896 [Andrew Or] Refactor checkpoint interface for modularity
2015-08-03 10:58:37 -07:00
Timothy Chen 95dccc6335 [SPARK-8873] [MESOS] Clean up shuffle files if external shuffle service is used
This patch builds directly on #7820, which is largely written by tnachen. The only addition is one commit for cleaning up the code. There should be no functional differences between this and #7820.

Author: Timothy Chen <tnachen@gmail.com>
Author: Andrew Or <andrew@databricks.com>

Closes #7881 from andrewor14/tim-cleanup-mesos-shuffle and squashes the following commits:

8894f7d [Andrew Or] Clean up code
2a5fa10 [Andrew Or] Merge branch 'mesos_shuffle_clean' of github.com:tnachen/spark into tim-cleanup-mesos-shuffle
fadff89 [Timothy Chen] Address comments.
e4d0f1d [Timothy Chen] Clean up external shuffle data on driver exit with Mesos.
2015-08-03 01:55:58 -07:00
Reynold Xin 2e981b7bfa [SPARK-9531] [SQL] UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter
This pull request adds a destructAndCreateExternalSorter method to UnsafeFixedWidthAggregationMap. The new method does the following:

1. Creates a new external sorter UnsafeKVExternalSorter
2. Adds all the data into an in-memory sorter, sorts them
3. Spills the sorted in-memory data to disk

This method can be used to fallback to sort-based aggregation when under memory pressure.

The pull request also includes accounting fixes from JoshRosen.

TODOs (that can be done in follow-up PRs)
- [x] Address Josh's feedbacks from #7849
- [x] More documentation and test cases
- [x] Make sure we are doing memory accounting correctly with test cases (e.g. did we release the memory in BytesToBytesMap twice?)
- [ ] Look harder at possible memory leaks and exception handling
- [ ] Randomized tester for the KV sorter as well as the aggregation map

Author: Reynold Xin <rxin@databricks.com>
Author: Josh Rosen <joshrosen@databricks.com>

Closes #7860 from rxin/kvsorter and squashes the following commits:

986a58c [Reynold Xin] Bug fix.
599317c [Reynold Xin] Style fix and slightly more compact code.
fe7bd4e [Reynold Xin] Bug fixes.
fd71bef [Reynold Xin] Merge remote-tracking branch 'josh/large-records-in-sql-sorter' into kvsorter-with-josh-fix
3efae38 [Reynold Xin] More fixes and documentation.
45f1b09 [Josh Rosen] Ensure that spill files are cleaned up
f6a9bd3 [Reynold Xin] Josh feedback.
9be8139 [Reynold Xin] Remove testSpillFrequency.
7cbe759 [Reynold Xin] [SPARK-9531][SQL] UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter.
ae4a8af [Josh Rosen] Detect leaked unsafe memory in UnsafeExternalSorterSuite.
52f9b06 [Josh Rosen] Detect ShuffleMemoryManager leaks in UnsafeExternalSorter.
2015-08-02 12:32:14 -07:00
Reynold Xin 3d1535d488 [SPARK-9520] [SQL] Support in-place sort in UnsafeFixedWidthAggregationMap
This pull request adds a sortedIterator method to UnsafeFixedWidthAggregationMap that sorts its data in-place by the grouping key.

This is needed so we can fallback to external sorting for aggregation.

Author: Reynold Xin <rxin@databricks.com>

Closes #7849 from rxin/bytes2bytes-sorting and squashes the following commits:

75018c6 [Reynold Xin] Updated documentation.
81a8694 [Reynold Xin] [SPARK-9520][SQL] Support in-place sort in UnsafeFixedWidthAggregationMap.
2015-08-01 13:20:26 -07:00
Andrew Or 6688ba6e68 [SPARK-4751] Dynamic allocation in standalone mode
Dynamic allocation is a feature that allows a Spark application to scale the number of executors up and down dynamically based on the workload. Support was first introduced in YARN since 1.2, and then extended to Mesos coarse-grained mode recently. Today, it is finally supported in standalone mode as well!

I tested this locally and it works as expected. This is WIP because unit tests are coming.

Author: Andrew Or <andrew@databricks.com>

Closes #7532 from andrewor14/standalone-da and squashes the following commits:

b3c1736 [Andrew Or] Merge branch 'master' of github.com:apache/spark into standalone-da
879e928 [Andrew Or] Add end-to-end tests for standalone dynamic allocation
accc8f6 [Andrew Or] Address comments
ee686a8 [Andrew Or] Merge branch 'master' of github.com:apache/spark into standalone-da
c0a2c02 [Andrew Or] Fix build after merge conflict
24149eb [Andrew Or] Merge branch 'master' of github.com:apache/spark into standalone-da
2e762d6 [Andrew Or] Merge branch 'master' of github.com:apache/spark into standalone-da
6832bd7 [Andrew Or] Add tests for scheduling with executor limit
a82e907 [Andrew Or] Fix comments
0a8be79 [Andrew Or] Simplify logic by removing the worker blacklist
b7742af [Andrew Or] Merge branch 'master' of github.com:apache/spark into standalone-da
2eb5f3f [Andrew Or] Merge branch 'master' of github.com:apache/spark into standalone-da
1334e9a [Andrew Or] Fix MiMa
32abe44 [Andrew Or] Fix style
58cb06f [Andrew Or] Privatize worker blacklist for cleanliness
42ac215 [Andrew Or] Clean up comments and rewrite code for readability
49702d1 [Andrew Or] Clean up shuffle files after application exits
80047aa [Andrew Or] First working implementation
2015-08-01 11:57:14 -07:00
Reynold Xin d90f2cf7a2 [SPARK-9517][SQL] BytesToBytesMap should encode data the same way as UnsafeExternalSorter
BytesToBytesMap current encodes key/value data in the following format:
```
8B key length, key data, 8B value length, value data
```

UnsafeExternalSorter, on the other hand, encodes data this way:
```
4B record length, data
```

As a result, we cannot pass records encoded by BytesToBytesMap directly into UnsafeExternalSorter for sorting. However, if we rearrange data slightly, we can then pass the key/value records directly into UnsafeExternalSorter:
```
4B key+value length, 4B key length, key data, value data
```

Author: Reynold Xin <rxin@databricks.com>

Closes #7845 from rxin/kvsort-rebase and squashes the following commits:

5716b59 [Reynold Xin] Fixed test.
2e62ccb [Reynold Xin] Updated BytesToBytesMap's data encoding to put the key first.
a51b641 [Reynold Xin] Added a KV sorter interface.
2015-07-31 23:55:16 -07:00
Josh Rosen 8cb415a4b9 [SPARK-9451] [SQL] Support entries larger than default page size in BytesToBytesMap & integrate with ShuffleMemoryManager
This patch adds support for entries larger than the default page size in BytesToBytesMap.  These large rows are handled by allocating special overflow pages to hold individual entries.

In addition, this patch integrates BytesToBytesMap with the ShuffleMemoryManager:

- Move BytesToBytesMap from `unsafe` to `core` so that it can import `ShuffleMemoryManager`.
- Before allocating new data pages, ask the ShuffleMemoryManager to reserve the memory:
  - `putNewKey()` now returns a boolean to indicate whether the insert succeeded or failed due to a lack of memory.  The caller can use this value to respond to the memory pressure (e.g. by spilling).
- `UnsafeFixedWidthAggregationMap. getAggregationBuffer()` now returns `null` to signal failure due to a lack of memory.
- Updated all uses of these classes to handle these error conditions.
- Added new tests for allocating large records and for allocations which fail due to memory pressure.
- Extended the `afterAll()` test teardown methods to detect ShuffleMemoryManager leaks.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7762 from JoshRosen/large-rows and squashes the following commits:

ae7bc56 [Josh Rosen] Fix compilation
82fc657 [Josh Rosen] Merge remote-tracking branch 'origin/master' into large-rows
34ab943 [Josh Rosen] Remove semi
31a525a [Josh Rosen] Integrate BytesToBytesMap with ShuffleMemoryManager.
626b33c [Josh Rosen] Move code to sql/core and spark/core packages so that ShuffleMemoryManager can be integrated
ec4484c [Josh Rosen] Move BytesToBytesMap from unsafe package to core.
642ed69 [Josh Rosen] Rename size to numElements
bea1152 [Josh Rosen] Add basic test.
2cd3570 [Josh Rosen] Remove accidental duplicated code
07ff9ef [Josh Rosen] Basic support for large rows in BytesToBytesMap.
2015-07-31 19:19:27 -07:00
Sameer Abhyankar 060c79aab5 [SPARK-9056] [STREAMING] Rename configuration spark.streaming.minRememberDuration to spark.streaming.fileStream.minRememberDuration
Rename configuration `spark.streaming.minRememberDuration` to `spark.streaming.fileStream.minRememberDuration`

Author: Sameer Abhyankar <sabhyankar@sabhyankar-MBP.local>
Author: Sameer Abhyankar <sabhyankar@sabhyankar-MBP.Samavihome>

Closes #7740 from sabhyankar/spark_branch_9056 and squashes the following commits:

d5b2f1f [Sameer Abhyankar] Correct deprecated version to 1.5
1268133 [Sameer Abhyankar] Add {} and indentation
ddf9844 [Sameer Abhyankar] Change 4 space indentation to 2 space indentation
1819b5f [Sameer Abhyankar] Use spark.streaming.fileStream.minRememberDuration property in lieu of spark.streaming.minRememberDuration
2015-07-31 13:08:55 -07:00
CodingCat c0686668ae [SPARK-9202] capping maximum number of executor&driver information kept in Worker
https://issues.apache.org/jira/browse/SPARK-9202

Author: CodingCat <zhunansjtu@gmail.com>

Closes #7714 from CodingCat/SPARK-9202 and squashes the following commits:

23977fb [CodingCat] add comments about why we don't synchronize finishedExecutors & finishedDrivers
dc9772d [CodingCat] addressing the comments
e125241 [CodingCat] stylistic fix
80bfe52 [CodingCat] fix JsonProtocolSuite
d7d9485 [CodingCat] styistic fix and respect insert ordering
031755f [CodingCat] add license info & stylistic fix
c3b5361 [CodingCat] test cases and docs
c557b3a [CodingCat] applications are fine
9cac751 [CodingCat] application is fine...
ad87ed7 [CodingCat] trimFinishedExecutorsAndDrivers
2015-07-31 20:27:00 +01:00
tedyu 27ae851ce1 [SPARK-9446] Clear Active SparkContext in stop() method
In thread 'stopped SparkContext remaining active' on mailing list, Andres observed the following in driver log:
```
15/07/29 15:17:09 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster has disassociated: <address removed>
15/07/29 15:17:09 INFO YarnClientSchedulerBackend: Shutting down all executors
Exception in thread "Yarn application state monitor" org.apache.spark.SparkException: Error asking standalone scheduler to shut down executors
        at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBackend.scala:261)
        at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:266)
        at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158)
        at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416)
        at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411)
        at org.apache.spark.SparkContext.stop(SparkContext.scala:1644)
        at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:139)
Caused by: java.lang.InterruptedException
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1325)
        at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
        at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
        at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
        at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
        at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
        at scala.concurrent.Await$.result(package.scala:190)15/07/29 15:17:09 INFO YarnClientSchedulerBackend: Asking each executor to shut down

        at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
        at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
        at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBackend.scala:257)
        ... 6 more
```
Effect of the above exception is that a stopped SparkContext is returned to user since SparkContext.clearActiveContext() is not called.

Author: tedyu <yuzhihong@gmail.com>

Closes #7756 from tedyu/master and squashes the following commits:

7339ff2 [tedyu] Move null assignment out of tryLogNonFatalError block
6e02cd9 [tedyu] Use Utils.tryLogNonFatalError to guard resource release
f5fb519 [tedyu] Clear Active SparkContext in stop() method using finally
2015-07-31 18:16:55 +01:00
zsxwing 04a49edfdb [SPARK-9497] [SPARK-9509] [CORE] Use ask instead of askWithRetry
`RpcEndpointRef.askWithRetry` throws `SparkException` rather than `TimeoutException`. Use ask to replace it because we don't need to retry here.

Author: zsxwing <zsxwing@gmail.com>

Closes #7824 from zsxwing/SPARK-9497 and squashes the following commits:

7bfc2b4 [zsxwing] Use ask instead of askWithRetry
2015-07-31 09:34:16 -07:00
Reynold Xin e7a0976e99 [SPARK-9458][SPARK-9469][SQL] Code generate prefix computation in sorting & moves unsafe conversion out of TungstenSort.
Author: Reynold Xin <rxin@databricks.com>

Closes #7803 from rxin/SPARK-9458 and squashes the following commits:

5b032dc [Reynold Xin] Fix string.
b670dbb [Reynold Xin] [SPARK-9458][SPARK-9469][SQL] Code generate prefix computation in sorting & moves unsafe conversion out of TungstenSort.
2015-07-30 17:17:27 -07:00
Hossein 157840d1b1 [SPARK-8742] [SPARKR] Improve SparkR error messages for DataFrame API
This patch improves SparkR error message reporting, especially with DataFrame API. When there is a user error (e.g., malformed SQL query), the message of the cause is sent back through the RPC and the R client reads it and returns it back to user.

cc shivaram

Author: Hossein <hossein@databricks.com>

Closes #7742 from falaki/SPARK-8742 and squashes the following commits:

4f643c9 [Hossein] Not logging exceptions in RBackendHandler
4a8005c [Hossein] Returning stack track of causing exception from RBackendHandler
5cf17f0 [Hossein] Adding unit test for error messages from SQLContext
2af75d5 [Hossein] Reading error message in case of failure and stoping with that message
f479c99 [Hossein] Wrting exception cause message in JVM
2015-07-30 16:16:17 -07:00
Imran Rashid 06b6a074fb [SPARK-9437] [CORE] avoid overflow in SizeEstimator
https://issues.apache.org/jira/browse/SPARK-9437

Author: Imran Rashid <irashid@cloudera.com>

Closes #7750 from squito/SPARK-9437_size_estimator_overflow and squashes the following commits:

29493f1 [Imran Rashid] prevent another potential overflow
bc1cb82 [Imran Rashid] avoid overflow
2015-07-30 10:46:26 -07:00
Josh Rosen 520ec0ff9d [SPARK-8850] [SQL] Enable Unsafe mode by default
This pull request enables Unsafe mode by default in Spark SQL. In order to do this, we had to fix a number of small issues:

**List of fixed blockers**:

- [x] Make some default buffer sizes configurable so that HiveCompatibilitySuite can run properly (#7741).
- [x] Memory leak on grouped aggregation of empty input (fixed by #7560 to fix this)
- [x] Update planner to also check whether codegen is enabled before planning unsafe operators.
- [x] Investigate failing HiveThriftBinaryServerSuite test.  This turns out to be caused by a ClassCastException that occurs when Exchange tries to apply an interpreted RowOrdering to an UnsafeRow when range partitioning an RDD.  This could be fixed by #7408, but a shorter-term fix is to just skip the Unsafe exchange path when RangePartitioner is used.
- [x] Memory leak exceptions masking exceptions that actually caused tasks to fail (will be fixed by #7603).
- [x]  ~~https://issues.apache.org/jira/browse/SPARK-9162, to implement code generation for ScalaUDF.  This is necessary for `UDFSuite` to pass.  For now, I've just ignored this test in order to try to find other problems while we wait for a fix.~~ This is no longer necessary as of #7682.
- [x] Memory leaks from Limit after UnsafeExternalSort cause the memory leak detector to fail tests. This is a huge problem in the HiveCompatibilitySuite (fixed by f4ac642a4e5b2a7931c5e04e086bb10e263b1db6).
- [x] Tests in `AggregationQuerySuite` are failing due to NaN-handling issues in UnsafeRow, which were fixed in #7736.
- [x] `org.apache.spark.sql.ColumnExpressionSuite.rand` needs to be updated so that the planner check also matches `TungstenProject`.
- [x] After having lowered the buffer sizes to 4MB so that most of HiveCompatibilitySuite runs:
  - [x] Wrong answer in `join_1to1` (fixed by #7680)
  - [x] Wrong answer in `join_nulls` (fixed by #7680)
  - [x] Managed memory OOM / leak in `lateral_view`
  - [x] Seems to hang indefinitely in `partcols1`.  This might be a deadlock in script transformation or a bug in error-handling code? The hang was fixed by #7710.
  - [x] Error while freeing memory in `partcols1`: will be fixed by #7734.
- [x] After fixing the `partcols1` hang, it appears that a number of later tests have issues as well.
- [x] Fix thread-safety bug in codegen fallback expression evaluation (#7759).

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7564 from JoshRosen/unsafe-by-default and squashes the following commits:

83c0c56 [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-by-default
f4cc859 [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-by-default
963f567 [Josh Rosen] Reduce buffer size for R tests
d6986de [Josh Rosen] Lower page size in PySpark tests
013b9da [Josh Rosen] Also match TungstenProject in checkNumProjects
5d0b2d3 [Josh Rosen] Add task completion callback to avoid leak in limit after sort
ea250da [Josh Rosen] Disable unsafe Exchange path when RangePartitioning is used
715517b [Josh Rosen] Enable Unsafe by default
2015-07-30 10:45:32 -07:00
Mridul Muralidharan e53534655d [SPARK-8297] [YARN] Scheduler backend is not notified in case node fails in YARN
This change adds code to notify the scheduler backend when a container dies in YARN.

Author: Mridul Muralidharan <mridulm@yahoo-inc.com>
Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #7431 from vanzin/SPARK-8297 and squashes the following commits:

471e4a0 [Marcelo Vanzin] Fix unit test after merge.
d4adf4e [Marcelo Vanzin] Merge branch 'master' into SPARK-8297
3b262e8 [Marcelo Vanzin] Merge branch 'master' into SPARK-8297
537da6f [Marcelo Vanzin] Make an expected log less scary.
04dc112 [Marcelo Vanzin] Use driver <-> AM communication to send "remove executor" request.
8855b97 [Marcelo Vanzin] Merge remote-tracking branch 'mridul/fix_yarn_scheduler_bug' into SPARK-8297
687790f [Mridul Muralidharan] Merge branch 'fix_yarn_scheduler_bug' of github.com:mridulm/spark into fix_yarn_scheduler_bug
e1b0067 [Mridul Muralidharan] Fix failing testcase, fix merge issue from our 1.3 -> master
9218fcc [Mridul Muralidharan] Fix failing testcase
362d64a [Mridul Muralidharan] Merge branch 'fix_yarn_scheduler_bug' of github.com:mridulm/spark into fix_yarn_scheduler_bug
62ad0cc [Mridul Muralidharan] Merge branch 'fix_yarn_scheduler_bug' of github.com:mridulm/spark into fix_yarn_scheduler_bug
bbf8811 [Mridul Muralidharan] Merge branch 'fix_yarn_scheduler_bug' of github.com:mridulm/spark into fix_yarn_scheduler_bug
9ee1307 [Mridul Muralidharan] Fix SPARK-8297
a3a0f01 [Mridul Muralidharan] Fix SPARK-8297
2015-07-30 10:37:53 -07:00
François Garillot 7bbf02f0bd [SPARK-9267] [CORE] Retire stringify(Partial)?Value from Accumulators
cc srowen

Author: François Garillot <francois@garillot.net>

Closes #7678 from huitseeker/master and squashes the following commits:

5e99f57 [François Garillot] [SPARK-9267][Core] Retire stringify(Partial)?Value from Accumulators
2015-07-30 18:14:08 +01:00
Reynold Xin 4a8bb9d00d Revert "[SPARK-9458] Avoid object allocation in prefix generation."
This reverts commit 9514d874f0.
2015-07-30 01:04:24 -07:00
Joseph Batchik 1221849f91 [SPARK-8005][SQL] Input file name
Users can now get the file name of the partition being read in. A thread local variable is in `SQLNewHadoopRDD` and is set when the partition is computed. `SQLNewHadoopRDD` is moved to core so that the catalyst package can reach it.

This supports:

`df.select(inputFileName())`

and

`sqlContext.sql("select input_file_name() from table")`

Author: Joseph Batchik <josephbatchik@gmail.com>

Closes #7743 from JDrit/input_file_name and squashes the following commits:

abb8609 [Joseph Batchik] fixed failing test and changed the default value to be an empty string
d2f323d [Joseph Batchik] updates per review
102061f [Joseph Batchik] updates per review
75313f5 [Joseph Batchik] small fixes
c7f7b5a [Joseph Batchik] addeding input file name to Spark SQL
2015-07-29 23:35:55 -07:00
Reynold Xin 07fd7d3647 [SPARK-9460] Avoid byte array allocation in StringPrefixComparator.
As of today, StringPrefixComparator converts the long values back to byte arrays in order to compare them. This patch optimizes this to compare the longs directly, rather than turning the longs into byte arrays and comparing them byte by byte (unsigned).

This only works on little-endian architecture right now.

Author: Reynold Xin <rxin@databricks.com>

Closes #7765 from rxin/SPARK-9460 and squashes the following commits:

e4908cc [Reynold Xin] Stricter randomized tests.
4c8d094 [Reynold Xin] [SPARK-9460] Avoid byte array allocation in StringPrefixComparator.
2015-07-29 21:18:43 -07:00
Reynold Xin 9514d874f0 [SPARK-9458] Avoid object allocation in prefix generation.
In our existing sort prefix generation code, we use expression's eval method to generate the prefix, which results in object allocation for every prefix. We can use the specialized getters available on InternalRow directly to avoid the object allocation.

I also removed the FLOAT prefix, opting for converting float directly to double.

Author: Reynold Xin <rxin@databricks.com>

Closes #7763 from rxin/sort-prefix and squashes the following commits:

5dc2f06 [Reynold Xin] [SPARK-9458] Avoid object allocation in prefix generation.
2015-07-29 20:46:03 -07:00
Josh Rosen 1b0099fc62 [SPARK-9411] [SQL] Make Tungsten page sizes configurable
We need to make page sizes configurable so we can reduce them in unit tests and increase them in real production workloads.  These sizes are now controlled by a new configuration, `spark.buffer.pageSize`.  The new default is 64 megabytes.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7741 from JoshRosen/SPARK-9411 and squashes the following commits:

a43c4db [Josh Rosen] Fix pow
2c0eefc [Josh Rosen] Fix MAXIMUM_PAGE_SIZE_BYTES comment + value
bccfb51 [Josh Rosen] Lower page size to 4MB in TestHive
ba54d4b [Josh Rosen] Make UnsafeExternalSorter's page size configurable
0045aa2 [Josh Rosen] Make UnsafeShuffle's page size configurable
bc734f0 [Josh Rosen] Rename configuration
e614858 [Josh Rosen] Makes BytesToBytesMap page size configurable
2015-07-29 16:00:30 -07:00
Joseph Batchik 069a4c414d [SPARK-746] [CORE] Added Avro Serialization to Kryo
Added a custom Kryo serializer for generic Avro records to reduce the network IO
involved during a shuffle. This compresses the schema and allows for users to
register their schemas ahead of time to further reduce traffic.

Currently Kryo tries to use its default serializer for generic Records, which will include
a lot of unneeded data in each record.

Author: Joseph Batchik <joseph.batchik@cloudera.com>
Author: Joseph Batchik <josephbatchik@gmail.com>

Closes #7004 from JDrit/Avro_serialization and squashes the following commits:

8158d51 [Joseph Batchik] updated per feedback
c0cf329 [Joseph Batchik] implemented @squito suggestion for SparkEnv
dd71efe [Joseph Batchik] fixed bug with serializing
1183a48 [Joseph Batchik] updated codec settings
fa9298b [Joseph Batchik] forgot a couple of fixes
c5fe794 [Joseph Batchik] implemented @squito suggestion
0f5471a [Joseph Batchik] implemented @squito suggestion to use a codec that is already in spark
6d1925c [Joseph Batchik] fixed to changes suggested by @squito
d421bf5 [Joseph Batchik] updated pom to removed versions
ab46d10 [Joseph Batchik] Changed Avro dependency to be similar to parent
f4ae251 [Joseph Batchik] fixed serialization error in that SparkConf cannot be serialized
2b545cc [Joseph Batchik] started working on fixes for pr
97fba62 [Joseph Batchik] Added a custom Kryo serializer for generic Avro records to reduce the network IO involved during a shuffle. This compresses the schema and allows for users to register their schemas ahead of time to further reduce traffic.
2015-07-29 14:02:32 -05:00
Josh Rosen ea49705bd4 [SPARK-9419] ShuffleMemoryManager and MemoryStore should track memory on a per-task, not per-thread, basis
Spark's ShuffleMemoryManager and MemoryStore track memory on a per-thread basis, which causes problems in the handful of cases where we have tasks that use multiple threads. In PythonRDD, RRDD, ScriptTransformation, and PipedRDD we consume the input iterator in a separate thread in order to write it to an external process.  As a result, these RDD's input iterators are consumed in a different thread than the thread that created them, which can cause problems in our memory allocation tracking. For example, if allocations are performed in one thread but deallocations are performed in a separate thread then memory may be leaked or we may get errors complaining that more memory was allocated than was freed.

I think that the right way to fix this is to change our accounting to be performed on a per-task instead of per-thread basis.  Note that the current per-thread tracking has caused problems in the past; SPARK-3731 (#2668) fixes a memory leak in PythonRDD that was caused by this issue (that fix is no longer necessary as of this patch).

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7734 from JoshRosen/memory-tracking-fixes and squashes the following commits:

b4b1702 [Josh Rosen] Propagate TaskContext to writer threads.
57c9b4e [Josh Rosen] Merge remote-tracking branch 'origin/master' into memory-tracking-fixes
ed25d3b [Josh Rosen] Address minor PR review comments
44f6497 [Josh Rosen] Fix long line.
7b0f04b [Josh Rosen] Fix ShuffleMemoryManagerSuite
f57f3f2 [Josh Rosen] More thread -> task changes
fa78ee8 [Josh Rosen] Move Executor's cleanup into Task so that TaskContext is defined when cleanup is performed
5e2f01e [Josh Rosen] Fix capitalization
1b0083b [Josh Rosen] Roll back fix in PySpark, which is no longer necessary
2e1e0f8 [Josh Rosen] Use TaskAttemptIds to track shuffle memory
c9e8e54 [Josh Rosen] Use TaskAttemptIds to track unroll memory
2015-07-28 21:53:28 -07:00
jerryshao ab62595661 [SPARK-4352] [YARN] [WIP] Incorporate locality preferences in dynamic allocation requests
Currently there's no locality preference for container request in YARN mode, this will affect the performance if fetching data remotely, so here proposed to add locality in Yarn dynamic allocation mode.

Ping sryza, please help to review, thanks a lot.

Author: jerryshao <saisai.shao@intel.com>

Closes #6394 from jerryshao/SPARK-4352 and squashes the following commits:

d45fecb [jerryshao] Add documents
6c3fe5c [jerryshao] Fix bug
8db6c0e [jerryshao] Further address the comments
2e2b2cb [jerryshao] Fix rebase compiling problem
ce5f096 [jerryshao] Fix style issue
7f7df95 [jerryshao] Fix rebase issue
9ca9e07 [jerryshao] Code refactor according to comments
d3e4236 [jerryshao] Further address the comments
5e7a593 [jerryshao] Fix bug introduced code rebase
9ca7783 [jerryshao] Style changes
08317f9 [jerryshao] code and comment refines
65b2423 [jerryshao] Further address the comments
a27c587 [jerryshao] address the comment
27faabc [jerryshao] redundant code remove
9ce06a1 [jerryshao] refactor the code
f5ba27b [jerryshao] Style fix
2c6cc8a [jerryshao] Fix bug and add unit tests
0757335 [jerryshao] Consider the distribution of existed containers to recalculate the new container requests
0ad66ff [jerryshao] Fix compile bugs
1c20381 [jerryshao] Minor fix
5ef2dc8 [jerryshao] Add docs and improve the code
3359814 [jerryshao] Fix rebase and test bugs
0398539 [jerryshao] reinitialize the new implementation
67596d6 [jerryshao] Still fix the code
654e1d2 [jerryshao] Fix some bugs
45b1c89 [jerryshao] Further polish the algorithm
dea0152 [jerryshao] Enable node locality information in YarnAllocator
74bbcc6 [jerryshao] Support node locality for dynamic allocation initial commit
2015-07-27 15:46:35 -07:00
Ryan Williams c0b7df68f8 [SPARK-9366] use task's stageAttemptId in TaskEnd event
Author: Ryan Williams <ryan.blake.williams@gmail.com>

Closes #7681 from ryan-williams/task-stage-attempt and squashes the following commits:

d6d5f0f [Ryan Williams] use task's stageAttemptId in TaskEnd event
2015-07-27 12:54:08 -05:00
Josh Rosen ecad9d4346 [SPARK-9364] Fix array out of bounds and use-after-free bugs in UnsafeExternalSorter
This patch fixes two bugs in UnsafeExternalSorter and UnsafeExternalRowSorter:

- UnsafeExternalSorter does not properly update freeSpaceInCurrentPage, which can cause it to write past the end of memory pages and trigger segfaults.
- UnsafeExternalRowSorter has a use-after-free bug when returning the last row from an iterator.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7680 from JoshRosen/SPARK-9364 and squashes the following commits:

590f311 [Josh Rosen] null out row
f4cf91d [Josh Rosen] Fix use-after-free bug in UnsafeExternalRowSorter.
8abcf82 [Josh Rosen] Properly decrement freeSpaceInCurrentPage in UnsafeExternalSorter
2015-07-27 09:34:49 -07:00
Kay Ousterhout 6b2baec04f [SPARK-9326] Close lock file used for file downloads.
A lock file is used to ensure multiple executors running on the
same machine don't download the same file concurrently. Spark never
closes these lock files (releasing the lock does not close the
underlying file); this commit fixes that.

cc vanzin (looks like you've been involved in various other fixes surrounding these lock files)

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #7650 from kayousterhout/SPARK-9326 and squashes the following commits:

0401bd1 [Kay Ousterhout] Close lock file used for file downloads.
2015-07-26 13:35:16 -07:00
Andrew Or 1cf19760d6 [SPARK-9352] [SPARK-9353] Add tests for standalone scheduling code
This also fixes a small issue in the standalone Master that was uncovered by the new tests. For more detail, read the description of SPARK-9353.

Author: Andrew Or <andrew@databricks.com>

Closes #7668 from andrewor14/standalone-scheduling-tests and squashes the following commits:

d852faf [Andrew Or] Add tests + fix scheduling with memory limits
2015-07-26 13:03:13 -07:00
Nishkam Ravi 41a7cdf85d [SPARK-8881] [SPARK-9260] Fix algorithm for scheduling executors on workers
Current scheduling algorithm allocates one core at a time and in doing so ends up ignoring spark.executor.cores. As a result, when spark.cores.max/spark.executor.cores (i.e, num_executors) < num_workers, executors are not launched and the app hangs. This PR fixes and refactors the scheduling algorithm.

andrewor14

Author: Nishkam Ravi <nravi@cloudera.com>
Author: nishkamravi2 <nishkamravi@gmail.com>

Closes #7274 from nishkamravi2/master_scheduler and squashes the following commits:

b998097 [nishkamravi2] Update Master.scala
da0f491 [Nishkam Ravi] Update Master.scala
79084e8 [Nishkam Ravi] Update Master.scala
1daf25f [Nishkam Ravi] Update Master.scala
f279cdf [Nishkam Ravi] Update Master.scala
adec84b [Nishkam Ravi] Update Master.scala
a06da76 [nishkamravi2] Update Master.scala
40c8f9f [nishkamravi2] Update Master.scala (to trigger retest)
c11c689 [nishkamravi2] Update EventLoggingListenerSuite.scala
5d6a19c [nishkamravi2] Update Master.scala (for the purpose of issuing a retest)
2d6371c [Nishkam Ravi] Update Master.scala
66362d5 [nishkamravi2] Update Master.scala
ee7cf0e [Nishkam Ravi] Improved scheduling algorithm for executors
2015-07-25 22:56:25 -07:00
Liang-Chi Hsieh 64135cbb33 [SPARK-9067] [SQL] Close reader in NewHadoopRDD early if there is no more data
JIRA: https://issues.apache.org/jira/browse/SPARK-9067

According to the description of the JIRA ticket, calling `reader.close()` only after the task is finished will cause memory and file open limit problem since these resources are occupied even we don't need that anymore.

This PR simply closes the reader early when we know there is no more data to read.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #7424 from viirya/close_reader and squashes the following commits:

3ff64e5 [Liang-Chi Hsieh] For comments.
3d20267 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into close_reader
e152182 [Liang-Chi Hsieh] For comments.
5116cbe [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into close_reader
3ceb755 [Liang-Chi Hsieh] For comments.
e34d98e [Liang-Chi Hsieh] For comments.
50ed729 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into close_reader
216912f [Liang-Chi Hsieh] Fix it.
f429016 [Liang-Chi Hsieh] Release reader if we don't need it.
a305621 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into close_reader
67569da [Liang-Chi Hsieh] Close reader early if there is no more data.
2015-07-24 12:36:44 -07:00
Marcelo Vanzin 8399ba1487 [SPARK-9261] [STREAMING] Avoid calling APIs that expose shaded classes.
Doing this may cause weird errors when tests are run on maven, depending
on the flags used. Instead, expose the needed functionality through methods
that do not expose shaded classes.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #7601 from vanzin/SPARK-9261 and squashes the following commits:

4f64a16 [Marcelo Vanzin] [SPARK-9261] [streaming] Avoid calling APIs that expose shaded classes.
2015-07-24 11:53:16 -07:00
Reynold Xin c8d71a4183 [SPARK-9305] Rename org.apache.spark.Row to Item.
It's a thing used in test cases, but named Row. Pretty annoying because everytime I search for Row, it shows up before the Spark SQL Row, which is what a developer wants most of the time.

Author: Reynold Xin <rxin@databricks.com>

Closes #7638 from rxin/remove-row and squashes the following commits:

aeda52d [Reynold Xin] [SPARK-9305] Rename org.apache.spark.Row to Item.
2015-07-24 09:38:13 -07:00
François Garillot 6cd28cc21e [SPARK-9236] [CORE] Make defaultPartitioner not reuse a parent RDD's partitioner if it has 0 partitions
See also comments on https://issues.apache.org/jira/browse/SPARK-9236

Author: François Garillot <francois@garillot.net>

Closes #7616 from huitseeker/issue/SPARK-9236 and squashes the following commits:

217f902 [François Garillot] [SPARK-9236] Make defaultPartitioner not reuse a parent RDD's partitioner if it has 0 partitions
2015-07-24 15:41:13 +01:00
Yijie Shen d2666a3c70 [SPARK-9183] confusing error message when looking up missing function in Spark SQL
JIRA: https://issues.apache.org/jira/browse/SPARK-9183

cc rxin

Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #7613 from yjshen/npe_udf and squashes the following commits:

44f58f2 [Yijie Shen] add jira ticket number
903c963 [Yijie Shen] add explanation comments
f44dd3c [Yijie Shen] Change two hive class LogLevel to avoid annoying messages
2015-07-23 10:31:12 -07:00
Cheng Hao 19aeab57c1 [Build][Minor] Fix building error & performance
1. When build the latest code with sbt, it throws exception like:
[error] /home/hcheng/git/catalyst/core/src/main/scala/org/apache/spark/ui/jobs/AllJobsPage.scala:78: match may not be exhaustive.
[error] It would fail on the following input: UNKNOWN
[error]       val classNameByStatus = status match {
[error]

2. Potential performance issue when implicitly convert an Array[Any] to Seq[Any]

Author: Cheng Hao <hao.cheng@intel.com>

Closes #7611 from chenghao-intel/toseq and squashes the following commits:

cab75c5 [Cheng Hao] remove the toArray
24df682 [Cheng Hao] fix building error & performance
2015-07-23 10:28:20 -07:00
Josh Rosen ac3ae0f2be [SPARK-9266] Prevent "managed memory leak detected" exception from masking original exception
When a task fails with an exception and also fails to properly clean up its managed memory, the `spark.unsafe.exceptionOnMemoryLeak` memory leak detection mechanism's exceptions will mask the original exception that caused the task to fail. We should throw the memory leak exception only if no other exception occurred.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7603 from JoshRosen/SPARK-9266 and squashes the following commits:

c268cb5 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-9266
c1f0167 [Josh Rosen] Fix the error masking problem
448eae8 [Josh Rosen] Add regression test
2015-07-23 00:43:26 -07:00
Perinkulam I. Ganesh b983d493b4 [SPARK-8695] [CORE] [MLLIB] TreeAggregation shouldn't be triggered when it doesn't save wall-clock time.
Author: Perinkulam I. Ganesh <gip@us.ibm.com>

Closes #7397 from piganesh/SPARK-8695 and squashes the following commits:

041620c [Perinkulam I. Ganesh] [SPARK-8695][CORE][MLlib] TreeAggregation shouldn't be triggered when it doesn't save wall-clock time.
9ad067c [Perinkulam I. Ganesh] [SPARK-8695] [core] [WIP] TreeAggregation shouldn't be triggered for 5 partitions
a6fed07 [Perinkulam I. Ganesh] [SPARK-8695] [core] [WIP] TreeAggregation shouldn't be triggered for 5 partitions
2015-07-23 07:47:42 +01:00
Josh Rosen b217230f2a [SPARK-9144] Remove DAGScheduler.runLocallyWithinThread and spark.localExecution.enabled
Spark has an option called spark.localExecution.enabled; according to the docs:

> Enables Spark to run certain jobs, such as first() or take() on the driver, without sending tasks to the cluster. This can make certain jobs execute very quickly, but may require shipping a whole partition of data to the driver.

This feature ends up adding quite a bit of complexity to DAGScheduler, especially in the runLocallyWithinThread method, but as far as I know nobody uses this feature (I searched the mailing list and haven't seen any recent mentions of the configuration nor stacktraces including the runLocally method). As a step towards scheduler complexity reduction, I propose that we remove this feature and all code related to it for Spark 1.5.

This pull request simply brings #7484 up to date.

Author: Josh Rosen <joshrosen@databricks.com>
Author: Reynold Xin <rxin@databricks.com>

Closes #7585 from rxin/remove-local-exec and squashes the following commits:

84bd10e [Reynold Xin] Python fix.
1d9739a [Reynold Xin] Merge pull request #7484 from JoshRosen/remove-localexecution
eec39fa [Josh Rosen] Remove allowLocal(); deprecate user-facing uses of it.
b0835dc [Josh Rosen] Remove local execution code in DAGScheduler
8975d96 [Josh Rosen] Remove local execution tests.
ffa8c9b [Josh Rosen] Remove documentation for configuration
2015-07-22 21:04:04 -07:00
Reynold Xin d71a13f475 [SPARK-9262][build] Treat Scala compiler warnings as errors
I've seen a few cases in the past few weeks that the compiler is throwing warnings that are caused by legitimate bugs. This patch upgrades warnings to errors, except deprecation warnings.

Note that ideally we should be able to mark deprecation warnings as errors as well. However, due to the lack of ability to suppress individual warning messages in the Scala compiler, we cannot do that (since we do need to access deprecated APIs in Hadoop).

Most of the work are done by ericl.

Author: Reynold Xin <rxin@databricks.com>
Author: Eric Liang <ekl@databricks.com>

Closes #7598 from rxin/warnings and squashes the following commits:

beb311b [Reynold Xin] Fixed tests.
542c031 [Reynold Xin] Fixed one more warning.
87c354a [Reynold Xin] Fixed all non-deprecation warnings.
78660ac [Eric Liang] first effort to fix warnings
2015-07-22 21:02:19 -07:00
Matei Zaharia fe26584a1f [SPARK-9244] Increase some memory defaults
There are a few memory limits that people hit often and that we could
make higher, especially now that memory sizes have grown.

- spark.akka.frameSize: This defaults at 10 but is often hit for map
  output statuses in large shuffles. This memory is not fully allocated
  up-front, so we can just make this larger and still not affect jobs
  that never sent a status that large. We increase it to 128.

- spark.executor.memory: Defaults at 512m, which is really small. We
  increase it to 1g.

Author: Matei Zaharia <matei@databricks.com>

Closes #7586 from mateiz/configs and squashes the following commits:

ce0038a [Matei Zaharia] [SPARK-9244] Increase some memory defaults
2015-07-22 15:28:09 -07:00
zsxwing d45355ee22 [SPARK-5423] [CORE] Register a TaskCompletionListener to make sure release all resources
Make `DiskMapIterator.cleanup` idempotent and register a TaskCompletionListener to make sure call `cleanup`.

Author: zsxwing <zsxwing@gmail.com>

Closes #7529 from zsxwing/SPARK-5423 and squashes the following commits:

3e3c413 [zsxwing] Remove TODO
9556c78 [zsxwing] Fix NullPointerException for tests
3d574d9 [zsxwing] Register a TaskCompletionListener to make sure release all resources
2015-07-21 09:55:42 -07:00
zsxwing 4f7f1ee378 [SPARK-4598] [WEBUI] Task table pagination for the Stage page
This PR adds pagination for the task table to solve the scalability issue of the stage page. Here is the initial screenshot:
<img width="1347" alt="pagination" src="https://cloud.githubusercontent.com/assets/1000778/8679669/9e63863c-2a8e-11e5-94e4-994febcd6717.png">
The task table only shows 100 tasks. There is a page navigation above the table. Users can click the page navigation or type the page number to jump to another page. The table can be sorted by clicking the headers. However, unlike previous implementation, the sorting work is done in the server now. So clicking a table column to sort needs to refresh the web page.

Author: zsxwing <zsxwing@gmail.com>

Closes #7399 from zsxwing/task-table-pagination and squashes the following commits:

144f513 [zsxwing] Display the page navigation when the page number is out of range
a3eee22 [zsxwing] Add extra space for the error message
54c5b84 [zsxwing] Reset page to 1 if the user changes the page size
c2f7f39 [zsxwing] Add a text field to let users fill the page size
bad52eb [zsxwing] Display user-friendly error messages
410586b [zsxwing] Scroll down to the tasks table if the url contains any sort column
a0746d1 [zsxwing] Use expand-dag-viz-arrow-job and expand-dag-viz-arrow-stage instead of expand-dag-viz-arrow-true and expand-dag-viz-arrow-false
b123f67 [zsxwing] Use localStorage to remember the user's actions and replay them when loading the page
894a342 [zsxwing] Show the link cursor when hovering for headers and page links and other minor fix
4d4fecf [zsxwing] Address Carson's comments
d9285f0 [zsxwing] Add comments and fix the style
74285fa [zsxwing] Merge branch 'master' into task-table-pagination
db6c859 [zsxwing] Task table pagination for the Stage page
2015-07-21 09:54:39 -07:00
Jacek Lewandowski 31954910d6 [SPARK-7171] Added a method to retrieve metrics sources in TaskContext
Author: Jacek Lewandowski <lewandowski.jacek@gmail.com>

Closes #5805 from jacek-lewandowski/SPARK-7171 and squashes the following commits:

ed20bda [Jacek Lewandowski] SPARK-7171: Added a method to retrieve metrics sources in TaskContext
2015-07-21 09:53:33 -07:00
Liang-Chi Hsieh 9a4fd875b3 [SPARK-9128] [CORE] Get outerclasses and objects with only one method calling in ClosureCleaner
JIRA: https://issues.apache.org/jira/browse/SPARK-9128

Currently, in `ClosureCleaner`, the outerclasses and objects are retrieved using two different methods. However, the logic of the two methods is the same, and we can get both the outerclasses and objects with only one method calling.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #7459 from viirya/remove_extra_closurecleaner and squashes the following commits:

7c9858d [Liang-Chi Hsieh] For comments.
a096941 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into remove_extra_closurecleaner
2ec5ce1 [Liang-Chi Hsieh] Remove unnecessary methods.
4df5a51 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into remove_extra_closurecleaner
dc110d1 [Liang-Chi Hsieh] Add method to get outerclasses and objects at the same time.
2015-07-21 09:52:27 -07:00
Ben f67da43c39 [SPARK-9036] [CORE] SparkListenerExecutorMetricsUpdate messages not included in JsonProtocol
This PR implements a JSON serializer and deserializer in the JSONProtocol to handle the (de)serialization of SparkListenerExecutorMetricsUpdate events. It also includes a unit test in the JSONProtocolSuite file. This was implemented to satisfy the improvement request in the JIRA  issue SPARK-9036.

Author: Ben <benjaminpiering@gmail.com>

Closes #7555 from NamelessAnalyst/master and squashes the following commits:

fb4e3cc [Ben] Update JSON Protocol and tests
aa69517 [Ben] Update JSON Protocol and tests --Corrected Stage Attempt to Stage Attempt ID
33e5774 [Ben] Update JSON Protocol Tests
3f237e7 [Ben] Update JSON Protocol Tests
84ca798 [Ben] Update JSON Protocol Tests
cde57a0 [Ben] Update JSON Protocol Tests
8049600 [Ben] Update JSON Protocol Tests
c5bc061 [Ben] Update JSON Protocol Tests
6f25785 [Ben] Merge remote-tracking branch 'origin/master'
df2a609 [Ben] Update JSON Protocol
dcda80b [Ben] Update JSON Protocol
2015-07-21 09:51:13 -07:00
Grace 6592a6058e [SPARK-9193] Avoid assigning tasks to "lost" executor(s)
Now, when some executors are killed by dynamic-allocation, it leads to some mis-assignment onto lost executors sometimes. Such kind of mis-assignment causes task failure(s) or even job failure if it repeats that errors for 4 times.

The root cause is that ***killExecutors*** doesn't remove those executors under killing ASAP. It depends on the ***OnDisassociated*** event to refresh the active working list later. The delay time really depends on your cluster status (from several milliseconds to sub-minute). When new tasks to be scheduled during that period of time, it will be assigned to those "active" but "under killing" executors. Then the tasks will be failed due to "executor lost". The better way is to exclude those executors under killing in the makeOffers(). Then all those tasks won't be allocated onto those executors "to be lost" any more.

Author: Grace <jie.huang@intel.com>

Closes #7528 from GraceH/AssignToLostExecutor and squashes the following commits:

ecc1da6 [Grace] scala style fix
6e2ed96 [Grace] Re-word makeOffers by more readable lines
b5546ce [Grace] Add comments about the fix
30a9ad0 [Grace] Avoid assigning tasks to lost executors
2015-07-21 11:35:49 -05:00
Kay Ousterhout 6364735bcc [SPARK-8875] Remove BlockStoreShuffleFetcher class
The shuffle code has gotten increasingly difficult to read as it has evolved, and many classes
have evolved significantly since they were originally created. The BlockStoreShuffleFetcher class
now serves little purpose other than to make the code more difficult to read; this commit moves its
functionality into the ShuffleBlockFetcherIterator class.

cc massie JoshRosen (Josh, this PR also removes the Try you pointed out as being confusing / not necessarily useful in a previous comment).  Matt, would be helpful to know whether this will interfere in any negative ways with your new shuffle PR (I took a look and it seems like this should still cleanly integrate with your parquet work, but want to double check).

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #7268 from kayousterhout/SPARK-8875 and squashes the following commits:

2b24a97 [Kay Ousterhout] Fixed DAGSchedulerSuite compile error
98a1831 [Kay Ousterhout] Merge remote-tracking branch 'upstream/master' into SPARK-8875
90f0e89 [Kay Ousterhout] Fixed broken test
14bfcbb [Kay Ousterhout] Last style fix
bc69d2b [Kay Ousterhout] Style improvements based on Josh's code review
ad3c8d1 [Kay Ousterhout] Better documentation for MapOutputTracker methods
0bc0e59 [Kay Ousterhout] [SPARK-8875] Remove BlockStoreShuffleFetcher class
2015-07-21 01:12:51 -07:00
Josh Rosen c032b0bf92 [SPARK-8797] [SPARK-9146] [SPARK-9145] [SPARK-9147] Support NaN ordering and equality comparisons in Spark SQL
This patch addresses an issue where queries that sorted float or double columns containing NaN values could fail with "Comparison method violates its general contract!" errors from TimSort.  The root of this problem is that `NaN > anything`, `NaN == anything`, and `NaN < anything` all return `false`.

Per the design specified in SPARK-9079, we have decided that `NaN = NaN` should return true and that NaN should appear last when sorting in ascending order (i.e. it is larger than any other numeric value).

In addition to implementing these semantics, this patch also adds canonicalization of NaN values in UnsafeRow, which is necessary in order to be able to do binary equality comparisons on equal NaNs that might have different bit representations (see SPARK-9147).

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7194 from JoshRosen/nan and squashes the following commits:

983d4fc [Josh Rosen] Merge remote-tracking branch 'origin/master' into nan
88bd73c [Josh Rosen] Fix Row.equals()
a702e2e [Josh Rosen] normalization -> canonicalization
a7267cf [Josh Rosen] Normalize NaNs in UnsafeRow
fe629ae [Josh Rosen] Merge remote-tracking branch 'origin/master' into nan
fbb2a29 [Josh Rosen] Fix NaN comparisons in BinaryComparison expressions
c1fd4fe [Josh Rosen] Fold NaN test into existing test framework
b31eb19 [Josh Rosen] Uncomment failing tests
7fe67af [Josh Rosen] Support NaN == NaN (SPARK-9145)
58bad2c [Josh Rosen] Revert "Compare rows' string representations to work around NaN incomparability."
fc6b4d2 [Josh Rosen] Update CodeGenerator
3998ef2 [Josh Rosen] Remove unused code
a2ba2e7 [Josh Rosen] Fix prefix comparision for NaNs
a30d371 [Josh Rosen] Compare rows' string representations to work around NaN incomparability.
6f03f85 [Josh Rosen] Fix bug in Double / Float ordering
42a1ad5 [Josh Rosen] Stop filtering NaNs in UnsafeExternalSortSuite
bfca524 [Josh Rosen] Change ordering so that NaN is maximum value.
8d7be61 [Josh Rosen] Update randomized test to use ScalaTest's assume()
b20837b [Josh Rosen] Add failing test for new NaN comparision ordering
5b88b2b [Josh Rosen] Fix compilation of CodeGenerationSuite
d907b5b [Josh Rosen] Merge remote-tracking branch 'origin/master' into nan
630ebc5 [Josh Rosen] Specify an ordering for NaN values.
9bf195a [Josh Rosen] Re-enable NaNs in CodeGenerationSuite to produce more regression tests
13fc06a [Josh Rosen] Add regression test for NaN sorting issue
f9efbb5 [Josh Rosen] Fix ORDER BY NULL
e7dc4fb [Josh Rosen] Add very generic test for ordering
7d5c13e [Josh Rosen] Add regression test for SPARK-8782 (ORDER BY NULL)
b55875a [Josh Rosen] Generate doubles and floats over entire possible range.
5acdd5c [Josh Rosen] Infinity and NaN are interesting.
ab76cbd [Josh Rosen] Move code to Catalyst package.
d2b4a4a [Josh Rosen] Add random data generator test utilities to Spark SQL.
2015-07-20 22:38:05 -07:00
Carson Wang 66bb8003b9 [SPARK-9187] [WEBUI] Timeline view may show negative value for running tasks
For running tasks, the executorRunTime metrics is 0 which causes negative executorComputingTime in the timeline. It also causes an incorrect SchedulerDelay time.
![timelinenegativevalue](https://cloud.githubusercontent.com/assets/9278199/8770953/f4362378-2eec-11e5-81e6-a06a07c04794.png)

Author: Carson Wang <carson.wang@intel.com>

Closes #7526 from carsonwang/timeline-negValue and squashes the following commits:

7b17db2 [Carson Wang] Fix negative value in timeline view
2015-07-20 18:08:59 -07:00
Cheng Lian a1064df0ee [SPARK-8125] [SQL] Accelerates Parquet schema merging and partition discovery
This PR tries to accelerate Parquet schema discovery and `HadoopFsRelation` partition discovery.  The acceleration is done by the following means:

- Turning off schema merging by default

  Schema merging is not the most common case, but requires reading footers of all Parquet part-files and can be very slow.

- Avoiding `FileSystem.globStatus()` call when possible

  `FileSystem.globStatus()` may issue multiple synchronous RPC calls, and can be very slow (esp. on S3).  This PR adds `SparkHadoopUtil.globPathIfNecessary()`, which only issues RPC calls when the path contain glob-pattern specific character(s) (`{}[]*?\`).

  This is especially useful when converting a metastore Parquet table with lots of partitions, since Spark SQL adds all partition directories as the input paths, and currently we do a `globStatus` call on each input path sequentially.

- Listing leaf files in parallel when the number of input paths exceeds a threshold

  Listing leaf files is required by partition discovery.  Currently it is done on driver side, and can be slow when there are lots of (nested) directories, since each `FileSystem.listStatus()` call issues an RPC.  In this PR, we list leaf files in a BFS style, and resort to a Spark job once we found that the number of directories need to be listed exceed a threshold.

  The threshold is controlled by `SQLConf` option `spark.sql.sources.parallelPartitionDiscovery.threshold`, which defaults to 32.

- Discovering Parquet schema in parallel

  Currently, schema merging is also done on driver side, and needs to read footers of all part-files.  This PR uses a Spark job to do schema merging.  Together with task side metadata reading in Parquet 1.7.0, we never read any footers on driver side now.

Author: Cheng Lian <lian@databricks.com>

Closes #7396 from liancheng/accel-parquet and squashes the following commits:

5598efc [Cheng Lian] Uses ParquetInputFormat[InternalRow] instead of ParquetInputFormat[Row]
ff32cd0 [Cheng Lian] Excludes directories while listing leaf files
3c580f1 [Cheng Lian] Fixes test failure caused by making "mergeSchema" default to "false"
b1646aa [Cheng Lian] Should allow empty input paths
32e5f0d [Cheng Lian] Moves schema merging to executor side
2015-07-20 16:42:43 -07:00