This was omitted in #1260. @aarondav
Author: Reynold Xin <rxin@apache.org>
Closes#1300 from rxin/historyServer and squashes the following commits:
af720a3 [Reynold Xin] Added SignalLogger to HistoryServer.
JIRA: https://issues.apache.org/jira/browse/SPARK-2282
This issue is caused by a buildup of sockets in the TIME_WAIT stage of TCP, which is a stage that lasts for some period of time after the communication closes.
This solution simply allows us to reuse sockets that are in TIME_WAIT, to avoid issues with the buildup of the rapid creation of these sockets.
Author: Aaron Davidson <aaron@databricks.com>
Closes#1220 from aarondav/SPARK-2282 and squashes the following commits:
2e5cab3 [Aaron Davidson] SPARK-2282: Reuse PySpark Accumulator sockets to avoid crashing Spark
**Problem.** The existing code in `ExecutorPage.scala` requires a linear scan through all the blocks to filter out the uncached ones. Every refresh could be expensive if there are many blocks and many executors.
**Solution.** The proper semantics should be the following: `StorageStatusListener` should contain only block statuses that are cached. This means as soon as a block is unpersisted by any mean, its status should be removed. This is reflected in the changes made in `StorageStatusListener.scala`.
Further, the `StorageTab` must stop relying on the `StorageStatusListener` changing a dropped block's status to `StorageLevel.NONE` (which no longer happens). This is reflected in the changes made in `StorageTab.scala` and `StorageUtils.scala`.
----------
If you have been following this chain of PRs like pwendell, you will quickly notice that this reverts the changes in #1249, which reverts the changes in #1080. In other words, we are adding back the changes from #1080, and fixing SPARK-2307 on top of those changes. Please ask questions if you are confused.
Author: Andrew Or <andrewor14@gmail.com>
Closes#1255 from andrewor14/storage-ui-fix-reprise and squashes the following commits:
45416fa [Andrew Or] Merge branch 'master' of github.com:apache/spark into storage-ui-fix-reprise
a82ea25 [Andrew Or] Add tests for StorageStatusListener
8773b01 [Andrew Or] Update comment / minor changes
3afde3f [Andrew Or] Correctly report the number of blocks on SparkUI
Prior to this change, we could throw a NPE if we launch a driver while another one is waiting, because removing from an iterator while iterating over it is not safe.
Author: Aaron Davidson <aaron@databricks.com>
Closes#1289 from aarondav/master-fail and squashes the following commits:
1cf1cf4 [Aaron Davidson] SPARK-2350: Don't NPE while launching drivers
Workaround Hadoop conf ConcurrentModification issue
Author: Raymond Liu <raymond.liu@intel.com>
Closes#1273 from colorant/hadoopRDD and squashes the following commits:
994e98b [Raymond Liu] Address comments
e2cda3d [Raymond Liu] Workaround Hadoop conf ConcurrentModification issue
It did not handle null keys very gracefully before.
Author: Andrew Or <andrewor14@gmail.com>
Closes#1288 from andrewor14/fix-external and squashes the following commits:
312b8d8 [Andrew Or] Abstract key hash code
ed5adf9 [Andrew Or] Fix NPE for ExternalAppendOnlyMap
The spark.local.dir is configured as a list of multiple paths as follows /data1/sparkenv/local,/data2/sparkenv/local. If the disk data2 of the driver node has error, the application will exit since DiskBlockManager exits directly at createLocalDirs. If the disk data2 of the worker node has error, the executor will exit either.
DiskBlockManager should not exit directly at createLocalDirs if one of spark.local.dir has error. Since spark.local.dir has multiple paths, a problem should not affect the overall situation.
I think DiskBlockManager could ignore the bad directory at createLocalDirs.
Author: yantangzhai <tyz0303@163.com>
Closes#1274 from YanTangZhai/SPARK-2324 and squashes the following commits:
609bf48 [yantangzhai] [SPARK-2324] SparkContext should not exit directly when spark.local.dir is a list of multiple paths and one of them has error
df08673 [yantangzhai] [SPARK-2324] SparkContext should not exit directly when spark.local.dir is a list of multiple paths and one of them has error
This functionality was added in an earlier commit but shortly
after was removed due to a bad git merge (totally my fault).
Author: Kay Ousterhout <kayousterhout@gmail.com>
Closes#1149 from kayousterhout/warning_bug and squashes the following commits:
3f1bb00 [Kay Ousterhout] Fixed Json tests
462a664 [Kay Ousterhout] Removed task set name from warning message
e89b2f6 [Kay Ousterhout] Fixed Json tests.
7af424c [Kay Ousterhout] Emit warning when task size exceeds a threshold.
Author: Reynold Xin <rxin@apache.org>
Closes#1260 from rxin/signalhandler1 and squashes the following commits:
8e73552 [Reynold Xin] Uh add Logging back in ApplicationMaster.
0402ba8 [Reynold Xin] Synchronize SignalLogger.register.
dc70705 [Reynold Xin] Added SignalLogger to YARN ApplicationMaster.
79a21b4 [Reynold Xin] Added license header.
0da052c [Reynold Xin] Added the SignalLogger itself.
e587d2e [Reynold Xin] [SPARK-2318] When exiting on a signal, print the signal name first.
This should go into 1.0.1.
Author: Reynold Xin <rxin@apache.org>
Closes#1264 from rxin/SPARK-2322 and squashes the following commits:
c77c07f [Reynold Xin] Added comment to SparkDriverExecutionException and a test case for accumulator.
5d8d920 [Reynold Xin] [SPARK-2322] Exception in resultHandler could crash DAGScheduler and shutdown SparkContext.
I could settle with this being a debug also if we provided an example of how to turn it on in `log4j.properties`
https://issues.apache.org/jira/browse/SPARK-2077
Author: Andrew Ash <andrew@andrewash.com>
Closes#1017 from ash211/SPARK-2077 and squashes the following commits:
580f680 [Andrew Ash] Drop to debug
0266415 [Andrew Ash] SPARK-2077 Log serializer that actually ends up being used
These commits cause `ClosureCleaner.clean` to attempt to serialize the cleaned closure with the default closure serializer and throw a `SparkException` if doing so fails. This behavior is enabled by default but can be disabled at individual callsites of `SparkContext.clean`.
Commit 98e01ae8 fixes some no-op assertions in `GraphSuite` that this work exposed; I'm happy to put that in a separate PR if that would be more appropriate.
Author: William Benton <willb@redhat.com>
Closes#143 from willb/spark-897 and squashes the following commits:
bceab8a [William Benton] Commented DStream corner cases for serializability checking.
64d04d2 [William Benton] FailureSuite now checks both messages and causes.
3b3f74a [William Benton] Stylistic and doc cleanups.
b215dea [William Benton] Fixed spurious failures in ImplicitOrderingSuite
be1ecd6 [William Benton] Don't check serializability of DStream transforms.
abe816b [William Benton] Make proactive serializability checking optional.
5bfff24 [William Benton] Adds proactive closure-serializablilty checking
ed2ccf0 [William Benton] Test cases for SPARK-897.
Details can be see in [SPARK-2104](https://issues.apache.org/jira/browse/SPARK-2104). This work is based on Reynold's work, add some unit tests to validate the issue.
@rxin , would you please take a look at this PR, thanks a lot.
Author: jerryshao <saisai.shao@intel.com>
Closes#1245 from jerryshao/SPARK-2104 and squashes the following commits:
c8ee362 [jerryshao] Make field partitions transient
2b41917 [jerryshao] Minor changes
47d763c [jerryshao] Fix task serializing issue when sort with Java non serializable class
This commit adds a new metric in TaskMetrics to record
the input data size and displays this information in the UI.
An earlier version of this commit also added the read time,
which can be useful for diagnosing straggler problems,
but unfortunately that change introduced a significant performance
regression for jobs that don't do much computation. In order to
track read time, we'll need to do sampling.
The screenshots below show the UI with the new "Input" field,
which I added to the stage summary page, the executor summary page,
and the per-stage page.
![image](https://cloud.githubusercontent.com/assets/1108612/3167930/2627f92a-eb77-11e3-861c-98ea5bb7a1a2.png)
![image](https://cloud.githubusercontent.com/assets/1108612/3167936/475a889c-eb77-11e3-9706-f11c48751f17.png)
![image](https://cloud.githubusercontent.com/assets/1108612/3167948/80ebcf12-eb77-11e3-87ed-349fce6a770c.png)
Author: Kay Ousterhout <kayousterhout@gmail.com>
Closes#962 from kayousterhout/read_metrics and squashes the following commits:
f13b67d [Kay Ousterhout] Correctly format input bytes on executor page
8b70cde [Kay Ousterhout] Added comment about potential inaccuracy of bytesRead
d1016e8 [Kay Ousterhout] Udated SparkListenerSuite test
8461492 [Kay Ousterhout] Miniscule style fix
ae04d99 [Kay Ousterhout] Remove input metrics for parallel collections
719f19d [Kay Ousterhout] Style fixes
bb6ec62 [Kay Ousterhout] Small fixes
869ac7b [Kay Ousterhout] Updated Json tests
44a0301 [Kay Ousterhout] Fixed accidentally added line
4bd0568 [Kay Ousterhout] Added input source, renamed Hdfs to Hadoop.
f27e535 [Kay Ousterhout] Updates based on review comments and to fix rebase
bf41029 [Kay Ousterhout] Updated Json tests to pass
0fc33e0 [Kay Ousterhout] Added explicit backward compatibility test
4e52925 [Kay Ousterhout] Added Json output and associated tests.
365400b [Kay Ousterhout] [SPARK-1683] Track task read metrics.
Author: Reynold Xin <rxin@apache.org>
Closes#1261 from rxin/ui-pre-size and squashes the following commits:
7ab1a69 [Reynold Xin] [SPARK-2320] Reduce exception/code block font size in web ui
Author: Reynold Xin <rxin@apache.org>
Closes#1258 from rxin/mapOutputTracker and squashes the following commits:
a7c95b6 [Reynold Xin] Improve MapOutputTracker error logging.
The existing docs are highly misleading. For standalone mode, for example, it encourages the user to use standalone-cluster mode, which is not officially supported. The safeguards have been added in Spark submit itself to prevent bad documentation from leading users down the wrong path in the future.
This PR is prompted by countless headaches users of Spark have run into on the mailing list.
Author: Andrew Or <andrewor14@gmail.com>
Closes#1200 from andrewor14/submit-docs and squashes the following commits:
5ea2460 [Andrew Or] Rephrase cluster vs client explanation
c827f32 [Andrew Or] Clarify spark submit messages
9f7ed8f [Andrew Or] Clarify client vs cluster deploy mode + add safeguards
The issue here is that the `StorageTab` listens for updates from the `StorageStatusListener`, but when a block is kicked out of the cache, `StorageStatusListener` removes it from its list. Thus, there is no way for the `StorageTab` to know whether a block has been dropped.
This issue was introduced in #1080, which was itself a bug fix. Here we revert that PR and offer a different fix for the original bug (SPARK-2144).
Author: Andrew Or <andrewor14@gmail.com>
Closes#1249 from andrewor14/storage-ui-fix and squashes the following commits:
af019ce [Andrew Or] Fix SPARK-2307
Author: witgo <witgo@qq.com>
Closes#1135 from witgo/SPARK-2181 and squashes the following commits:
39dad90 [witgo] The keys for sorting the columns of Executor page in SparkUI are incorrect
The following code is very likely to throw an exception:
~~~
val rdd = sc.parallelize(0 until 111, 10).sample(false, 0.1)
rdd.zip(rdd).count()
~~~
because the same random number generator is used in compute partitions.
Author: Xiangrui Meng <meng@databricks.com>
Closes#1229 from mengxr/fix-sample and squashes the following commits:
f1ee3d7 [Xiangrui Meng] fix concurrency issues in random sampler
New UI:
![screen shot 2014-06-26 at 1 43 52 pm](https://cloud.githubusercontent.com/assets/323388/3404643/82b9ddc6-fd73-11e3-96f9-f7592a7aee79.png)
Author: Reynold Xin <rxin@apache.org>
Closes#1236 from rxin/ui-task-attempt and squashes the following commits:
3b645dd [Reynold Xin] Expose attemptId in Stage.
c0474b1 [Reynold Xin] Beefed up unit test.
c404bdd [Reynold Xin] Fix ReplayListenerSuite.
f56be4b [Reynold Xin] Fixed JsonProtocolSuite.
e29e0f7 [Reynold Xin] Minor update.
5e4354a [Reynold Xin] [SPARK-2297][UI] Make task attempt and speculation more explicit in UI.
FetchFailedException used to have a Throwable field, but in reality we never propagate any of the throwable/exceptions back to the driver because Executor explicitly looks for FetchFailedException and then sends FetchFailed as the TaskEndReason.
This pull request removes the throwable and adds a MetadataFetchFailedException that extends FetchFailedException (so now MapOutputTracker throws MetadataFetchFailedException instead).
Author: Reynold Xin <rxin@apache.org>
Closes#1227 from rxin/metadataFetchException and squashes the following commits:
5cb1e0a [Reynold Xin] MetadataFetchFailedException extends FetchFailedException.
8861ee2 [Reynold Xin] Throw MetadataFetchFailedException in MapOutputTracker.
Also added inline doc for each TaskEndReason.
Author: Reynold Xin <rxin@apache.org>
Closes#1225 from rxin/SPARK-2286 and squashes the following commits:
6a7959d [Reynold Xin] Fix unit test failure.
cf9d5eb [Reynold Xin] Merge branch 'master' into SPARK-2286
a61fae1 [Reynold Xin] Move to line above ...
38c7391 [Reynold Xin] [SPARK-2286][UI] Report exception/errors for failed tasks that are not ExceptionFailure.
Previously only tasks failed with ExceptionFailure reason was marked as failure.
Author: Reynold Xin <rxin@apache.org>
Closes#1224 from rxin/SPARK-2284 and squashes the following commits:
be79dbd [Reynold Xin] [SPARK-2284][UI] Mark all failed tasks as failures.
This is a fixed up version of #686 (cc @markhamstra @pwendell). The last commit (the only one I authored) reflects the changes I made from Mark's original patch.
Author: Mark Hamstra <markhamstra@gmail.com>
Author: Kay Ousterhout <kayousterhout@gmail.com>
Closes#1219 from kayousterhout/mark-SPARK-1749 and squashes the following commits:
42dfa7e [Kay Ousterhout] Got rid of terrible double-negative name
80b3205 [Kay Ousterhout] Don't notify listeners of job failure if it wasn't successfully cancelled.
d156d33 [Mark Hamstra] Do nothing in no-kill submitTasks
9312baa [Mark Hamstra] code review update
cc353c8 [Mark Hamstra] scalastyle
e61f7f8 [Mark Hamstra] Catch UnsupportedOperationException when DAGScheduler tries to cancel a job on a SchedulerBackend that does not implement killTask
The scheduler for Mesos in fine-grained mode launches tasks on the wrong executors. `MesosSchedulerBackend.resourceOffers(SchedulerDriver, List[Offer])` is assuming that `TaskSchedulerImpl.resourceOffers(Seq[WorkerOffer])` is returning task lists in the same order as the offers it was passed, but in the current implementation `TaskSchedulerImpl.resourceOffers` shuffles the offers to avoid assigning the tasks always to the same executors. The result is that the tasks are launched on the wrong executors. The jobs are sometimes able to complete, but most of the time they fail. It seems that as soon as something goes wrong with a task for some reason Spark is not able to recover since it's mistaken as to where the tasks are actually running. Also, it seems that the more the cluster is under load the more likely the job is to fail because there's a higher probability that Spark is trying to launch a task on a slave that doesn't actually have enough resources, again because it's using the wrong offers.
The solution is to not assume that the order in which the tasks are returned is the same as the offers, and simply launch the tasks on the executor decided by `TaskSchedulerImpl.resourceOffers`. What I am not sure about is that I considered slaveId and executorId to be the same, which is true at least in my setup, but I don't know if that is always true.
I tested this on top of the 1.0.0 release and it seems to work fine on our cluster.
Author: Sebastien Rainville <sebastien@hopper.com>
Closes#1140 from sebastienrainville/fine-grained-mode-fix-master and squashes the following commits:
a98b0e0 [Sebastien Rainville] Use a HashMap to retrieve the offer indices
d6ffe54 [Sebastien Rainville] Launch tasks on the proper executors in mesos fine-grained mode
and thus groupBy/cogroup are broken in Java APIs when Kryo is used).
@pwendell this should be merged into 1.0.1.
Thanks @sorenmacbeth for reporting this & helping out with the fix.
Author: Reynold Xin <rxin@apache.org>
Closes#1206 from rxin/kryo-iterable-2270 and squashes the following commits:
09da0aa [Reynold Xin] Updated the comment.
009bf64 [Reynold Xin] [SPARK-2270] Kryo cannot serialize results returned by asJavaIterable (and thus groupBy/cogroup are broken in Java APIs when Kryo is used).
**SPARK-2258.** Worker UI displays zombie processes if the executor throws an exception before a process is launched. This is because we only inform the Worker of the change if the process is already launched, which in this case it isn't.
**SPARK-2266.** We expose "Some(app-id)" on the log page. This is fairly minor.
Author: Andrew Or <andrewor14@gmail.com>
Closes#1213 from andrewor14/fix-worker-ui and squashes the following commits:
c1223fe [Andrew Or] Fix worker UI bugs
https://issues.apache.org/jira/browse/SPARK-2038
to differentiate with SparkConf object and at the same time keep the source level compatibility
Author: CodingCat <zhunansjtu@gmail.com>
Closes#1137 from CodingCat/SPARK-2038 and squashes the following commits:
11abeba [CodingCat] revise the comments
7ee5712 [CodingCat] to keep the source-compatibility
763975f [CodingCat] style fix
d91288d [CodingCat] rename "conf" parameters in the saveAsHadoop functions
Author: witgo <witgo@qq.com>
Closes#1194 from witgo/SPARK-2248 and squashes the following commits:
6ac950b [witgo] spark.default.parallelism does not apply in local mode
Author: Michael Armbrust <michael@databricks.com>
Closes#1204 from marmbrus/nullPointerToString and squashes the following commits:
35b5fce [Michael Armbrust] Fix possible null pointer in acumulator toString
This is an alternative solution to #1124 . Before launching the executor backend, we first fetch driver's spark properties and use it to overwrite executor's spark properties. This should be better than #1124.
@pwendell Are there spark properties that might be different on the driver and on the executors?
Author: Xiangrui Meng <meng@databricks.com>
Closes#1132 from mengxr/akka-bootstrap and squashes the following commits:
77ff32d [Xiangrui Meng] organize imports
68e1dfb [Xiangrui Meng] use timeout from AkkaUtils; remove props from RegisteredExecutor
46d332d [Xiangrui Meng] fix a test
7947c18 [Xiangrui Meng] increase slack size for akka
4ab696a [Xiangrui Meng] bootstrap to retrieve driver spark conf
The assertJsonStringEquals method was missing an "assert" so
did not actually check that the strings were equal. This commit
adds the missing assert and fixes subsequently revealed problems
with the JsonProtocolSuite.
@andrewor14 I changed some of the test functionality to match what it
looks like you intended based on the expected strings -- let me know if
anything here looks wrong.
Author: Kay Ousterhout <kayousterhout@gmail.com>
Closes#1198 from kayousterhout/json_test_fix and squashes the following commits:
77f858f [Kay Ousterhout] Fix broken Json tests.
Don't check executor/host availability when creating a TaskSetManager. Because the executors may haven't been registered when the TaskSetManager is created, in which case all tasks will be considered "has no preferred locations", and thus losing data locality in later scheduling.
Author: Rui Li <rui.li@intel.com>
Author: lirui-intel <rui.li@intel.com>
Closes#892 from lirui-intel/delaySchedule and squashes the following commits:
8444d7c [Rui Li] fix code style
fafd57f [Rui Li] keep locality constraints within the valid levels
18f9e05 [Rui Li] restrict allowed locality
5b3fb2f [Rui Li] refine UT
99f843e [Rui Li] add unit test and fix bug
fff4123 [Rui Li] fix computing valid locality levels
685ed3d [Rui Li] remove delay shedule for pendingTasksWithNoPrefs
7b0177a [Rui Li] remove redundant code
c7b93b5 [Rui Li] revise patch
3d7da02 [lirui-intel] Update TaskSchedulerImpl.scala
cab4c71 [Rui Li] revised patch
539a578 [Rui Li] fix code style
cf0d6ac [Rui Li] fix code style
3dfae86 [Rui Li] re-compute pending tasks when new host is added
a225ac2 [Rui Li] SPARK-1937: fix issue with task locality
This PR is a sub-task of SPARK-2044 to move the execution of aggregation into shuffle implementations.
I leave `CoGoupedRDD` and `SubtractedRDD` unchanged because they have their implementations of aggregation. I'm not sure is it suitable to change these two RDDs.
Also I do not move sort related code of `OrderedRDDFunctions` into shuffle, this will be solved in another sub-task.
Author: jerryshao <saisai.shao@intel.com>
Closes#1064 from jerryshao/SPARK-2124 and squashes the following commits:
4a05a40 [jerryshao] Modify according to comments
1f7dcc8 [jerryshao] Style changes
50a2fd6 [jerryshao] Fix test suite issue after moving aggregator to Shuffle reader and writer
1a96190 [jerryshao] Code modification related to the ShuffledRDD
308f635 [jerryshao] initial works of move combiner to ShuffleManager's reader and writer
Cleanup on Connection, ConnectionManagerId, and ConnectionManager classes part 2 while I was working at the code there to help IDE:
1. Remove unused imports
2. Remove parentheses in method calls that do not have side affect.
3. Add parentheses in method calls that do have side effect or not simple get to object properties.
4. Change if-else check (via isInstanceOf) for Connection class type with Scala expression for consistency and cleanliness.
5. Remove semicolon
6. Remove extra spaces.
7. Remove redundant return for consistency
Author: Henry Saputra <henry.saputra@gmail.com>
Closes#1157 from hsaputra/cleanup_connection_classes_part2 and squashes the following commits:
4be6906 [Henry Saputra] Fix Spark Scala style for line over 100 chars.
85b24f7 [Henry Saputra] Cleanup on Connection and ConnectionManager classes part 2 while I was working at the code there to help IDE: 1. Remove unused imports 2. Remove parentheses in method calls that do not have side affect. 3. Add parentheses in method calls that do have side effect. 4. Change if-else check (via isInstanceOf) for Connection class type with Scala expression for consitency and cleanliness. 5. Remove semicolon 6. Remove extra spaces.
Two improvements to the history server:
- Separate the HTTP handling from history fetching, so that it's easy to add
new backends later (thinking about SPARK-1537 in the long run)
- Avoid loading all UIs in memory. Do lazy loading instead, keeping a few in
memory for faster access. This allows the app limit to go away, since holding
just the listing in memory shouldn't be too expensive unless the user has millions
of completed apps in the history (at which point I'd expect other issues to arise
aside from history server memory usage, such as FileSystem.listStatus()
starting to become ridiculously expensive).
I also fixed a few minor things along the way which aren't really worth mentioning.
I also removed the app's log path from the UI since that information may not even
exist depending on which backend is used (even though there is only one now).
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#718 from vanzin/hist-server and squashes the following commits:
53620c9 [Marcelo Vanzin] Add mima exclude, fix scaladoc wording.
c21f8d8 [Marcelo Vanzin] Feedback: formatting, docs.
dd8cc4b [Marcelo Vanzin] Standardize on using spark.history.* configuration.
4da3a52 [Marcelo Vanzin] Remove UI from ApplicationHistoryInfo.
2a7f68d [Marcelo Vanzin] Address review feedback.
4e72c77 [Marcelo Vanzin] Remove comment about ordering.
249bcea [Marcelo Vanzin] Remove offset / count from provider interface.
ca5d320 [Marcelo Vanzin] Remove code that deals with unfinished apps.
6e2432f [Marcelo Vanzin] Second round of feedback.
b2c570a [Marcelo Vanzin] Make class package-private.
4406f61 [Marcelo Vanzin] Cosmetic change to listing header.
e852149 [Marcelo Vanzin] Initialize new app array to expected size.
e8026f4 [Marcelo Vanzin] Review feedback.
49d2fd3 [Marcelo Vanzin] Fix a comment.
91e96ca [Marcelo Vanzin] Fix scalastyle issues.
6fbe0d8 [Marcelo Vanzin] Better handle failures when loading app info.
eee2f5a [Marcelo Vanzin] Ensure server.stop() is called when shutting down.
bda2fa1 [Marcelo Vanzin] Rudimentary paging support for the history UI.
b284478 [Marcelo Vanzin] Separate history server from history backend.
Author: witgo <witgo@qq.com>
Closes#1174 from witgo/SPARK-2229 and squashes the following commits:
f85f321 [witgo] FileAppender throw anIllegalArgumentException in JDK6
e1a8da8 [witgo] SizeBasedRollingPolicy throw an java.lang.IllegalArgumentException in JDK6
Commons IO is actually barely used, and is not a declared dependency. This just replaces with equivalents from the JDK and Guava.
Author: Sean Owen <sowen@cloudera.com>
Closes#1173 from srowen/SPARK-1316 and squashes the following commits:
2eb53db [Sean Owen] Reorder Guava import
8fde404 [Sean Owen] Remove use of Commons IO, which is not actually a dependency
- JavaAPISuite was trying to compare a bare path with a URI. Fix by
extracting the path from the URI, since we know it should be a
local path anyway/
- b9be1609 excluded the ASM dependency everywhere, but easymock needs
it (because cglib needs it). So re-add the dependency, with test
scope this time.
The second one above actually uncovered a weird situation: the maven
test target works, even though I can't find the class sbt complains
about in its classpath. sbt complains with:
[error] Uncaught exception when running org.apache.spark.util
.random.RandomSamplerSuite: java.lang.NoClassDefFoundError:
org/objectweb/asm/Type
To avoid more weirdness caused by that, I explicitly added the asm
dependency to both maven and sbt (for tests only), and verified
the classes don't end up in the final assembly.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#917 from vanzin/flaky-tests and squashes the following commits:
d022320 [Marcelo Vanzin] Fix some tests.
The jira for the issue can be found at: https://issues.apache.org/jira/browse/SPARK-2061
Most of spark has used over to consistently using `partitions` instead of `splits`. We should do likewise and add a `partitions` method to JavaRDDLike and have `splits` just call that. We should also go through all cases where other API's (e.g. Python) call `splits` and we should change those to use the newer API.
Author: Anant <anant.asty@gmail.com>
Closes#1062 from anantasty/SPARK-2061 and squashes the following commits:
b83ce6b [Anant] Fixed syntax issue
21f9210 [Anant] Fixed version number in deprecation string
9315b76 [Anant] made related changes to use partitions in python api
8c62dd1 [Anant] Made splits deprecated in JavaRDDLike
Updating the chisquare unit test in XORShiftRandomSuite to use the ChiSquareTest in commons-math3 instead of hardcoding the chisquare statistic for the desired confidence interval.
Author: Doris Xin <doris.s.xin@gmail.com>
Closes#1073 from dorx/math3Unit and squashes the following commits:
da0e891 [Doris Xin] remove math3 from common pom
9954143 [Doris Xin] merge master
c19948f [Doris Xin] Merge branch 'master' into math3Unit
8f84f19 [Doris Xin] [SPARK-1970] unit test in XORShiftRandomSuite
ffea61a [Doris Xin] SPARK-1939: Refactor takeSample method in RDD
1441977 [Doris Xin] SPARK-1939 Refactor takeSample method in RDD to use ScaSRS
Before:
```
14/06/08 23:58:23 WARN AbstractLifeCycle: FAILED SelectChannelConnector@0.0.0.0:4040: java.net.BindException: Address already in use
java.net.BindException: Address already in use
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:444)
at sun.nio.ch.Net.bind(Net.java:436)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
at org.eclipse.jetty.server.nio.SelectChannelConnector.open(SelectChannelConnector.java:187)
at org.eclipse.jetty.server.AbstractConnector.doStart(AbstractConnector.java:316)
at org.eclipse.jetty.server.nio.SelectChannelConnector.doStart(SelectChannelConnector.java:265)
at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
at org.eclipse.jetty.server.Server.doStart(Server.java:293)
at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
at org.apache.spark.ui.JettyUtils$$anonfun$1.apply$mcV$sp(JettyUtils.scala:192)
at org.apache.spark.ui.JettyUtils$$anonfun$1.apply(JettyUtils.scala:192)
at org.apache.spark.ui.JettyUtils$$anonfun$1.apply(JettyUtils.scala:192)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.ui.JettyUtils$.connect$1(JettyUtils.scala:191)
at org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:205)
at org.apache.spark.ui.WebUI.bind(WebUI.scala:99)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:223)
at org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:957)
at $line3.$read$$iwC$$iwC.<init>(<console>:8)
at $line3.$read$$iwC.<init>(<console>:14)
at $line3.$read.<init>(<console>:16)
at $line3.$read$.<init>(<console>:20)
at $line3.$read$.<clinit>(<console>)
at $line3.$eval$.<init>(<console>:7)
at $line3.$eval$.<clinit>(<console>)
at $line3.$eval.$print(<console>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753)
at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:121)
at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:120)
at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:263)
at org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:120)
at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:56)
at org.apache.spark.repl.SparkILoop$$anonfun$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:913)
at org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:142)
at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:56)
at org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkILoopInit.scala:104)
at org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:56)
at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:930)
at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
14/06/08 23:58:23 WARN AbstractLifeCycle: FAILED org.eclipse.jetty.server.Server@7439e55a: java.net.BindException: Address already in use
java.net.BindException: Address already in use
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:444)
at sun.nio.ch.Net.bind(Net.java:436)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
at org.eclipse.jetty.server.nio.SelectChannelConnector.open(SelectChannelConnector.java:187)
at org.eclipse.jetty.server.AbstractConnector.doStart(AbstractConnector.java:316)
at org.eclipse.jetty.server.nio.SelectChannelConnector.doStart(SelectChannelConnector.java:265)
at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
at org.eclipse.jetty.server.Server.doStart(Server.java:293)
at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
at org.apache.spark.ui.JettyUtils$$anonfun$1.apply$mcV$sp(JettyUtils.scala:192)
at org.apache.spark.ui.JettyUtils$$anonfun$1.apply(JettyUtils.scala:192)
at org.apache.spark.ui.JettyUtils$$anonfun$1.apply(JettyUtils.scala:192)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.ui.JettyUtils$.connect$1(JettyUtils.scala:191)
at org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:205)
at org.apache.spark.ui.WebUI.bind(WebUI.scala:99)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:223)
at org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:957)
at $line3.$read$$iwC$$iwC.<init>(<console>:8)
at $line3.$read$$iwC.<init>(<console>:14)
at $line3.$read.<init>(<console>:16)
at $line3.$read$.<init>(<console>:20)
at $line3.$read$.<clinit>(<console>)
at $line3.$eval$.<init>(<console>:7)
at $line3.$eval$.<clinit>(<console>)
at $line3.$eval.$print(<console>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753)
at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:121)
at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:120)
at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:263)
at org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:120)
at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:56)
at org.apache.spark.repl.SparkILoop$$anonfun$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:913)
at org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:142)
at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:56)
at org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkILoopInit.scala:104)
at org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:56)
at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:930)
at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
14/06/08 23:58:23 INFO JettyUtils: Failed to create UI at port, 4040. Trying again.
14/06/08 23:58:23 INFO JettyUtils: Error was: Failure(java.net.BindException: Address already in use)
14/06/08 23:58:23 INFO SparkUI: Started SparkUI at http://aash-mbp.local:4041
````
After:
```
14/06/09 00:04:12 INFO JettyUtils: Failed to create UI at port, 4040. Trying again.
14/06/09 00:04:12 INFO JettyUtils: Error was: Failure(java.net.BindException: Address already in use)
14/06/09 00:04:12 INFO Server: jetty-8.y.z-SNAPSHOT
14/06/09 00:04:12 INFO AbstractConnector: Started SelectChannelConnector@0.0.0.0:4041
14/06/09 00:04:12 INFO SparkUI: Started SparkUI at http://aash-mbp.local:4041
```
Lengthy logging comes from this line of code in Jetty: http://grepcode.com/file/repo1.maven.org/maven2/org.eclipse.jetty.aggregate/jetty-all/9.1.3.v20140225/org/eclipse/jetty/util/component/AbstractLifeCycle.java#210
Author: Andrew Ash <andrew@andrewash.com>
Closes#1019 from ash211/SPARK-1902 and squashes the following commits:
0dd02f7 [Andrew Ash] Leave old org.eclipse.jetty silencing in place
1e2866b [Andrew Ash] Address CR comments
9d85eed [Andrew Ash] SPARK-1902 Silence stacktrace from logs when doing port failover to port n+1
**UPDATE**
I have removed the special handling for `StorageLevel.MEMORY_*_SER` for now, because it introduces a potential performance regression. With the latest changes, this PR should include mainly style (code readability) fixes. The only functionality change is the update in `MemoryStore#putBytes` to actually return updated blocks, though this is a minor bug fix.
Now this is mainly a precursor to another PR (once again).
---------
*Old comment*
The deserialized version of a partition may occupy much more space than the serialized version. Therefore, if a partition is to be cached with `StorageLevel.MEMORY_*_SER`, we don't need to fully unroll it into an `ArrayBuffer`, but instead we can unroll it into a potentially much smaller `ByteBuffer`. This may save us from OOMs in this case.
Author: Andrew Or <andrewor14@gmail.com>
Closes#1083 from andrewor14/unroll-them-partitions and squashes the following commits:
7048aa0 [Andrew Or] Merge branch 'master' of github.com:apache/spark into unroll-them-partitions
3d9a366 [Andrew Or] Minor change for readability
d12b95f [Andrew Or] Remove unused imports (minor)
a4c387b [Andrew Or] Merge branch 'master' of github.com:apache/spark into unroll-them-partitions
cf5f565 [Andrew Or] Remove special handling for MEM_*_SER
0091ec0 [Andrew Or] Address review feedback
44ef282 [Andrew Or] Actually return updated blocks in putBytes
2941c89 [Andrew Or] Clean up BlockStore (minor)
a8f181d [Andrew Or] Add special handling for StorageLevel.MEMORY_*_SER
Adds cogroup for 4 RDDs.
Author: Allan Douglas R. de Oliveira <allandouglas@gmail.com>
Closes#813 from douglaz/more_cogroups and squashes the following commits:
f8d6273 [Allan Douglas R. de Oliveira] Test python groupWith for one more case
0e9009c [Allan Douglas R. de Oliveira] Added scala tests
c3ffcdd [Allan Douglas R. de Oliveira] Added java tests
517a67f [Allan Douglas R. de Oliveira] Added tests for python groupWith
2f402d5 [Allan Douglas R. de Oliveira] Removed TODO
17474f4 [Allan Douglas R. de Oliveira] Use new cogroup function
7877a2a [Allan Douglas R. de Oliveira] Fixed code
ba02414 [Allan Douglas R. de Oliveira] Added varargs cogroup to pyspark
c4a8a51 [Allan Douglas R. de Oliveira] Added java cogroup 4
e94963c [Allan Douglas R. de Oliveira] Fixed spacing
f1ee57b [Allan Douglas R. de Oliveira] Fixed scala style issues
d7196f1 [Allan Douglas R. de Oliveira] Allow the cogroup of 4 RDDs
Author: Patrick Wendell <pwendell@gmail.com>
Closes#1141 from pwendell/hotfix and squashes the following commits:
83e4c79 [Patrick Wendell] HOTFIX: SPARK-2208 local metrics tests can fail on fast machines
int format expected for input memory parameter when spark-submit is invoked in standalone cluster mode. Make it consistent with rest of Spark.
Author: nravi <nravi@c1704.halxg.cloudera.com>
Closes#1095 from nishkamravi2/master and squashes the following commits:
2b630f9 [nravi] Accept memory input as "30g", "512M" instead of an int value, to be consistent with rest of Spark
3bf8fad [nravi] Merge branch 'master' of https://github.com/apache/spark
5423a03 [nravi] Merge branch 'master' of https://github.com/apache/spark
eb663ca [nravi] Merge branch 'master' of https://github.com/apache/spark
df2aeb1 [nravi] Improved fix for ConcurrentModificationIssue (Spark-1097, Hadoop-10456)
6b840f0 [nravi] Undo the fix for SPARK-1758 (the problem is fixed)
5108700 [nravi] Fix in Spark for the Concurrent thread modification issue (SPARK-1097, HADOOP-10456)
681b36f [nravi] Fix for SPARK-1758: failing test org.apache.spark.JavaAPISuite.wholeTextFiles
The value "env" is never used in SparkContext.scala.
Add detailed comment for method setDelaySeconds in MetadataCleaner.scala instead of the unsure one.
Author: WangTao <barneystinson@aliyun.com>
Closes#1105 from WangTaoTheTonic/master and squashes the following commits:
688358e [WangTao] Minor fix
Some IDEs don’t support unicode characters in source code. Check if this breaks binary compatibility.
Author: Doris Xin <doris.s.xin@gmail.com>
Closes#1119 from dorx/unicode and squashes the following commits:
05618c3 [Doris Xin] Remove unicode operator from RDD.scala
@tdas
Author: Mark Hamstra <markhamstra@gmail.com>
Closes#1100 from markhamstra/SPARK-2158 and squashes the following commits:
ae8e069 [Mark Hamstra] Response to TD's review
2f1e201 [Mark Hamstra] Cleanup 'stdout' file within FileAppenderSuite
A follow up on #1103
@andrewor14
Author: Reynold Xin <rxin@apache.org>
Closes#1117 from rxin/SPARK-2162 and squashes the following commits:
a4231de [Reynold Xin] Updated the comment for SPARK-2162.
other wise, it will either read in vain in memory level case, or throw exception in disk level case when it believe the block is there while actually it had been removed.
Author: Raymond Liu <raymond.liu@intel.com>
Closes#1103 from colorant/bm and squashes the following commits:
daac114 [Raymond Liu] Address comments
d1ea287 [Raymond Liu] Double check in doGetLocal to avoid read on removed block.
This PR includes two changes
- **[SPARK-2147]** When an application finishes cleanly (i.e. `sc.stop()` is called), all of its executors used to disappear from the Master UI. This no longer happens.
- **[SPARK-2161]** This adds a "Removed Executors" table to Master UI, so the user can find out why their executors died from the logs, for instance. The equivalent table already existed in the Worker UI, but was hidden because of a bug (the comment `//scalastyle:off` disconnected the `Seq[Node]` that represents the HTML for table).
This should go into 1.0.1 if possible.
Author: Andrew Or <andrewor14@gmail.com>
Closes#1102 from andrewor14/remember-removed-executors and squashes the following commits:
2e2298f [Andrew Or] Add hash code method to ExecutorInfo (minor)
abd72e0 [Andrew Or] Merge branch 'master' of github.com:apache/spark into remember-removed-executors
792f992 [Andrew Or] Add missing equals method in ExecutorInfo
3390b49 [Andrew Or] Add executor state column to WorkerPage
161f8a2 [Andrew Or] Display finished executors table (fix bug)
fbb65b8 [Andrew Or] Removed unused method
c89bb6e [Andrew Or] Add table for removed executors in MasterWebUI
fe47402 [Andrew Or] Show exited executors on the Master UI
to distinguish with SparkConf object
https://issues.apache.org/jira/browse/SPARK-2038
Author: CodingCat <zhunansjtu@gmail.com>
Closes#1087 from CodingCat/SPARK-2038 and squashes the following commits:
763975f [CodingCat] style fix
d91288d [CodingCat] rename "conf" parameters in the saveAsHadoop functions
Removes Python syntax in Scaladoc, corrects result in Scaladoc, and removes irrelevant cache() call in Python doc.
Author: Sandy Ryza <sandy@cloudera.com>
Closes#1086 from sryza/sandy-spark-2146 and squashes the following commits:
185ff18 [Sandy Ryza] Use Seq instead of Array
c996120 [Sandy Ryza] SPARK-2146. Fix takeOrdered doc
This never got merged from the apache/incubator-spark repo (which is now deleted) but there had been several rounds of code review on this PR there.
I think this is ready for merging.
Author: Andrew Ash <andrew@andrewash.com>
This patch had conflicts when merged, resolved by
Committer: Reynold Xin <rxin@apache.org>
Closes#369 from ash211/sortby and squashes the following commits:
d09147a [Andrew Ash] Fix Ordering import
43d0a53 [Andrew Ash] Fix missing .collect()
29a54ed [Andrew Ash] Re-enable test by converting to a closure
5a95348 [Andrew Ash] Add license for RDDSuiteUtils
64ed6e3 [Andrew Ash] Remove leaked diff
d4de69a [Andrew Ash] Remove scar tissue
63638b5 [Andrew Ash] Add Python version of .sortBy()
45e0fde [Andrew Ash] Add Java version of .sortBy()
adf84c5 [Andrew Ash] Re-indent to keep line lengths under 100 chars
9d9b9d8 [Andrew Ash] Use parentheses on .collect() calls
0457b69 [Andrew Ash] Ignore failing test
99f0baf [Andrew Ash] Merge branch 'master' into sortby
222ae97 [Andrew Ash] Try moving Ordering objects out to a different class
3fd0dd3 [Andrew Ash] Add (failing) test for sortByKey with explicit Ordering
b8b5bbc [Andrew Ash] Align remove extra spaces that were used to align ='s in test code
8c53298 [Andrew Ash] Actually use ascending and numPartitions parameters
381eef2 [Andrew Ash] Correct silly typo
7db3e84 [Andrew Ash] Support ascending and numPartitions params in sortBy()
0f685fd [Andrew Ash] Merge remote-tracking branch 'origin/master' into sortby
ca4490d [Andrew Ash] Add .sortBy(f) method on RDD
This is reproducible whenever we drop a block because of memory pressure.
This is because StorageStatusListener actually never removes anything from the block maps of its StorageStatuses. Instead, when a block is dropped, it sets the block's storage level to `StorageLevel.NONE`, when it should just remove it from the map.
This PR includes this simple fix.
Author: Andrew Or <andrewor14@gmail.com>
Closes#1080 from andrewor14/ui-blocks and squashes the following commits:
fcf9f1a [Andrew Or] Remove BlockStatus if it is no longer cached
I'm not sure about the test -- I get a lot of unrelated failures for some reason. I'll try to sort it out. But hopefully the automation will test this for me if I send a pull request :).
I'll attach a demo HTML in [Jira](https://issues.apache.org/jira/browse/SPARK-2035).
Author: Daniel Darabos <darabos.daniel@gmail.com>
Author: Patrick Wendell <pwendell@gmail.com>
Closes#981 from darabos/darabos-call-stack and squashes the following commits:
f7c6bfa [Daniel Darabos] Fix bad merge. I undid 83c226d454 by Doris.
3d0a48d [Daniel Darabos] Merge remote-tracking branch 'upstream/master' into darabos-call-stack
b857849 [Daniel Darabos] Style: Break long line.
ecb5690 [Daniel Darabos] Include the last Spark method in the full stack trace. Otherwise it is not visible if the stage name is overridden.
d00a85b [Patrick Wendell] Make call sites for stages non-optional and well defined
b9eba24 [Daniel Darabos] Make StageInfo.details non-optional. Add JSON serialization code for the new field. Verify JSON backward compatibility.
4312828 [Daniel Darabos] Remove Mima excludes for CallSite. They should be unnecessary now, with SPARK-2070 fixed.
0920750 [Daniel Darabos] Merge remote-tracking branch 'upstream/master' into darabos-call-stack
a4b1faf [Daniel Darabos] Add Mima exclusions for the CallSite changes it has picked up. They are private methods/classes, so we ought to be safe.
932f810 [Daniel Darabos] Use empty CallSite instead of null in DAGSchedulerSuite. Outside of testing, this parameter always originates in SparkContext.scala, and will never be null.
ccd89d1 [Daniel Darabos] Fix long lines.
ac173e4 [Daniel Darabos] Hide "show details" if there are no details to show.
6182da6 [Daniel Darabos] Set a configurable limit on maximum call stack depth. It can be useful in memory-constrained situations with large numbers of stages.
8fe2e34 [Daniel Darabos] Store call stack for stages, display it on the UI.
https://issues.apache.org/jira/browse/SPARK-2039
apply output dir existence checking for all output formats
Author: CodingCat <zhunansjtu@gmail.com>
Closes#1088 from CodingCat/SPARK-2039 and squashes the following commits:
c52747a [CodingCat] apply output dir existence checking for all output formats
StorageLevel in 'storage tab' and 'RDD Storage Info' never changes even if you call rdd.unpersist() and then you give the rdd another different storage level.
Author: CrazyJvm <crazyjvm@gmail.com>
Closes#968 from CrazyJvm/ui-storagelevel and squashes the following commits:
62555fa [CrazyJvm] change RDDInfo constructor param 'storageLevel' to var, so there's need to add another variable _storageLevel。
9f1571e [CrazyJvm] JIRA https://issues.apache.org/jira/browse/SPARK-1999 UI : StorageLevel in storage tab and RDD Storage Info never changes
There seems to be 2 issues.
1. When job is done, driver asks executor to shutdown. However, this clean exit was assigned FAILED executor state by Worker. I introduced EXITED executor state for executors who voluntarily exit (both normal and abnormal exit depending on the exit code).
2. When Master gets notified an executor has exited, it launches another one to replace it, regardless of reason why the executor had exited. When the reason was job has finished, the unnecessary replacement got subsequently killed when App disassociates. This launching and killing of unnecessary executors shows up in the log and is confusing to users. I added check for executor exit status and avoid launching (and subsequent killing) of unnecessary replacements when executors exit cleanly.
One could ask the scheduler to tell Master job is done so that Master wouldn't launch the replacement executor. However, there is a race condition between App telling Master job is done and Worker telling Master an executor had exited. There is no guarantee the former will happen before the later. Instead, I chose to check the exit code when executor exits. If the exit code is 0, I assume executor has been asked to shutdown by driver and Master will not launch replacements.
Due to race condition, it could also happen that (although didn't happen on my local cluster), Master detects App disassociation event before the executor exits by itself. In such cases, the executor will be rightfully killed and labeled as KILLED, while the App state will show FINISHED.
Author: Kan Zhang <kzhang@apache.org>
Closes#306 from kanzhang/SPARK-1118 and squashes the following commits:
cb0cc86 [Kan Zhang] [SPARK-937] adding EXITED executor state and not relaunching cleanly exited executors
... sequences
Author: Kan Zhang <kzhang@apache.org>
Closes#776 from kanzhang/SPARK-1837 and squashes the following commits:
e48f018 [Kan Zhang] [SPARK-1837] code refactoring
67c33b5 [Kan Zhang] minor change
403f9b1 [Kan Zhang] [SPARK-1837] NumericRange should be partitioned in the same way as other sequences
This fix has gone into Hadoop 2.4.1. For developers using < 2.4.1, it would be good to have a workaround in Spark as well.
Fix has been tested for performance as well, no regressions found.
Author: nravi <nravi@c1704.halxg.cloudera.com>
Closes#1000 from nishkamravi2/master and squashes the following commits:
eb663ca [nravi] Merge branch 'master' of https://github.com/apache/spark
df2aeb1 [nravi] Improved fix for ConcurrentModificationIssue (Spark-1097, Hadoop-10456)
6b840f0 [nravi] Undo the fix for SPARK-1758 (the problem is fixed)
5108700 [nravi] Fix in Spark for the Concurrent thread modification issue (SPARK-1097, HADOOP-10456)
681b36f [nravi] Fix for SPARK-1758: failing test org.apache.spark.JavaAPISuite.wholeTextFiles
This is a precursor to a bigger change. I wanted to separate out the relatively insignificant changes so the ultimate PR is not inflated.
(Warning: this PR is full of unimportant nitpicks)
Author: Andrew Or <andrewor14@gmail.com>
Closes#1058 from andrewor14/bm-minor and squashes the following commits:
8e12eaf [Andrew Or] SparkException -> BlockException
c36fd53 [Andrew Or] Make parts of BlockManager more readable
0a5f378 [Andrew Or] Entry -> MemoryEntry
e9762a5 [Andrew Or] Tone down string interpolation (minor reverts)
c4de9ac [Andrew Or] Merge branch 'master' of github.com:apache/spark into bm-minor
b3470f1 [Andrew Or] More string interpolation (minor)
7f9dcab [Andrew Or] Use string interpolation (minor)
94a425b [Andrew Or] Refactor against duplicate code + minor changes
8a6a7dc [Andrew Or] Exception -> SparkException
97c410f [Andrew Or] Deal with MIMA excludes
2480f1d [Andrew Or] Fixes in StorgeLevel.scala
abb0163 [Andrew Or] Style, formatting and naming fixes
Modified the takeSample method in RDD to use the ScaSRS sampling technique to improve performance. Added a private method that computes sampling rate > sample_size/total to ensure sufficient sample size with success rate >= 0.9999. Added a unit test for the private method to validate choice of sampling rate.
Author: Doris Xin <doris.s.xin@gmail.com>
Author: dorx <doris.s.xin@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>
Closes#916 from dorx/takeSample and squashes the following commits:
5b061ae [Doris Xin] merge master
444e750 [Doris Xin] edge cases
3de882b [dorx] Merge pull request #2 from mengxr/SPARK-1939
82dde31 [Xiangrui Meng] update pyspark's takeSample
48d954d [Doris Xin] remove unused imports from RDDSuite
fb1452f [Doris Xin] allowing num to be greater than count in all cases
1481b01 [Doris Xin] washing test tubes and making coffee
dc699f3 [Doris Xin] give back imports removed by accident in rdd.py
64e445b [Doris Xin] logwarnning as soon as it enters the while loop
55518ed [Doris Xin] added TODO for logging in rdd.py
eff89e2 [Doris Xin] addressed reviewer comments.
ecab508 [Doris Xin] "fixed checkstyle violation
0a9b3e3 [Doris Xin] "reviewer comment addressed"
f80f270 [Doris Xin] Merge branch 'master' into takeSample
ae3ad04 [Doris Xin] fixed edge cases to prevent overflow
065ebcd [Doris Xin] Merge branch 'master' into takeSample
9bdd36e [Doris Xin] Check sample size and move computeFraction
e3fd6a6 [Doris Xin] Merge branch 'master' into takeSample
7cab53a [Doris Xin] fixed import bug in rdd.py
ffea61a [Doris Xin] SPARK-1939: Refactor takeSample method in RDD
1441977 [Doris Xin] SPARK-1939 Refactor takeSample method in RDD to use ScaSRS
Took me several hours to figure out this behavior. It would be good to highlight it in the documentation.
Author: Ariel Rabkin <asrabkin@cs.princeton.edu>
Closes#1070 from asrabkin/master and squashes the following commits:
29a076e [Ariel Rabkin] doc fix
(This change is actually small, I moved some logic into
compute-classpath that was previously in spark-class).
Assemble deps has existed for a while to allow developers to
run local code with new changes quickly. When I'm developing I
typically use a simpler approach which just prepends the Spark
classes to the classpath before the assembly jar. This is well
defined in the JVM and the Spark classes take precedence over those
in the assembly.
This approach is portable across both builds which is the main reason I'd
like to switch to it. It's also a bit easier to toggle on and off quickly.
The way you use this is the following:
```
$ ./bin/spark-shell # Use spark with the normal assembly
$ export SPARK_PREPEND_CLASSES=true
$ ./bin/spark-shell # Now it's using compiled classes
$ unset SPARK_PREPEND_CLASSES
$ ./bin/spark-shell # Back to normal
```
Author: Patrick Wendell <pwendell@gmail.com>
Closes#877 from pwendell/assemble-deps and squashes the following commits:
8a11345 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into assemble-deps
faa3168 [Patrick Wendell] Adding a warning for compatibility
3f151a7 [Patrick Wendell] Small fix
bbfb73c [Patrick Wendell] Review feedback
328e9f8 [Patrick Wendell] SPARK-1843: Replace assemble-deps with env variable.
Yarn client mode was not setting the app's tracking URL to the
History Server's URL when configured by the user. Now client mode
behaves the same as cluster mode.
In SparkContext.scala, the "user.name" system property had precedence
over the SPARK_USER environment variable. This means that SPARK_USER
was never used, since "user.name" is always set by the JVM. In Yarn
cluster mode, this means the application always reported itself as
being run by user "yarn" (or whatever user was running the Yarn NM).
One could argue that the correct fix would be to use UGI.getCurrentUser()
here, but at least for Yarn that will match what SPARK_USER is set
to.
Author: Marcelo Vanzin <vanzin@cloudera.com>
This patch had conflicts when merged, resolved by
Committer: Thomas Graves <tgraves@apache.org>
Closes#1002 from vanzin/yarn-client-url and squashes the following commits:
4046e04 [Marcelo Vanzin] Set HS link in yarn-alpha also.
4c692d9 [Marcelo Vanzin] Yarn: report HS URL in client mode, correct user in cluster mode.
After deserialization, the transient field creationSiteInfo does not get backfilled with the default value, but the toString method, which is invoked by the serializer, expects the field to always be non-null. An NPE is thrown when toString is called by the serializer when creationSiteInfo is null.
Author: Doris Xin <doris.s.xin@gmail.com>
Closes#1028 from dorx/toStringNPE and squashes the following commits:
f20021e [Doris Xin] unit test for toString after desrialization
6f0a586 [Doris Xin] Merge branch 'master' into toStringNPE
f47fecf [Doris Xin] Merge branch 'master' into toStringNPE
76199c6 [Doris Xin] [SPARK-2088] fix NPE in toString
Simple cleanup on Connection and ConnectionManager to make IDE happy while working of issue:
1. Replace var with var
2. Add parentheses to Queue#dequeu to be consistent with side-effects.
3. Remove return on final line of a method.
Author: Henry Saputra <henry.saputra@gmail.com>
Closes#1060 from hsaputra/cleanup_connection_classes and squashes the following commits:
245fd09 [Henry Saputra] Cleanup on Connection and ConnectionManager to make IDE happy while working of issue: 1. Replace var with var 2. Add parentheses to Queue#dequeu to be consistent with side-effects. 3. Remove return on final line of a method.
Author: Yadong <qiyadong2010@gmail.com>
Closes#1052 from watermen/bug-fix1 and squashes the following commits:
409d09a [Yadong] 'killFuture' is never used
This is a first cut at moving shuffle logic behind a pluggable interface, as described at https://issues.apache.org/jira/browse/SPARK-2044, to let us more easily experiment with new shuffle implementations. It moves the existing shuffle code to a class HashShuffleManager behind a general ShuffleManager interface.
Two things are still missing to make this complete:
* MapOutputTracker needs to be hidden behind the ShuffleManager interface; this will also require adding methods to ShuffleManager that will let the DAGScheduler interact with it as it does with the MapOutputTracker today
* The code to do map-sides and reduce-side combine in ShuffledRDD, PairRDDFunctions, etc needs to be moved into the ShuffleManager's readers and writers
However, some of these may also be done later after we merge the current interface.
Author: Matei Zaharia <matei@databricks.com>
Closes#1009 from mateiz/pluggable-shuffle and squashes the following commits:
7a09862 [Matei Zaharia] review comments
be33d3f [Matei Zaharia] review comments
1513d4e [Matei Zaharia] Add ASF header
ac56831 [Matei Zaharia] Bug fix and better error message
4f681ba [Matei Zaharia] Move write part of ShuffleMapTask to ShuffleManager
f6f011d [Matei Zaharia] Move hash shuffle reader behind ShuffleManager interface
55c7717 [Matei Zaharia] Changed RDD code to use ShuffleReader
75cc044 [Matei Zaharia] Partial work to move hash shuffle in
Author: Prashant Sharma <prashant.s@imaginea.com>
Closes#1047 from ScrapCodes/SPARK-2108/mark-as-dev-api and squashes the following commits:
073ee34 [Prashant Sharma] [SPARK-2108] Mark SparkContext methods that return block information as developer API's
Author: witgo <witgo@qq.com>
Closes#1032 from witgo/ShouldMatchers and squashes the following commits:
7ebf34c [witgo] Resolve scalatest warnings during build
Currently, in the default log4j configuration, all the executor logs get sent to the file <code>[executor-working-dir]/stderr</code>. This does not all log files to be rolled, so old logs cannot be removed.
Using log4j RollingFileAppender allows log4j logs to be rolled, but all the logs get sent to a different set of files, other than the files <code>stdout</code> and <code>stderr</code> . So the logs are not visible in the Spark web UI any more as Spark web UI only reads the files <code>stdout</code> and <code>stderr</code>. Furthermore, it still does not allow the stdout and stderr to be cleared periodically in case a large amount of stuff gets written to them (e.g. by explicit `println` inside map function).
This PR solves this by implementing a simple `RollingFileAppender` within Spark (disabled by default). When enabled (using configuration parameter `spark.executor.rollingLogs.enabled`), the logs can get rolled over either by time interval (set with `spark.executor.rollingLogs.interval`, set to daily by default), or by size of logs (set with `spark.executor.rollingLogs.size`). Finally, old logs can be automatically deleted by specifying how many of the latest log files to keep (set with `spark.executor.rollingLogs.keepLastN`). The web UI has also been modified to show the logs across the rolled-over files.
You can test this locally (without waiting a whole day) by setting configuration `spark.executor.rollingLogs.enabled=true` and `spark.executor.rollingLogs.interval=minutely`. Continuously generate logs by running spark jobs and the generated logs files would look like this (`stderr` and `stdout` are the most current log file that are being written to).
```
stderr
stderr--2014-05-27--14-37
stderr--2014-05-27--14-47
stderr--2014-05-27--15-05
stdout
stdout--2014-05-27--14-47
```
The web ui should show logs across these files.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#895 from tdas/rolling-logs and squashes the following commits:
fd8f87f [Tathagata Das] Minor change.
d326aee [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into rolling-logs
ad956c1 [Tathagata Das] Scala style fix.
1f0a6ec [Tathagata Das] Some more changes based on Patrick's PR comments.
c8bfe4e [Tathagata Das] Refactore FileAppender to a package spark.util.logging and broke up the file into multiple files. Changed configuration parameter names.
4224409 [Tathagata Das] Style fix.
108a9f8 [Tathagata Das] Added better constraint handling for rolling policies.
f7da977 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into rolling-logs
9134495 [Tathagata Das] Simplified rolling logs by removing Daily/Hourly/MinutelyRollingFileAppender, and removing the setting rollingLogs.enabled
312d874 [Tathagata Das] Minor fixes based on PR comments.
8a67d83 [Tathagata Das] Fixed comments.
b36cfd6 [Tathagata Das] Implemented RollingPolicy, TimeBasedRollingPolicy and SizeBasedRollingPolicy, and changed RollingFileAppender accordingly.
b7e8272 [Tathagata Das] Style fix,
374c9a9 [Tathagata Das] Added missing license.
24354ea [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into rolling-logs
6cc09c7 [Tathagata Das] Fixed bugs in rolling logs, and added more debug statements.
adf4910 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into rolling-logs
931f8fb [Tathagata Das] Changed log viewer in Spark web UI to handle rolling log files.
cb4fb6d [Tathagata Das] Added FileAppender and RollingFileAppender to generate rolling executor logs.
So I finally resurrected this PR. It seems the old one against the incubator mirror is no longer available, so I cannot reference it.
This adds initial support for reading Hadoop ```SequenceFile```s, as well as arbitrary Hadoop ```InputFormat```s, in PySpark.
# Overview
The basics are as follows:
1. ```PythonRDD``` object contains the relevant methods, that are in turn invoked by ```SparkContext``` in PySpark
2. The SequenceFile or InputFormat is read on the Scala side and converted from ```Writable``` instances to the relevant Scala classes (in the case of primitives)
3. Pyrolite is used to serialize Java objects. If this fails, the fallback is ```toString```
4. ```PickleSerializer``` on the Python side deserializes.
This works "out the box" for simple ```Writable```s:
* ```Text```
* ```IntWritable```, ```DoubleWritable```, ```FloatWritable```
* ```NullWritable```
* ```BooleanWritable```
* ```BytesWritable```
* ```MapWritable```
It also works for simple, "struct-like" classes. Due to the way Pyrolite works, this requires that the classes satisfy the JavaBeans convenstions (i.e. with fields and a no-arg constructor and getters/setters). (Perhaps in future some sugar for case classes and reflection could be added).
I've tested it out with ```ESInputFormat``` as an example and it works very nicely:
```python
conf = {"es.resource" : "index/type" }
rdd = sc.newAPIHadoopRDD("org.elasticsearch.hadoop.mr.EsInputFormat", "org.apache.hadoop.io.NullWritable", "org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=conf)
rdd.first()
```
I suspect for things like HBase/Cassandra it will be a bit trickier to get it to work out the box.
# Some things still outstanding:
1. ~~Requires ```msgpack-python``` and will fail without it. As originally discussed with Josh, add a ```as_strings``` argument that defaults to ```False```, that can be used if ```msgpack-python``` is not available~~
2. ~~I see from https://github.com/apache/spark/pull/363 that Pyrolite is being used there for SerDe between Scala and Python. @ahirreddy @mateiz what is the plan behind this - is Pyrolite preferred? It seems from a cursory glance that adapting the ```msgpack```-based SerDe here to use Pyrolite wouldn't be too hard~~
3. ~~Support the key and value "wrapper" that would allow a Scala/Java function to be plugged in that would transform whatever the key/value Writable class is into something that can be serialized (e.g. convert some custom Writable to a JavaBean or ```java.util.Map``` that can be easily serialized)~~
4. Support ```saveAsSequenceFile``` and ```saveAsHadoopFile``` etc. This would require SerDe in the reverse direction, that can be handled by Pyrolite. Will work on this as a separate PR
Author: Nick Pentreath <nick.pentreath@gmail.com>
Closes#455 from MLnick/pyspark-inputformats and squashes the following commits:
268df7e [Nick Pentreath] Documentation changes mer @pwendell comments
761269b [Nick Pentreath] Address @pwendell comments, simplify default writable conversions and remove registry.
4c972d8 [Nick Pentreath] Add license headers
d150431 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
cde6af9 [Nick Pentreath] Parameterize converter trait
5ebacfa [Nick Pentreath] Update docs for PySpark input formats
a985492 [Nick Pentreath] Move Converter examples to own package
365d0be [Nick Pentreath] Make classes private[python]. Add docs and @Experimental annotation to Converter interface.
eeb8205 [Nick Pentreath] Fix path relative to SPARK_HOME in tests
1eaa08b [Nick Pentreath] HBase -> Cassandra app name oversight
3f90c3e [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
2c18513 [Nick Pentreath] Add examples for reading HBase and Cassandra InputFormats from Python
b65606f [Nick Pentreath] Add converter interface
5757f6e [Nick Pentreath] Default key/value classes for sequenceFile asre None
085b55f [Nick Pentreath] Move input format tests to tests.py and clean up docs
43eb728 [Nick Pentreath] PySpark InputFormats docs into programming guide
94beedc [Nick Pentreath] Clean up args in PythonRDD. Set key/value converter defaults to None for PySpark context.py methods
1a4a1d6 [Nick Pentreath] Address @mateiz style comments
01e0813 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
15a7d07 [Nick Pentreath] Remove default args for key/value classes. Arg names to camelCase
9fe6bd5 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
84fe8e3 [Nick Pentreath] Python programming guide space formatting
d0f52b6 [Nick Pentreath] Python programming guide
7caa73a [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
93ef995 [Nick Pentreath] Add back context.py changes
9ef1896 [Nick Pentreath] Recover earlier changes lost in previous merge for serializers.py
077ecb2 [Nick Pentreath] Recover earlier changes lost in previous merge for context.py
5af4770 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
35b8e3a [Nick Pentreath] Another fix for test ordering
bef3afb [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
e001b94 [Nick Pentreath] Fix test failures due to ordering
78978d9 [Nick Pentreath] Add doc for SequenceFile and InputFormat support to Python programming guide
64eb051 [Nick Pentreath] Scalastyle fix
e7552fa [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
44f2857 [Nick Pentreath] Remove msgpack dependency and switch serialization to Pyrolite, plus some clean up and refactoring
c0ebfb6 [Nick Pentreath] Change sequencefile test data generator to easily be called from PySpark tests
1d7c17c [Nick Pentreath] Amend tests to auto-generate sequencefile data in temp dir
17a656b [Nick Pentreath] remove binary sequencefile for tests
f60959e [Nick Pentreath] Remove msgpack dependency and serializer from PySpark
450e0a2 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
31a2fff [Nick Pentreath] Scalastyle fixes
fc5099e [Nick Pentreath] Add Apache license headers
4e08983 [Nick Pentreath] Clean up docs for PySpark context methods
b20ec7e [Nick Pentreath] Clean up merge duplicate dependencies
951c117 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
f6aac55 [Nick Pentreath] Bring back msgpack
9d2256e [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
1bbbfb0 [Nick Pentreath] Clean up SparkBuild from merge
a67dfad [Nick Pentreath] Clean up Msgpack serialization and registering
7237263 [Nick Pentreath] Add back msgpack serializer and hadoop file code lost during merging
25da1ca [Nick Pentreath] Add generator for nulls, bools, bytes and maps
65360d5 [Nick Pentreath] Adding test SequenceFiles
0c612e5 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
d72bf18 [Nick Pentreath] msgpack
dd57922 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
e67212a [Nick Pentreath] Add back msgpack dependency
f2d76a0 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
41856a5 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
97ef708 [Nick Pentreath] Remove old writeToStream
2beeedb [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
795a763 [Nick Pentreath] Change name to WriteInputFormatTestDataGenerator. Cleanup some var names. Use SPARK_HOME in path for writing test sequencefile data.
174f520 [Nick Pentreath] Add back graphx settings
703ee65 [Nick Pentreath] Add back msgpack
619c0fa [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
1c8efbc [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
eb40036 [Nick Pentreath] Remove unused comment lines
4d7ef2e [Nick Pentreath] Fix indentation
f1d73e3 [Nick Pentreath] mergeConfs returns a copy rather than mutating one of the input arguments
0f5cd84 [Nick Pentreath] Remove unused pair UTF8 class. Add comments to msgpack deserializer
4294cbb [Nick Pentreath] Add old Hadoop api methods. Clean up and expand comments. Clean up argument names
818a1e6 [Nick Pentreath] Add seqencefile and Hadoop InputFormat support to PythonRDD
4e7c9e3 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
c304cc8 [Nick Pentreath] Adding supporting sequncefiles for tests. Cleaning up
4b0a43f [Nick Pentreath] Refactoring utils into own objects. Cleaning up old commented-out code
d86325f [Nick Pentreath] Initial WIP of PySpark support for SequenceFile and arbitrary Hadoop InputFormat
This test ensures that when there are no
alive executors that satisfy a particular locality level,
the TaskSetManager doesn't ever use that as the maximum
allowed locality level (this optimization ensures that a
job doesn't wait extra time in an attempt to satisfy
a scheduling locality level that is impossible).
@mateiz and @lirui-intel this unit test illustrates an issue
with #892 (it fails with that patch).
Author: Kay Ousterhout <kayousterhout@gmail.com>
Closes#1024 from kayousterhout/scheduler_unit_test and squashes the following commits:
de6a08f [Kay Ousterhout] Added a TaskSetManager unit test.
https://issues.apache.org/jira/browse/SPARK-1944
Author: Andrew Ash <andrew@andrewash.com>
Closes#1020 from ash211/SPARK-1944 and squashes the following commits:
a831c4d [Andrew Ash] SPARK-1944 Document --verbose in spark-shell -h
Author: Andrew Ash <andrew@andrewash.com>
Closes#1016 from ash211/patch-6 and squashes the following commits:
e3865c8 [Andrew Ash] Grammar: read -> reads
Author: Neville Li <neville@spotify.com>
Closes#1006 from nevillelyh/gh/SPARK-2067 and squashes the following commits:
9ee64cf [Neville Li] [SPARK-2067] use relative path for Spark logo in UI
Adding a paragraph clarifying a weird behavior in RangePartitioner.
See also #549.
Author: Reynold Xin <rxin@apache.org>
Closes#1012 from rxin/partitioner-doc and squashes the following commits:
6f0109e [Reynold Xin] SPARK-1628 follow up: Improve RangePartitioner's documentation.
JIRA: https://issues.apache.org/jira/browse/SPARK-1628
Added `hashCode` in HashPartitioner, RangePartitioner, PythonPartitioner and PageRankUtils.CustomPartitioner.
Author: zsxwing <zsxwing@gmail.com>
Closes#549 from zsxwing/SPARK-1628 and squashes the following commits:
2620936 [zsxwing] SPARK-1628: Add missing hashCode methods in Partitioner subclasses
Author: Neville Li <neville@spotify.com>
Closes#992 from nevillelyh/master and squashes the following commits:
3011739 [Neville Li] [SPARK-2056] Set RDD name to input path
The current implementation reads one key with the next hash code as it finishes reading the keys with the current hash code, which may cause it to miss some matches of the next key. This can cause operations like join to give the wrong result when reduce tasks spill to disk and there are hash collisions, as values won't be matched together. This PR fixes it by not reading in that next key, using a peeking iterator instead.
Author: Matei Zaharia <matei@databricks.com>
Closes#986 from mateiz/spark-2043 and squashes the following commits:
0959514 [Matei Zaharia] Added unit test for having many hash collisions
892debb [Matei Zaharia] SPARK-2043: don't read a key with the next hash code in ExternalAppendOnlyMap, instead use a buffered iterator to only read values with the current hash code.
DAGScheduler supports pluggable clock like what TaskSetManager does.
Author: CrazyJvm <crazyjvm@gmail.com>
Closes#976 from CrazyJvm/clock and squashes the following commits:
6779a4c [CrazyJvm] Use pluggable clock in DAGSheduler
https://issues.apache.org/jira/browse/SPARK-1677
For compatibility with older versions of Spark it would be nice to have an option `spark.hadoop.validateOutputSpecs` (default true) for the user to disable the output directory existence checking
Author: CodingCat <zhunansjtu@gmail.com>
Closes#947 from CodingCat/SPARK-1677 and squashes the following commits:
7930f83 [CodingCat] miao
c0c0e03 [CodingCat] bug fix and doc update
5318562 [CodingCat] bug fix
13219b5 [CodingCat] allow user to disable output dir existence checking
In Hadoop trunk (currently Hadoop 3.0.0), the deprecated
FSDataOutputStream#sync() method has been removed. Instead, we should
call FSDataOutputStream#hflush, which does the same thing as the
deprecated method used to do.
Author: Colin McCabe <cmccabe@cloudera.com>
Closes#898 from cmccabe/SPARK-1518 and squashes the following commits:
752b9d7 [Colin McCabe] FileLogger: Fix compile against Hadoop trunk
I'd like to use randomSplit through the Java API, and would like to add a convenience wrapper for this method to JavaRDD. This is fairly trivial. (In fact, is the intent that JavaRDD not wrap every RDD method? and that sometimes users should just use JavaRDD.wrapRDD()?)
Along the way, I added tests for it, and also touched up the Java API test style and behavior. This is maybe the more useful part of this small change.
Author: Sean Owen <sowen@cloudera.com>
Author: Xiangrui Meng <meng@databricks.com>
This patch had conflicts when merged, resolved by
Committer: Xiangrui Meng <meng@databricks.com>
Closes#919 from srowen/SPARK-1973 and squashes the following commits:
148cb7b [Sean Owen] Some final Java test polish, while we are at it
1fc3f3e [Xiangrui Meng] more cleaning on Java 8 tests
9ebc57f [Sean Owen] Use accumulator instead of temp files to test foreach
5efb0be [Sean Owen] Add Java randomSplit, and unit tests (including for sample)
5dcc158 [Sean Owen] Simplified Java 8 test with new language features, and fixed the name of MLB's greatest team
91a1769 [Sean Owen] Touch up minor style issues in existing Java API suite test
RDD.zip() will throw an exception if it finds partition sizes are not the same.
Author: Kan Zhang <kzhang@apache.org>
Closes#944 from kanzhang/SPARK-1817 and squashes the following commits:
c073848 [Kan Zhang] [SPARK-1817] Cosmetic updates
524c670 [Kan Zhang] [SPARK-1817] RDD.zip() should verify partition sizes for each partition
The update to Mesos 0.18 caused some deprecation warnings in the build. The change to the non-deprecated version is straightforward as it emulates what the Mesos driver does with the deprecated method anyway (c5aa1dd221/src/sched/sched.cpp (L1354))
Author: Sean Owen <sowen@cloudera.com>
Closes#920 from srowen/SPARK-1806 and squashes the following commits:
8d76b6a [Sean Owen] Use non-deprecated methods in Mesos 0.18
I also corrected some errors made in the previous HLL count approximate API, including relativeSD wasn't really a measure for error (and we used it to test error bounds in test results).
Author: Reynold Xin <rxin@apache.org>
Closes#897 from rxin/hll and squashes the following commits:
4d83f41 [Reynold Xin] New error bound and non-randomness.
f154ea0 [Reynold Xin] Added a comment on the value bound for testing.
e367527 [Reynold Xin] One more round of code review.
41e649a [Reynold Xin] Update final mima list.
9e320c8 [Reynold Xin] Incorporate code review feedback.
e110d70 [Reynold Xin] Merge branch 'master' into hll
354deb8 [Reynold Xin] Added comment on the Mima exclude rules.
acaa524 [Reynold Xin] Added the right exclude rules in MimaExcludes.
6555bfe [Reynold Xin] Added a default method and re-arranged MimaExcludes.
1db1522 [Reynold Xin] Excluded util.SerializableHyperLogLog from MIMA check.
9221b27 [Reynold Xin] Merge branch 'master' into hll
88cfe77 [Reynold Xin] Updated documentation and restored the old incorrect API to maintain API compatibility.
1294be6 [Reynold Xin] Updated HLL+.
e7786cb [Reynold Xin] Merge branch 'master' into hll
c0ef0c2 [Reynold Xin] SPARK-1941: Update streamlib to 2.7.0 and use HyperLogLogPlus instead of HyperLogLog.
This PR adds support for specifying custom storage levels for the vertices and edges of a graph. This enables GraphX to handle graphs larger than memory size by specifying MEMORY_AND_DISK and then repartitioning the graph to use many small partitions, each of which does fit in memory. Spark will then automatically load partitions from disk as needed.
The user specifies the desired vertex and edge storage levels when building the graph by passing them to the graph constructor. These are then stored in the `targetStorageLevel` attribute of the VertexRDD and EdgeRDD respectively. Whenever GraphX needs to cache a VertexRDD or EdgeRDD (because it plans to use it more than once, for example), it uses the specified target storage level. Also, when the user calls `Graph#cache()`, the vertices and edges are persisted using their target storage levels.
In order to facilitate propagating the target storage levels across VertexRDD and EdgeRDD operations, we remove raw calls to the constructors and instead introduce the `withPartitionsRDD` and `withTargetStorageLevel` methods.
I tested this change by running PageRank and triangle count on a severely memory-constrained cluster (1 executor with 300 MB of memory, and a 1 GB graph). Before this PR, these algorithms used to fail with OutOfMemoryErrors. With this PR, and using the DISK_ONLY storage level, they succeed.
Author: Ankur Dave <ankurdave@gmail.com>
Closes#946 from ankurdave/SPARK-1991 and squashes the following commits:
ce17d95 [Ankur Dave] Move pickStorageLevel to StorageLevel.fromString
ccaf06f [Ankur Dave] Shadow members in withXYZ() methods rather than using underscores
c34abc0 [Ankur Dave] Exclude all of GraphX from compatibility checks vs. 1.0.0
c5ca068 [Ankur Dave] Revert "Exclude all of GraphX from binary compatibility checks"
34bcefb [Ankur Dave] Exclude all of GraphX from binary compatibility checks
6fdd137 [Ankur Dave] [SPARK-1991] Support custom storage levels for vertices and edges
When we need to read a compressed block, we will first create a compress stream instance(LZF or Snappy) and use it to wrap that block.
Let's say a reducer task need to read 1000 local shuffle blocks, it will first prepare to read that 1000 blocks, which means create 1000 compression stream instance to wrap them. But the initialization of compression instance will allocate some memory and when we have many compression instance at the same time, it is a problem.
Actually reducer reads the shuffle blocks one by one, so we can do the compression instance initialization lazily.
Author: Wenchen Fan(Cloud) <cloud0fan@gmail.com>
Closes#860 from cloud-fan/fix-compress and squashes the following commits:
0924a6b [Wenchen Fan(Cloud)] rename 'doWork' into 'getIterator'
07f32c2 [Wenchen Fan(Cloud)] move the LazyProxyIterator to dataDeserialize
d80c426 [Wenchen Fan(Cloud)] remove empty lines in short class
2c8adb2 [Wenchen Fan(Cloud)] add inline comment
8ebff77 [Wenchen Fan(Cloud)] fix compress memory issue during reduce
stop resetting spark.driver.port in unit tests (scala, java and python).
Author: Syed Hashmi <shashmi@cloudera.com>
Author: CodingCat <zhunansjtu@gmail.com>
Closes#943 from syedhashmi/master and squashes the following commits:
885f210 [Syed Hashmi] Removing unnecessary file (created by mergetool)
b8bd4b5 [Syed Hashmi] Merge remote-tracking branch 'upstream/master'
b895e59 [Syed Hashmi] Revert "[SPARK-1784] Add a new partitioner"
57b6587 [Syed Hashmi] Revert "[SPARK-1784] Add a balanced partitioner"
1574769 [Syed Hashmi] [SPARK-1942] Stop clearing spark.driver.port in unit tests
4354836 [Syed Hashmi] Revert "SPARK-1686: keep schedule() calling in the main thread"
fd36542 [Syed Hashmi] [SPARK-1784] Add a balanced partitioner
6668015 [CodingCat] SPARK-1686: keep schedule() calling in the main thread
4ca94cc [Syed Hashmi] [SPARK-1784] Add a new partitioner
This patch simply ports over the Scala implementation of RDD#take(), which reads the first partition at the driver, then decides how many more partitions it needs to read and will possibly start a real job if it's more than 1. (Note that SparkContext#runJob(allowLocal=true) only runs the job locally if there's 1 partition selected and no parent stages.)
Author: Aaron Davidson <aaron@databricks.com>
Closes#922 from aarondav/take and squashes the following commits:
fa06df9 [Aaron Davidson] SPARK-1839: PySpark RDD#take() shouldn't always read from driver
`Properties#load()` doesn't close the InputStream, but it'd be closed after being GC'd anyway...
Also changed file.getName to file, because getName only shows the filename. This will show the full (possibly relative) path, which is less confusing if it's not found.
Author: Aaron Davidson <aaron@databricks.com>
Closes#914 from aarondav/tiny and squashes the following commits:
db9d072 [Aaron Davidson] Super minor: Close inputStream in SparkSubmitArguments
Author: Chen Chao <crazyjvm@gmail.com>
Closes#928 from CrazyJvm/patch-8 and squashes the following commits:
144328b [Chen Chao] correct tiny comment error
https://issues.apache.org/jira/browse/SPARK-1901
Author: Zhen Peng <zhenpeng01@baidu.com>
Closes#854 from zhpengg/bugfix-worker-kills-executor and squashes the following commits:
21d380b [Zhen Peng] add some error messages
506cea6 [Zhen Peng] add some docs for killProcess()
a0b9860 [Zhen Peng] [SPARK-1901] worker should make sure executor has exited before updating executor's info
bugfix worker DriverStateChanged state should match DriverState.FAILED
Author: lianhuiwang <lianhuiwang09@gmail.com>
Closes#864 from lianhuiwang/master and squashes the following commits:
480ce94 [lianhuiwang] address aarondav comments
f2b5970 [lianhuiwang] bugfix worker DriverStateChanged state should match DriverState.FAILED
`var cachedPeers: Seq[BlockManagerId] = null` is used in `def replicate(blockId: BlockId, data: ByteBuffer, level: StorageLevel)` without proper protection.
There are two place will call `replicate(blockId, bytesAfterPut, level)`
* 17f3075bc4/core/src/main/scala/org/apache/spark/storage/BlockManager.scala (L644) runs in `connectionManager.futureExecContext`
* 17f3075bc4/core/src/main/scala/org/apache/spark/storage/BlockManager.scala (L752) `doPut` runs in `connectionManager.handleMessageExecutor`. `org.apache.spark.storage.BlockManagerWorker` calls `blockManager.putBytes` in `connectionManager.handleMessageExecutor`.
As they run in different `Executor`s, this is a race condition which may cause the memory pointed by `cachedPeers` is not correct even if `cachedPeers != null`.
The race condition of `onReceiveCallback` is that it's set in `BlockManagerWorker` but read in a different thread in `ConnectionManager.handleMessageExecutor`.
Author: zsxwing <zsxwing@gmail.com>
Closes#887 from zsxwing/SPARK-1932 and squashes the following commits:
524f69c [zsxwing] SPARK-1932: Fix race conditions in onReceiveCallback and cachedPeers
https://issues.apache.org/jira/browse/SPARK-1933
Author: Reynold Xin <rxin@apache.org>
Closes#888 from rxin/addfile and squashes the following commits:
8c402a3 [Reynold Xin] Updated comment.
ff6c162 [Reynold Xin] SPARK-1933: Throw a more meaningful exception when a directory is passed to addJar/addFile.
DAGScheduler does not handle local task OOM properly, and will wait for the job result forever.
Author: Zhen Peng <zhenpeng01@baidu.com>
Closes#883 from zhpengg/bugfix-dag-scheduler-oom and squashes the following commits:
76f7eda [Zhen Peng] remove redundant memory allocations
aa63161 [Zhen Peng] SPARK-1929 DAGScheduler suspended by local task OOM
Self explanatory.
Author: Patrick Wendell <pwendell@gmail.com>
Closes#878 from pwendell/java-constructor and squashes the following commits:
2cc1605 [Patrick Wendell] HOTFIX: Add no-arg SparkContext constructor in Java
Author: Zhen Peng <zhenpeng01@baidu.com>
Closes#827 from zhpengg/bugfix-executor-id-not-found and squashes the following commits:
cd8bb65 [Zhen Peng] bugfix: check executor id existence when executor exit
If I run the following on a YARN cluster
```
bin/spark-submit sheep.py --master yarn-client
```
it fails because of a mismatch in paths: `spark-submit` thinks that `sheep.py` resides on HDFS, and balks when it can't find the file there. A natural workaround is to add the `file:` prefix to the file:
```
bin/spark-submit file:/path/to/sheep.py --master yarn-client
```
However, this also fails. This time it is because python does not understand URI schemes.
This PR fixes this by automatically resolving all paths passed as command line argument to `spark-submit` properly. This has the added benefit of keeping file and jar paths consistent across different cluster modes. For python, we strip the URI scheme before we actually try to run it.
Much of the code is originally written by @mengxr. Tested on YARN cluster. More tests pending.
Author: Andrew Or <andrewor14@gmail.com>
Closes#853 from andrewor14/submit-paths and squashes the following commits:
0bb097a [Andrew Or] Format path correctly before adding it to PYTHONPATH
323b45c [Andrew Or] Include --py-files on PYTHONPATH for pyspark shell
3c36587 [Andrew Or] Improve error messages (minor)
854aa6a [Andrew Or] Guard against NPE if user gives pathological paths
6638a6b [Andrew Or] Fix spark-shell jar paths after #849 went in
3bb0359 [Andrew Or] Update more comments (minor)
2a1f8a0 [Andrew Or] Update comments (minor)
6af2c77 [Andrew Or] Merge branch 'master' of github.com:apache/spark into submit-paths
a68c4d1 [Andrew Or] Handle Windows python file path correctly
427a250 [Andrew Or] Resolve paths properly for Windows
a591a4a [Andrew Or] Update tests for resolving URIs
6c8621c [Andrew Or] Move resolveURIs to Utils
db8255e [Andrew Or] Merge branch 'master' of github.com:apache/spark into submit-paths
f542dce [Andrew Or] Fix outdated tests
691c4ce [Andrew Or] Ignore special primary resource names
5342ac7 [Andrew Or] Add missing space in error message
02f77f3 [Andrew Or] Resolve command line arguments to spark-submit properly
Due to perhaps zombie processes on Jenkins, it seems that at least 10
Spark ports are in use. It also doesn't matter that the port increases
when used, it could in fact go down -- the only part that matters is
that it selects a different port rather than failing to bind.
Changed test to match this.
Thanks to @andrewor14 for helping diagnose this.
Author: Aaron Davidson <aaron@databricks.com>
Closes#857 from aarondav/tiny and squashes the following commits:
c199ec8 [Aaron Davidson] Fix UISuite unit test that fails under Jenkins contention
Sent secondary jars to distributed cache of all containers and add the cached jars to classpath before executors start. Tested on a YARN cluster (CDH-5.0).
`spark-submit --jars` also works in standalone server and `yarn-client`. Thanks for @andrewor14 for testing!
I removed "Doesn't work for drivers in standalone mode with "cluster" deploy mode." from `spark-submit`'s help message, though we haven't tested mesos yet.
CC: @dbtsai @sryza
Author: Xiangrui Meng <meng@databricks.com>
Closes#848 from mengxr/yarn-classpath and squashes the following commits:
23e7df4 [Xiangrui Meng] rename spark.jar to __spark__.jar and app.jar to __app__.jar to avoid confliction apped $CWD/ and $CWD/* to the classpath remove unused methods
a40f6ed [Xiangrui Meng] standalone -> cluster
65e04ad [Xiangrui Meng] update spark-submit help message and add a comment for yarn-client
11e5354 [Xiangrui Meng] minor changes
3e7e1c4 [Xiangrui Meng] use sparkConf instead of hadoop conf
dc3c825 [Xiangrui Meng] add secondary jars to classpath in yarn
It was in the wrong package
Author: Andrew Or <andrewor14@gmail.com>
Closes#839 from andrewor14/jdbc-suite and squashes the following commits:
f948c5a [Andrew Or] cache -> cache()
b215279 [Andrew Or] Move JdbcRDDSuite to the correct package
Updated version of #821
Author: Tathagata Das <tathagata.das1565@gmail.com>
Author: Ghidireac <bogdang@u448a5b0a73d45358d94a.ant.amazon.com>
Closes#835 from tdas/SPARK-1877 and squashes the following commits:
f346f71 [Tathagata Das] Addressed Patrick's comments.
fee0c5d [Ghidireac] SPARK-1877: ClassNotFoundException when loading RDD with serialized objects
scheduler.error() will mask the error if there are active tasks. Being removed is a cataclysmic event for Spark applications, and should probably be treated as such.
Author: Aaron Davidson <aaron@databricks.com>
Closes#832 from aarondav/i-love-u and squashes the following commits:
9f1200f [Aaron Davidson] SPARK-1689: Spark application should die when removed by Master
See https://issues.apache.org/jira/browse/SPARK-1879 -- builds with Hadoop2 and Hive ran out of PermGen space in spark-shell, when those things added up with the Scala compiler.
Note that users can still override it by setting their own Java options with this change. Their options will come later in the command string than the -XX:MaxPermSize=128m.
Author: Matei Zaharia <matei@databricks.com>
Closes#823 from mateiz/spark-1879 and squashes the following commits:
6bc0ee8 [Matei Zaharia] Increase MaxPermSize to 128m since some of our builds have lots of classes
- Look for JARs in the right place
- Launch examples the same way as on Unix
- Load datanucleus JARs if they exist
- Don't attempt to parse local paths as URIs in SparkSubmit, since paths with C:\ are not valid URIs
- Also fixed POM exclusion rules for datanucleus (it wasn't properly excluding it, whereas SBT was)
Author: Matei Zaharia <matei@databricks.com>
Closes#819 from mateiz/win-fixes and squashes the following commits:
d558f96 [Matei Zaharia] Fix comment
228577b [Matei Zaharia] Review comments
d3b71c7 [Matei Zaharia] Properly exclude datanucleus files in Maven assembly
144af84 [Matei Zaharia] Update Windows scripts to match latest binary package layout
Just a small change. I think it's good not to scare people who are using the old options.
Author: Patrick Wendell <pwendell@gmail.com>
Closes#810 from pwendell/warnings and squashes the following commits:
cb8a311 [Patrick Wendell] Make deprecation warning less severe
**Problem.** For `bin/pyspark`, there is currently no other way to specify Spark configuration properties other than through `SPARK_JAVA_OPTS` in `conf/spark-env.sh`. However, this mechanism is supposedly deprecated. Instead, it needs to pick up configurations explicitly specified in `conf/spark-defaults.conf`.
**Solution.** Have `bin/pyspark` invoke `bin/spark-submit`, like all of its counterparts in Scala land (i.e. `bin/spark-shell`, `bin/run-example`). This has the additional benefit of making the invocation of all the user facing Spark scripts consistent.
**Details.** `bin/pyspark` inherently handles two cases: (1) running python applications and (2) running the python shell. For (1), Spark submit already handles running python applications. For cases in which `bin/pyspark` is given a python file, we can simply call pass the file directly to Spark submit and let it handle the rest.
For case (2), `bin/pyspark` starts a python process as before, which launches the JVM as a sub-process. The existing code already provides a code path to do this. All we needed to change is to use `bin/spark-submit` instead of `spark-class` to launch the JVM. This requires modifications to Spark submit to handle the pyspark shell as a special case.
This has been tested locally (OSX and Windows 7), on a standalone cluster, and on a YARN cluster. Running IPython also works as before, except now it takes in Spark submit arguments too.
Author: Andrew Or <andrewor14@gmail.com>
Closes#799 from andrewor14/pyspark-submit and squashes the following commits:
bf37e36 [Andrew Or] Minor changes
01066fa [Andrew Or] bin/pyspark for Windows
c8cb3bf [Andrew Or] Handle perverse app names (with escaped quotes)
1866f85 [Andrew Or] Windows is not cooperating
456d844 [Andrew Or] Guard against shlex hanging if PYSPARK_SUBMIT_ARGS is not set
7eebda8 [Andrew Or] Merge branch 'master' of github.com:apache/spark into pyspark-submit
b7ba0d8 [Andrew Or] Address a few comments (minor)
06eb138 [Andrew Or] Use shlex instead of writing our own parser
05879fa [Andrew Or] Merge branch 'master' of github.com:apache/spark into pyspark-submit
a823661 [Andrew Or] Fix --die-on-broken-pipe not propagated properly
6fba412 [Andrew Or] Deal with quotes + address various comments
fe4c8a7 [Andrew Or] Update --help for bin/pyspark
afe47bf [Andrew Or] Fix spark shell
f04aaa4 [Andrew Or] Merge branch 'master' of github.com:apache/spark into pyspark-submit
a371d26 [Andrew Or] Route bin/pyspark through Spark submit
Author: Michael Armbrust <michael@databricks.com>
Closes#808 from marmbrus/confClasspath and squashes the following commits:
4c31d57 [Michael Armbrust] Look in spark conf instead of system properties when propagating configuration to executors.
This causes an unrecoverable error for applications that are running for longer
than 7 days that have jars added to the SparkContext, as the jars are cleaned up
even though the application is still running.
Author: Aaron Davidson <aaron@databricks.com>
Closes#800 from aarondav/shitty-defaults and squashes the following commits:
a573fbb [Aaron Davidson] SPARK-1860: Do not cleanup application work/ directories by default
Author: Huajian Mao <huajianmao@gmail.com>
Closes#798 from huajianmao/patch-1 and squashes the following commits:
208a454 [Huajian Mao] A typo in Task
1b515af [Huajian Mao] A typo in the message
This is a few changes based on the original patch by @scrapcodes.
Author: Prashant Sharma <prashant.s@imaginea.com>
Author: Patrick Wendell <pwendell@gmail.com>
Closes#785 from pwendell/package-docs and squashes the following commits:
c32b731 [Patrick Wendell] Changes based on Prashant's patch
c0463d3 [Prashant Sharma] added eof new line
ce8bf73 [Prashant Sharma] Added eof new line to all files.
4c35f2e [Prashant Sharma] SPARK-1563 Add package-info.java and package.scala files for all packages that appear in docs
Author: Patrick Wendell <pwendell@gmail.com>
Closes#784 from pwendell/group-by-key and squashes the following commits:
9b4505f [Patrick Wendell] Small fix
6347924 [Patrick Wendell] Documentation: Encourage use of reduceByKey instead of groupByKey.
Running SparkPi example gave this error.
```
Pi is roughly 3.14374
14/05/14 18:16:19 ERROR Utils: Uncaught exception in thread SparkListenerBus
scala.runtime.NonLocalReturnControl$mcV$sp
```
This is due to the catch-all in the SparkListenerBus, which logged control throwable used by scala system
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#783 from tdas/controlexception-fix and squashes the following commits:
a466c8d [Tathagata Das] Ignored control exceptions when logging all exceptions.
Author: andrewor14 <andrewor14@gmail.com>
Closes#780 from andrewor14/submit-typo and squashes the following commits:
e70e057 [andrewor14] propertes -> properties
After having been invited to make the change in 6bee01dd04 (commitcomment-6284165) by @witgo.
Author: Jacek Laskowski <jacek@japila.pl>
Closes#748 from jaceklaskowski/sparkenv-string-interpolation and squashes the following commits:
be6ebac [Jacek Laskowski] String interpolation + some other small changes
This is nicer than relying on new SparkContext(new SparkConf())
Author: Patrick Wendell <pwendell@gmail.com>
Closes#774 from pwendell/spark-context and squashes the following commits:
ef9f12f [Patrick Wendell] SPARK-1833 - Have an empty SparkContext constructor.
As "99 ms" up to 99 ms
As "0.1 s" from 0.1 s up to 0.9 s
https://issues.apache.org/jira/browse/SPARK-1829
Compare the first image to the second here: http://imgur.com/RaLEsSZ,7VTlgfo#0
Author: Andrew Ash <andrew@andrewash.com>
Closes#768 from ash211/spark-1829 and squashes the following commits:
1c15b8e [Andrew Ash] SPARK-1829 Format sub-second durations more appropriately
If the intended behavior was that uncaught exceptions thrown in functions being run by the Akka scheduler would end up being handled by the default uncaught exception handler set in Executor, and if that behavior is, in fact, correct, then this is a way to accomplish that. I'm not certain, though, that we shouldn't be doing something different to handle uncaught exceptions from some of these scheduled functions.
In any event, this PR covers all of the cases I comment on in [SPARK-1620](https://issues.apache.org/jira/browse/SPARK-1620).
Author: Mark Hamstra <markhamstra@gmail.com>
Closes#622 from markhamstra/SPARK-1620 and squashes the following commits:
071d193 [Mark Hamstra] refactored post-SPARK-1772
1a6a35e [Mark Hamstra] another style fix
d30eb94 [Mark Hamstra] scalastyle
3573ecd [Mark Hamstra] Use wrapped try/catch in Utils.tryOrExit
8fc0439 [Mark Hamstra] Make functions run by the Akka scheduler use Executor's UncaughtExceptionHandler
This PR replaces the Schedulable data structures in Pool.scala with thread-safe ones from java. Note that Scala's `with SynchronizedBuffer` trait is soon to be deprecated in 2.11 because it is ["inherently unreliable"](http://www.scala-lang.org/api/2.11.0/index.html#scala.collection.mutable.SynchronizedBuffer). We should slowly drift away from `SynchronizedBuffer` in other places too.
Note that this PR introduces an API-breaking change; `sc.getAllPools` now returns an Array rather than an ArrayBuffer. This is because we want this method to return an immutable copy rather than one may potentially confuse the user if they try to modify the copy, which takes no effect on the original data structure.
Author: Andrew Or <andrewor14@gmail.com>
Closes#762 from andrewor14/pool-npe and squashes the following commits:
383e739 [Andrew Or] JavaConverters -> JavaConversions
3f32981 [Andrew Or] Merge branch 'master' of github.com:apache/spark into pool-npe
769be19 [Andrew Or] Assorted minor changes
2189247 [Andrew Or] Merge branch 'master' of github.com:apache/spark into pool-npe
05ad9e9 [Andrew Or] Fix test - contains is not the same as containsKey
0921ea0 [Andrew Or] var -> val
07d720c [Andrew Or] Synchronize Schedulable data structures
...loper api
Author: Koert Kuipers <koert@tresata.com>
Closes#764 from koertkuipers/feat-rdd-developerapi and squashes the following commits:
8516dd2 [Koert Kuipers] SPARK-1801. expose InterruptibleIterator and TaskKilledException in developer api
Add the implementation for ApproximateCountDistinct to SparkSql. We use the HyperLogLog algorithm implemented in stream-lib, and do the count in two phases: 1) counting the number of distinct elements in each partitions, and 2) merge the HyperLogLog results from different partitions.
A simple serializer and test cases are added as well.
Author: larvaboy <larvaboy@gmail.com>
Closes#737 from larvaboy/master and squashes the following commits:
bd8ef3f [larvaboy] Add support of user-provided standard deviation to ApproxCountDistinct.
9ba8360 [larvaboy] Fix alignment and null handling issues.
95b4067 [larvaboy] Add a test case for count distinct and approximate count distinct.
f57917d [larvaboy] Add the parser for the approximate count.
a2d5d10 [larvaboy] Add ApproximateCountDistinct aggregates and functions.
7ad273a [larvaboy] Add SparkSql serializer for HyperLogLog.
1d9aacf [larvaboy] Fix a minor typo in the toString method of the Count case class.
653542b [larvaboy] Fix a couple of minor typos.
This change adds a new partitioner which allows users
to specify # of keys per partition.
Author: Syed Hashmi <shashmi@cloudera.com>
Closes#721 from syedhashmi/master and squashes the following commits:
4ca94cc [Syed Hashmi] [SPARK-1784] Add a new partitioner
JIRA issue: [SPARK-1527](https://issues.apache.org/jira/browse/SPARK-1527)
getName() only gets the last component of the file path. When deleting test-generated directories,
we should pass the generated directory's absolute path to DiskBlockManager.
Author: Ye Xianjin <advancedxy@gmail.com>
This patch had conflicts when merged, resolved by
Committer: Patrick Wendell <pwendell@gmail.com>
Closes#436 from advancedxy/SPARK-1527 and squashes the following commits:
4678bab [Ye Xianjin] change rootDir*.getname to rootDir*.getAbsolutePath so the temporary directories are deleted when the test is finished.
The solution is to wrap a try / catch / log around the posting of each event to each listener.
Author: Andrew Or <andrewor14@gmail.com>
Closes#759 from andrewor14/listener-die and squashes the following commits:
aee5107 [Andrew Or] Merge branch 'master' of github.com:apache/spark into listener-die
370939f [Andrew Or] Remove two layers of indirection
422d278 [Andrew Or] Explicitly throw an exception instead of 1 / 0
0df0e2a [Andrew Or] Try/catch and log exceptions when posting events
This patch checks top-level closure arguments to `ClosureCleaner.clean` for `return` statements and raises an exception if it finds any. This is mainly a user-friendliness addition, since programs with return statements in closure arguments will currently fail upon RDD actions with a less-than-intuitive error message.
Author: William Benton <willb@redhat.com>
Closes#717 from willb/spark-571 and squashes the following commits:
c41eb7d [William Benton] Another test case for SPARK-571
30c42f4 [William Benton] Stylistic cleanups
559b16b [William Benton] Stylistic cleanups from review
de13b79 [William Benton] Style fixes
295b6a5 [William Benton] Forbid return statements in closure arguments.
b017c47 [William Benton] Added a test for SPARK-571
Author: Sandy Ryza <sandy@cloudera.com>
Closes#753 from sryza/sandy-spark-1815 and squashes the following commits:
957a8ac [Sandy Ryza] SPARK-1815. SparkContext should not be marked DeveloperApi
What they really mean is SPARK_DAEMON_***JAVA***_OPTS
Author: Andrew Or <andrewor14@gmail.com>
Closes#751 from andrewor14/spark-daemon-opts and squashes the following commits:
70c41f9 [Andrew Or] SPARK_DAEMON_OPTS -> SPARK_DAEMON_JAVA_OPTS
Author: Andrew Ash <andrew@andrewash.com>
Closes#743 from ash211/patch-4 and squashes the following commits:
c959f3b [Andrew Ash] Typo: resond -> respond
This seems strictly better, and I think it's justified only the grounds of
clean-up. It might also fix issues with path conversions, but I haven't
yet isolated any instance of that happening.
/cc @srowen @tdas
Author: Patrick Wendell <pwendell@gmail.com>
Closes#749 from pwendell/broadcast-cleanup and squashes the following commits:
d6d54f2 [Patrick Wendell] SPARK-1623: Use File objects instead of string's in HTTPBroadcast
This was changed, but in fact, it's used for things other than tests.
So I've changed it back.
Author: Patrick Wendell <pwendell@gmail.com>
Closes#747 from pwendell/executor-env and squashes the following commits:
36a60a5 [Patrick Wendell] Rename testExecutorEnvs --> executorEnvs.
Three issues related to temp files that tests generate – these should be touched up for hygiene but are not urgent.
Modules have a log4j.properties which directs the unit-test.log output file to a directory like `[module]/target/unit-test.log`. But this ends up creating `[module]/[module]/target/unit-test.log` instead of former.
The `work/` directory is not deleted by "mvn clean", in the parent and in modules. Neither is the `checkpoint/` directory created under the various external modules.
Many tests create a temp directory, which is not usually deleted. This can be largely resolved by calling `deleteOnExit()` at creation and trying to call `Utils.deleteRecursively` consistently to clean up, sometimes in an `@After` method.
_If anyone seconds the motion, I can create a more significant change that introduces a new test trait along the lines of `LocalSparkContext`, which provides management of temp directories for subclasses to take advantage of._
Author: Sean Owen <sowen@cloudera.com>
Closes#732 from srowen/SPARK-1798 and squashes the following commits:
5af578e [Sean Owen] Try to consistently delete test temp dirs and files, and set deleteOnExit() for each
b21b356 [Sean Owen] Remove work/ and checkpoint/ dirs with mvn clean
bdd0f41 [Sean Owen] Remove duplicate module dir in log4j.properties output path for tests
Enabled Mesos (0.18.1) dependency with shaded protobuf
Why is this needed?
Avoids any protobuf version collision between Mesos and any other
dependency in Spark e.g. Hadoop HDFS 2.2+ or 1.0.4.
Ticket: https://issues.apache.org/jira/browse/SPARK-1806
* Should close https://issues.apache.org/jira/browse/SPARK-1433
Author berngp
Author: Bernardo Gomez Palacio <bernardo.gomezpalacio@gmail.com>
Closes#741 from berngp/feature/SPARK-1806 and squashes the following commits:
5d70646 [Bernardo Gomez Palacio] SPARK-1806: Upgrade Mesos dependency to 0.18.1
The main issue this patch fixes is [SPARK-1772](https://issues.apache.org/jira/browse/SPARK-1772), in which Executors may not die when fatal exceptions (e.g., OOM) are thrown. This patch causes Executors to delegate to the ExecutorUncaughtExceptionHandler when a fatal exception is thrown.
This patch also continues the fight in the neverending war against `case t: Throwable =>`, by only catching Exceptions in many places, and adding a wrapper for Threads and Runnables to make sure any uncaught exceptions are at least printed to the logs.
It also turns out that it is unlikely that the IndestructibleActorSystem actually works, given testing ([here](https://gist.github.com/aarondav/ca1f0cdcd50727f89c0d)). The uncaughtExceptionHandler is not called from the places that we expected it would be.
[SPARK-1620](https://issues.apache.org/jira/browse/SPARK-1620) deals with part of this issue, but refactoring our Actor Systems to ensure that exceptions are dealt with properly is a much bigger change, outside the scope of this PR.
Author: Aaron Davidson <aaron@databricks.com>
Closes#715 from aarondav/throwable and squashes the following commits:
f9b9bfe [Aaron Davidson] Remove other redundant 'throw e'
e937a0a [Aaron Davidson] Address Prashant and Matei's comments
1867867 [Aaron Davidson] [RFC] SPARK-1772 Stop catching Throwable, let Executors die
This patch adds better balancing when performing a repartition of an
RDD. Previously the elements in the RDD were hash partitioned, meaning
if the RDD was skewed certain partitions would end up being very large.
This commit adds load balancing of elements across the repartitioned
RDD splits. The load balancing is not perfect: a given output partition
can have up to N more elements than the average if there are N input
partitions. However, some randomization is used to minimize the
probabiliy that this happens.
Author: Patrick Wendell <pwendell@gmail.com>
Closes#727 from pwendell/load-balance and squashes the following commits:
f9da752 [Patrick Wendell] Response to Matei's feedback
acfa46a [Patrick Wendell] SPARK-1770: Load balance elements when repartitioning.
Author: witgo <witgo@qq.com>
Closes#728 from witgo/scala_home and squashes the following commits:
cdfd8be [witgo] Merge branch 'master' of https://github.com/apache/spark into scala_home
fac094a [witgo] remove outdated runtime Information scala home
SparkSubmit ignores `--jars` for YARN client. This is a bug.
This PR also automatically adds the application jar to `spark.jar`. Previously, when running as yarn-client, you must specify the jar additionally through `--files` (because `--jars` didn't work). Now you don't have to explicitly specify it through either.
Tested on a YARN cluster.
Author: Andrew Or <andrewor14@gmail.com>
Closes#710 from andrewor14/yarn-jars and squashes the following commits:
35d1928 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-jars
c27bf6c [Andrew Or] For yarn-cluster and python, do not add primaryResource to spark.jar
c92c5bf [Andrew Or] Minor cleanups
269f9f3 [Andrew Or] Fix format
013d840 [Andrew Or] Fix tests
1407474 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-jars
3bb75e8 [Andrew Or] Allow SparkSubmit --jars to take effect in yarn-client mode