The default constructor loads default properties, which can fail the test.
Author: Xiangrui Meng <meng@databricks.com>
Closes#775 from mengxr/pyspark-conf-fix and squashes the following commits:
83ef6c4 [Xiangrui Meng] do not load defaults when testing SparkConf in pyspark
This is nicer than relying on new SparkContext(new SparkConf())
Author: Patrick Wendell <pwendell@gmail.com>
Closes#774 from pwendell/spark-context and squashes the following commits:
ef9f12f [Patrick Wendell] SPARK-1833 - Have an empty SparkContext constructor.
As "99 ms" up to 99 ms
As "0.1 s" from 0.1 s up to 0.9 s
https://issues.apache.org/jira/browse/SPARK-1829
Compare the first image to the second here: http://imgur.com/RaLEsSZ,7VTlgfo#0
Author: Andrew Ash <andrew@andrewash.com>
Closes#768 from ash211/spark-1829 and squashes the following commits:
1c15b8e [Andrew Ash] SPARK-1829 Format sub-second durations more appropriately
If the intended behavior was that uncaught exceptions thrown in functions being run by the Akka scheduler would end up being handled by the default uncaught exception handler set in Executor, and if that behavior is, in fact, correct, then this is a way to accomplish that. I'm not certain, though, that we shouldn't be doing something different to handle uncaught exceptions from some of these scheduled functions.
In any event, this PR covers all of the cases I comment on in [SPARK-1620](https://issues.apache.org/jira/browse/SPARK-1620).
Author: Mark Hamstra <markhamstra@gmail.com>
Closes#622 from markhamstra/SPARK-1620 and squashes the following commits:
071d193 [Mark Hamstra] refactored post-SPARK-1772
1a6a35e [Mark Hamstra] another style fix
d30eb94 [Mark Hamstra] scalastyle
3573ecd [Mark Hamstra] Use wrapped try/catch in Utils.tryOrExit
8fc0439 [Mark Hamstra] Make functions run by the Akka scheduler use Executor's UncaughtExceptionHandler
See https://issues.apache.org/jira/browse/SPARK-1828 for more information.
This is being submitted to Jenkin's for testing. The dependency won't fully
propagate in Maven central for a few more hours.
Author: Patrick Wendell <pwendell@gmail.com>
Closes#767 from pwendell/hive-shaded and squashes the following commits:
ea10ac5 [Patrick Wendell] SPARK-1828: Created forked version of hive-exec that doesn't bundle other dependencies
Place more emphasis on using precompiled binary versions of Spark and Mesos
instead of encouraging the reader to compile from source.
Author: Andrew Ash <andrew@andrewash.com>
Closes#756 from ash211/spark-1818 and squashes the following commits:
7ef3b33 [Andrew Ash] Brief explanation of the interactions between Spark and Mesos
e7dea8e [Andrew Ash] Add troubleshooting and debugging section
956362d [Andrew Ash] Don't need to pass spark.executor.uri into the spark shell
de3353b [Andrew Ash] Wrap to 100char
7ebf6ef [Andrew Ash] Polish on the section on Mesos Master URLs
3dcc2c1 [Andrew Ash] Use --tgz parameter of make-distribution
41b68ed [Andrew Ash] Period at end of sentence; formatting on :5050
8bf2c53 [Andrew Ash] Update site.MESOS_VERSIOn to match /pom.xml
74f2040 [Andrew Ash] SPARK-1818 Freshen Mesos documentation
LICENSE and NOTICE policy is explained here:
http://www.apache.org/dev/licensing-howto.htmlhttp://www.apache.org/legal/3party.html
This leads to the following changes.
First, this change enables two extensions to maven-shade-plugin in assembly/ that will try to include and merge all NOTICE and LICENSE files. This can't hurt.
This generates a consolidated NOTICE file that I manually added to NOTICE.
Next, a list of all dependencies and their licenses was generated:
`mvn ... license:aggregate-add-third-party`
to create: `target/generated-sources/license/THIRD-PARTY.txt`
Each dependency is listed with one or more licenses. Determine the most-compatible license for each if there is more than one.
For "unknown" license dependencies, I manually evaluateD their license. Many are actually Apache projects or components of projects covered already. The only non-trivial one was Colt, which has its own (compatible) license.
I ignored Apache-licensed and public domain dependencies as these require no further action (beyond NOTICE above).
BSD and MIT licenses (permissive Category A licenses) are evidently supposed to be mentioned in LICENSE, so I added a section without output from the THIRD-PARTY.txt file appropriately.
Everything else, Category B licenses, are evidently mentioned in NOTICE (?) Same there.
LICENSE contained some license statements for source code that is redistributed. I left this as I think that is the right place to put it.
Author: Sean Owen <sowen@cloudera.com>
Closes#770 from srowen/SPARK-1827 and squashes the following commits:
a764504 [Sean Owen] Add LICENSE and NOTICE info for all transitive dependencies as of 1.0
Pretty self-explanatory
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#722 from tdas/example-fix and squashes the following commits:
7839979 [Tathagata Das] Minor changes.
0673441 [Tathagata Das] Fixed java docs of java streaming example
e687123 [Tathagata Das] Fixed scala style errors.
9b8d112 [Tathagata Das] Fixed streaming examples docs to use run-example instead of spark-submit.
This PR replaces the Schedulable data structures in Pool.scala with thread-safe ones from java. Note that Scala's `with SynchronizedBuffer` trait is soon to be deprecated in 2.11 because it is ["inherently unreliable"](http://www.scala-lang.org/api/2.11.0/index.html#scala.collection.mutable.SynchronizedBuffer). We should slowly drift away from `SynchronizedBuffer` in other places too.
Note that this PR introduces an API-breaking change; `sc.getAllPools` now returns an Array rather than an ArrayBuffer. This is because we want this method to return an immutable copy rather than one may potentially confuse the user if they try to modify the copy, which takes no effect on the original data structure.
Author: Andrew Or <andrewor14@gmail.com>
Closes#762 from andrewor14/pool-npe and squashes the following commits:
383e739 [Andrew Or] JavaConverters -> JavaConversions
3f32981 [Andrew Or] Merge branch 'master' of github.com:apache/spark into pool-npe
769be19 [Andrew Or] Assorted minor changes
2189247 [Andrew Or] Merge branch 'master' of github.com:apache/spark into pool-npe
05ad9e9 [Andrew Or] Fix test - contains is not the same as containsKey
0921ea0 [Andrew Or] var -> val
07d720c [Andrew Or] Synchronize Schedulable data structures
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#763 from vanzin/netty-dep-hell and squashes the following commits:
dfb6ce2 [Marcelo Vanzin] Fix dep exclusion: avro-ipc, not avro, depends on netty.
...loper api
Author: Koert Kuipers <koert@tresata.com>
Closes#764 from koertkuipers/feat-rdd-developerapi and squashes the following commits:
8516dd2 [Koert Kuipers] SPARK-1801. expose InterruptibleIterator and TaskKilledException in developer api
Add the implementation for ApproximateCountDistinct to SparkSql. We use the HyperLogLog algorithm implemented in stream-lib, and do the count in two phases: 1) counting the number of distinct elements in each partitions, and 2) merge the HyperLogLog results from different partitions.
A simple serializer and test cases are added as well.
Author: larvaboy <larvaboy@gmail.com>
Closes#737 from larvaboy/master and squashes the following commits:
bd8ef3f [larvaboy] Add support of user-provided standard deviation to ApproxCountDistinct.
9ba8360 [larvaboy] Fix alignment and null handling issues.
95b4067 [larvaboy] Add a test case for count distinct and approximate count distinct.
f57917d [larvaboy] Add the parser for the approximate count.
a2d5d10 [larvaboy] Add ApproximateCountDistinct aggregates and functions.
7ad273a [larvaboy] Add SparkSql serializer for HyperLogLog.
1d9aacf [larvaboy] Fix a minor typo in the toString method of the Count case class.
653542b [larvaboy] Fix a couple of minor typos.
This change adds a new partitioner which allows users
to specify # of keys per partition.
Author: Syed Hashmi <shashmi@cloudera.com>
Closes#721 from syedhashmi/master and squashes the following commits:
4ca94cc [Syed Hashmi] [SPARK-1784] Add a new partitioner
Author: Michael Armbrust <michael@databricks.com>
Closes#761 from marmbrus/existingContext and squashes the following commits:
4651051 [Michael Armbrust] Make it possible to create Java/Python SQLContexts from an existing Scala SQLContext.
JIRA issue: [SPARK-1527](https://issues.apache.org/jira/browse/SPARK-1527)
getName() only gets the last component of the file path. When deleting test-generated directories,
we should pass the generated directory's absolute path to DiskBlockManager.
Author: Ye Xianjin <advancedxy@gmail.com>
This patch had conflicts when merged, resolved by
Committer: Patrick Wendell <pwendell@gmail.com>
Closes#436 from advancedxy/SPARK-1527 and squashes the following commits:
4678bab [Ye Xianjin] change rootDir*.getname to rootDir*.getAbsolutePath so the temporary directories are deleted when the test is finished.
The solution is to wrap a try / catch / log around the posting of each event to each listener.
Author: Andrew Or <andrewor14@gmail.com>
Closes#759 from andrewor14/listener-die and squashes the following commits:
aee5107 [Andrew Or] Merge branch 'master' of github.com:apache/spark into listener-die
370939f [Andrew Or] Remove two layers of indirection
422d278 [Andrew Or] Explicitly throw an exception instead of 1 / 0
0df0e2a [Andrew Or] Try/catch and log exceptions when posting events
Summary:
https://issues.apache.org/jira/browse/SPARK-1791
Simple fix, and backward compatible, since
- anyone who set the threshold was getting completely wrong answers.
- anyone who did not set the threshold had the default 0.0 value for the threshold anyway.
Test Plan:
Unit test added that is verified to fail under the old implementation,
and pass under the new implementation.
Reviewers:
CC:
Author: Andrew Tulloch <andrew@tullo.ch>
Closes#725 from ajtulloch/SPARK-1791-SVM and squashes the following commits:
770f55d [Andrew Tulloch] SPARK-1791 - SVM implementation does not use threshold parameter
This patch checks top-level closure arguments to `ClosureCleaner.clean` for `return` statements and raises an exception if it finds any. This is mainly a user-friendliness addition, since programs with return statements in closure arguments will currently fail upon RDD actions with a less-than-intuitive error message.
Author: William Benton <willb@redhat.com>
Closes#717 from willb/spark-571 and squashes the following commits:
c41eb7d [William Benton] Another test case for SPARK-571
30c42f4 [William Benton] Stylistic cleanups
559b16b [William Benton] Stylistic cleanups from review
de13b79 [William Benton] Style fixes
295b6a5 [William Benton] Forbid return statements in closure arguments.
b017c47 [William Benton] Added a test for SPARK-571
Author: Sandy Ryza <sandy@cloudera.com>
Closes#753 from sryza/sandy-spark-1815 and squashes the following commits:
957a8ac [Sandy Ryza] SPARK-1815. SparkContext should not be marked DeveloperApi
YARN
- SparkPi was updated to not take in master as an argument; we should update the docs to reflect that.
- The default YARN build guide should be in maven, not sbt.
- This PR also adds a paragraph on steps to debug a YARN application.
Standalone
- Emphasize spark-submit more. Right now it's one small paragraph preceding the legacy way of launching through `org.apache.spark.deploy.Client`.
- The way we set configurations / environment variables according to the old docs is outdated. This needs to reflect changes introduced by the Spark configuration changes we made.
In general, this PR also adds a little more documentation on the new spark-shell, spark-submit, spark-defaults.conf etc here and there.
Author: Andrew Or <andrewor14@gmail.com>
Closes#701 from andrewor14/yarn-docs and squashes the following commits:
e2c2312 [Andrew Or] Merge in changes in #752 (SPARK-1814)
25cfe7b [Andrew Or] Merge in the warning from SPARK-1753
a8c39c5 [Andrew Or] Minor changes
336bbd9 [Andrew Or] Tabs -> spaces
4d9d8f7 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-docs
041017a [Andrew Or] Abstract Spark submit documentation to cluster-overview.html
3cc0649 [Andrew Or] Detail how to set configurations + remove legacy instructions
5b7140a [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-docs
85a51fc [Andrew Or] Update run-example, spark-shell, configuration etc.
c10e8c7 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-docs
381fe32 [Andrew Or] Update docs for standalone mode
757c184 [Andrew Or] Add a note about the requirements for the debugging trick
f8ca990 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-docs
924f04c [Andrew Or] Revert addition of --deploy-mode
d5fe17b [Andrew Or] Update the YARN docs
What they really mean is SPARK_DAEMON_***JAVA***_OPTS
Author: Andrew Or <andrewor14@gmail.com>
Closes#751 from andrewor14/spark-daemon-opts and squashes the following commits:
70c41f9 [Andrew Or] SPARK_DAEMON_OPTS -> SPARK_DAEMON_JAVA_OPTS
https://issues.apache.org/jira/browse/SPARK-1757
The first test succeeds, but the second test fails with exception:
```
[info] - save and load case class RDD with Nones as parquet *** FAILED *** (14 milliseconds)
[info] java.lang.RuntimeException: Unsupported datatype StructType(List())
[info] at scala.sys.package$.error(package.scala:27)
[info] at org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType(ParquetRelation.scala:201)
[info] at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$1.apply(ParquetRelation.scala:235)
[info] at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$1.apply(ParquetRelation.scala:235)
[info] at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
[info] at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
[info] at scala.collection.immutable.List.foreach(List.scala:318)
[info] at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
[info] at scala.collection.AbstractTraversable.map(Traversable.scala:105)
[info] at org.apache.spark.sql.parquet.ParquetTypesConverter$.convertFromAttributes(ParquetRelation.scala:234)
[info] at org.apache.spark.sql.parquet.ParquetTypesConverter$.writeMetaData(ParquetRelation.scala:267)
[info] at org.apache.spark.sql.parquet.ParquetRelation$.createEmpty(ParquetRelation.scala:143)
[info] at org.apache.spark.sql.parquet.ParquetRelation$.create(ParquetRelation.scala:122)
[info] at org.apache.spark.sql.execution.SparkStrategies$ParquetOperations$.apply(SparkStrategies.scala:139)
[info] at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
[info] at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
[info] at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
[info] at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
[info] at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:264)
[info] at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:264)
[info] at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:265)
[info] at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:265)
[info] at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:268)
[info] at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:268)
[info] at org.apache.spark.sql.SchemaRDDLike$class.saveAsParquetFile(SchemaRDDLike.scala:66)
[info] at org.apache.spark.sql.SchemaRDD.saveAsParquetFile(SchemaRDD.scala:98)
```
Author: Andrew Ash <andrew@andrewash.com>
Author: Michael Armbrust <michael@databricks.com>
Closes#690 from ash211/rdd-parquet-save and squashes the following commits:
747a0b9 [Andrew Ash] Merge pull request #1 from marmbrus/pr/690
54bd00e [Michael Armbrust] Need to put Option first since Option <: Seq.
8f3f281 [Andrew Ash] SPARK-1757 Add failing test for saving SparkSQL Schemas with Option[?] fields as parquet
As I mentioned in SPARK-1765, there is a word 'JXM' in monitoring.md.
I think it's typo for 'JMX'.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#698 from sarutak/SPARK-1765 and squashes the following commits:
bae9843 [Kousuke Saruta] modified a typoe in monitoring.md
Documentation for L-BFGS, and an example of training binary L2 logistic regression using L-BFGS.
Author: DB Tsai <dbtsai@alpinenow.com>
Closes#702 from dbtsai/dbtsai-lbfgs-doc and squashes the following commits:
0712215 [DB Tsai] Update
38fdfa1 [DB Tsai] Removed extra empty line
5745b64 [DB Tsai] Update again
e9e418e [DB Tsai] Update
7381521 [DB Tsai] L-BFGS Documentation
Author: Andrew Ash <andrew@andrewash.com>
Closes#743 from ash211/patch-4 and squashes the following commits:
c959f3b [Andrew Ash] Typo: resond -> respond
I need this to be public for the implementation of SharkServer2. However, I think this functionality is generally useful and should be pretty stable.
Author: Michael Armbrust <michael@databricks.com>
Closes#750 from marmbrus/metastoreTypes and squashes the following commits:
f51b62e [Michael Armbrust] Make Hive Metastore conversion functions publicly visible.
Tested on Windows 7.
Author: Andrew Or <andrewor14@gmail.com>
Closes#745 from andrewor14/windows-submit and squashes the following commits:
c0b58fb [Andrew Or] Allow spaces in parameters
162e54d [Andrew Or] Merge branch 'master' of github.com:apache/spark into windows-submit
91597ce [Andrew Or] Make spark-shell.cmd use spark-submit.cmd
af6fd29 [Andrew Or] Add spark submit for Windows
Following on a few more items from SPARK-1802 --
The first commit touches up a few similar problems remaining with the YARN profile. I think this is worth cherry-picking.
The second commit is more of the same for hadoop-client, although the fix is a little more complex. It may or may not be worth bothering with.
Author: Sean Owen <sowen@cloudera.com>
Closes#746 from srowen/SPARK-1802.2 and squashes the following commits:
52aeb41 [Sean Owen] Add more commons-logging, servlet excludes to avoid conflicts in assembly when building for YARN
This seems strictly better, and I think it's justified only the grounds of
clean-up. It might also fix issues with path conversions, but I haven't
yet isolated any instance of that happening.
/cc @srowen @tdas
Author: Patrick Wendell <pwendell@gmail.com>
Closes#749 from pwendell/broadcast-cleanup and squashes the following commits:
d6d54f2 [Patrick Wendell] SPARK-1623: Use File objects instead of string's in HTTPBroadcast
This was changed, but in fact, it's used for things other than tests.
So I've changed it back.
Author: Patrick Wendell <pwendell@gmail.com>
Closes#747 from pwendell/executor-env and squashes the following commits:
36a60a5 [Patrick Wendell] Rename testExecutorEnvs --> executorEnvs.
This initial commit resolves the conflicts in the Hive profiles as noted in https://issues.apache.org/jira/browse/SPARK-1802 .
Most of the fix was to note that Hive drags in Avro, and so if the hive module depends on Spark's version of the `avro-*` dependencies, it will pull in our exclusions as needed too. But I found we need to copy some exclusions between the two Avro dependencies to get this right. And then had to squash some commons-logging intrusions.
This turned up another annoying find, that `hive-exec` is basically an "assembly" artifact that _also_ packages all of its transitive dependencies. This means the final assembly shows lots of collisions between itself and its dependencies, and even other project dependencies. I have a TODO to examine whether that is going to be a deal-breaker or not.
In the meantime I'm going to tack on a second commit to this PR that will also fix some similar, last collisions in the YARN profile.
Author: Sean Owen <sowen@cloudera.com>
Closes#744 from srowen/SPARK-1802 and squashes the following commits:
a856604 [Sean Owen] Resolve JAR version conflicts specific to Hive profile
Three issues related to temp files that tests generate – these should be touched up for hygiene but are not urgent.
Modules have a log4j.properties which directs the unit-test.log output file to a directory like `[module]/target/unit-test.log`. But this ends up creating `[module]/[module]/target/unit-test.log` instead of former.
The `work/` directory is not deleted by "mvn clean", in the parent and in modules. Neither is the `checkpoint/` directory created under the various external modules.
Many tests create a temp directory, which is not usually deleted. This can be largely resolved by calling `deleteOnExit()` at creation and trying to call `Utils.deleteRecursively` consistently to clean up, sometimes in an `@After` method.
_If anyone seconds the motion, I can create a more significant change that introduces a new test trait along the lines of `LocalSparkContext`, which provides management of temp directories for subclasses to take advantage of._
Author: Sean Owen <sowen@cloudera.com>
Closes#732 from srowen/SPARK-1798 and squashes the following commits:
5af578e [Sean Owen] Try to consistently delete test temp dirs and files, and set deleteOnExit() for each
b21b356 [Sean Owen] Remove work/ and checkpoint/ dirs with mvn clean
bdd0f41 [Sean Owen] Remove duplicate module dir in log4j.properties output path for tests
Addressing issue in MimaBuild.scala.
Author: Ankur Dave <ankurdave@gmail.com>
Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>
Closes#742 from jegonzal/edge_partition_serialization and squashes the following commits:
8ba6e0d [Ankur Dave] Add concatenation operators to MimaBuild.scala
cb2ed3a [Joseph E. Gonzalez] addressing missing exclusion in MimaBuild.scala
5d27824 [Ankur Dave] Disable reference tracking to fix serialization test
c0a9ae5 [Ankur Dave] Add failing test for EdgePartition Kryo serialization
a4a3faa [Joseph E. Gonzalez] Making EdgePartition serializable.
Enabled Mesos (0.18.1) dependency with shaded protobuf
Why is this needed?
Avoids any protobuf version collision between Mesos and any other
dependency in Spark e.g. Hadoop HDFS 2.2+ or 1.0.4.
Ticket: https://issues.apache.org/jira/browse/SPARK-1806
* Should close https://issues.apache.org/jira/browse/SPARK-1433
Author berngp
Author: Bernardo Gomez Palacio <bernardo.gomezpalacio@gmail.com>
Closes#741 from berngp/feature/SPARK-1806 and squashes the following commits:
5d70646 [Bernardo Gomez Palacio] SPARK-1806: Upgrade Mesos dependency to 0.18.1
The main issue this patch fixes is [SPARK-1772](https://issues.apache.org/jira/browse/SPARK-1772), in which Executors may not die when fatal exceptions (e.g., OOM) are thrown. This patch causes Executors to delegate to the ExecutorUncaughtExceptionHandler when a fatal exception is thrown.
This patch also continues the fight in the neverending war against `case t: Throwable =>`, by only catching Exceptions in many places, and adding a wrapper for Threads and Runnables to make sure any uncaught exceptions are at least printed to the logs.
It also turns out that it is unlikely that the IndestructibleActorSystem actually works, given testing ([here](https://gist.github.com/aarondav/ca1f0cdcd50727f89c0d)). The uncaughtExceptionHandler is not called from the places that we expected it would be.
[SPARK-1620](https://issues.apache.org/jira/browse/SPARK-1620) deals with part of this issue, but refactoring our Actor Systems to ensure that exceptions are dealt with properly is a much bigger change, outside the scope of this PR.
Author: Aaron Davidson <aaron@databricks.com>
Closes#715 from aarondav/throwable and squashes the following commits:
f9b9bfe [Aaron Davidson] Remove other redundant 'throw e'
e937a0a [Aaron Davidson] Address Prashant and Matei's comments
1867867 [Aaron Davidson] [RFC] SPARK-1772 Stop catching Throwable, let Executors die
This appears to address the issue with edge partition serialization. The solution appears to be just registering the `PrimitiveKeyOpenHashMap`. However I noticed that we appear to have forked that code in GraphX but retained the same name (which is confusing). I also renamed our local copy to `GraphXPrimitiveKeyOpenHashMap`. We should consider dropping that and using the one in Spark if possible.
Author: Ankur Dave <ankurdave@gmail.com>
Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>
Closes#724 from jegonzal/edge_partition_serialization and squashes the following commits:
b0a525a [Ankur Dave] Disable reference tracking to fix serialization test
bb7f548 [Ankur Dave] Add failing test for EdgePartition Kryo serialization
67dac22 [Joseph E. Gonzalez] Making EdgePartition serializable.
Their was a minor bug in which negative partition ids could be generated when constructing a 2D partitioning of a graph. This could lead to an inefficient 2D partition for large vertex id values.
Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>
Closes#709 from jegonzal/fix_2d_partitioning and squashes the following commits:
937c562 [Joseph E. Gonzalez] fixing bug in 2d partitioning algorithm where negative partition ids could be generated.
The previous check didn't account for the fact that the default
deploy mode is "client" unless otherwise specified. Also, this
sets the more narrowly defined SPARK_DRIVER_MEMORY instead of setting
SPARK_MEM.
Author: Patrick Wendell <pwendell@gmail.com>
Closes#730 from pwendell/spark-submit and squashes the following commits:
430b98f [Patrick Wendell] Feedback from Aaron
e788edf [Patrick Wendell] Changes based on Aaron's feedback
f508146 [Patrick Wendell] SPARK-1652: Set driver memory correctly in spark-submit.
This patch adds better balancing when performing a repartition of an
RDD. Previously the elements in the RDD were hash partitioned, meaning
if the RDD was skewed certain partitions would end up being very large.
This commit adds load balancing of elements across the repartitioned
RDD splits. The load balancing is not perfect: a given output partition
can have up to N more elements than the average if there are N input
partitions. However, some randomization is used to minimize the
probabiliy that this happens.
Author: Patrick Wendell <pwendell@gmail.com>
Closes#727 from pwendell/load-balance and squashes the following commits:
f9da752 [Patrick Wendell] Response to Matei's feedback
acfa46a [Patrick Wendell] SPARK-1770: Load balance elements when repartitioning.
Author: witgo <witgo@qq.com>
Closes#728 from witgo/scala_home and squashes the following commits:
cdfd8be [witgo] Merge branch 'master' of https://github.com/apache/spark into scala_home
fac094a [witgo] remove outdated runtime Information scala home
More info at. https://github.com/sbt/sbt/issues/1010
Author: Prashant Sharma <prashant.s@imaginea.com>
Closes#525 from ScrapCodes/sbt-inc-opt and squashes the following commits:
ba8fa42 [Prashant Sharma] Enabled incremental build that comes with sbt 0.13.2
SparkSubmit ignores `--jars` for YARN client. This is a bug.
This PR also automatically adds the application jar to `spark.jar`. Previously, when running as yarn-client, you must specify the jar additionally through `--files` (because `--jars` didn't work). Now you don't have to explicitly specify it through either.
Tested on a YARN cluster.
Author: Andrew Or <andrewor14@gmail.com>
Closes#710 from andrewor14/yarn-jars and squashes the following commits:
35d1928 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-jars
c27bf6c [Andrew Or] For yarn-cluster and python, do not add primaryResource to spark.jar
c92c5bf [Andrew Or] Minor cleanups
269f9f3 [Andrew Or] Fix format
013d840 [Andrew Or] Fix tests
1407474 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-jars
3bb75e8 [Andrew Or] Allow SparkSubmit --jars to take effect in yarn-client mode
TL;DR is there is a bit of JAR hell trouble with Netty, that can be mostly resolved and will resolve a test failure.
I hit the error described at http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-startup-time-out-td1753.html while running FlumeStreamingSuite, and have for a short while (is it just me?)
velvia notes:
"I have found a workaround. If you add akka 2.2.4 to your dependencies, then everything works, probably because akka 2.2.4 brings in newer version of Jetty."
There are at least 3 versions of Netty in play in the build:
- the new Flume 1.4.0 dependency brings in io.netty:netty:3.4.0.Final, and that is the immediate problem
- the custom version of akka 2.2.3 depends on io.netty:netty:3.6.6.
- but, Spark Core directly uses io.netty:netty-all:4.0.17.Final
The POMs try to exclude other versions of netty, but are excluding org.jboss.netty:netty, when in fact older versions of io.netty:netty (not netty-all) are also an issue.
The org.jboss.netty:netty excludes are largely unnecessary. I replaced many of them with io.netty:netty exclusions until everything agreed on io.netty:netty-all:4.0.17.Final.
But this didn't work, since Akka 2.2.3 doesn't work with Netty 4.x. Down-grading to 3.6.6.Final across the board made some Spark code not compile.
If the build *keeps* io.netty:netty:3.6.6.Final as well, everything seems to work. Part of the reason seems to be that Netty 3.x used the old `org.jboss.netty` packages. This is less than ideal, but is no worse than the current situation.
So this PR resolves the issue and improves the JAR hell, even if it leaves the existing theoretical Netty 3-vs-4 conflict:
- Remove org.jboss.netty excludes where possible, for clarity; they're not needed except with Hadoop artifacts
- Add io.netty:netty excludes where needed -- except, let akka keep its io.netty:netty
- Change a bit of test code that actually depended on Netty 3.x, to use 4.x equivalent
- Update SBT build accordingly
A better change would be to update Akka far enough such that it agrees on Netty 4.x, but I don't know if that's feasible.
Author: Sean Owen <sowen@cloudera.com>
Closes#723 from srowen/SPARK-1789 and squashes the following commits:
43661b7 [Sean Owen] Update and add Netty excludes to prevent some JAR conflicts that cause test issues