1. Renamed StopAfter to Limit to be more consistent with naming in other relational databases.
2. Renamed TopK to TakeOrdered to be more consistent with Spark RDD API.
3. Avoid breaking lineage in Limit.
4. Added a bunch of override's to execution/basicOperators.scala.
@marmbrus @liancheng
Author: Reynold Xin <rxin@apache.org>
Author: Michael Armbrust <michael@databricks.com>
Closes#233 from rxin/limit and squashes the following commits:
13eb12a [Reynold Xin] Merge pull request #1 from marmbrus/limit
92b9727 [Michael Armbrust] More hacks to make Maps serialize with Kryo.
4fc8b4e [Reynold Xin] Merge branch 'master' of github.com:apache/spark into limit
87b7d37 [Reynold Xin] Use the proper serializer in limit.
9b79246 [Reynold Xin] Updated doc for Limit.
47d3327 [Reynold Xin] Copy tuples in Limit before shuffle.
231af3a [Reynold Xin] Limit/TakeOrdered: 1. Renamed StopAfter to Limit to be more consistent with naming in other relational databases. 2. Renamed TopK to TakeOrdered to be more consistent with Spark RDD API. 3. Avoid breaking lineage in Limit. 4. Added a bunch of override's to execution/basicOperators.scala.
JIRA issue: [SPARK-1373](https://issues.apache.org/jira/browse/SPARK-1373)
(Although tagged as WIP, this PR is structurally complete. The only things left unimplemented are 3 more compression algorithms: `BooleanBitSet`, `IntDelta` and `LongDelta`, which are trivial to add later in this or another separate PR.)
This PR contains compression support for Spark SQL in-memory columnar storage. Main interfaces include:
* `CompressionScheme`
Each `CompressionScheme` represents a concrete compression algorithm, which basically consists of an `Encoder` for compression and a `Decoder` for decompression. Algorithms implemented include:
* `RunLengthEncoding`
* `DictionaryEncoding`
Algorithms to be implemented include:
* `BooleanBitSet`
* `IntDelta`
* `LongDelta`
* `CompressibleColumnBuilder`
A stackable `ColumnBuilder` trait used to build byte buffers for compressible columns. A best `CompressionScheme` that exhibits lowest compression ratio is chosen for each column according to statistical information gathered while elements are appended into the `ColumnBuilder`. However, if no `CompressionScheme` can achieve a compression ratio better than 80%, no compression will be done for this column to save CPU time.
Memory layout of the final byte buffer is showed below:
```
.--------------------------- Column type ID (4 bytes)
| .----------------------- Null count N (4 bytes)
| | .------------------- Null positions (4 x N bytes, empty if null count is zero)
| | | .------------- Compression scheme ID (4 bytes)
| | | | .--------- Compressed non-null elements
V V V V V
+---+---+-----+---+---------+
| | | ... | | ... ... |
+---+---+-----+---+---------+
\-----------/ \-----------/
header body
```
* `CompressibleColumnAccessor`
A stackable `ColumnAccessor` trait used to iterate (possibly) compressed data column.
* `ColumnStats`
Used to collect statistical information while loading data into in-memory columnar table. Optimizations like partition pruning rely on this information.
Strictly speaking, `ColumnStats` related code is not part of the compression support. It's contained in this PR to ensure and validate the row-based API design (which is used to avoid boxing/unboxing cost whenever possible).
A major refactoring change since PR #205 is:
* Refactored all getter/setter methods for primitive types in various places into `ColumnType` classes to remove duplicated code.
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes#285 from liancheng/memColumnarCompression and squashes the following commits:
ed71bbd [Cheng Lian] Addressed all PR comments by @marmbrus
d3a4fa9 [Cheng Lian] Removed Ordering[T] in ColumnStats for better performance
5034453 [Cheng Lian] Bug fix, more tests, and more refactoring
c298b76 [Cheng Lian] Test suites refactored
2780d6a [Cheng Lian] [WIP] in-memory columnar compression support
211331c [Cheng Lian] WIP: in-memory columnar compression support
85cc59b [Cheng Lian] Refactored ColumnAccessors & ColumnBuilders to remove duplicate code
This avoids a silent data corruption issue (https://spark-project.atlassian.net/browse/SPARK-1188) and has no performance impact by my measurements. It also simplifies the code. As far as I can tell the object re-use was nothing but premature optimization.
I did actual benchmarks for all the included changes, and there is no performance difference. I am not sure where to put the benchmarks. Does Spark not have a benchmark suite?
This is an example benchmark I did:
test("benchmark") {
val builder = new EdgePartitionBuilder[Int]
for (i <- (1 to 10000000)) {
builder.add(i.toLong, i.toLong, i)
}
val p = builder.toEdgePartition
p.map(_.attr + 1).iterator.toList
}
It ran for 10 seconds both before and after this change.
Author: Daniel Darabos <darabos.daniel@gmail.com>
Closes#276 from darabos/spark-1188 and squashes the following commits:
574302b [Daniel Darabos] Restore "manual" copying in EdgePartition.map(Iterator). Add comment to discourage novices like myself from trying to simplify the code.
4117a64 [Daniel Darabos] Revert EdgePartitionSuite.
4955697 [Daniel Darabos] Create a copy of the Edge objects in EdgeRDD.compute(). This avoids exposing the object re-use, while still enables the more efficient behavior for internal code.
4ec77f8 [Daniel Darabos] Add comments about object re-use to the affected functions.
2da5e87 [Daniel Darabos] Restore object re-use in EdgePartition.
0182f2b [Daniel Darabos] Do not re-use objects in the EdgePartition/EdgeTriplet iterators. This avoids a silent data corruption issue (SPARK-1188) and has no performance impact in my measurements. It also simplifies the code.
c55f52f [Daniel Darabos] Tests that reproduce the problems from SPARK-1188.
`BlockId.scala` offers a way to reconstruct a BlockId from a string through regex matching. `util/JsonProtocol.scala` duplicates this functionality by explicitly matching on the BlockId type.
With this PR, the de/serialization of BlockIds will go through the first (older) code path.
(Most of the line changes in this PR involve changing `==` to `===` in `JsonProtocolSuite.scala`)
Author: Andrew Or <andrewor14@gmail.com>
Closes#289 from andrewor14/blockid-json and squashes the following commits:
409d226 [Andrew Or] Simplify JSON de/serialization for BlockId
This data structure was misused and, as a result, later renamed to an incorrect name.
This data structure seems to have gotten into this tangled state as a result of @henrydavidge using the stageID instead of the job Id to index into it and later @andrewor14 renaming the data structure to reflect this misunderstanding.
This patch renames it and removes an incorrect indexing into it. The incorrect indexing into it meant that the code added by @henrydavidge to warn when a task size is too large (added here 57579934f0) was not always executed; this commit fixes that.
Author: Kay Ousterhout <kayousterhout@gmail.com>
Closes#301 from kayousterhout/fixCancellation and squashes the following commits:
bd3d3a4 [Kay Ousterhout] Renamed stageIdToActiveJob to jobIdToActiveJob.
@rxin mentioned this might cause issues on windows machines.
Author: Michael Armbrust <michael@databricks.com>
Closes#297 from marmbrus/noStars and squashes the following commits:
263122a [Michael Armbrust] Remove * from test case golden filename.
Joint work with @hirakendu, @etrain, @atalwalkar and @harsha2010.
Key features:
+ Supports binary classification and regression
+ Supports gini, entropy and variance for information gain calculation
+ Supports both continuous and categorical features
The algorithm has gone through several development iterations over the last few months leading to a highly optimized implementation. Optimizations include:
1. Level-wise training to reduce passes over the entire dataset.
2. Bin-wise split calculation to reduce computation overhead.
3. Aggregation over partitions before combining to reduce communication overhead.
Author: Manish Amde <manish9ue@gmail.com>
Author: manishamde <manish9ue@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>
Closes#79 from manishamde/tree and squashes the following commits:
1e8c704 [Manish Amde] remove numBins field in the Strategy class
7d54b4f [manishamde] Merge pull request #4 from mengxr/dtree
f536ae9 [Xiangrui Meng] another pass on code style
e1dd86f [Manish Amde] implementing code style suggestions
62dc723 [Manish Amde] updating javadoc and converting helper methods to package private to allow unit testing
201702f [Manish Amde] making some more methods private
f963ef5 [Manish Amde] making methods private
c487e6a [manishamde] Merge pull request #1 from mengxr/dtree
24500c5 [Xiangrui Meng] minor style updates
4576b64 [Manish Amde] documentation and for to while loop conversion
ff363a7 [Manish Amde] binary search for bins and while loop for categorical feature bins
632818f [Manish Amde] removing threshold for classification predict method
2116360 [Manish Amde] removing dummy bin calculation for categorical variables
6068356 [Manish Amde] ensuring num bins is always greater than max number of categories
62c2562 [Manish Amde] fixing comment indentation
ad1fc21 [Manish Amde] incorporated mengxr's code style suggestions
d1ef4f6 [Manish Amde] more documentation
794ff4d [Manish Amde] minor improvements to docs and style
eb8fcbe [Manish Amde] minor code style updates
cd2c2b4 [Manish Amde] fixing code style based on feedback
63e786b [Manish Amde] added multiple train methods for java compatability
d3023b3 [Manish Amde] adding more docs for nested methods
84f85d6 [Manish Amde] code documentation
9372779 [Manish Amde] code style: max line lenght <= 100
dd0c0d7 [Manish Amde] minor: some docs
0dd7659 [manishamde] basic doc
5841c28 [Manish Amde] unit tests for categorical features
f067d68 [Manish Amde] minor cleanup
c0e522b [Manish Amde] updated predict and split threshold logic
b09dc98 [Manish Amde] minor refactoring
6b7de78 [Manish Amde] minor refactoring and tests
d504eb1 [Manish Amde] more tests for categorical features
dbb7ac1 [Manish Amde] categorical feature support
6df35b9 [Manish Amde] regression predict logic
53108ed [Manish Amde] fixing index for highest bin
e23c2e5 [Manish Amde] added regression support
c8f6d60 [Manish Amde] adding enum for feature type
b0e3e76 [Manish Amde] adding enum for feature type
154aa77 [Manish Amde] enums for configurations
733d6dd [Manish Amde] fixed tests
02c595c [Manish Amde] added command line parsing
98ec8d5 [Manish Amde] tree building and prediction logic
b0eb866 [Manish Amde] added logic to handle leaf nodes
80e8c66 [Manish Amde] working version of multi-level split calculation
4798aae [Manish Amde] added gain stats class
dad0afc [Manish Amde] decison stump functionality working
03f534c [Manish Amde] some more tests
0012a77 [Manish Amde] basic stump working
8bca1e2 [Manish Amde] additional code for creating intermediate RDD
92cedce [Manish Amde] basic building blocks for intermediate RDD calculation. untested.
cd53eae [Manish Amde] skeletal framework
see comments on Pull Request https://github.com/apache/spark/pull/38
(i couldn't figure out how to modify an existing pull request, so I'm hoping I can withdraw that one and replace it with this one.)
Author: Diana Carroll <dcarroll@cloudera.com>
Closes#227 from dianacarroll/spark-1134 and squashes the following commits:
ffe47f2 [Diana Carroll] [spark-1134] remove ipythonopts from ipython command
b673bf7 [Diana Carroll] Merge branch 'master' of github.com:apache/spark
0309cf9 [Diana Carroll] SPARK-1134 bug with ipython prevents non-interactive use with spark; only call ipython if no command line arguments were supplied
Just a Scala version increment
Author: Mark Hamstra <markhamstra@gmail.com>
Closes#259 from markhamstra/scala-2.10.4 and squashes the following commits:
fbec547 [Mark Hamstra] [SPARK-1342] Bumped Scala version to 2.10.4
This doesn't yet support different databases in Hive (though you can probably workaround this by calling `USE <dbname>`). However, given the time constraints for 1.0 I think its probably worth including this now and extending the functionality in the next release.
Author: Michael Armbrust <michael@databricks.com>
Closes#282 from marmbrus/cacheTables and squashes the following commits:
83785db [Michael Armbrust] Support for caching and uncaching tables in a SQLContext.
If a previously persisted RDD is re-used, its information disappears from the Storage page.
This is because the tasks associated with re-using the RDD do not report the RDD's blocks as updated (which is correct). On stage submit, however, we overwrite any existing information regarding that RDD with a fresh one, whether or not the information for the RDD already exists.
Author: Andrew Or <andrewor14@gmail.com>
Closes#281 from andrewor14/ui-storage-fix and squashes the following commits:
408585a [Andrew Or] Fix storage UI bug
Previous version was 7.6.8v20121106. The only difference between Jetty 7 and Jetty 8 is that the former uses Servlet API 2.5, while the latter uses Servlet API 3.0.
Author: Andrew Or <andrewor14@gmail.com>
Closes#280 from andrewor14/jetty-upgrade and squashes the following commits:
dd57104 [Andrew Or] Merge github.com:apache/spark into jetty-upgrade
e75fa85 [Andrew Or] Upgrade Jetty to 8.1.14v20131031
Author: Sandy Ryza <sandy@cloudera.com>
Closes#279 from sryza/sandy-spark-1376 and squashes the following commits:
d8aebfa [Sandy Ryza] SPARK-1376. In the yarn-cluster submitter, rename "args" option to "arg"
This test needs to be fixed. It currently depends on Thread.sleep() having exact-timing
semantics, which is not a valid assumption.
Author: Patrick Wendell <pwendell@gmail.com>
Closes#277 from pwendell/rate-limited-stream and squashes the following commits:
6c0ff81 [Patrick Wendell] SPARK-1365: Fix RateLimitedOutputStream test
Before we were materializing everything in memory. This also uses the projection interface so will be easier to plug in code gen (its ported from that branch).
@rxin @liancheng
Author: Michael Armbrust <michael@databricks.com>
Closes#250 from marmbrus/hashJoin and squashes the following commits:
1ad873e [Michael Armbrust] Change hasNext logic back to the correct version.
8e6f2a2 [Michael Armbrust] Review comments.
1e9fb63 [Michael Armbrust] style
bc0cb84 [Michael Armbrust] Rewrite join implementation to allow streaming of one relation.
1. Better error messages when required arguments are missing.
2. Support for unit testing cases where presented arguments are invalid.
3. Bug fix: Only use environment varaibles when they are set (otherwise will cause NPE).
4. A verbose mode to aid debugging.
5. Visibility of several variables is set to private.
6. Deprecation warning for existing scripts.
Author: Patrick Wendell <pwendell@gmail.com>
Closes#271 from pwendell/spark-submit and squashes the following commits:
9146def [Patrick Wendell] SPARK-1352: Improve robustness of spark-submit script
Author: Prashant Sharma <prashant.s@imaginea.com>
Closes#261 from ScrapCodes/comment-style-check2 and squashes the following commits:
6cde61e [Prashant Sharma] comment style space before ending */ check.
Fix attribute unresolved when query with table name as a qualifier in SQLContext with SimplCatelog, details please see [SPARK-1354](https://issues.apache.org/jira/browse/SPARK-1354?jql=project%20%3D%20SPARK).
Author: jerryshao <saisai.shao@intel.com>
Closes#272 from jerryshao/qualifier-fix and squashes the following commits:
7950170 [jerryshao] Add tableName as a qualifier for SimpleCatelogy
Author: Prashant Sharma <prashant.s@imaginea.com>
Author: Prashant Sharma <scrapcodes@gmail.com>
Closes#262 from ScrapCodes/SPARK-1336/ReduceVerbosity and squashes the following commits:
87dfa54 [Prashant Sharma] Further reduction in noise and made pyspark tests to fail fast.
811170f [Prashant Sharma] Reducing the ouput of run-tests script.
@AndreSchumacher, please take a look.
https://spark-project.atlassian.net/browse/SPARK-1354
Author: Michael Armbrust <michael@databricks.com>
Closes#269 from marmbrus/parquetJoin and squashes the following commits:
4081e77 [Michael Armbrust] Create new instances of Parquet relation when multiple copies are in a single plan.
Author: Michael Armbrust <michael@databricks.com>
Closes#142 from marmbrus/kryoErrors and squashes the following commits:
9c72d1f [Michael Armbrust] Make the test more future proof.
78f5a42 [Michael Armbrust] Don't swallow all kryo errors, only those that indicate we are out of data.
Enrich the Spark Shell functionality to support the following options.
```
Usage: spark-shell [OPTIONS]
OPTIONS:
-h --help : Print this help information.
-c --cores : The maximum number of cores to be used by the Spark Shell.
-em --executor-memory : The memory used by each executor of the Spark Shell, the number
is followed by m for megabytes or g for gigabytes, e.g. "1g".
-dm --driver-memory : The memory used by the Spark Shell, the number is followed
by m for megabytes or g for gigabytes, e.g. "1g".
-m --master : A full string that describes the Spark Master, defaults to "local"
e.g. "spark://localhost:7077".
--log-conf : Enables logging of the supplied SparkConf as INFO at start of the
Spark Context.
e.g.
spark-shell -m spark://localhost:7077 -c 4 -dm 512m -em 2g
```
**Note**: this commit reflects the changes applied to _master_ based on [5d98cfc1].
[ticket: SPARK-1186] : Enrich the Spark Shell to support additional arguments.
https://spark-project.atlassian.net/browse/SPARK-1186
Author : bernardo.gomezpalcio@gmail.com
Author: Bernardo Gomez Palacio <bernardo.gomezpalacio@gmail.com>
Closes#116 from berngp/feature/enrich-spark-shell and squashes the following commits:
c5f455f [Bernardo Gomez Palacio] [SPARK-1186] : Enrich the Spark Shell to support additional arguments.
This PR includes:
1) Unify the unit test for expression evaluation
2) Add implementation of RLike & Like
Author: Cheng Hao <hao.cheng@intel.com>
Closes#224 from chenghao-intel/string_expression and squashes the following commits:
84f72e9 [Cheng Hao] fix bug in RLike/Like & Simplify the unit test
aeeb1d7 [Cheng Hao] Simplify the implementation/unit test of RLike/Like
319edb7 [Cheng Hao] change to spark code style
91cfd33 [Cheng Hao] add implementation for rlike/like
2c8929e [Cheng Hao] Update the unit test for expression evaluation
This is a starting version of the spark-app script for running compiled binaries against Spark. It still needs tests and some polish. The only testing I've done so far has been using it to launch jobs in yarn-standalone mode against a pseudo-distributed cluster.
This leaves out the changes required for launching python scripts. I think it might be best to save those for another JIRA/PR (while keeping to the design so that they won't require backwards-incompatible changes).
Author: Sandy Ryza <sandy@cloudera.com>
Closes#86 from sryza/sandy-spark-1126 and squashes the following commits:
d428d85 [Sandy Ryza] Commenting, doc, and import fixes from Patrick's comments
e7315c6 [Sandy Ryza] Fix failing tests
34de899 [Sandy Ryza] Change --more-jars to --jars and fix docs
299ddca [Sandy Ryza] Fix scalastyle
a94c627 [Sandy Ryza] Add newline at end of SparkSubmit
04bc4e2 [Sandy Ryza] SPARK-1126. spark-submit script
...sql pom files
Author: Thomas Graves <tgraves@apache.org>
Closes#263 from tgravescs/SPARK-1345 and squashes the following commits:
b43a2a0 [Thomas Graves] SPARK-1345 adding missing dependency on avro for hadoop 0.23 to the new sql pom files
Author: Nick Lanham <nick@afternight.org>
Closes#264 from nicklan/make-distribution-fixes and squashes the following commits:
172b981 [Nick Lanham] fix path for jar, make sed actually work on OSX
Author: Prashant Sharma <prashant.s@imaginea.com>
Closes#124 from ScrapCodes/SPARK-1096/scalastyle-comment-check and squashes the following commits:
214135a [Prashant Sharma] Review feedback.
5eba88c [Prashant Sharma] Fixed style checks for ///+ comments.
e54b2f8 [Prashant Sharma] improved message, work around.
83e7144 [Prashant Sharma] removed dependency on scalastyle in plugin, since scalastyle sbt plugin already depends on the right version. Incase we update the plugin we will have to adjust our spark-style project to depend on right scalastyle version.
810a1d6 [Prashant Sharma] SPARK-1096, a space after comment style checker.
ba33193 [Prashant Sharma] scala style as a project
I don't have access to an OSX machine, so if someone could test this that would be great.
Author: Nick Lanham <nick@afternight.org>
Closes#258 from nicklan/osx-sed-fix and squashes the following commits:
a6f158f [Nick Lanham] Also make mktemp work on OSX
558fd6e [Nick Lanham] Make sed do -i '' on OSX
...r.
Constructor of `org.apache.spark.executor.Executor` should not set context class loader of current thread, which is backend Actor's thread.
Run the following code in local-mode REPL.
```
scala> case class Foo(i: Int)
scala> val ret = sc.parallelize((1 to 100).map(Foo), 10).collect
```
This causes errors as follows:
```
ERROR actor.OneForOneStrategy: [L$line5.$read$$iwC$$iwC$$iwC$$iwC$Foo;
java.lang.ArrayStoreException: [L$line5.$read$$iwC$$iwC$$iwC$$iwC$Foo;
at scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:88)
at org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:870)
at org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:870)
at org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:56)
at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:859)
at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:616)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
```
This is because the class loaders to deserialize result `Foo` instances might be different from backend Actor's, and the Actor's class loader should be the same as Driver's.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#15 from ueshin/wip/wrongcontextclassloader and squashes the following commits:
d79e8c0 [Takuya UESHIN] Change a parent class loader of ExecutorURLClassLoader.
c6c09b6 [Takuya UESHIN] Add a test to collect objects of class defined in repl.
43e0feb [Takuya UESHIN] Prevent ContextClassLoader of Actor from becoming ClassLoader of Executor.
Symmetric difference (xor) in particular is useful for computing some distance metrics (e.g. Hamming). Unit tests added.
Author: Petko Nikolov <nikolov@soundcloud.com>
Closes#172 from petko-nikolov/bitset-imprv and squashes the following commits:
451f28b [Petko Nikolov] fixed style mistakes
5beba18 [Petko Nikolov] rm outer loop in andNot test
0e61035 [Petko Nikolov] conform to spark style; rm redundant asserts; more unit tests added; use arraycopy instead of loop
d53cdb9 [Petko Nikolov] rm incidentally added space
4e1df43 [Petko Nikolov] adding xor and and-not to BitSet; unit tests added
I am observing build failures when the Maven build reaches tests in the new SQL components. (I'm on Java 7 / OSX 10.9). The failure is the usual complaint from scala, that it's out of permgen space, or that JIT out of code cache space.
I see that various build scripts increase these both for SBT. This change simply adds these settings to scalatest's arguments. Works for me and seems a bit more consistent.
(I also snuck in cures for new build warnings from new scaladoc. Felt too trivial for a new PR, although it's separate. Just something I also saw while examining the build output.)
Author: Sean Owen <sowen@cloudera.com>
Closes#253 from srowen/SPARK-1335 and squashes the following commits:
c0f2d31 [Sean Owen] Appease scalastyle with a newline at the end of the file
a02679c [Sean Owen] Fix scaladoc errors due to missing links, which are generating build warnings, from some recent doc changes. We apparently can't generate links outside the module.
b2c6a09 [Sean Owen] Add perm gen, code cache settings to scalatest, mirroring SBT settings elsewhere, which allows tests to complete in at least one environment where they are failing. (Also removed a duplicate -Xms setting elsewhere.)
remove the extra echo which prevents spark-class from working. Note that I did not update the comment above it, which is also wrong because I'm not sure what it should do.
Should hive only be included if explicitly built with sbt hive/assembly or should sbt assembly build it?
Author: Thomas Graves <tgraves@apache.org>
Closes#241 from tgravescs/SPARK-1330 and squashes the following commits:
b10d708 [Thomas Graves] SPARK-1330 removed extra echo from comput_classpath.sh
This PR amortizes the cost of downloading all the jars and compiling core across more test cases. In one anecdotal run this change takes the cumulative time down from ~80 minutes to ~40 minutes.
Author: Michael Armbrust <michael@databricks.com>
Closes#255 from marmbrus/travis and squashes the following commits:
506b22d [Michael Armbrust] Cut down the granularity of travis tests so we can amortize the cost of compilation.
GLM needs to check addIntercept for intercept and weights. The current implementation always uses the first weight as intercept. Added a test for training without adding intercept.
JIRA: https://spark-project.atlassian.net/browse/SPARK-1327
Author: Xiangrui Meng <meng@databricks.com>
Closes#236 from mengxr/glm and squashes the following commits:
bcac1ac [Xiangrui Meng] add two tests to ensure {Lasso, Ridge}.setIntercept will throw an exceptions
a104072 [Xiangrui Meng] remove protected to be compatible with 0.9
0e57aa4 [Xiangrui Meng] update Lasso and RidgeRegression to parse the weights correctly from GLM mark createModel protected mark predictPoint protected
d7f629f [Xiangrui Meng] fix a bug in GLM when intercept is not used
This is just a slight variation on https://github.com/apache/spark/pull/234 and alternative suggestion for SPARK-1325. `scala-actors` is not necessary. `SparkBuild.scala` should be updated to reflect the direct dependency on `scala-reflect` and `scala-compiler`. And the `repl` build, which has the same dependencies, should also be consistent between Maven / SBT.
Author: Sean Owen <sowen@cloudera.com>
Author: witgo <witgo@qq.com>
Closes#240 from srowen/SPARK-1325 and squashes the following commits:
25bd7db [Sean Owen] Add necessary dependencies scala-reflect and scala-compiler to tools. Update repl dependencies, which are similar, to be consistent between Maven / SBT in this regard too.
Excluded those that are self-evident and the cases that are discussed in the mailing list.
Author: NirmalReddy <nirmal_reddy2000@yahoo.com>
Author: NirmalReddy <nirmal.reddy@imaginea.com>
Closes#168 from NirmalReddy/Spark-1095 and squashes the following commits:
ac54b29 [NirmalReddy] import misplaced
8c5ff3e [NirmalReddy] Changed syntax of unit returning methods
02d0778 [NirmalReddy] fixed explicit types in all the other packages
1c17773 [NirmalReddy] fixed explicit types in core package
/cc @aarondav and @andrewor14
Author: Patrick Wendell <pwendell@gmail.com>
Closes#231 from pwendell/ui-binding and squashes the following commits:
e8025f8 [Patrick Wendell] SPARK-1324: SparkUI Should Not Bind to SPARK_PUBLIC_DNS
Author: Michael Armbrust <michael@databricks.com>
Closes#243 from marmbrus/mapSer and squashes the following commits:
54045f7 [Michael Armbrust] Add a custom serializer for maps since they do not have a no-arg constructor.
Add golden answer for aforementioned test.
Also, fix golden test generation from sbt/sbt by setting the classpath correctly.
Author: Michael Armbrust <michael@databricks.com>
Closes#244 from marmbrus/partTest and squashes the following commits:
37a33c9 [Michael Armbrust] Un-ignore a test that is now passing, add golden answer for aforementioned test. Fix golden test generation from sbt/sbt.
According to discussions in comments of PR #208, this PR unifies package definition format in Spark SQL.
Some broken links in ScalaDoc and typos detected along the way are also fixed.
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes#225 from liancheng/packageDefinition and squashes the following commits:
75c47b3 [Cheng Lian] Fixed file line length
4f87968 [Cheng Lian] Unified package definition format in Spark SQL
Author: Prashant Sharma <prashant.s@imaginea.com>
Closes#235 from ScrapCodes/SPARK-1322/top-rev-sort and squashes the following commits:
f316266 [Prashant Sharma] Minor change in comment.
58e58c6 [Prashant Sharma] SPARK-1322, top in pyspark should sort result in descending order.
Also updated the documentation for top and takeOrdered.
On my simple test of sorting 100 million (Int, Int) tuples using Spark, Guava's top k implementation (in Ordering) is much faster than the BoundedPriorityQueue implementation for roughly sorted input (10 - 20X faster), and still faster for purely random input (2 - 5X).
Author: Reynold Xin <rxin@apache.org>
Closes#229 from rxin/takeOrdered and squashes the following commits:
0d11844 [Reynold Xin] Use Guava's top k implementation rather than our BoundedPriorityQueue based implementation. Also updated the documentation for top and takeOrdered.
This is not intended to replace Jenkins immediately, and Jenkins will remain the CI of reference for merging pull requests in the near term. Long term, it is possible that Travis will give us better integration with github, so we are investigating its use.
Author: Michael Armbrust <michael@databricks.com>
Closes#230 from marmbrus/travis and squashes the following commits:
93f9a32 [Michael Armbrust] Add Apache license to .travis.yml
d7c0e78 [Michael Armbrust] Initial experimentation with Travis CI configuration
This is an update on https://github.com/apache/spark/pull/180, which changes the solution from blacklisting "Option.scala" to avoiding the Option code path while generating the call path.
Also includes a unit test to prevent this issue in the future, and some minor refactoring.
Thanks @witgo for reporting this issue and working on the initial solution!
Author: witgo <witgo@qq.com>
Author: Aaron Davidson <aaron@databricks.com>
Closes#222 from aarondav/180 and squashes the following commits:
f74aad1 [Aaron Davidson] Avoid Option while generating call site & add unit tests
d2b4980 [witgo] Modify the position of the filter
1bc22d7 [witgo] Fix Stage.name return "apply at Option.scala:120"
Move CPUS_PER_TASK to TaskSchedulerImpl as the value is a constant and use it in both Mesos and CoarseGrained scheduler backends.
Thanks @kayousterhout for the design discussion
Author: Shivaram Venkataraman <shivaram@eecs.berkeley.edu>
Closes#219 from shivaram/multi-cpus and squashes the following commits:
5c7d685 [Shivaram Venkataraman] Don't pass availableCpus to TaskSetManager
260e4d5 [Shivaram Venkataraman] Add a check for non-zero CPUs in TaskSetManager
73fcf6f [Shivaram Venkataraman] Add documentation for spark.task.cpus
647bc45 [Shivaram Venkataraman] Fix scheduler to account for tasks using > 1 CPUs. Move CPUS_PER_TASK to TaskSchedulerImpl as the value is a constant and use it in both Mesos and CoarseGrained scheduler backends.
(This follows from a side point on SPARK-1133, in discussion of the PR: https://github.com/apache/spark/pull/164 )
Commons IO is barely used in the project, and can easily be replaced with equivalent calls to Guava or the existing Spark `Utils.scala` class.
Removing a dependency feels good, and this one in particular can get a little problematic since Hadoop uses it too.
Author: Sean Owen <sowen@cloudera.com>
Closes#226 from srowen/SPARK-1316 and squashes the following commits:
21efef3 [Sean Owen] Remove use of Commons IO
Author: Michael Armbrust <michael@databricks.com>
Closes#220 from marmbrus/moreTests and squashes the following commits:
223ec35 [Michael Armbrust] Blacklist machine specific test
9c966cc [Michael Armbrust] add more hive compatability tests to whitelist
Various spark scripts load spark-env.sh. This can cause growth of any variables that may be appended to (SPARK_CLASSPATH, SPARK_REPL_OPTS) and it makes the precedence order for options specified in spark-env.sh less clear.
One use-case for the latter is that we want to set options from the command-line of spark-shell, but these options will be overridden by subsequent loading of spark-env.sh. If we were to load the spark-env.sh first and then set our command-line options, we could guarantee correct precedence order.
Note that we use SPARK_CONF_DIR if available to support the sbin/ scripts, which always set this variable from sbin/spark-config.sh. Otherwise, we default to the ../conf/ as usual.
Author: Aaron Davidson <aaron@databricks.com>
Closes#184 from aarondav/idem and squashes the following commits:
e291f91 [Aaron Davidson] Use "private" variables in load-spark-env.sh
8da8360 [Aaron Davidson] Add .sh extension to load-spark-env.sh
93a2471 [Aaron Davidson] SPARK-1286: Make usage of spark-env.sh idempotent
This removes duplicated logic, dead code and casting when planning parquet table scans and hive table scans.
Other changes:
- Fix tests now that we are doing a better job of column pruning (i.e., since pruning predicates are applied before we even start scanning tuples, columns required by these predicates do not need to be included in the output of the scan unless they are also included in the final output of this logical plan fragment).
- Add rule to simplify trivial filters. This was required to avoid `WHERE false` from getting pushed into table scans, since `HiveTableScan` (reasonably) refuses to apply partition pruning predicates to non-partitioned tables.
Author: Michael Armbrust <michael@databricks.com>
Closes#213 from marmbrus/strategyCleanup and squashes the following commits:
48ce403 [Michael Armbrust] Move one more bit of parquet stuff into the core SQLContext.
834ce08 [Michael Armbrust] Address comments.
0f2c6f5 [Michael Armbrust] Unify the logic for column pruning, projection, and filtering of table scans for both Hive and Parquet relations. Fix tests now that we are doing a better job of column pruning.