Commit graph

566 commits

Author SHA1 Message Date
Andrew Or 61a5cced04 [SPARK-3797] Run external shuffle service in Yarn NM
This creates a new module `network/yarn` that depends on `network/shuffle` recently created in #3001. This PR introduces a custom Yarn auxiliary service that runs the external shuffle service. As of the changes here this shuffle service is required for using dynamic allocation with Spark.

This is still WIP mainly because it doesn't handle security yet. I have tested this on a stable Yarn cluster.

Author: Andrew Or <andrew@databricks.com>

Closes #3082 from andrewor14/yarn-shuffle-service and squashes the following commits:

ef3ddae [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service
0ee67a2 [Andrew Or] Minor wording suggestions
1c66046 [Andrew Or] Remove unused provided dependencies
0eb6233 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service
6489db5 [Andrew Or] Try catch at the right places
7b71d8f [Andrew Or] Add detailed java docs + reword a few comments
d1124e4 [Andrew Or] Add security to shuffle service (INCOMPLETE)
5f8a96f [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service
9b6e058 [Andrew Or] Address various feedback
f48b20c [Andrew Or] Fix tests again
f39daa6 [Andrew Or] Do not make network-yarn an assembly module
761f58a [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service
15a5b37 [Andrew Or] Fix build for Hadoop 1.x
baff916 [Andrew Or] Fix tests
5bf9b7e [Andrew Or] Address a few minor comments
5b419b8 [Andrew Or] Add missing license header
804e7ff [Andrew Or] Include the Yarn shuffle service jar in the distribution
cd076a4 [Andrew Or] Require external shuffle service for dynamic allocation
ea764e0 [Andrew Or] Connect to Yarn shuffle service only if it's enabled
1bf5109 [Andrew Or] Use the shuffle service port specified through hadoop config
b4b1f0c [Andrew Or] 4 tabs -> 2 tabs
43dcb96 [Andrew Or] First cut integration of shuffle service with Yarn aux service
b54a0c4 [Andrew Or] Initial skeleton for Yarn shuffle service
2014-11-05 15:42:05 -08:00
Aaron Davidson f55218aeb1 [SPARK-3796] Create external service which can serve shuffle files
This patch introduces the tooling necessary to construct an external shuffle service which is independent of Spark executors, and then use this service inside Spark. An example (just for the sake of this PR) of the service creation can be found in Worker, and the service itself is used by plugging in the StandaloneShuffleClient as Spark's ShuffleClient (setup in BlockManager).

This PR continues the work from #2753, which extracted out the transport layer of Spark's block transfer into an independent package within Spark. A new package was created which contains the Spark business logic necessary to retrieve the actual shuffle data, which is completely independent of the transport layer introduced in the previous patch. Similar to the transport layer, this package must not depend on Spark as we anticipate plugging this service as a lightweight process within, say, the YARN NodeManager, and do not wish to include Spark's dependencies (including Scala itself).

There are several outstanding tasks which must be complete before this PR can be merged:
- [x] Complete unit testing of network/shuffle package.
- [x] Performance and correctness testing on a real cluster.
- [x] Remove example service instantiation from Worker.scala.

There are even more shortcomings of this PR which should be addressed in followup patches:
- Don't use Java serializer for RPC layer! It is not cross-version compatible.
- Handle shuffle file cleanup for dead executors once the application terminates or the ContextCleaner triggers.
- Documentation of the feature in the Spark docs.
- Improve behavior if the shuffle service itself goes down (right now we don't blacklist it, and new executors cannot spawn on that machine).
- SSL and SASL integration
- Nice to have: Handle shuffle file consolidation (this would requires changes to Spark's implementation).

Author: Aaron Davidson <aaron@databricks.com>

Closes #3001 from aarondav/shuffle-service and squashes the following commits:

4d1f8c1 [Aaron Davidson] Remove changes to Worker
705748f [Aaron Davidson] Rename Standalone* to External*
fd3928b [Aaron Davidson] Do not unregister executor outputs unduly
9883918 [Aaron Davidson] Make suggested build changes
3d62679 [Aaron Davidson] Add Spark integration test
7fe51d5 [Aaron Davidson] Fix SBT integration
56caa50 [Aaron Davidson] Address comments
c8d1ac3 [Aaron Davidson] Add unit tests
2f70c0c [Aaron Davidson] Fix unit tests
5483e96 [Aaron Davidson] Fix unit tests
46a70bf [Aaron Davidson] Whoops, bracket
5ea4df6 [Aaron Davidson] [SPARK-3796] Create external service which can serve shuffle files
2014-11-01 14:37:45 -07:00
Patrick Wendell 0734d09320 HOTFIX: Clean up build in network module.
This is currently breaking the package build for some people (including me).

This patch does some general clean-up which also fixes the current issue.
- Uses consistent artifact naming
- Adds sbt support for this module
- Changes tests to use scalatest (fixes the original issue[1])

One thing to note, it turns out that scalatest when invoked in the
Maven build doesn't succesfully detect JUnit Java tests. This is
a long standing issue, I noticed it applies to all of our current
test suites as well. I've created SPARK-4159 to fix this.

[1] The original issue is that we need to allocate extra memory
for the tests, happens by default in our scalatest configuration.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #3025 from pwendell/hotfix and squashes the following commits:

faa9053 [Patrick Wendell] HOTFIX: Clean up build in network module.
2014-10-30 20:15:36 -07:00
Andrew Or 26d31d15fd Revert "SPARK-1209 [CORE] SparkHadoop{MapRed,MapReduce}Util should not use package org.apache.hadoop"
This reverts commit 68cb69daf3.
2014-10-30 17:56:10 -07:00
Sean Owen 68cb69daf3 SPARK-1209 [CORE] SparkHadoop{MapRed,MapReduce}Util should not use package org.apache.hadoop
(This is just a look at what completely moving the classes would look like. I know Patrick flagged that as maybe not OK, although, it's private?)

Author: Sean Owen <sowen@cloudera.com>

Closes #2814 from srowen/SPARK-1209 and squashes the following commits:

ead1115 [Sean Owen] Disable MIMA warnings resulting from moving the class -- this was also part of the PairRDDFunctions type hierarchy though?
2d42c1d [Sean Owen] Move SparkHadoopMapRedUtil / SparkHadoopMapReduceUtil from org.apache.hadoop to org.apache.spark
2014-10-30 15:54:53 -07:00
Andrew Or 1df05a40eb [SPARK-3822] Executor scaling mechanism for Yarn
This is part of a broader effort to enable dynamic scaling of executors ([SPARK-3174](https://issues.apache.org/jira/browse/SPARK-3174)). This is intended to work alongside SPARK-3795 (#2746), SPARK-3796 and SPARK-3797, but is functionally independently of these other issues.

The logic is built on top of PraveenSeluka's changes at #2798. This is different from the changes there in a few major ways: (1) the mechanism is implemented within the existing scheduler backend framework rather than in new `Actor` classes. This also introduces a parent abstract class `YarnSchedulerBackend` to encapsulate common logic to communicate with the Yarn `ApplicationMaster`. (2) The interface of requesting executors exposed to the `SparkContext` is the same, but the communication between the scheduler backend and the AM uses total number executors desired instead of an incremental number. This is discussed in #2746 and explained in the comments in the code.

I have tested this significantly on a stable Yarn cluster.

------------
A remaining task for this issue is to tone down the error messages emitted when an executor is removed.
Currently, `SparkContext` and its components react as if the executor has failed, resulting in many scary error messages and eventual timeouts. While it's not strictly necessary to fix this as of the first-cut implementation of this mechanism, it would be good to add logic to distinguish this case. I prefer to address this in a separate PR. I have filed a separate JIRA for this task at SPARK-4134.

Author: Andrew Or <andrew@databricks.com>
Author: Andrew Or <andrewor14@gmail.com>

Closes #2840 from andrewor14/yarn-scaling-mechanism and squashes the following commits:

485863e [Andrew Or] Minor log message changes
4920be8 [Andrew Or] Clarify that public API is only for Yarn mode for now
1c57804 [Andrew Or] Reword a few comments + other review comments
6321140 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-scaling-mechanism
02836c0 [Andrew Or] Limit scope of synchronization
4e2ed7f [Andrew Or] Fix bug: keep track of removed executors properly
73ade46 [Andrew Or] Wording changes (minor)
2a7a6da [Andrew Or] Add `sc.killExecutor` as a shorthand (minor)
665f229 [Andrew Or] Mima excludes
79aa2df [Andrew Or] Simplify the request interface by asking for a total
04f625b [Andrew Or] Fix race condition that causes over-allocation of executors
f4783f8 [Andrew Or] Change the semantics of requesting executors
005a124 [Andrew Or] Fix tests
4628b16 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-scaling-mechanism
db4a679 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-scaling-mechanism
572f5c5 [Andrew Or] Unused import (minor)
f30261c [Andrew Or] Kill multiple executors rather than one at a time
de260d9 [Andrew Or] Simplify by skipping useless null check
9c52542 [Andrew Or] Simplify by skipping the TaskSchedulerImpl
97dd1a8 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-scaling-mechanism
d987b3e [Andrew Or] Move addWebUIFilters to Yarn scheduler backend
7b76d0a [Andrew Or] Expose mechanism in SparkContext as developer API
47466cd [Andrew Or] Refactor common Yarn scheduler backend logic
c4dfaac [Andrew Or] Avoid thrashing when removing executors
53e8145 [Andrew Or] Start yarn actor early to listen for AM registration message
bbee669 [Andrew Or] Add mechanism in yarn client mode
2014-10-29 14:01:00 -07:00
Reynold Xin dff015533d [SPARK-3453] Netty-based BlockTransferService, extracted from Spark core
This PR encapsulates #2330, which is itself a continuation of #2240. The first goal of this PR is to provide an alternate, simpler implementation of the ConnectionManager which is based on Netty.

In addition to this goal, however, we want to resolve [SPARK-3796](https://issues.apache.org/jira/browse/SPARK-3796), which calls for a standalone shuffle service which can be integrated into the YARN NodeManager, Standalone Worker, or on its own. This PR makes the first step in this direction by ensuring that the actual Netty service is as small as possible and extracted from Spark core. Given this, we should be able to construct this standalone jar which can be included in other JVMs without incurring significant dependency or runtime issues. The actual work to ensure that such a standalone shuffle service would work in Spark will be left for a future PR, however.

In order to minimize dependencies and allow for the service to be long-running (possibly much longer-running than Spark, and possibly having to support multiple version of Spark simultaneously), the entire service has been ported to Java, where we have full control over the binary compatibility of the components and do not depend on the Scala runtime or version.

These issues: have been addressed by folding in #2330:

SPARK-3453: Refactor Netty module to use BlockTransferService interface
SPARK-3018: Release all buffers upon task completion/failure
SPARK-3002: Create a connection pool and reuse clients across different threads
SPARK-3017: Integration tests and unit tests for connection failures
SPARK-3049: Make sure client doesn't block when server/connection has error(s)
SPARK-3502: SO_RCVBUF and SO_SNDBUF should be bootstrap childOption, not option
SPARK-3503: Disable thread local cache in PooledByteBufAllocator

TODO before mergeable:
- [x] Implement uploadBlock()
- [x] Unit tests for RPC side of code
- [x] Performance testing (see comments [here](https://github.com/apache/spark/pull/2753#issuecomment-59475022))
- [x] Turn OFF by default (currently on for unit testing)

Author: Reynold Xin <rxin@apache.org>
Author: Aaron Davidson <aaron@databricks.com>
Author: cocoatomo <cocoatomo77@gmail.com>
Author: Patrick Wendell <pwendell@gmail.com>
Author: Prashant Sharma <prashant.s@imaginea.com>
Author: Davies Liu <davies.liu@gmail.com>
Author: Anand Avati <avati@redhat.com>

Closes #2753 from aarondav/netty and squashes the following commits:

cadfd28 [Aaron Davidson] Turn netty off by default
d7be11b [Aaron Davidson] Turn netty on by default
4a204b8 [Aaron Davidson] Fail block fetches if client connection fails
2b0d1c0 [Aaron Davidson] 100ch
0c5bca2 [Aaron Davidson] Merge branch 'master' of https://github.com/apache/spark into netty
14e37f7 [Aaron Davidson] Address Reynold's comments
8dfcceb [Aaron Davidson] Merge branch 'master' of https://github.com/apache/spark into netty
322dfc1 [Aaron Davidson] Address Reynold's comments, including major rename
e5675a4 [Aaron Davidson] Fail outstanding RPCs as well
ccd4959 [Aaron Davidson] Don't throw exception if client immediately fails
9da0bc1 [Aaron Davidson] Add RPC unit tests
d236dfd [Aaron Davidson] Remove no-op serializer :)
7b7a26c [Aaron Davidson] Fix Nio compile issue
dd420fd [Aaron Davidson] Merge branch 'master' of https://github.com/apache/spark into netty-test
939f276 [Aaron Davidson] Attempt to make comm. bidirectional
aa58f67 [cocoatomo] [SPARK-3909][PySpark][Doc] A corrupted format in Sphinx documents and building warnings
8dc1ded [cocoatomo] [SPARK-3867][PySpark] ./python/run-tests failed when it run with Python 2.6 and unittest2 is not installed
5b5dbe6 [Prashant Sharma] [SPARK-2924] Required by scala 2.11, only one fun/ctor amongst overriden alternatives, can have default argument(s).
2c5d9dc [Patrick Wendell] HOTFIX: Fix build issue with Akka 2.3.4 upgrade.
020691e [Davies Liu] [SPARK-3886] [PySpark] use AutoBatchedSerializer by default
ae4083a [Anand Avati] [SPARK-2805] Upgrade Akka to 2.3.4
29c6dcf [Aaron Davidson] [SPARK-3453] Netty-based BlockTransferService, extracted from Spark core
f7e7568 [Reynold Xin] Fixed spark.shuffle.io.receiveBuffer setting.
5d98ce3 [Reynold Xin] Flip buffer.
f6c220d [Reynold Xin] Merge with latest master.
407e59a [Reynold Xin] Fix style violation.
a0518c7 [Reynold Xin] Implemented block uploads.
4b18db2 [Reynold Xin] Copy the buffer in fetchBlockSync.
bec4ea2 [Reynold Xin] Removed OIO and added num threads settings.
1bdd7ee [Reynold Xin] Fixed tests.
d68f328 [Reynold Xin] Logging close() in case close() fails.
f63fb4c [Reynold Xin] Add more debug message.
6afc435 [Reynold Xin] Added logging.
c066309 [Reynold Xin] Implement java.io.Closeable interface.
519d64d [Reynold Xin] Mark private package visibility and MimaExcludes.
f0a16e9 [Reynold Xin] Fixed test hanging.
14323a5 [Reynold Xin] Removed BlockManager.getLocalShuffleFromDisk.
b2f3281 [Reynold Xin] Added connection pooling.
d23ed7b [Reynold Xin] Incorporated feedback from Norman: - use same pool for boss and worker - remove ioratio - disable caching of byte buf allocator - childoption sendbuf/receivebuf - fire exception through pipeline
9e0cb87 [Reynold Xin] Fixed BlockClientHandlerSuite
5cd33d7 [Reynold Xin] Fixed style violation.
cb589ec [Reynold Xin] Added more test cases covering cleanup when fault happens in ShuffleBlockFetcherIteratorSuite
1be4e8e [Reynold Xin] Shorten NioManagedBuffer and NettyManagedBuffer class names.
108c9ed [Reynold Xin] Forgot to add TestSerializer to the commit list.
b5c8d1f [Reynold Xin] Fixed ShuffleBlockFetcherIteratorSuite.
064747b [Reynold Xin] Reference count buffers and clean them up properly.
2b44cf1 [Reynold Xin] Added more documentation.
1760d32 [Reynold Xin] Use Epoll.isAvailable in BlockServer as well.
165eab1 [Reynold Xin] [SPARK-3453] Refactor Netty module to use BlockTransferService.
2014-10-29 11:27:07 -07:00
Xiangrui Meng 84e5da87e3 [SPARK-4084] Reuse sort key in Sorter
Sorter uses generic-typed key for sorting. When data is large, it creates lots of key objects, which is not efficient. We should reuse the key in Sorter for memory efficiency. This change is part of the petabyte sort implementation from rxin .

The `Sorter` class was written in Java and marked package private. So it is only available to `org.apache.spark.util.collection`. I renamed it to `TimSort` and add a simple wrapper of it, still called `Sorter`, in Scala, which is `private[spark]`.

The benchmark code is updated, which now resets the array before each run. Here is the result on sorting primitive Int arrays of size 25 million using Sorter:

~~~
[info] - Sorter benchmark for key-value pairs !!! IGNORED !!!
Java Arrays.sort() on non-primitive int array: Took 13237 ms
Java Arrays.sort() on non-primitive int array: Took 13320 ms
Java Arrays.sort() on non-primitive int array: Took 15718 ms
Java Arrays.sort() on non-primitive int array: Took 13283 ms
Java Arrays.sort() on non-primitive int array: Took 13267 ms
Java Arrays.sort() on non-primitive int array: Took 15122 ms
Java Arrays.sort() on non-primitive int array: Took 15495 ms
Java Arrays.sort() on non-primitive int array: Took 14877 ms
Java Arrays.sort() on non-primitive int array: Took 16429 ms
Java Arrays.sort() on non-primitive int array: Took 14250 ms
Java Arrays.sort() on non-primitive int array: (13878 ms first try, 14499 ms average)
Java Arrays.sort() on primitive int array: Took 2683 ms
Java Arrays.sort() on primitive int array: Took 2683 ms
Java Arrays.sort() on primitive int array: Took 2701 ms
Java Arrays.sort() on primitive int array: Took 2746 ms
Java Arrays.sort() on primitive int array: Took 2685 ms
Java Arrays.sort() on primitive int array: Took 2735 ms
Java Arrays.sort() on primitive int array: Took 2669 ms
Java Arrays.sort() on primitive int array: Took 2693 ms
Java Arrays.sort() on primitive int array: Took 2680 ms
Java Arrays.sort() on primitive int array: Took 2642 ms
Java Arrays.sort() on primitive int array: (2948 ms first try, 2691 ms average)
Sorter without key reuse on primitive int array: Took 10732 ms
Sorter without key reuse on primitive int array: Took 12482 ms
Sorter without key reuse on primitive int array: Took 10718 ms
Sorter without key reuse on primitive int array: Took 12650 ms
Sorter without key reuse on primitive int array: Took 10747 ms
Sorter without key reuse on primitive int array: Took 10783 ms
Sorter without key reuse on primitive int array: Took 12721 ms
Sorter without key reuse on primitive int array: Took 10604 ms
Sorter without key reuse on primitive int array: Took 10622 ms
Sorter without key reuse on primitive int array: Took 11843 ms
Sorter without key reuse on primitive int array: (11089 ms first try, 11390 ms average)
Sorter with key reuse on primitive int array: Took 5141 ms
Sorter with key reuse on primitive int array: Took 5298 ms
Sorter with key reuse on primitive int array: Took 5066 ms
Sorter with key reuse on primitive int array: Took 5164 ms
Sorter with key reuse on primitive int array: Took 5203 ms
Sorter with key reuse on primitive int array: Took 5274 ms
Sorter with key reuse on primitive int array: Took 5186 ms
Sorter with key reuse on primitive int array: Took 5159 ms
Sorter with key reuse on primitive int array: Took 5164 ms
Sorter with key reuse on primitive int array: Took 5078 ms
Sorter with key reuse on primitive int array: (5311 ms first try, 5173 ms average)
~~~

So with key reuse, it is faster and less likely to trigger GC.

Author: Xiangrui Meng <meng@databricks.com>
Author: Reynold Xin <rxin@apache.org>

Closes #2937 from mengxr/SPARK-4084 and squashes the following commits:

d73c3d0 [Xiangrui Meng] address comments
0b7b682 [Xiangrui Meng] fix mima
a72f53c [Xiangrui Meng] update timeIt
38ba50c [Xiangrui Meng] update timeIt
720f731 [Xiangrui Meng] add doc about JIT specialization
78f2879 [Xiangrui Meng] update tests
7de2efd [Xiangrui Meng] update the Sorter benchmark code to be correct
8626356 [Xiangrui Meng] add prepare to timeIt and update testsin SorterSuite
5f0d530 [Xiangrui Meng] update method modifiers of SortDataFormat
6ffbe66 [Xiangrui Meng] rename Sorter to TimSort and add a Scala wrapper that is private[spark]
b00db4d [Xiangrui Meng] doc and tests
cf94e8a [Xiangrui Meng] renaming
464ddce [Reynold Xin] cherry-pick rxin's commit
2014-10-28 15:14:41 -07:00
GuoQiang Li 89e8a5d8ba [SPARK-3997][Build]scalastyle should output the error location
Author: GuoQiang Li <witgo@qq.com>

Closes #2846 from witgo/SPARK-3997 and squashes the following commits:

d6a57f8 [GuoQiang Li] scalastyle should output the error location
2014-10-26 16:24:50 -07:00
Michael Armbrust 3a845d3c04 [SQL] Update Hive test harness for Hive 12 and 13
As part of the upgrade I also copy the newest version of the query tests, and whitelist a bunch of new ones that are now passing.

Author: Michael Armbrust <michael@databricks.com>

Closes #2936 from marmbrus/fix13tests and squashes the following commits:

d9cbdab [Michael Armbrust] Remove user specific tests
65801cd [Michael Armbrust] style and rat
8f6b09a [Michael Armbrust] Update test harness to work with both Hive 12 and 13.
f044843 [Michael Armbrust] Update Hive query tests and golden files to 0.13
2014-10-24 18:36:35 -07:00
Holden Karau 293672c499 specify unidocGenjavadocVersion of 0.8
Fixes an issue with being too strict generating javadoc causing errors.

Author: Holden Karau <holden@pigscanfly.ca>

Closes #2893 from holdenk/SPARK-3359-sbtunidoc-java8 and squashes the following commits:

9379a70 [Holden Karau] specify unidocGenjavadocVersion of 0.8
2014-10-23 13:46:55 -07:00
Prashant Sharma d6a3025392 [BUILD] Fixed resolver for scalastyle plugin and upgrade sbt version.
Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #2877 from ScrapCodes/scalastyle-fix and squashes the following commits:

a17b9fe [Prashant Sharma] [BUILD] Fixed resolver for scalastyle plugin.
2014-10-22 22:13:11 -07:00
Josh Rosen d1966f3a8b [SPARK-3902] [SPARK-3590] Stabilize AsynRDDActions and add Java API
This PR adds a Java API for AsyncRDDActions and promotes the API from `Experimental` to stable.

Author: Josh Rosen <joshrosen@apache.org>
Author: Josh Rosen <joshrosen@databricks.com>

Closes #2760 from JoshRosen/async-rdd-actions-in-java and squashes the following commits:

0d45fbc [Josh Rosen] Whitespace fix.
ad3ae53 [Josh Rosen] Merge remote-tracking branch 'origin/master' into async-rdd-actions-in-java
c0153a5 [Josh Rosen] Remove unused variable.
e8e2867 [Josh Rosen] Updates based on Marcelo's review feedback
7a1417f [Josh Rosen] Removed unnecessary java.util import.
6f8f6ac [Josh Rosen] Fix import ordering.
ff28e49 [Josh Rosen] Add MiMa excludes and fix a scalastyle error.
346e46e [Josh Rosen] [SPARK-3902] Stabilize AsyncRDDActions; add Java API.
2014-10-19 20:02:31 -07:00
Prashant Sharma 2fe0ba9561 SPARK-3874: Provide stable TaskContext API
This is a small number of clean-up changes on top of #2782. Closes #2782.

Author: Prashant Sharma <prashant.s@imaginea.com>
Author: Patrick Wendell <pwendell@gmail.com>

Closes #2803 from pwendell/pr-2782 and squashes the following commits:

56d5b7a [Patrick Wendell] Minor clean-up
44089ec [Patrick Wendell] Clean-up the TaskContext API.
ed551ce [Prashant Sharma] Fixed a typo
df261d0 [Prashant Sharma] Josh's suggestion
facf3b1 [Prashant Sharma] Fixed the mima issue.
7ecc2fe [Prashant Sharma] CR, Moved implementations to TaskContextImpl
bbd9e05 [Prashant Sharma] adding missed out files to git.
ef633f5 [Prashant Sharma] SPARK-3874, Provide stable TaskContext API
2014-10-16 21:38:45 -04:00
prudhvi 044583a241 [Core] Upgrading ScalaStyle version to 0.5 and removing SparkSpaceAfterCommentStartChecker.
Author: prudhvi <prudhvi953@gmail.com>

Closes #2799 from prudhvije/ScalaStyle/space-after-comment-start and squashes the following commits:

fc263a1 [prudhvi] [Core] Using scalastyle to check the space after comment start
2014-10-16 02:05:44 -04:00
Colin Patrick Mccabe 6e27cb630d SPARK-1767: Prefer HDFS-cached replicas when scheduling data-local tasks
This change reorders the replicas returned by
HadoopRDD#getPreferredLocations so that replicas cached by HDFS are at
the start of the list.  This requires Hadoop 2.5 or higher; previous
versions of Hadoop do not expose the information needed to determine
whether a replica is cached.

Author: Colin Patrick Mccabe <cmccabe@cloudera.com>

Closes #1486 from cmccabe/SPARK-1767 and squashes the following commits:

338d4f8 [Colin Patrick Mccabe] SPARK-1767: Prefer HDFS-cached replicas when scheduling data-local tasks
2014-10-02 00:29:31 -07:00
Reynold Xin 6b79bfb425 [SPARK-3613] Record only average block size in MapStatus for large stages
This changes the way we send MapStatus from executors back to driver for large stages (>2000 tasks). For large stages, we no longer send one byte per block. Instead, we just send the average block size.

This makes large jobs (tens of thousands of tasks) much more reliable since the driver no longer sends huge amount of data.

Author: Reynold Xin <rxin@apache.org>

Closes #2470 from rxin/mapstatus and squashes the following commits:

822ff54 [Reynold Xin] Code review feedback.
3b86f56 [Reynold Xin] Added MimaExclude.
f89d182 [Reynold Xin] Fixed a bug in MapStatus
6a0401c [Reynold Xin] [SPARK-3613] Record only average block size in MapStatus for large stages.
2014-09-29 22:56:22 -07:00
Reza Zadeh 587a0cd7ed [MLlib] [SPARK-2885] DIMSUM: All-pairs similarity
# All-pairs similarity via DIMSUM
Compute all pairs of similar vectors using brute force approach, and also DIMSUM sampling approach.

Laying down some notation: we are looking for all pairs of similar columns in an m x n RowMatrix whose entries are denoted a_ij, with the i’th row denoted r_i and the j’th column denoted c_j. There is an oversampling parameter labeled ɣ that should be set to 4 log(n)/s to get provably correct results (with high probability), where s is the similarity threshold.

The algorithm is stated with a Map and Reduce, with proofs of correctness and efficiency in published papers [1] [2]. The reducer is simply the summation reducer. The mapper is more interesting, and is also the heart of the scheme. As an exercise, you should try to see why in expectation, the map-reduce below outputs cosine similarities.

![dimsumv2](https://cloud.githubusercontent.com/assets/3220351/3807272/d1d9514e-1c62-11e4-9f12-3cfdb1d78b3a.png)

[1] Bosagh-Zadeh, Reza and Carlsson, Gunnar (2013), Dimension Independent Matrix Square using MapReduce, arXiv:1304.1467 http://arxiv.org/abs/1304.1467

[2] Bosagh-Zadeh, Reza and Goel, Ashish (2012), Dimension Independent Similarity Computation, arXiv:1206.2082 http://arxiv.org/abs/1206.2082

# Testing

Tests for all invocations included.

Added L1 and L2 norm computation to MultivariateStatisticalSummary since it was needed. Added tests for both of them.

Author: Reza Zadeh <rizlar@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #1778 from rezazadeh/dimsumv2 and squashes the following commits:

404c64c [Reza Zadeh] Merge remote-tracking branch 'upstream/master' into dimsumv2
4eb71c6 [Reza Zadeh] Add excludes for normL1 and normL2
ee8bd65 [Reza Zadeh] Merge remote-tracking branch 'upstream/master' into dimsumv2
976ddd4 [Reza Zadeh] Broadcast colMags. Avoid div by zero.
3467cff [Reza Zadeh] Merge remote-tracking branch 'upstream/master' into dimsumv2
aea0247 [Reza Zadeh] Allow large thresholds to promote sparsity
9fe17c0 [Xiangrui Meng] organize imports
2196ba5 [Xiangrui Meng] Merge branch 'rezazadeh-dimsumv2' into dimsumv2
254ca08 [Reza Zadeh] Merge remote-tracking branch 'upstream/master' into dimsumv2
f2947e4 [Xiangrui Meng] some optimization
3c4cf41 [Xiangrui Meng] Merge branch 'master' into rezazadeh-dimsumv2
0e4eda4 [Reza Zadeh] Use partition index for RNG
251bb9c [Reza Zadeh] Documentation
25e9d0d [Reza Zadeh] Line length for style
fb296f6 [Reza Zadeh] renamed to normL1 and normL2
3764983 [Reza Zadeh] Documentation
e9c6791 [Reza Zadeh] New interface and documentation
613f261 [Reza Zadeh] Column magnitude summary
75a0b51 [Reza Zadeh] Use Ints instead of Longs in the shuffle
0f12ade [Reza Zadeh] Style changes
eb1dc20 [Reza Zadeh] Use Double.PositiveInfinity instead of Double.Max
f56a882 [Reza Zadeh] Remove changes to MultivariateOnlineSummarizer
dbc55ba [Reza Zadeh] Make colMagnitudes a method in RowMatrix
41e8ece [Reza Zadeh] style changes
139c8e1 [Reza Zadeh] Syntax changes
029aa9c [Reza Zadeh] javadoc and new test
75edb25 [Reza Zadeh] All tests passing!
05e59b8 [Reza Zadeh] Add test
502ce52 [Reza Zadeh] new interface
654c4fb [Reza Zadeh] default methods
3726ca9 [Reza Zadeh] Remove MatrixAlgebra
6bebabb [Reza Zadeh] remove changes to MatrixSuite
5b8cd7d [Reza Zadeh] Initial files
2014-09-29 11:15:09 -07:00
William Benton 6918012d0f SPARK-3699: SQL and Hive console tasks now clean up appropriately
The sbt tasks sql/console and hive/console will now `stop()`
the `SparkContext` upon exit.  Previously, they left an ugly stack
trace when quitting.

Author: William Benton <willb@redhat.com>

Closes #2547 from willb/consoleCleanup and squashes the following commits:

d5e431f [William Benton] SQL and Hive console tasks now clean up.
2014-09-28 01:01:27 -07:00
Burak e76ef5cb8e [SPARK-3418] Sparse Matrix support (CCS) and additional native BLAS operations added
Local `SparseMatrix` support added in Compressed Column Storage (CCS) format in addition to Level-2 and Level-3 BLAS operations such as dgemv and dgemm respectively.

BLAS doesn't support  sparse matrix operations, therefore support for `SparseMatrix`-`DenseMatrix` multiplication and `SparseMatrix`-`DenseVector` implementations have been added. I will post performance comparisons in the comments momentarily.

Author: Burak <brkyvz@gmail.com>

Closes #2294 from brkyvz/SPARK-3418 and squashes the following commits:

88814ed [Burak] Hopefully fixed MiMa this time
47e49d5 [Burak] really fixed MiMa issue
f0bae57 [Burak] [SPARK-3418] Fixed MiMa compatibility issues (excluded from check)
4b7dbec [Burak] 9/17 comments addressed
7af2f83 [Burak] sealed traits Vector and Matrix
d3a8a16 [Burak] [SPARK-3418] Squashed missing alpha bug.
421045f [Burak] [SPARK-3418] New code review comments addressed
f35a161 [Burak] [SPARK-3418] Code review comments addressed and multiplication further optimized
2508577 [Burak] [SPARK-3418] Fixed one more style issue
d16e8a0 [Burak] [SPARK-3418] Fixed style issues and added documentation for methods
204a3f7 [Burak] [SPARK-3418] Fixed failing Matrix unit test
6025297 [Burak] [SPARK-3418] Fixed Scala-style errors
dc7be71 [Burak] [SPARK-3418][MLlib] Matrix unit tests expanded with indexing and updating
d2d5851 [Burak] [SPARK-3418][MLlib] Sparse Matrix support and additional native BLAS operations added
2014-09-18 22:18:51 -07:00
WangTaoTheTonic 3f169bfe3c [SPARK-3565]Fix configuration item not consistent with document
https://issues.apache.org/jira/browse/SPARK-3565

"spark.ports.maxRetries" should be "spark.port.maxRetries". Make the configuration keys in document and code consistent.

Author: WangTaoTheTonic <barneystinson@aliyun.com>

Closes #2427 from WangTaoTheTonic/fixPortRetries and squashes the following commits:

c178813 [WangTaoTheTonic] Use blank lines trigger Jenkins
646f3fe [WangTaoTheTonic] also in SparkBuild.scala
3700dba [WangTaoTheTonic] Fix configuration item not consistent with document
2014-09-17 21:59:23 -07:00
Prashant Sharma 7b8008f5a4 [SPARK-2182] Scalastyle rule blocking non ascii characters.
...erators.

Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #2358 from ScrapCodes/scalastyle-unicode and squashes the following commits:

12a20f2 [Prashant Sharma] [SPARK-2182] Scalastyle rule blocking (non keyboard typeable) unicode operators.
2014-09-16 09:21:03 -07:00
Prashant Sharma ecf0c02935 [SPARK-3433][BUILD] Fix for Mima false-positives with @DeveloperAPI and @Experimental annotations.
Actually false positive reported was due to mima generator not picking up the new jars in presence of old jars(theoretically this should not have happened.). So as a workaround, ran them both separately and just append them together.

Author: Prashant Sharma <prashant@apache.org>
Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #2285 from ScrapCodes/mima-fix and squashes the following commits:

093c76f [Prashant Sharma] Update mima
59012a8 [Prashant Sharma] Update mima
35b6c71 [Prashant Sharma] SPARK-3433 Fix for Mima false-positives with @DeveloperAPI and @Experimental annotations.
2014-09-15 21:14:00 -07:00
Andrew Or 6324eb7b5b [Spark-3490] Disable SparkUI for tests
We currently open many ephemeral ports during the tests, and as a result we occasionally can't bind to new ones. This has caused the `DriverSuite` and the `SparkSubmitSuite` to fail intermittently.

By disabling the `SparkUI` when it's not needed, we already cut down on the number of ports opened significantly, on the order of the number of `SparkContexts` ever created. We must keep it enabled for a few tests for the UI itself, however.

Author: Andrew Or <andrewor14@gmail.com>

Closes #2363 from andrewor14/disable-ui-for-tests and squashes the following commits:

332a7d5 [Andrew Or] No need to set spark.ui.port to 0 anymore
30c93a2 [Andrew Or] Simplify streaming UISuite
a431b84 [Andrew Or] Fix streaming test failures
8f5ae53 [Andrew Or] Fix no new line at the end
29c9b5b [Andrew Or] Disable SparkUI for tests
2014-09-11 17:18:46 -07:00
Prashant Sharma 0d1cc4ae42 [HOTFIX] A left over version change. It should make mima happy.
Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #2317 from ScrapCodes/hotfix and squashes the following commits:

b6472d4 [Prashant Sharma] [HOTFIX] for hotfixes, a left over version change.
2014-09-08 14:32:53 +05:30
Josh Rosen 4ba2673569 [HOTFIX] Fix broken Mima tests on the master branch
By merging #2268, which bumped the Spark version to 1.2.0-SNAPSHOT, I inadvertently broke the Mima binary compatibility tests.  The issue is that we were comparing 1.2.0-SNAPSHOT against Spark 1.0.0 without using any Mima excludes.  The right long-term fix for this is probably to publish nightly snapshots on Maven central and change the master branch to test binary compatibility against the current release candidate branch's snapshots until that release is finalized.

As a short-term fix until 1.1.0 is published on Maven central, I've configured the build to test the master branch for binary compatibility against the 1.1.0-RC4 jars.  I'll loop back and remove the Apache staging repo as soon as 1.1.0 final is available.

Author: Josh Rosen <joshrosen@apache.org>

Closes #2315 from JoshRosen/mima-fix and squashes the following commits:

776bc2c [Josh Rosen] Add two excludes to workaround Mima annotation issues.
ec90e21 [Josh Rosen] Add deploy and graphx to 1.2 MiMa excludes.
57569be [Josh Rosen] Fix MiMa tests in master branch; test against 1.1.0 RC.
2014-09-07 20:39:53 -07:00
Marcelo Vanzin f2b5b619a9 [SPARK-3388] Expose aplication ID in ApplicationStart event, use it in history server.
This change exposes the application ID generated by the Spark Master, Mesos or Yarn
via the SparkListenerApplicationStart event. It then uses that information to expose the
application via its ID in the history server, instead of using the internal directory name
generated by the event logger as an application id. This allows someone who knows
the application ID to easily figure out the URL for the application's entry in the HS, aside
from looking better.

In Yarn mode, this is used to generate a direct link from the RM application list to the
Spark history server entry (thus providing a fix for SPARK-2150).

Note this sort of assumes that the different managers will generate app ids that are
sufficiently different from each other that clashes will not occur.

Author: Marcelo Vanzin <vanzin@cloudera.com>

This patch had conflicts when merged, resolved by
Committer: Andrew Or <andrewor14@gmail.com>

Closes #1218 from vanzin/yarn-hs-link-2 and squashes the following commits:

2d19f3c [Marcelo Vanzin] Review feedback.
6706d3a [Marcelo Vanzin] Implement applicationId() in base classes.
56fe42e [Marcelo Vanzin] Fix cluster mode history address, plus a cleanup.
44112a8 [Marcelo Vanzin] Merge branch 'master' into yarn-hs-link-2
8278316 [Marcelo Vanzin] Merge branch 'master' into yarn-hs-link-2
a86bbcf [Marcelo Vanzin] Merge branch 'master' into yarn-hs-link-2
a0056e6 [Marcelo Vanzin] Unbreak test.
4b10cfd [Marcelo Vanzin] Merge branch 'master' into yarn-hs-link-2
cb0cab2 [Marcelo Vanzin] Merge branch 'master' into yarn-hs-link-2
25f2826 [Marcelo Vanzin] Add MIMA excludes.
f0ba90f [Marcelo Vanzin] Use BufferedIterator.
c90a08d [Marcelo Vanzin] Remove unused code.
3f8ec66 [Marcelo Vanzin] Review feedback.
21aa71b [Marcelo Vanzin] Fix JSON test.
b022bae [Marcelo Vanzin] Undo SparkContext cleanup.
c6d7478 [Marcelo Vanzin] Merge branch 'master' into yarn-hs-link-2
4e3483f [Marcelo Vanzin] Fix test.
57517b8 [Marcelo Vanzin] Review feedback. Mostly, more consistent use of Scala's Option.
311e49d [Marcelo Vanzin] Merge branch 'master' into yarn-hs-link-2
d35d86f [Marcelo Vanzin] Fix yarn backend after rebase.
36dc362 [Marcelo Vanzin] Don't use Iterator::takeWhile().
0afd696 [Marcelo Vanzin] Wait until master responds before returning from start().
abc4697 [Marcelo Vanzin] Make FsHistoryProvider keep a map of applications by id.
26b266e [Marcelo Vanzin] Use Mesos framework ID as Spark application ID.
b3f3664 [Marcelo Vanzin] [yarn] Make the RM link point to the app direcly in the HS.
2fb7de4 [Marcelo Vanzin] Expose the application ID in the ApplicationStart event.
ed10348 [Marcelo Vanzin] Expose application id to spark context.
2014-09-03 14:57:38 -07:00
Tathagata Das e9bb12bea9 [SPARK-1981][Streaming][Hotfix] Fixed docs related to kinesis
- Include kinesis in the unidocs
- Hide non-public classes from docs

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #2239 from tdas/kinesis-doc-fix and squashes the following commits:

156e20c [Tathagata Das] More fixes, based on PR comments.
e9a6c01 [Tathagata Das] Fixed docs related to kinesis
2014-09-02 19:02:48 -07:00
lirui fbf2678c16 SPARK-2636: Expose job ID in JobWaiter API
This PR adds the async actions to the Java API. User can call these async actions to get the FutureAction and use JobWaiter (for SimpleFutureAction) to retrieve job Id.

Author: lirui <rui.li@intel.com>

Closes #2176 from lirui-intel/SPARK-2636 and squashes the following commits:

ccaafb7 [lirui] SPARK-2636: fix java doc
5536d55 [lirui] SPARK-2636: mark the async API as experimental
e2e01d5 [lirui] SPARK-2636: add mima exclude
0ca320d [lirui] SPARK-2636: fix method name & javadoc
3fa39f7 [lirui] SPARK-2636: refine the patch
af4f5d9 [lirui] SPARK-2636: remove unused imports
843276c [lirui] SPARK-2636: only keep foreachAsync in the java API
fbf5744 [lirui] SPARK-2636: add more async actions for java api
1b25abc [lirui] SPARK-2636: expose some fields in JobWaiter
d09f732 [lirui] SPARK-2636: fix build
eb1ee79 [lirui] SPARK-2636: change some parameters in SimpleFutureAction to member field
6e2b87b [lirui] SPARK-2636: add java API for async actions
2014-09-01 23:28:19 -07:00
Raymond Liu acea92806c [SPARK-2288] Hide ShuffleBlockManager behind ShuffleManager
By Hiding the shuffleblockmanager behind Shufflemanager, we decouple the shuffle data's block mapping management work from Diskblockmananger. This give a more clear interface and more easy for other shuffle manager to implement their own block management logic. the jira ticket have more details.

Author: Raymond Liu <raymond.liu@intel.com>

Closes #1241 from colorant/shuffle and squashes the following commits:

0e01ae3 [Raymond Liu] Move ShuffleBlockmanager behind shuffleManager
2014-08-29 23:05:18 -07:00
Marcelo Vanzin c9f743957f [SPARK-2848] Shade Guava in uber-jars.
For further discussion, please check the JIRA entry.

This change moves Guava classes to a different package so that they don't conflict with the user-provided Guava (or the Hadoop-provided one). Since one class (Optional) was exposed through Spark's public API, that class was forked from Guava at the current dependency version (14.0.1) so that it can be kept going forward (until the API is cleaned).

Note this change has a few implications:
- *all* classes in the final jars will reference the relocated classes. If Hadoop classes are included (i.e. "-Phadoop-provided" is not activated), those will also reference the Guava 14 classes (instead of the Guava 11 classes from the Hadoop classpath).
- if the Guava version in Spark is ever changed, the new Guava will still reference the forked Optional class; this may or may not be a problem, but in the long term it's better to think about removing Optional from the public API.

For the end user, there are two visible implications:

- Guava is not provided as a transitive dependency anymore (since it's "provided" in Spark)
- At runtime, unless they provide their own, they'll either have no Guava or Hadoop's version of Guava (11), depending on how they set up their classpath.

Note that this patch does not change the sbt deliverables; those will still contain guava in its original package, and provide guava as a compile-time dependency. This assumes that maven is the canonical build, and sbt-built artifacts are not (officially) published.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #1813 from vanzin/SPARK-2848 and squashes the following commits:

9bdffb0 [Marcelo Vanzin] Undo sbt build changes.
819b445 [Marcelo Vanzin] Review feedback.
05e0a3d [Marcelo Vanzin] Merge branch 'master' into SPARK-2848
fef4370 [Marcelo Vanzin] Unfork Optional.java.
d3ea8e1 [Marcelo Vanzin] Exclude asm classes from final jar.
637189b [Marcelo Vanzin] Add hacky filter to prefer Spark's copy of Optional.
2fec990 [Marcelo Vanzin] Shade Guava in the sbt build.
616998e [Marcelo Vanzin] Shade Guava in the maven build, fork Guava's Optional.java.
2014-08-20 16:23:10 -07:00
Michael Armbrust 3abd0c1cda [SPARK-2406][SQL] Initial support for using ParquetTableScan to read HiveMetaStore tables.
This PR adds an experimental flag `spark.sql.hive.convertMetastoreParquet` that when true causes the planner to detects tables that use Hive's Parquet SerDe and instead plans them using Spark SQL's native `ParquetTableScan`.

Author: Michael Armbrust <michael@databricks.com>
Author: Yin Huai <huai@cse.ohio-state.edu>

Closes #1819 from marmbrus/parquetMetastore and squashes the following commits:

1620079 [Michael Armbrust] Revert "remove hive parquet bundle"
cc30430 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into parquetMetastore
4f3d54f [Michael Armbrust] fix style
41ebc5f [Michael Armbrust] remove hive parquet bundle
a43e0da [Michael Armbrust] Merge remote-tracking branch 'origin/master' into parquetMetastore
4c4dc19 [Michael Armbrust] Fix bug with tree splicing.
ebb267e [Michael Armbrust] include parquet hive to tests pass (Remove this later).
c0d9b72 [Michael Armbrust] Avoid creating a HadoopRDD per partition.  Add dirty hacks to retrieve partition values from the InputSplit.
8cdc93c [Michael Armbrust] Merge pull request #8 from yhuai/parquetMetastore
a0baec7 [Yin Huai] Partitioning columns can be resolved.
1161338 [Michael Armbrust] Add a test to make sure conversion is actually happening
212d5cd [Michael Armbrust] Initial support for using ParquetTableScan to read HiveMetaStore tables.
2014-08-18 13:17:10 -07:00
Xiangrui Meng 7e70708a99 [SPARK-3048][MLLIB] add LabeledPoint.parse and remove loadStreamingLabeledPoints
Move `parse()` from `LabeledPointParser` to `LabeledPoint` and make it public. This breaks binary compatibility only when a user uses synthesized methods like `tupled` and `curried`, which is rare.

`LabeledPoint.parse` is more consistent with `Vectors.parse`, which is why `LabeledPointParser` is not preferred.

freeman-lab tdas

Author: Xiangrui Meng <meng@databricks.com>

Closes #1952 from mengxr/labelparser and squashes the following commits:

c818fb2 [Xiangrui Meng] merge master
ce20e6f [Xiangrui Meng] update mima excludes
b386b8d [Xiangrui Meng] fix tests
2436b3d [Xiangrui Meng] add parse() to LabeledPoint
2014-08-16 15:13:34 -07:00
Reynold Xin a83c7723bf [SPARK-3045] Make Serializer interface Java friendly
Author: Reynold Xin <rxin@apache.org>

Closes #1948 from rxin/kryo and squashes the following commits:

a3a80d8 [Reynold Xin] [SPARK-3046] use executor's class loader as the default serializer classloader
3d13277 [Reynold Xin] Reverted that in TestJavaSerializerImpl too.
196f3dc [Reynold Xin] Ok one more commit to revert the classloader change.
c49b50c [Reynold Xin] Removed JavaSerializer change.
afbf37d [Reynold Xin] Moved the test case also.
a2e693e [Reynold Xin] Removed the Kryo bug fix from this pull request.
c81bd6c [Reynold Xin] Use defaultClassLoader when executing user specified custom registrator.
68f261e [Reynold Xin] Added license check excludes.
0c28179 [Reynold Xin] [SPARK-3045] Make Serializer interface Java friendly [SPARK-3046] Set executor's class loader as the default serializer class loader
2014-08-15 23:12:34 -07:00
Anand Avati 7589c39d39 [SPARK-2924] remove default args to overloaded methods
Not supported in Scala 2.11. Split them into separate methods instead.

Author: Anand Avati <avati@redhat.com>

Closes #1704 from avati/SPARK-1812-default-args and squashes the following commits:

3e3924a [Anand Avati] SPARK-1812: Add Mima excludes for the broken ABI
901dfc7 [Anand Avati] SPARK-1812: core - Fix overloaded methods with default arguments
07f00af [Anand Avati] SPARK-1812: streaming - Fix overloaded methods with default arguments
2014-08-15 08:53:52 -07:00
Xiangrui Meng 9038d94e1e [SPARK-2923][MLLIB] Implement some basic BLAS routines
Having some basic BLAS operations implemented in MLlib can help simplify the current implementation and improve some performance.

Tested on my local machine:

~~~
bin/spark-submit --class org.apache.spark.examples.mllib.BinaryClassification \
examples/target/scala-*/spark-examples-*.jar --algorithm LR --regType L2 \
--regParam 1.0 --numIterations 1000 ~/share/data/rcv1.binary/rcv1_train.binary
~~~

1. before: ~1m
2. after: ~30s

CC: jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #1849 from mengxr/ml-blas and squashes the following commits:

ba583a2 [Xiangrui Meng] exclude Vector.copy
a4d7d2f [Xiangrui Meng] Merge branch 'master' into ml-blas
6edeab9 [Xiangrui Meng] address comments
940bdeb [Xiangrui Meng] rename MLlibBLAS to BLAS
c2a38bc [Xiangrui Meng] enhance dot tests
4cfaac4 [Xiangrui Meng] add apache header
48d01d2 [Xiangrui Meng] add tests for zeros and copy
3b882b1 [Xiangrui Meng] use blas.scal in gradient
735eb23 [Xiangrui Meng] remove d from BLAS routines
d2d7d3c [Xiangrui Meng] update gradient and lbfgs
7f78186 [Xiangrui Meng] add zeros to Vectors; add dscal and dcopy to BLAS
14e6645 [Xiangrui Meng] add ddot
cbb8273 [Xiangrui Meng] add daxpy test
07db0bb [Xiangrui Meng] Merge branch 'master' into ml-blas
e8c326d [Xiangrui Meng] axpy
2014-08-11 22:33:45 -07:00
Xiangrui Meng 74d6f62264 [SPARK-1997][MLLIB] update breeze to 0.9
0.9 dependences (this version doesn't depend on scalalogging and I excluded commons-math3 from its transitive dependencies):
~~~
+-org.scalanlp:breeze_2.10:0.9 [S]
  +-com.github.fommil.netlib:core:1.1.2
  +-com.github.rwl:jtransforms:2.4.0
  +-net.sf.opencsv:opencsv:2.3
  +-net.sourceforge.f2j:arpack_combined_all:0.1
  +-org.scalanlp:breeze-macros_2.10:0.3.1 [S]
  | +-org.scalamacros:quasiquotes_2.10:2.0.0 [S]
  |
  +-org.slf4j:slf4j-api:1.7.5
  +-org.spire-math:spire_2.10:0.7.4 [S]
    +-org.scalamacros:quasiquotes_2.10:2.0.0 [S]
    |
    +-org.spire-math:spire-macros_2.10:0.7.4 [S]
      +-org.scalamacros:quasiquotes_2.10:2.0.0 [S]
~~~

Closes #1749

CC: witgo avati

Author: Xiangrui Meng <meng@databricks.com>

Closes #1857 from mengxr/breeze-0.9 and squashes the following commits:

7fc16b6 [Xiangrui Meng] don't know why but exclude a private method for mima
dcc502e [Xiangrui Meng] update breeze to 0.9
2014-08-08 15:07:31 -07:00
Prashant Sharma 32096c2aed SPARK-2899 Doc generation is back to working in new SBT Build.
The reason for this bug was introduciton of OldDeps project. It had to be excluded to prevent unidocs from trying to put it on "docs compile" classpath.

Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #1830 from ScrapCodes/doc-fix and squashes the following commits:

e5d52e6 [Prashant Sharma] SPARK-2899 Doc generation is back to working in new SBT Build.
2014-08-07 16:24:22 -07:00
Gregory Owen 4e008334ee SPARK-2882: Spark build now checks local maven cache for dependencies
Fixes [SPARK-2882](https://issues.apache.org/jira/browse/SPARK-2882)

Author: Gregory Owen <greowen@gmail.com>

Closes #1818 from GregOwen/spark-2882 and squashes the following commits:

294446d [Gregory Owen] SPARK-2882: Spark build now checks local maven cache for dependencies
2014-08-06 16:52:00 -07:00
Andrew Or 09f7e4587b [SPARK-2157] Enable tight firewall rules for Spark
The goal of this PR is to allow users of Spark to write tight firewall rules for their clusters. This is currently not possible because Spark uses random ports in many places, notably the communication between executors and drivers. The changes in this PR are based on top of ash211's changes in #1107.

The list covered here may or may not be the complete set of port needed for Spark to operate perfectly. However, as of the latest commit there are no known sources of random ports (except in tests). I have not documented a few of the more obscure configs.

My spark-env.sh looks like this:
```
export SPARK_MASTER_PORT=6060
export SPARK_WORKER_PORT=7070
export SPARK_MASTER_WEBUI_PORT=9090
export SPARK_WORKER_WEBUI_PORT=9091
```
and my spark-defaults.conf looks like this:
```
spark.master spark://andrews-mbp:6060
spark.driver.port 5001
spark.fileserver.port 5011
spark.broadcast.port 5021
spark.replClassServer.port 5031
spark.blockManager.port 5041
spark.executor.port 5051
```

Author: Andrew Or <andrewor14@gmail.com>
Author: Andrew Ash <andrew@andrewash.com>

Closes #1777 from andrewor14/configure-ports and squashes the following commits:

621267b [Andrew Or] Merge branch 'master' of github.com:apache/spark into configure-ports
8a6b820 [Andrew Or] Use a random UI port during tests
7da0493 [Andrew Or] Fix tests
523c30e [Andrew Or] Add test for isBindCollision
b97b02a [Andrew Or] Minor fixes
c22ad00 [Andrew Or] Merge branch 'master' of github.com:apache/spark into configure-ports
93d359f [Andrew Or] Executors connect to wrong port when collision occurs
d502e5f [Andrew Or] Handle port collisions when creating Akka systems
a2dd05c [Andrew Or] Patrick's comment nit
86461e2 [Andrew Or] Remove spark.executor.env.port and spark.standalone.client.port
1d2d5c6 [Andrew Or] Fix ports for standalone cluster mode
cb3be88 [Andrew Or] Various doc fixes (broken link, format etc.)
e837cde [Andrew Or] Remove outdated TODOs
bfbab28 [Andrew Or] Merge branch 'master' of github.com:apache/spark into configure-ports
de1b207 [Andrew Or] Update docs to reflect new ports
b565079 [Andrew Or] Add spark.ports.maxRetries
2551eb2 [Andrew Or] Remove spark.worker.watcher.port
151327a [Andrew Or] Merge branch 'master' of github.com:apache/spark into configure-ports
9868358 [Andrew Or] Add a few miscellaneous ports
6016e77 [Andrew Or] Add spark.executor.port
8d836e6 [Andrew Or] Also document SPARK_{MASTER/WORKER}_WEBUI_PORT
4d9e6f3 [Andrew Or] Fix super subtle bug
3f8e51b [Andrew Or] Correct erroneous docs...
e111d08 [Andrew Or] Add names for UI services
470f38c [Andrew Or] Special case non-"Address already in use" exceptions
1d7e408 [Andrew Or] Treat 0 ports specially + return correct ConnectionManager port
ba32280 [Andrew Or] Minor fixes
6b550b0 [Andrew Or] Assorted fixes
73fbe89 [Andrew Or] Move start service logic to Utils
ec676f4 [Andrew Or] Merge branch 'SPARK-2157' of github.com:ash211/spark into configure-ports
038a579 [Andrew Ash] Trust the server start function to report the port the service started on
7c5bdc4 [Andrew Ash] Fix style issue
0347aef [Andrew Ash] Unify port fallback logic to a single place
24a4c32 [Andrew Ash] Remove type on val to match surrounding style
9e4ad96 [Andrew Ash] Reformat for style checker
5d84e0e [Andrew Ash] Document new port configuration options
066dc7a [Andrew Ash] Fix up HttpServer port increments
cad16da [Andrew Ash] Add fallover increment logic for HttpServer
c5a0568 [Andrew Ash] Fix ConnectionManager to retry with increment
b80d2fd [Andrew Ash] Make Spark's block manager port configurable
17c79bb [Andrew Ash] Add a configuration option for spark-shell's class server
f34115d [Andrew Ash] SPARK-1176 Add port configuration for HttpBroadcast
49ee29b [Andrew Ash] SPARK-1174 Add port configuration for HttpFileServer
1c0981a [Andrew Ash] Make port in HttpServer configurable
2014-08-06 00:07:40 -07:00
Chris Fregly 91f9504e60 [SPARK-1981] Add AWS Kinesis streaming support
Author: Chris Fregly <chris@fregly.com>

Closes #1434 from cfregly/master and squashes the following commits:

4774581 [Chris Fregly] updated docs, renamed retry to retryRandom to be more clear, removed retries around store() method
0393795 [Chris Fregly] moved Kinesis examples out of examples/ and back into extras/kinesis-asl
691a6be [Chris Fregly] fixed tests and formatting, fixed a bug with JavaKinesisWordCount during union of streams
0e1c67b [Chris Fregly] Merge remote-tracking branch 'upstream/master'
74e5c7c [Chris Fregly] updated per TD's feedback.  simplified examples, updated docs
e33cbeb [Chris Fregly] Merge remote-tracking branch 'upstream/master'
bf614e9 [Chris Fregly] per matei's feedback:  moved the kinesis examples into the examples/ dir
d17ca6d [Chris Fregly] per TD's feedback:  updated docs, simplified the KinesisUtils api
912640c [Chris Fregly] changed the foundKinesis class to be a publically-avail class
db3eefd [Chris Fregly] Merge remote-tracking branch 'upstream/master'
21de67f [Chris Fregly] Merge remote-tracking branch 'upstream/master'
6c39561 [Chris Fregly] parameterized the versions of the aws java sdk and kinesis client
338997e [Chris Fregly] improve build docs for kinesis
828f8ae [Chris Fregly] more cleanup
e7c8978 [Chris Fregly] Merge remote-tracking branch 'upstream/master'
cd68c0d [Chris Fregly] fixed typos and backward compatibility
d18e680 [Chris Fregly] Merge remote-tracking branch 'upstream/master'
b3b0ff1 [Chris Fregly] [SPARK-1981] Add AWS Kinesis streaming support
2014-08-02 13:35:35 -07:00
Andrew Or 148af6082c [SPARK-2454] Do not ship spark home to Workers
When standalone Workers launch executors, they inherit the Spark home set by the driver. This means if the worker machines do not share the same directory structure as the driver node, the Workers will attempt to run scripts (e.g. bin/compute-classpath.sh) that do not exist locally and fail. This is a common scenario if the driver is launched from outside of the cluster.

The solution is to simply not pass the driver's Spark home to the Workers. This PR further makes an attempt to avoid overloading the usages of `spark.home`, which is now only used for setting executor Spark home on Mesos and in python.

This is based on top of #1392 and originally reported by YanTangZhai. Tested on standalone cluster.

Author: Andrew Or <andrewor14@gmail.com>

Closes #1734 from andrewor14/spark-home-reprise and squashes the following commits:

f71f391 [Andrew Or] Revert changes in python
1c2532c [Andrew Or] Merge branch 'master' of github.com:apache/spark into spark-home-reprise
188fc5d [Andrew Or] Avoid using spark.home where possible
09272b7 [Andrew Or] Always use Worker's working directory as spark home
2014-08-02 00:45:38 -07:00
Patrick Wendell dab37966b0 Revert "[SPARK-1470][SPARK-1842] Use the scala-logging wrapper instead of the directly sfl4j api"
This reverts commit adc8303294.
2014-08-01 23:55:30 -07:00
GuoQiang Li adc8303294 [SPARK-1470][SPARK-1842] Use the scala-logging wrapper instead of the directly sfl4j api
Author: GuoQiang Li <witgo@qq.com>

Closes #1369 from witgo/SPARK-1470_new and squashes the following commits:

66a1641 [GuoQiang Li] IncompatibleResultTypeProblem
73a89ba [GuoQiang Li] Use the scala-logging wrapper instead of the directly sfl4j api.
2014-08-01 23:55:11 -07:00
jerryshao a32f0fb73a [SPARK-2103][Streaming] Change to ClassTag for KafkaInputDStream and fix reflection issue
This PR updates previous Manifest for KafkaInputDStream's Decoder to ClassTag, also fix the problem addressed in [SPARK-2103](https://issues.apache.org/jira/browse/SPARK-2103).

Previous Java interface cannot actually get the type of Decoder, so when using this Manifest to reconstruct the decode object will meet reflection exception.

Also for other two Java interfaces, ClassTag[String] is useless because calling Scala API will get the right implicit ClassTag.

Current Kafka unit test cannot actually verify the interface. I've tested these interfaces in my local and distribute settings.

Author: jerryshao <saisai.shao@intel.com>

Closes #1508 from jerryshao/SPARK-2103 and squashes the following commits:

e90c37b [jerryshao] Add Mima excludes
7529810 [jerryshao] Change Manifest to ClassTag for KafkaInputDStream's Decoder and fix Decoder construct issue when using Java API
2014-08-01 04:32:46 -07:00
Matei Zaharia e966284409 SPARK-2045 Sort-based shuffle
This adds a new ShuffleManager based on sorting, as described in https://issues.apache.org/jira/browse/SPARK-2045. The bulk of the code is in an ExternalSorter class that is similar to ExternalAppendOnlyMap, but sorts key-value pairs by partition ID and can be used to create a single sorted file with a map task's output. (Longer-term I think this can take on the remaining functionality in ExternalAppendOnlyMap and replace it so we don't have code duplication.)

The main TODOs still left are:
- [x] enabling ExternalSorter to merge across spilled files
  - [x] with an Ordering
  - [x] without an Ordering, using the keys' hash codes
- [x] adding more tests (e.g. a version of our shuffle suite that runs on this)
- [x] rebasing on top of the size-tracking refactoring in #1165 when that is merged
- [x] disabling spilling if spark.shuffle.spill is set to false

Despite this though, this seems to work pretty well (running successfully in cases where the hash shuffle would OOM, such as 1000 reduce tasks on executors with only 1G memory), and it seems to be comparable in speed or faster than hash-based shuffle (it will create much fewer files for the OS to keep track of). So I'm posting it to get some early feedback.

After these TODOs are done, I'd also like to enable ExternalSorter to sort data within each partition by a key as well, which will allow us to use it to implement external spilling in reduce tasks in `sortByKey`.

Author: Matei Zaharia <matei@databricks.com>

Closes #1499 from mateiz/sort-based-shuffle and squashes the following commits:

bd841f9 [Matei Zaharia] Various review comments
d1c137fd [Matei Zaharia] Various review comments
a611159 [Matei Zaharia] Compile fixes due to rebase
62c56c8 [Matei Zaharia] Fix ShuffledRDD sometimes not returning Tuple2s.
f617432 [Matei Zaharia] Fix a failing test (seems to be due to change in SizeTracker logic)
9464d5f [Matei Zaharia] Simplify code and fix conflicts after latest rebase
0174149 [Matei Zaharia] Add cleanup behavior and cleanup tests for sort-based shuffle
eb4ee0d [Matei Zaharia] Remove customizable element type in ShuffledRDD
fa2e8db [Matei Zaharia] Allow nextBatchStream to be called after we're done looking at all streams
a34b352 [Matei Zaharia] Fix tracking of indices within a partition in SpillReader, and add test
03e1006 [Matei Zaharia] Add a SortShuffleSuite that runs ShuffleSuite with sort-based shuffle
3c7ff1f [Matei Zaharia] Obey the spark.shuffle.spill setting in ExternalSorter
ad65fbd [Matei Zaharia] Rebase on top of Aaron's Sorter change, and use Sorter in our buffer
44d2a93 [Matei Zaharia] Use estimateSize instead of atGrowThreshold to test collection sizes
5686f71 [Matei Zaharia] Optimize merging phase for in-memory only data:
5461cbb [Matei Zaharia] Review comments and more tests (e.g. tests with 1 element per partition)
e9ad356 [Matei Zaharia] Update ContextCleanerSuite to make sure shuffle cleanup tests use hash shuffle (since they were written for it)
c72362a [Matei Zaharia] Added bug fix and test for when iterators are empty
de1fb40 [Matei Zaharia] Make trait SizeTrackingCollection private[spark]
4988d16 [Matei Zaharia] tweak
c1b7572 [Matei Zaharia] Small optimization
ba7db7f [Matei Zaharia] Handle null keys in hash-based comparator, and add tests for collisions
ef4e397 [Matei Zaharia] Support for partial aggregation even without an Ordering
4b7a5ce [Matei Zaharia] More tests, and ability to sort data if a total ordering is given
e1f84be [Matei Zaharia] Fix disk block manager test
5a40a1c [Matei Zaharia] More tests
614f1b4 [Matei Zaharia] Add spill metrics to map tasks
cc52caf [Matei Zaharia] Add more error handling and tests for error cases
bbf359d [Matei Zaharia] More work
3a56341 [Matei Zaharia] More partial work towards sort-based shuffle
7a0895d [Matei Zaharia] Some more partial work towards sort-based shuffle
b615476 [Matei Zaharia] Scaffolding for sort-based shuffle
2014-07-30 18:07:59 -07:00
Sean Owen e9b275b769 SPARK-2341 [MLLIB] loadLibSVMFile doesn't handle regression datasets
Per discussion at https://issues.apache.org/jira/browse/SPARK-2341 , this is a look at deprecating the multiclass parameter. Thoughts welcome of course.

Author: Sean Owen <srowen@gmail.com>

Closes #1663 from srowen/SPARK-2341 and squashes the following commits:

8a3abd7 [Sean Owen] Suppress MIMA error for removed package private classes
18a8c8e [Sean Owen] Updates from review
83d0092 [Sean Owen] Deprecated methods with multiclass, and instead always parse target as a double (ie. multiclass = true)
2014-07-30 17:34:32 -07:00
Michael Armbrust 2248891a43 [SQL] Fix compiling of catalyst docs.
Author: Michael Armbrust <michael@databricks.com>

Closes #1653 from marmbrus/fixDocs and squashes the following commits:

0aa1feb [Michael Armbrust] Fix compiling of catalyst docs.
2014-07-30 13:11:09 -07:00
Yin Huai 7003c163db [SPARK-2179][SQL] Public API for DataTypes and Schema
The current PR contains the following changes:
* Expose `DataType`s in the sql package (internal details are private to sql).
* Users can create Rows.
* Introduce `applySchema` to create a `SchemaRDD` by applying a `schema: StructType` to an `RDD[Row]`.
* Add a function `simpleString` to every `DataType`. Also, the schema represented by a `StructType` can be visualized by `printSchema`.
* `ScalaReflection.typeOfObject` provides a way to infer the Catalyst data type based on an object. Also, we can compose `typeOfObject` with some custom logics to form a new function to infer the data type (for different use cases).
* `JsonRDD` has been refactored to use changes introduced by this PR.
* Add a field `containsNull` to `ArrayType`. So, we can explicitly mark if an `ArrayType` can contain null values. The default value of `containsNull` is `false`.

New APIs are introduced in the sql package object and SQLContext. You can find the scaladoc at
[sql package object](http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.package) and [SQLContext](http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext).

An example of using `applySchema` is shown below.
```scala
import org.apache.spark.sql._
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

val schema =
  StructType(
    StructField("name", StringType, false) ::
    StructField("age", IntegerType, true) :: Nil)

val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Row(p(0), p(1).trim.toInt))
val peopleSchemaRDD = sqlContext. applySchema(people, schema)
peopleSchemaRDD.printSchema
// root
// |-- name: string (nullable = false)
// |-- age: integer (nullable = true)

peopleSchemaRDD.registerAsTable("people")
sqlContext.sql("select name from people").collect.foreach(println)
```

I will add new contents to the SQL programming guide later.

JIRA: https://issues.apache.org/jira/browse/SPARK-2179

Author: Yin Huai <huai@cse.ohio-state.edu>

Closes #1346 from yhuai/dataTypeAndSchema and squashes the following commits:

1d45977 [Yin Huai] Clean up.
a6e08b4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
c712fbf [Yin Huai] Converts types of values based on defined schema.
4ceeb66 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
e5f8df5 [Yin Huai] Scaladoc.
122d1e7 [Yin Huai] Address comments.
03bfd95 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
2476ed0 [Yin Huai] Minor updates.
ab71f21 [Yin Huai] Format.
fc2bed1 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
bd40a33 [Yin Huai] Address comments.
991f860 [Yin Huai] Move "asJavaDataType" and "asScalaDataType" to DataTypeConversions.scala.
1cb35fe [Yin Huai] Add "valueContainsNull" to MapType.
3edb3ae [Yin Huai] Python doc.
692c0b9 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
1d93395 [Yin Huai] Python APIs.
246da96 [Yin Huai] Add java data type APIs to javadoc index.
1db9531 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
d48fc7b [Yin Huai] Minor updates.
33c4fec [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
b9f3071 [Yin Huai] Java API for applySchema.
1c9f33c [Yin Huai] Java APIs for DataTypes and Row.
624765c [Yin Huai] Tests for applySchema.
aa92e84 [Yin Huai] Update data type tests.
8da1a17 [Yin Huai] Add Row.fromSeq.
9c99bc0 [Yin Huai] Several minor updates.
1d9c13a [Yin Huai] Update applySchema API.
85e9b51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
e495e4e [Yin Huai] More comments.
42d47a3 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
c3f4a02 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
2e58dbd [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
b8b7db4 [Yin Huai] 1. Move sql package object and package-info to sql-core. 2. Minor updates on APIs. 3. Update scala doc.
68525a2 [Yin Huai] Update JSON unit test.
3209108 [Yin Huai] Add unit tests.
dcaf22f [Yin Huai] Add a field containsNull to ArrayType to indicate if an array can contain null values or not. If an ArrayType is constructed by "ArrayType(elementType)" (the existing constructor), the value of containsNull is false.
9168b83 [Yin Huai] Update comments.
fc649d7 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
eca7d04 [Yin Huai] Add two apply methods which will be used to extract StructField(s) from a StructType.
949d6bb [Yin Huai] When creating a SchemaRDD for a JSON dataset, users can apply an existing schema.
7a6a7e5 [Yin Huai] Fix bug introduced by the change made on SQLContext.inferSchema.
43a45e1 [Yin Huai] Remove sql.util.package introduced in a previous commit.
0266761 [Yin Huai] Format
03eec4c [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
90460ac [Yin Huai] Infer the Catalyst data type from an object and cast a data value to the expected type.
3fa0df5 [Yin Huai] Provide easier ways to construct a StructType.
16be3e5 [Yin Huai] This commit contains three changes: * Expose `DataType`s in the sql package (internal details are private to sql). * Introduce `createSchemaRDD` to create a `SchemaRDD` from an `RDD` with a provided schema (represented by a `StructType`) and a provided function to construct `Row`, * Add a function `simpleString` to every `DataType`. Also, the schema represented by a `StructType` can be visualized by `printSchema`.
2014-07-30 00:15:31 -07:00
Michael Armbrust 84467468d4 [SPARK-2054][SQL] Code Generation for Expression Evaluation
Adds a new method for evaluating expressions using code that is generated though Scala reflection.  This functionality is configured by the SQLConf option `spark.sql.codegen` and is currently turned off by default.

Evaluation can be done in several specialized ways:
 - *Projection* - Given an input row, produce a new row from a set of expressions that define each column in terms of the input row.  This can either produce a new Row object or perform the projection in-place on an existing Row (MutableProjection).
 - *Ordering* - Compares two rows based on a list of `SortOrder` expressions
 - *Condition* - Returns `true` or `false` given an input row.

For each of the above operations there is both a Generated and Interpreted version.  When generation for a given expression type is undefined, the code generator falls back on calling the `eval` function of the expression class.  Even without custom code, there is still a potential speed up, as loops are unrolled and code can still be inlined by JIT.

This PR also contains a new type of Aggregation operator, `GeneratedAggregate`, that performs aggregation by using generated `Projection` code.  Currently the required expression rewriting only works for simple aggregations like `SUM` and `COUNT`.  This functionality will be extended in a future PR.

This PR also performs several clean ups that simplified the implementation:
 - The notion of `Binding` all expressions in a tree automatically before query execution has been removed.  Instead it is the responsibly of an operator to provide the input schema when creating one of the specialized evaluators defined above.  In cases when the standard eval method is going to be called, binding can still be done manually using `BindReferences`.  There are a few reasons for this change:  First, there were many operators where it just didn't work before.  For example, operators with more than one child, and operators like aggregation that do significant rewriting of the expression. Second, the semantics of equality with `BoundReferences` are broken.  Specifically, we have had a few bugs where partitioning breaks because of the binding.
 - A copy of the current `SQLContext` is automatically propagated to all `SparkPlan` nodes by the query planner.  Before this was done ad-hoc for the nodes that needed this.  However, this required a lot of boilerplate as one had to always remember to make it `transient` and also had to modify the `otherCopyArgs`.

Author: Michael Armbrust <michael@databricks.com>

Closes #993 from marmbrus/newCodeGen and squashes the following commits:

96ef82c [Michael Armbrust] Merge remote-tracking branch 'apache/master' into newCodeGen
f34122d [Michael Armbrust] Merge remote-tracking branch 'apache/master' into newCodeGen
67b1c48 [Michael Armbrust] Use conf variable in SQLConf object
4bdc42c [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen
41a40c9 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen
de22aac [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen
fed3634 [Michael Armbrust] Inspectors are not serializable.
ef8d42b [Michael Armbrust] comments
533fdfd [Michael Armbrust] More logging of expression rewriting for GeneratedAggregate.
3cd773e [Michael Armbrust] Allow codegen for Generate.
64b2ee1 [Michael Armbrust] Implement copy
3587460 [Michael Armbrust] Drop unused string builder function.
9cce346 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen
1a61293 [Michael Armbrust] Address review comments.
0672e8a [Michael Armbrust] Address comments.
1ec2d6e [Michael Armbrust] Address comments
033abc6 [Michael Armbrust] off by default
4771fab [Michael Armbrust] Docs, more test coverage.
d30fee2 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen
d2ad5c5 [Michael Armbrust] Refactor putting SQLContext into SparkPlan. Fix ordering, other test cases.
be2cd6b [Michael Armbrust] WIP: Remove old method for reference binding, more work on configuration.
bc88ecd [Michael Armbrust] Style
6cc97ca [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen
4220f1e [Michael Armbrust] Better config, docs, etc.
ca6cc6b [Michael Armbrust] WIP
9d67d85 [Michael Armbrust] Fix hive planner
fc522d5 [Michael Armbrust] Hook generated aggregation in to the planner.
e742640 [Michael Armbrust] Remove unneeded changes and code.
675e679 [Michael Armbrust] Upgrade paradise.
0093376 [Michael Armbrust] Comment / indenting cleanup.
d81f998 [Michael Armbrust] include schema for binding.
0e889e8 [Michael Armbrust] Use typeOf instead tq
f623ffd [Michael Armbrust] Quiet logging from test suite.
efad14f [Michael Armbrust] Remove some half finished functions.
92e74a4 [Michael Armbrust] add overrides
a2b5408 [Michael Armbrust] WIP: Code generation with scala reflection.
2014-07-29 20:58:05 -07:00