Apache Spark - A unified analytics engine for large-scale data processing

Go to file

Ram Sriharsha 595a67589a [SPARK-7015] [MLLIB] [WIP] Multiclass to Binary Reduction: One Against All initial cut of one against all. test code is a scaffolding , not fully implemented. This WIP is to gather early feedback. Author: Ram Sriharsha <rsriharsha@hw11853.local> Closes #5830 from harsha2010/reduction and squashes the following commits: 5f4b495 [Ram Sriharsha] Fix Test 386e98b [Ram Sriharsha] Style fix 49b4a17 [Ram Sriharsha] Simplify the test 02279cc [Ram Sriharsha] Output Label Metadata in Prediction Col bc78032 [Ram Sriharsha] Code Review Updates 8ce4845 [Ram Sriharsha] Merge with Master 2a807be [Ram Sriharsha] Merge branch 'master' into reduction e21bfcc [Ram Sriharsha] Style Fix 5614f23 [Ram Sriharsha] Style Fix c75583a [Ram Sriharsha] Cleanup 7a5f136 [Ram Sriharsha] Fix TODOs 804826b [Ram Sriharsha] Merge with Master 1448a5f [Ram Sriharsha] Style Fix 6e47807 [Ram Sriharsha] Style Fix d63e46b [Ram Sriharsha] Incorporate Code Review Feedback ced68b5 [Ram Sriharsha] Refactor OneVsAll to implement Predictor 78fa82a [Ram Sriharsha] extra line 0dfa1fb [Ram Sriharsha] Fix inexhaustive match cases that may arise from UnresolvedAttribute a59a4f4 [Ram Sriharsha] @Experimental 4167234 [Ram Sriharsha] Merge branch 'master' into reduction 868a4fd [Ram Sriharsha] @Experimental 041d905 [Ram Sriharsha] Code Review Fixes df188d8 [Ram Sriharsha] Style fix 612ec48 [Ram Sriharsha] Style Fix 6ef43d3 [Ram Sriharsha] Prefer Unresolved Attribute to Option: Java APIs are cleaner 6bf6bff [Ram Sriharsha] Update OneHotEncoder to new API e29cb89 [Ram Sriharsha] Merge branch 'master' into reduction 1c7fa44 [Ram Sriharsha] Fix Tests ca83672 [Ram Sriharsha] Incorporate Code Review Feedback + Rename to OneVsRestClassifier 221beeed [Ram Sriharsha] Upgrade to use Copy method for cloning Base Classifiers 26f1ddb [Ram Sriharsha] Merge with SPARK-5956 API changes 9738744 [Ram Sriharsha] Merge branch 'master' into reduction 1a3e375 [Ram Sriharsha] More efficient Implementation: Use withColumn to generate label column dynamically 32e0189 [Ram Sriharsha] Restrict reduction to Margin Based Classifiers ff272da [Ram Sriharsha] Style fix 28771f5 [Ram Sriharsha] Add Tests for Multiclass to Binary Reduction b60f874 [Ram Sriharsha] Fix Style issues in Test 3191cdf [Ram Sriharsha] Remove this test, accidental commit 23f056c [Ram Sriharsha] Fix Headers for test 1b5e929 [Ram Sriharsha] Fix Style issues and add Header 8752863 [Ram Sriharsha] [SPARK-7015][MLLib][WIP] Multiclass to Binary Reduction: One Against All		2015-05-12 13:35:12 -07:00
assembly	[SPARK-6869] [PYSPARK] Add pyspark archives path to PYTHONPATH	2015-05-08 08:44:46 -05:00
bagel	[SPARK-6758]block the right jetty package in log	2015-04-09 17:44:08 -04:00
bin	Limit help option regex	2015-05-01 19:26:55 +01:00
build	SPARK-5856: In Maven build script, launch Zinc with more memory	2015-02-17 10:10:01 -08:00
conf	[SPARK-2691] [MESOS] Support for Mesos DockerInfo	2015-05-01 18:41:22 -07:00
core	[HOT FIX #6076 ] DAG visualization: curve the edges	2015-05-12 12:06:30 -07:00
data/mllib	[SPARK-5939][MLLib] make FPGrowth example app take parameters	2015-02-23 08:47:28 -08:00
dev	[SPARK-6908] [SQL] Use isolated Hive client	2015-05-07 19:36:24 -07:00
docker	[SPARK-2691] [MESOS] Support for Mesos DockerInfo	2015-05-01 18:41:22 -07:00
docs	[SPARK-6994][SQL] Update docs for fetching Row fields by name	2015-05-11 22:29:24 -07:00
ec2	updated ec2 instance types	2015-05-08 15:59:34 -07:00
examples	[SPARK-7522] [EXAMPLES] Removed angle brackets from dataFormat option	2015-05-11 09:23:47 -07:00
external	[SPARK-7113] [STREAMING] Support input information reporting for Direct Kafka stream	2015-05-05 02:01:06 -07:00
extras	[SPARK-6440][CORE]Handle IPv6 addresses properly when constructing URI	2015-04-13 12:55:25 +01:00
graphx	[SPARK-5854] personalized page rank	2015-05-01 11:55:43 -07:00
launcher	[SPARK-7031] [THRIFTSERVER] let thrift server take SPARK_DAEMON_MEMORY and SPARK_DAEMON_JAVA_OPTS	2015-05-03 00:47:47 +01:00
mllib	[SPARK-7015] [MLLIB] [WIP] Multiclass to Binary Reduction: One Against All	2015-05-12 13:35:12 -07:00
network	[SPARK-6955] Perform port retries at NettyBlockTransferService level	2015-05-08 17:13:55 -07:00
project	[SPARK-3928] [SPARK-5182] [SQL] Partitioning support for the data sources API	2015-05-13 01:32:28 +08:00
python	[SPARK-7487] [ML] Feature Parity in PySpark for ml.regression	2015-05-12 12:17:05 -07:00
R	[SPARK-7435] [SPARKR] Make DataFrame.show() consistent with that of Scala and pySpark	2015-05-11 21:04:32 -07:00
repl	[SPARK-7489] [SPARK SHELL] Spark shell crashes when compiled with scala 2.11	2015-05-08 14:07:53 -07:00
sbin	[SPARK-5338] [MESOS] Add cluster mode support for Mesos	2015-04-28 13:33:57 -07:00
sbt	Adde LICENSE Header to build/mvn, build/sbt and sbt/sbt	2014-12-29 10:48:53 -08:00
sql	[SPARK-7276] [DATAFRAME] speed up DataFrame.select by collapsing Project	2015-05-12 11:51:55 -07:00
streaming	[SPARK-7532] [STREAMING] StreamingContext.start() made to logWarning and not throw exception	2015-05-12 08:48:24 -07:00
tools	[SPARK-4550] In sort-based shuffle, store map outputs in serialized form	2015-04-30 23:14:14 -07:00
unsafe	[SPARK-7450] Use UNSAFE.getLong() to speed up BitSetMethods#anySet()	2015-05-07 16:55:34 -07:00
yarn	[SPARK-6470] [YARN] Add support for YARN node labels.	2015-05-11 12:09:39 -07:00
.gitattributes	[SPARK-3870] EOL character enforcement	2014-10-31 12:39:52 -07:00
.gitignore	[MINOR] Ignore python/lib/pyspark.zip	2015-05-08 14:06:02 -07:00
.rat-excludes	[WEBUI] Remove debug feature for vis.js	2015-05-08 14:06:37 -07:00
CONTRIBUTING.md	[SPARK-6889] [DOCS] CONTRIBUTING.md updates to accompany contribution doc updates	2015-04-21 22:34:31 -07:00
LICENSE	[SPARK-7403] [WEBUI] Link URL in objects on Timeline View is wrong in case of running on YARN	2015-05-09 10:10:29 +01:00
make-distribution.sh	[SPARK-7302] [DOCS] SPARK building documentation still mentions building for yarn 0.23	2015-05-03 21:22:31 +01:00
NOTICE	SPARK-1827. LICENSE and NOTICE files need a refresh to contain transitive dependency info	2014-05-14 09:38:33 -07:00
pom.xml	[SPARK-2018] [CORE] Upgrade LZF library to fix endian serialization p…	2015-05-12 20:48:26 +01:00
README.md	[MINOR] [DOCS] Fix the link to test building info on the wiki	2015-05-12 00:25:43 +01:00
scalastyle-config.xml	[SPARK-6428] Turn on explicit type checking for public methods.	2015-04-03 01:25:02 -07:00
tox.ini	[SPARK-7427] [PYSPARK] Make sharedParams match in Scala, Python	2015-05-10 19:18:32 -07:00

README.md

Apache Spark

Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, and Python, and an optimized engine that supports general computation graphs for data analysis. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming for stream processing.

http://spark.apache.org/

Online Documentation

You can find the latest Spark documentation, including a programming guide, on the project web page and project wiki. This README file only contains basic setup instructions.

Building Spark

Spark is built using Apache Maven. To build Spark and its example programs, run:

mvn -DskipTests clean package

(You do not need to do this if you downloaded a pre-built package.) More detailed documentation is available from the project site, at "Building Spark".

Interactive Scala Shell

The easiest way to start using Spark is through the Scala shell:

./bin/spark-shell

Try the following command, which should return 1000:

scala> sc.parallelize(1 to 1000).count()

Interactive Python Shell

Alternatively, if you prefer Python, you can use the Python shell:

./bin/pyspark

And run the following command, which should also return 1000:

>>> sc.parallelize(range(1000)).count()

Example Programs

Spark also comes with several sample programs in the examples directory. To run one of them, use ./bin/run-example <class> [params]. For example:

./bin/run-example SparkPi

will run the Pi example locally.

You can set the MASTER environment variable when running examples to submit examples to a cluster. This can be a mesos:// or spark:// URL, "yarn-cluster" or "yarn-client" to run on YARN, and "local" to run locally with one thread, or "local[N]" to run locally with N threads. You can also use an abbreviated class name if the class is in the examples package. For instance:

MASTER=spark://host:7077 ./bin/run-example SparkPi

Many of the example programs print usage help if no params are given.

Running Tests

Testing first requires building Spark. Once Spark is built, tests can be run using:

./dev/run-tests

Please see the guidance on how to run tests for a module, or individual tests.

A Note About Hadoop Versions

Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported storage systems. Because the protocols have changed in different versions of Hadoop, you must build Spark against the same version that your cluster runs.

Please refer to the build documentation at "Specifying the Hadoop Version" for detailed guidance on building for a particular distribution of Hadoop, including building for particular Hive and Hive Thriftserver distributions. See also "Third Party Hadoop Distributions" for guidance on building a Spark application that works with a particular distribution.

Configuration

Please refer to the Configuration guide in the online documentation for an overview on how to configure Spark.