Commit graph

677 commits

Author SHA1 Message Date
Matei Zaharia b547f69bdb SPARK-2680: Lower spark.shuffle.memoryFraction to 0.2 by default
Author: Matei Zaharia <matei@databricks.com>

Closes #1593 from mateiz/spark-2680 and squashes the following commits:

3c949c4 [Matei Zaharia] Lower spark.shuffle.memoryFraction to 0.2 by default
2014-07-26 22:44:17 -07:00
Hossein 66f26a4610 [SPARK-2696] Reduce default value of spark.serializer.objectStreamReset
The current default value of spark.serializer.objectStreamReset is 10,000.
When trying to re-partition (e.g., to 64 partitions) a large file (e.g., 500MB), containing 1MB records, the serializer will cache 10000 x 1MB x 64 ~= 640 GB which will cause out of memory errors.

This patch sets the default value to a more reasonable default value (100).

Author: Hossein <hossein@databricks.com>

Closes #1595 from falaki/objectStreamReset and squashes the following commits:

650a935 [Hossein] Updated documentation
1aa0df8 [Hossein] Reduce default value of spark.serializer.objectStreamReset
2014-07-26 01:04:56 -07:00
Michael Armbrust afd757a241 Revert "[SPARK-2410][SQL] Merging Hive Thrift/JDBC server"
This reverts commit 06dc0d2c6b.

#1399 is making Jenkins fail.  We should investigate and put this back after its passing tests.

Author: Michael Armbrust <michael@databricks.com>

Closes #1594 from marmbrus/revertJDBC and squashes the following commits:

59748da [Michael Armbrust] Revert "[SPARK-2410][SQL] Merging Hive Thrift/JDBC server"
2014-07-25 15:36:57 -07:00
Cheng Lian 06dc0d2c6b [SPARK-2410][SQL] Merging Hive Thrift/JDBC server
JIRA issue:

- Main: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410)
- Related: [SPARK-2678](https://issues.apache.org/jira/browse/SPARK-2678)

Cherry picked the Hive Thrift/JDBC server from [branch-1.0-jdbc](https://github.com/apache/spark/tree/branch-1.0-jdbc).

(Thanks chenghao-intel for his initial contribution of the Spark SQL CLI.)

TODO

- [x] Use `spark-submit` to launch the server, the CLI and beeline
- [x] Migration guideline draft for Shark users

----

Hit by a bug in `SparkSubmitArguments` while working on this PR: all application options that are recognized by `SparkSubmitArguments` are stolen as `SparkSubmit` options. For example:

```bash
$ spark-submit --class org.apache.hive.beeline.BeeLine spark-internal --help
```

This actually shows usage information of `SparkSubmit` rather than `BeeLine`.

~~Fixed this bug here since the `spark-internal` related stuff also touches `SparkSubmitArguments` and I'd like to avoid conflict.~~

**UPDATE** The bug mentioned above is now tracked by [SPARK-2678](https://issues.apache.org/jira/browse/SPARK-2678). Decided to revert changes to this bug since it involves more subtle considerations and worth a separate PR.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1399 from liancheng/thriftserver and squashes the following commits:

090beea [Cheng Lian] Revert changes related to SPARK-2678, decided to move them to another PR
21c6cf4 [Cheng Lian] Updated Spark SQL programming guide docs
fe0af31 [Cheng Lian] Reordered spark-submit options in spark-shell[.cmd]
199e3fb [Cheng Lian] Disabled MIMA for hive-thriftserver
1083e9d [Cheng Lian] Fixed failed test suites
7db82a1 [Cheng Lian] Fixed spark-submit application options handling logic
9cc0f06 [Cheng Lian] Starts beeline with spark-submit
cfcf461 [Cheng Lian] Updated documents and build scripts for the newly added hive-thriftserver profile
061880f [Cheng Lian] Addressed all comments by @pwendell
7755062 [Cheng Lian] Adapts test suites to spark-submit settings
40bafef [Cheng Lian] Fixed more license header issues
e214aab [Cheng Lian] Added missing license headers
b8905ba [Cheng Lian] Fixed minor issues in spark-sql and start-thriftserver.sh
f975d22 [Cheng Lian] Updated docs for Hive compatibility and Shark migration guide draft
3ad4e75 [Cheng Lian] Starts spark-sql shell with spark-submit
a5310d1 [Cheng Lian] Make HiveThriftServer2 play well with spark-submit
61f39f4 [Cheng Lian] Starts Hive Thrift server via spark-submit
2c4c539 [Cheng Lian] Cherry picked the Hive Thrift server
2014-07-25 12:20:49 -07:00
Davies Liu 14174abd42 [SPARK-2538] [PySpark] Hash based disk spilling aggregation
During aggregation in Python worker, if the memory usage is above spark.executor.memory, it will do disk spilling aggregation.

It will split the aggregation into multiple stage, in each stage, it will partition the aggregated data by hash and dump them into disks. After all the data are aggregated, it will merge all the stages together (partition by partition).

Author: Davies Liu <davies.liu@gmail.com>

Closes #1460 from davies/spill and squashes the following commits:

cad91bf [Davies Liu] call gc.collect() after data.clear() to release memory as much as possible.
37d71f7 [Davies Liu] balance the partitions
902f036 [Davies Liu] add shuffle.py into run-tests
dcf03a9 [Davies Liu] fix memory_info() of psutil
67e6eba [Davies Liu] comment for MAX_TOTAL_PARTITIONS
f6bd5d6 [Davies Liu] rollback next_limit() again, the performance difference is huge:
e74b785 [Davies Liu] fix code style and change next_limit to memory_limit
400be01 [Davies Liu] address all the comments
6178844 [Davies Liu] refactor and improve docs
fdd0a49 [Davies Liu] add long doc string for ExternalMerger
1a97ce4 [Davies Liu] limit used memory and size of objects in partitionBy()
e6cc7f9 [Davies Liu] Merge branch 'master' into spill
3652583 [Davies Liu] address comments
e78a0a0 [Davies Liu] fix style
24cec6a [Davies Liu] get local directory by SPARK_LOCAL_DIR
57ee7ef [Davies Liu] update docs
286aaff [Davies Liu] let spilled aggregation in Python configurable
e9a40f6 [Davies Liu] recursive merger
6edbd1f [Davies Liu] Hash based disk spilling aggregation
2014-07-24 22:53:47 -07:00
Sandy Ryza e34922a221 SPARK-2310. Support arbitrary Spark properties on the command line with ...
...spark-submit

The PR allows invocations like
  spark-submit --class org.MyClass --spark.shuffle.spill false myjar.jar

Author: Sandy Ryza <sandy@cloudera.com>

Closes #1253 from sryza/sandy-spark-2310 and squashes the following commits:

1dc9855 [Sandy Ryza] More doc and cleanup
00edfb9 [Sandy Ryza] Review comments
91b244a [Sandy Ryza] Change format to --conf PROP=VALUE
8fabe77 [Sandy Ryza] SPARK-2310. Support arbitrary Spark properties on the command line with spark-submit
2014-07-23 23:11:26 -07:00
Ian O Connell efdaeb1119 [SPARK-2102][SQL][CORE] Add option for kryo registration required and use a resource pool in Spark SQL for Kryo instances.
Author: Ian O Connell <ioconnell@twitter.com>

Closes #1377 from ianoc/feature/SPARK-2102 and squashes the following commits:

5498566 [Ian O Connell] Docs update suggested by Patrick
20e8555 [Ian O Connell] Slight style change
f92c294 [Ian O Connell] Add docs for new KryoSerializer option
f3735c8 [Ian O Connell] Add using a kryo resource pool for the SqlSerializer
4e5c342 [Ian O Connell] Register the SparkConf for kryo, it gets swept into serialization
665805a [Ian O Connell] Add a spark.kryo.registrationRequired option for configuring the Kryo Serializer
2014-07-23 16:30:11 -07:00
Michael Giannakopoulos db56f2df1b [SPARK-1945][MLLIB] Documentation Improvements for Spark 1.0
Standalone application examples are added to 'mllib-linear-methods.md' file written in Java.
This commit is related to the issue [Add full Java Examples in MLlib docs](https://issues.apache.org/jira/browse/SPARK-1945).
Also I changed the name of the sigmoid function from 'logit' to 'f'. This is because the logit function
is the inverse of sigmoid.

Thanks,
Michael

Author: Michael Giannakopoulos <miccagiann@gmail.com>

Closes #1311 from miccagiann/master and squashes the following commits:

8ffe5ab [Michael Giannakopoulos] Update code so as to comply with code standards.
f7ad5cc [Michael Giannakopoulos] Merge remote-tracking branch 'upstream/master'
38d92c7 [Michael Giannakopoulos] Adding PCA, SVD and LBFGS examples in Java. Performing minor updates in the already committed examples so as to eradicate the call of 'productElement' function whenever is possible.
cc0a089 [Michael Giannakopoulos] Modyfied Java examples so as to comply with coding standards.
b1141b2 [Michael Giannakopoulos] Added Java examples for Clustering and Collaborative Filtering [mllib-clustering.md & mllib-collaborative-filtering.md].
837f7a8 [Michael Giannakopoulos] Merge remote-tracking branch 'upstream/master'
15f0eb4 [Michael Giannakopoulos] Java examples included in 'mllib-linear-methods.md' file.
2014-07-20 20:48:44 -07:00
lianhuiwang 4da01e3813 [SPARK-2524] missing document about spark.deploy.retainedDrivers
https://issues.apache.org/jira/browse/SPARK-2524
The configuration on spark.deploy.retainedDrivers is undocumented but actually used
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L60

Author: lianhuiwang <lianhuiwang09@gmail.com>
Author: Wang Lianhui <lianhuiwang09@gmail.com>
Author: unknown <Administrator@taguswang-PC1.tencent.com>

Closes #1443 from lianhuiwang/SPARK-2524 and squashes the following commits:

64660fd [Wang Lianhui] address pwendell's comments
5f6bbb7 [Wang Lianhui] missing document about spark.deploy.retainedDrivers
44a3f50 [unknown] Merge remote-tracking branch 'upstream/master'
eacf933 [lianhuiwang] Merge remote-tracking branch 'upstream/master'
8bbfe76 [lianhuiwang] Merge remote-tracking branch 'upstream/master'
480ce94 [lianhuiwang] address aarondav comments
f2b5970 [lianhuiwang] bugfix worker DriverStateChanged state should match DriverState.FAILED
2014-07-19 20:46:59 -07:00
Cesar Arevalo 0d01e85f42 Typo fix to the programming guide in the docs
Typo fix to the programming guide in the docs. Changed the word "distibuted" to "distributed".

Author: Cesar Arevalo <cesar@zephyrhealthinc.com>

Closes #1495 from cesararevalo/master and squashes the following commits:

0c2e3a7 [Cesar Arevalo] Typo fix to the programming guide in the docs
2014-07-19 20:20:07 -07:00
Manish Amde d88f6be446 [MLlib] SPARK-1536: multiclass classification support for decision tree
The ability to perform multiclass classification is a big advantage for using decision trees and was a highly requested feature for mllib. This pull request adds multiclass classification support to the MLlib decision tree. It also adds sample weights support using WeightedLabeledPoint class for handling unbalanced datasets during classification. It will also support algorithms such as AdaBoost which requires instances to be weighted.

It handles the special case where the categorical variables cannot be ordered for multiclass classification and thus the optimizations used for speeding up binary classification cannot be directly used for multiclass classification with categorical variables. More specifically, for m categories in a categorical feature, it analyses all the ```2^(m-1) - 1``` categorical splits provided that #splits are less than the maxBins provided in the input. This condition will not be met for features with large number of categories -- using decision trees is not recommended for such datasets in general since the categorical features are favored over continuous features. Moreover, the user can use a combination of tricks (increasing bin size of the tree algorithms, use binary encoding for categorical features or use one-vs-all classification strategy) to avoid these constraints.

The new code is accompanied by unit tests and has also been tested on the iris and covtype datasets.

cc: mengxr, etrain, hirakendu, atalwalkar, srowen

Author: Manish Amde <manish9ue@gmail.com>
Author: manishamde <manish9ue@gmail.com>
Author: Evan Sparks <sparks@cs.berkeley.edu>

Closes #886 from manishamde/multiclass and squashes the following commits:

26f8acc [Manish Amde] another attempt at fixing mima
c5b2d04 [Manish Amde] more MIMA fixes
1ce7212 [Manish Amde] change problem filter for mima
10fdd82 [Manish Amde] fixing MIMA excludes
e1c970d [Manish Amde] merged master
abf2901 [Manish Amde] adding classes to MimaExcludes.scala
45e767a [Manish Amde] adding developer api annotation for overriden methods
c8428c4 [Manish Amde] fixing weird multiline bug
afced16 [Manish Amde] removed label weights support
2d85a48 [Manish Amde] minor: fixed scalastyle issues reprise
4e85f2c [Manish Amde] minor: fixed scalastyle issues
b2ae41f [Manish Amde] minor: scalastyle
e4c1321 [Manish Amde] using while loop for regression histograms
d75ac32 [Manish Amde] removed WeightedLabeledPoint from this PR
0fecd38 [Manish Amde] minor: add newline to EOF
2061cf5 [Manish Amde] merged from master
06b1690 [Manish Amde] fixed off-by-one error in bin to split conversion
9cc3e31 [Manish Amde] added implicit conversion import
5c1b2ca [Manish Amde] doc for PointConverter class
485eaae [Manish Amde] implicit conversion from LabeledPoint to WeightedLabeledPoint
3d7f911 [Manish Amde] updated doc
8e44ab8 [Manish Amde] updated doc
adc7315 [Manish Amde] support ordered categorical splits for multiclass classification
e3e8843 [Manish Amde] minor code formatting
23d4268 [Manish Amde] minor: another minor code style
34ee7b9 [Manish Amde] minor: code style
237762d [Manish Amde] renaming functions
12e6d0a [Manish Amde] minor: removing line in doc
9a90c93 [Manish Amde] Merge branch 'master' into multiclass
1892a2c [Manish Amde] tests and use multiclass binaggregate length when atleast one categorical feature is present
f5f6b83 [Manish Amde] multiclass for continous variables
8cfd3b6 [Manish Amde] working for categorical multiclass classification
828ff16 [Manish Amde] added categorical variable test
bce835f [Manish Amde] code cleanup
7e5f08c [Manish Amde] minor doc
1dd2735 [Manish Amde] bin search logic for multiclass
f16a9bb [Manish Amde] fixing while loop
d811425 [Manish Amde] multiclass bin aggregate logic
ab5cb21 [Manish Amde] multiclass logic
d8e4a11 [Manish Amde] sample weights
ed5a2df [Manish Amde] fixed classification requirements
d012be7 [Manish Amde] fixed while loop
18d2835 [Manish Amde] changing default values for num classes
6b912dc [Manish Amde] added numclasses to tree runner, predict logic for multiclass, add multiclass option to train
75f2bfc [Manish Amde] minor code style fix
e547151 [Manish Amde] minor modifications
34549d0 [Manish Amde] fixing error during merge
098e8c5 [Manish Amde] merged master
e006f9d [Manish Amde] changing variable names
5c78e1a [Manish Amde] added multiclass support
6c7af22 [Manish Amde] prepared for multiclass without breaking binary classification
46e06ee [Manish Amde] minor mods
3f85a17 [Manish Amde] tests for multiclass classification
4d5f70c [Manish Amde] added multiclass support for find splits bins
46f909c [Manish Amde] todo for multiclass support
455bea9 [Manish Amde] fixed tests
14aea48 [Manish Amde] changing instance format to weighted labeled point
a1a6e09 [Manish Amde] added weighted point class
968ca9d [Manish Amde] merged master
7fc9545 [Manish Amde] added docs
ce004a1 [Manish Amde] minor formatting
b27ad2c [Manish Amde] formatting
426bb28 [Manish Amde] programming guide blurb
8053fed [Manish Amde] more formatting
5eca9e4 [Manish Amde] grammar
4731cda [Manish Amde] formatting
5e82202 [Manish Amde] added documentation, fixed off by 1 error in max level calculation
cbd9f14 [Manish Amde] modified scala.math to math
dad9652 [Manish Amde] removed unused imports
e0426ee [Manish Amde] renamed parameter
718506b [Manish Amde] added unit test
1517155 [Manish Amde] updated documentation
9dbdabe [Manish Amde] merge from master
719d009 [Manish Amde] updating user documentation
fecf89a [manishamde] Merge pull request #6 from etrain/deep_tree
0287772 [Evan Sparks] Fixing scalastyle issue.
2f1e093 [Manish Amde] minor: added doc for maxMemory parameter
2f6072c [manishamde] Merge pull request #5 from etrain/deep_tree
abc5a23 [Evan Sparks] Parameterizing max memory.
50b143a [Manish Amde] adding support for very deep trees
2014-07-18 14:00:13 -07:00
Xiangrui Meng 96f28c9726 [SPARK-2522] set default broadcast factory to torrent
HttpBroadcastFactory is the current default broadcast factory. It sends the broadcast data to each worker one by one, which is slow when the cluster is big. TorrentBroadcastFactory scales much better than http. Maybe we should make torrent the default broadcast method.

Author: Xiangrui Meng <meng@databricks.com>

Closes #1437 from mengxr/bt-broadcast and squashes the following commits:

ed492fe [Xiangrui Meng] set default broadcast factory to torrent
2014-07-16 11:27:51 -07:00
Reynold Xin e7ec815d9a Added LZ4 to compression codec in configuration page.
Author: Reynold Xin <rxin@apache.org>

Closes #1417 from rxin/lz4 and squashes the following commits:

472f6a1 [Reynold Xin] Set the proper default.
9cf0b2f [Reynold Xin] Added LZ4 to compression codec in configuration page.
2014-07-15 13:13:33 -07:00
witgo 9dd635eb5d SPARK-2480: Resolve sbt warnings "NOTE: SPARK_YARN is deprecated, please use -Pyarn flag"
Author: witgo <witgo@qq.com>

Closes #1404 from witgo/run-tests and squashes the following commits:

f703aee [witgo] fix Note: implicit method fromPairDStream is not applicable here because it comes after the application point and it lacks an explicit result type
2944f51 [witgo] Remove "NOTE: SPARK_YARN is deprecated, please use -Pyarn flag"
ef59c70 [witgo] fix Note: implicit method fromPairDStream is not applicable here because it comes after the application point and it lacks an explicit result type
6cefee5 [witgo] Remove "NOTE: SPARK_YARN is deprecated, please use -Pyarn flag"
2014-07-15 10:46:17 -07:00
Reynold Xin dd95abada7 [SPARK-2399] Add support for LZ4 compression.
Based on Greg Bowyer's patch from JIRA https://issues.apache.org/jira/browse/SPARK-2399

Author: Reynold Xin <rxin@apache.org>

Closes #1416 from rxin/lz4 and squashes the following commits:

6c8fefe [Reynold Xin] Fixed typo.
8a14d38 [Reynold Xin] [SPARK-2399] Add support for LZ4 compression.
2014-07-15 01:46:57 -07:00
li-zhihui 3dd8af7a66 [SPARK-1946] Submit tasks after (configured ratio) executors have been registered
Because submitting tasks and registering executors are asynchronous, in most situation, early stages' tasks run without preferred locality.

A simple solution is sleeping few seconds in application, so that executors have enough time to register.

The PR add 2 configuration properties to make TaskScheduler submit tasks after a few of executors have been registered.

\# Submit tasks only after (registered executors / total executors) arrived the ratio, default value is 0
spark.scheduler.minRegisteredExecutorsRatio = 0.8

\# Whatever minRegisteredExecutorsRatio is arrived, submit tasks after the maxRegisteredWaitingTime(millisecond), default value is 30000
spark.scheduler.maxRegisteredExecutorsWaitingTime = 5000

Author: li-zhihui <zhihui.li@intel.com>

Closes #900 from li-zhihui/master and squashes the following commits:

b9f8326 [li-zhihui] Add logs & edit docs
1ac08b1 [li-zhihui] Add new configs to user docs
22ead12 [li-zhihui] Move waitBackendReady to postStartHook
c6f0522 [li-zhihui] Bug fix: numExecutors wasn't set & use constant DEFAULT_NUMBER_EXECUTORS
4d6d847 [li-zhihui] Move waitBackendReady to TaskSchedulerImpl.start & some code refactor
0ecee9a [li-zhihui] Move waitBackendReady from DAGScheduler.submitStage to TaskSchedulerImpl.submitTasks
4261454 [li-zhihui] Add docs for new configs & code style
ce0868a [li-zhihui] Code style, rename configuration property name of minRegisteredRatio & maxRegisteredWaitingTime
6cfb9ec [li-zhihui] Code style, revert default minRegisteredRatio of yarn to 0, driver get --num-executors in yarn/alpha
812c33c [li-zhihui] Fix driver lost --num-executors option in yarn-cluster mode
e7b6272 [li-zhihui] support yarn-cluster
37f7dc2 [li-zhihui] support yarn mode(percentage style)
3f8c941 [li-zhihui] submit stage after (configured ratio of) executors have been registered
2014-07-14 15:32:49 -05:00
Sean Owen 635888cbed SPARK-2363. Clean MLlib's sample data files
(Just made a PR for this, mengxr was the reporter of:)

MLlib has sample data under serveral folders:
1) data/mllib
2) data/
3) mllib/data/*
Per previous discussion with Matei Zaharia, we want to put them under `data/mllib` and clean outdated files.

Author: Sean Owen <sowen@cloudera.com>

Closes #1394 from srowen/SPARK-2363 and squashes the following commits:

54313dd [Sean Owen] Move ML example data from /mllib/data/ and /data/ into /data/mllib/
2014-07-13 19:27:43 -07:00
Issac Buenrostro 2dd6724850 [SPARK-1341] [Streaming] Throttle BlockGenerator to limit rate of data consumption.
Author: Issac Buenrostro <buenrostro@ooyala.com>

Closes #945 from ibuenros/SPARK-1341-throttle and squashes the following commits:

5514916 [Issac Buenrostro] Formatting changes, added documentation for streaming throttling, stricter unit tests for throttling.
62f395f [Issac Buenrostro] Add comments and license to streaming RateLimiter.scala
7066438 [Issac Buenrostro] Moved throttle code to RateLimiter class, smoother pushing when throttling active
ccafe09 [Issac Buenrostro] Throttle BlockGenerator to limit rate of data consumption.
2014-07-10 16:01:08 -07:00
CrazyJvm b520b6453e SPARK-2400 : fix spark.yarn.max.executor.failures explaination
According to
```scala
  private val maxNumExecutorFailures = sparkConf.getInt("spark.yarn.max.executor.failures",
    sparkConf.getInt("spark.yarn.max.worker.failures", math.max(args.numExecutors * 2, 3)))
```
default value should be numExecutors * 2, with minimum of 3,  and it's same to the config
`spark.yarn.max.worker.failures`

Author: CrazyJvm <crazyjvm@gmail.com>

Closes #1282 from CrazyJvm/yarn-doc and squashes the following commits:

1a5f25b [CrazyJvm] remove deprecated config
c438aec [CrazyJvm] fix style
86effa6 [CrazyJvm] change expression
211f130 [CrazyJvm] fix html tag
2900d23 [CrazyJvm] fix style
a4b2e27 [CrazyJvm] fix configuration spark.yarn.max.executor.failures
2014-07-08 13:55:42 -05:00
Rishi Verma 0128905eea Updated programming-guide.md
Made sure that readers know the random number generator seed argument, within the 'takeSample' method, is optional.

Author: Rishi Verma <riverma@apache.org>

Closes #1324 from riverma/patch-1 and squashes the following commits:

4699676 [Rishi Verma] Updated programming-guide.md
2014-07-08 00:29:23 -07:00
Clément MATHIEU fdc4c112e7 Streaming programming guide typos
Fix a bad Java code sample and a broken link in the streaming programming guide.

Author: Clément MATHIEU <clement@unportant.info>

Closes #1286 from cykl/streaming-programming-guide-typos and squashes the following commits:

b0908cb [Clément MATHIEU] Fix broken URL
9d3c535 [Clément MATHIEU] Spark streaming requires at least two working threads (scala version was OK)
2014-07-03 18:31:18 -07:00
Sean Owen 04fa1223ee SPARK-2293. Replace RDD.zip usage by map with predict inside.
This is the only occurrence of this pattern in the examples that needs to be replaced. It only addresses the example change.

Author: Sean Owen <sowen@cloudera.com>

Closes #1250 from srowen/SPARK-2293 and squashes the following commits:

6b1b28c [Sean Owen] Compute prediction-and-label RDD directly rather than by zipping, for efficiency
2014-06-30 16:03:38 -07:00
Matthew Farrellee 0e0686d3ef [SPARK-2003] Fix python SparkContext example
Author: Matthew Farrellee <matt@redhat.com>

Closes #1246 from mattf/SPARK-2003 and squashes the following commits:

b12e7ca [Matthew Farrellee] [SPARK-2003] Fix python SparkContext example
2014-06-27 18:20:33 -07:00
Andrew Or f17510e371 [SPARK-2259] Fix highly misleading docs on cluster / client deploy modes
The existing docs are highly misleading. For standalone mode, for example, it encourages the user to use standalone-cluster mode, which is not officially supported. The safeguards have been added in Spark submit itself to prevent bad documentation from leading users down the wrong path in the future.

This PR is prompted by countless headaches users of Spark have run into on the mailing list.

Author: Andrew Or <andrewor14@gmail.com>

Closes #1200 from andrewor14/submit-docs and squashes the following commits:

5ea2460 [Andrew Or] Rephrase cluster vs client explanation
c827f32 [Andrew Or] Clarify spark submit messages
9f7ed8f [Andrew Or] Clarify client vs cluster deploy mode + add safeguards
2014-06-27 16:11:31 -07:00
Reynold Xin ac06a85da5 Replace doc reference to Shark with Spark SQL. 2014-06-25 01:01:23 -07:00
Reynold Xin 420c1c3e1b [SPARK-2252] Fix MathJax for HTTPs.
Found out about this from the Hacker News link to GraphX which was using HTTPs.

@mengxr

Author: Reynold Xin <rxin@apache.org>

Closes #1189 from rxin/mllib-doc and squashes the following commits:

5328be0 [Reynold Xin] [SPARK-2252] Fix MathJax for HTTPs.
2014-06-23 23:18:47 -07:00
Marcelo Vanzin 21ddd7d1e9 [SPARK-1768] History server enhancements.
Two improvements to the history server:

- Separate the HTTP handling from history fetching, so that it's easy to add
  new backends later (thinking about SPARK-1537 in the long run)

- Avoid loading all UIs in memory. Do lazy loading instead, keeping a few in
  memory for faster access. This allows the app limit to go away, since holding
  just the listing in memory shouldn't be too expensive unless the user has millions
  of completed apps in the history (at which point I'd expect other issues to arise
  aside from history server memory usage, such as FileSystem.listStatus()
  starting to become ridiculously expensive).

I also fixed a few minor things along the way which aren't really worth mentioning.
I also removed the app's log path from the UI since that information may not even
exist depending on which backend is used (even though there is only one now).

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #718 from vanzin/hist-server and squashes the following commits:

53620c9 [Marcelo Vanzin] Add mima exclude, fix scaladoc wording.
c21f8d8 [Marcelo Vanzin] Feedback: formatting, docs.
dd8cc4b [Marcelo Vanzin] Standardize on using spark.history.* configuration.
4da3a52 [Marcelo Vanzin] Remove UI from ApplicationHistoryInfo.
2a7f68d [Marcelo Vanzin] Address review feedback.
4e72c77 [Marcelo Vanzin] Remove comment about ordering.
249bcea [Marcelo Vanzin] Remove offset / count from provider interface.
ca5d320 [Marcelo Vanzin] Remove code that deals with unfinished apps.
6e2432f [Marcelo Vanzin] Second round of feedback.
b2c570a [Marcelo Vanzin] Make class package-private.
4406f61 [Marcelo Vanzin] Cosmetic change to listing header.
e852149 [Marcelo Vanzin] Initialize new app array to expected size.
e8026f4 [Marcelo Vanzin] Review feedback.
49d2fd3 [Marcelo Vanzin] Fix a comment.
91e96ca [Marcelo Vanzin] Fix scalastyle issues.
6fbe0d8 [Marcelo Vanzin] Better handle failures when loading app info.
eee2f5a [Marcelo Vanzin] Ensure server.stop() is called when shutting down.
bda2fa1 [Marcelo Vanzin] Rudimentary paging support for the history UI.
b284478 [Marcelo Vanzin] Separate history server from history backend.
2014-06-23 13:53:44 -07:00
Vlad b88238faee Fixed small running on YARN docs typo
The backslash is needed for multiline command

Author: Vlad <frolvlad@gmail.com>

Closes #1158 from frol/patch-1 and squashes the following commits:

e258044 [Vlad] Fixed small running on YARN docs typo
2014-06-23 10:55:49 -05:00
Marcelo Vanzin e380767de3 [SPARK-1395] Fix "local:" URI support in Yarn mode (again).
Recent changes ignored the fact that path may be defined with "local:"
URIs, which means they need to be explicitly added to the classpath
everywhere a remote process is started. This change fixes that by:

- Using the correct methods to add paths to the classpath
- Creating SparkConf settings for the Spark jar itself and for the
  user's jar
- Propagating those two settings to the remote processes where needed

This ensures that both in client and in cluster mode, the driver has
the necessary info to build the executor's classpath and have things
still work when they contain "local:" references.

The change also fixes some confusion in ClientBase about whether
to use SparkConf or system properties to propagate config options to
the driver and executors, by standardizing on using data held by
SparkConf.

On the cleanup front, I removed the hacky way that log4j configuration
was being propagated to handle the "local:" case. It's much more cleanly
(and generically) handled by using spark-submit arguments (--files to
upload a config file, or setting spark.executor.extraJavaOptions to pass
JVM arguments and use a local file).

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #560 from vanzin/yarn-local-2 and squashes the following commits:

4e7f066 [Marcelo Vanzin] Correctly propagate SPARK_JAVA_OPTS to driver/executor.
6a454ea [Marcelo Vanzin] Use constants for PWD in test.
6dd5943 [Marcelo Vanzin] Fix propagation of config options to driver / executor.
b2e377f [Marcelo Vanzin] Review feedback.
93c3f85 [Marcelo Vanzin] Fix ClassCastException in test.
e5c682d [Marcelo Vanzin] Fix cluster mode, restore SPARK_LOG4J_CONF.
1dfbb40 [Marcelo Vanzin] Add documentation for spark.yarn.jar.
bbdce05 [Marcelo Vanzin] [SPARK-1395] Fix "local:" URI support in Yarn mode (again).
2014-06-23 08:51:11 -05:00
Sean Owen 1db9cbc336 SPARK-1996. Remove use of special Maven repo for Akka
Just following up Matei's suggestion to remove the Akka repo references. Builds and the audit-release script appear OK.

Author: Sean Owen <sowen@cloudera.com>

Closes #1170 from srowen/SPARK-1996 and squashes the following commits:

5ca2930 [Sean Owen] Remove outdated Akka repository references
2014-06-21 23:29:57 -07:00
witgo bce0897bc6 [SPARK-2051]In yarn.ClientBase spark.yarn.dist.* do not work
Author: witgo <witgo@qq.com>

Closes #969 from witgo/yarn_ClientBase and squashes the following commits:

8117765 [witgo] review commit
3bdbc52 [witgo] Merge branch 'master' of https://github.com/apache/spark into yarn_ClientBase
5261b6c [witgo] fix sys.props.get("SPARK_YARN_DIST_FILES")
e3c1107 [witgo] update docs
b6a9aa1 [witgo] merge master
c8b4554 [witgo] review commit
2f48789 [witgo] Merge branch 'master' of https://github.com/apache/spark into yarn_ClientBase
8d7b82f [witgo] Merge branch 'master' of https://github.com/apache/spark into yarn_ClientBase
1048549 [witgo] remove Utils.resolveURIs
871f1db [witgo] add spark.yarn.dist.* documentation
41bce59 [witgo] review commit
35d6fa0 [witgo] move to ClientArguments
55d72fc [witgo] Merge branch 'master' of https://github.com/apache/spark into yarn_ClientBase
9cdff16 [witgo] review commit
8bc2f4b [witgo] review commit
20e667c [witgo] Merge branch 'master' into yarn_ClientBase
0961151 [witgo] merge master
ce609fc [witgo] Merge branch 'master' into yarn_ClientBase
8362489 [witgo] yarn.ClientBase spark.yarn.dist.* do not work
2014-06-19 12:11:26 -05:00
Yin Huai d2f4f30b12 [SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQL
JIRA: https://issues.apache.org/jira/browse/SPARK-2060

Programming guide: http://yhuai.github.io/site/sql-programming-guide.html

Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext

Author: Yin Huai <huai@cse.ohio-state.edu>

Closes #999 from yhuai/newJson and squashes the following commits:

227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
ce8eedd [Yin Huai] rxin's comments.
bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
94ffdaa [Yin Huai] Remove "get" from method names.
ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
79ea9ba [Yin Huai] Fix typos.
5428451 [Yin Huai] Newline
1f908ce [Yin Huai] Remove extra line.
d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
7ea750e [Yin Huai] marmbrus's comments.
6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
83013fb [Yin Huai] Update Java Example.
e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map.
6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
4fbddf0 [Yin Huai] Programming guide.
9df8c5a [Yin Huai] Python API.
7027634 [Yin Huai] Java API.
cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset.
d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
ab810b0 [Yin Huai] Make JsonRDD private.
6df0891 [Yin Huai] Apache header.
8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema.
8ffed79 [Yin Huai] Update the example.
a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution.
65b87f0 [Yin Huai] Fix sampling...
8846af5 [Yin Huai] API doc.
52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
0387523 [Yin Huai] Address PR comments.
666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
a2313a6 [Yin Huai] Address PR comments.
f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used.
0576406 [Yin Huai] Add Apache license header.
af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD.
f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.
2014-06-17 19:14:59 -07:00
Anatoli Fomenko 7afa912e74 MLlib documentation fix
Synchronized mllib-optimization.md with Spark Scaladoc: removed reference to GradientDescent.runMiniBatchSGD method

This is a temporary fix to remove  a link from http://spark.apache.org/docs/latest/mllib-optimization.html to GradientDescent.runMiniBatchSGD which is not in the current online GradientDescent Scaladoc.
FIXME: revert this commit after GradientDescent Scaladoc is updated.
See images for details.

![mllib-docs-fix-1](https://cloud.githubusercontent.com/assets/1375501/3294410/ccf19bb8-f5a8-11e3-93f1-f593016209eb.png)
![mllib-docs-fix-2](https://cloud.githubusercontent.com/assets/1375501/3294411/d0b59a7e-f5a8-11e3-8fc8-329c177ef8c8.png)

Author: Anatoli Fomenko <fa@apache.org>

Closes #1098 from afomenko/master and squashes the following commits:

5cb0758 [Anatoli Fomenko] MLlib documentation fix
2014-06-16 23:10:36 -07:00
witgo cdf2b04570 [SPARK-1930] The Container is running beyond physical memory limits, so as to be killed
Author: witgo <witgo@qq.com>

Closes #894 from witgo/SPARK-1930 and squashes the following commits:

564307e [witgo] Update the running-on-yarn.md
3747515 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1930
172647b [witgo] add memoryOverhead docs
a0ff545 [witgo] leaving only two configs
a17bda2 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1930
478ca15 [witgo] Merge branch 'master' into SPARK-1930
d1244a1 [witgo] Merge branch 'master' into SPARK-1930
8b967ae [witgo] Merge branch 'master' into SPARK-1930
655a820 [witgo] review commit
71859a7 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1930
e3c531d [witgo] review commit
e16f190 [witgo] different memoryOverhead
ffa7569 [witgo] review commit
5c9581f [witgo] Merge branch 'master' into SPARK-1930
9a6bcf2 [witgo] review commit
8fae45a [witgo] fix NullPointerException
e0dcc16 [witgo] Adding  configuration items
b6a989c [witgo] Fix container memory beyond limit, were killed
2014-06-16 14:27:31 -05:00
Ali Ghodsi 119b06a04f Updating docs to include missing information about reducers and clarify ...
...how the OFFHEAP storage level works (there has been confusion around this).

Author: Ali Ghodsi <alig@cs.berkeley.edu>

Closes #1089 from alig/master and squashes the following commits:

ca8114d [Ali Ghodsi] Updating docs to include missing information about reducers and clarify how the OFFHEAP storage level works (there has been confusion around this).
2014-06-15 23:44:30 -07:00
Andrew Ash 9672ee07fb SPARK-2148 Add link to requirements for custom equals() and hashcode() methods
https://issues.apache.org/jira/browse/SPARK-2148

Author: Andrew Ash <andrew@andrewash.com>

Closes #1092 from ash211/SPARK-2148 and squashes the following commits:

93513df [Andrew Ash] SPARK-2148 Add link to requirements for custom equals() and hashcode() methods
2014-06-15 23:33:22 -07:00
Kan Zhang b52603b039 [SPARK-2013] Documentation for saveAsPickleFile and pickleFile in Python
Author: Kan Zhang <kzhang@apache.org>

Closes #983 from kanzhang/SPARK-2013 and squashes the following commits:

0e128bb [Kan Zhang] [SPARK-2013] minor update
e728516 [Kan Zhang] [SPARK-2013] Documentation for saveAsPickleFile and pickleFile in Python
2014-06-14 13:22:30 -07:00
akkomar edb1f0e316 Small correction in Streaming Programming Guide doc
Corrected description of `repartition` function under 'Level of Parallelism in Data Receiving'.

Author: akkomar <ak.komar@gmail.com>

Closes #1079 from akkomar/streaming-guide-doc and squashes the following commits:

32dfc62 [akkomar] Corrected description of `repartition` function under 'Level of Parallelism in Data Receiving'.
2014-06-13 15:37:26 -07:00
Sandy Ryza ce92a9c18f SPARK-554. Add aggregateByKey.
Author: Sandy Ryza <sandy@cloudera.com>

Closes #705 from sryza/sandy-spark-554 and squashes the following commits:

2302b8f [Sandy Ryza] Add MIMA exclude
f52e0ad [Sandy Ryza] Fix Python tests for real
2f3afa3 [Sandy Ryza] Fix Python test
0b735e9 [Sandy Ryza] Fix line lengths
ae56746 [Sandy Ryza] Fix doc (replace T with V)
c2be415 [Sandy Ryza] Java and Python aggregateByKey
23bf400 [Sandy Ryza] SPARK-554.  Add aggregateByKey.
2014-06-12 08:14:25 -07:00
Tathagata Das 4823bf470e [SPARK-1940] Enabling rolling of executor logs, and automatic cleanup of old executor logs
Currently, in the default log4j configuration, all the executor logs get sent to the file <code>[executor-working-dir]/stderr</code>. This does not all log files to be rolled, so old logs cannot be removed.

Using log4j RollingFileAppender allows log4j logs to be rolled, but all the logs get sent to a different set of files, other than the files <code>stdout</code> and <code>stderr</code> . So the logs are not visible in the Spark web UI any more as Spark web UI only reads the files <code>stdout</code> and <code>stderr</code>. Furthermore, it still does not allow the stdout and stderr to be cleared periodically in case a large amount of stuff gets written to them (e.g. by explicit `println` inside map function).

This PR solves this by implementing a simple `RollingFileAppender` within Spark (disabled by default). When enabled (using configuration parameter `spark.executor.rollingLogs.enabled`), the logs can get rolled over either by time interval (set with `spark.executor.rollingLogs.interval`, set to daily by default), or by size of logs (set with  `spark.executor.rollingLogs.size`). Finally, old logs can be automatically deleted by specifying how many of the latest log files to keep (set with `spark.executor.rollingLogs.keepLastN`).  The web UI has also been modified to show the logs across the rolled-over files.

You can test this locally (without waiting a whole day) by setting  configuration `spark.executor.rollingLogs.enabled=true` and `spark.executor.rollingLogs.interval=minutely`. Continuously generate logs by running spark jobs and the generated logs files would look like this (`stderr` and `stdout` are the most current log file that are being written to).

```
stderr
stderr--2014-05-27--14-37
stderr--2014-05-27--14-47
stderr--2014-05-27--15-05
stdout
stdout--2014-05-27--14-47
```

The web ui should show logs across these files.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #895 from tdas/rolling-logs and squashes the following commits:

fd8f87f [Tathagata Das] Minor change.
d326aee [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into rolling-logs
ad956c1 [Tathagata Das] Scala style fix.
1f0a6ec [Tathagata Das] Some more changes based on Patrick's PR comments.
c8bfe4e [Tathagata Das] Refactore FileAppender to a package spark.util.logging and broke up the file into multiple files. Changed configuration parameter names.
4224409 [Tathagata Das] Style fix.
108a9f8 [Tathagata Das] Added better constraint handling for rolling policies.
f7da977 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into rolling-logs
9134495 [Tathagata Das] Simplified rolling logs by removing Daily/Hourly/MinutelyRollingFileAppender, and removing the setting rollingLogs.enabled
312d874 [Tathagata Das] Minor fixes based on PR comments.
8a67d83 [Tathagata Das] Fixed comments.
b36cfd6 [Tathagata Das] Implemented RollingPolicy, TimeBasedRollingPolicy and SizeBasedRollingPolicy, and changed RollingFileAppender accordingly.
b7e8272 [Tathagata Das] Style fix,
374c9a9 [Tathagata Das] Added missing license.
24354ea [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into rolling-logs
6cc09c7 [Tathagata Das] Fixed bugs in rolling logs, and added more debug statements.
adf4910 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into rolling-logs
931f8fb [Tathagata Das] Changed log viewer in Spark web UI to handle rolling log files.
cb4fb6d [Tathagata Das] Added FileAppender and RollingFileAppender to generate rolling executor logs.
2014-06-10 20:22:02 -07:00
Nick Pentreath f971d6cb60 SPARK-1416: PySpark support for SequenceFile and Hadoop InputFormats
So I finally resurrected this PR. It seems the old one against the incubator mirror is no longer available, so I cannot reference it.

This adds initial support for reading Hadoop ```SequenceFile```s, as well as arbitrary Hadoop ```InputFormat```s, in PySpark.

# Overview
The basics are as follows:
1. ```PythonRDD``` object contains the relevant methods, that are in turn invoked by ```SparkContext``` in PySpark
2. The SequenceFile or InputFormat is read on the Scala side and converted from ```Writable``` instances to the relevant Scala classes (in the case of primitives)
3. Pyrolite is used to serialize Java objects. If this fails, the fallback is ```toString```
4. ```PickleSerializer``` on the Python side deserializes.

This works "out the box" for simple ```Writable```s:
* ```Text```
* ```IntWritable```, ```DoubleWritable```, ```FloatWritable```
* ```NullWritable```
* ```BooleanWritable```
* ```BytesWritable```
* ```MapWritable```

It also works for simple, "struct-like" classes. Due to the way Pyrolite works, this requires that the classes satisfy the JavaBeans convenstions (i.e. with fields and a no-arg constructor and getters/setters). (Perhaps in future some sugar for case classes and reflection could be added).

I've tested it out with ```ESInputFormat```  as an example and it works very nicely:
```python
conf = {"es.resource" : "index/type" }
rdd = sc.newAPIHadoopRDD("org.elasticsearch.hadoop.mr.EsInputFormat", "org.apache.hadoop.io.NullWritable", "org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=conf)
rdd.first()
```

I suspect for things like HBase/Cassandra it will be a bit trickier to get it to work out the box.

# Some things still outstanding:
1. ~~Requires ```msgpack-python``` and will fail without it. As originally discussed with Josh, add a ```as_strings``` argument that defaults to ```False```, that can be used if ```msgpack-python``` is not available~~
2. ~~I see from https://github.com/apache/spark/pull/363 that Pyrolite is being used there for SerDe between Scala and Python. @ahirreddy @mateiz what is the plan behind this - is Pyrolite preferred? It seems from a cursory glance that adapting the ```msgpack```-based SerDe here to use Pyrolite wouldn't be too hard~~
3. ~~Support the key and value "wrapper" that would allow a Scala/Java function to be plugged in that would transform whatever the key/value Writable class is into something that can be serialized (e.g. convert some custom Writable to a JavaBean or ```java.util.Map``` that can be easily serialized)~~
4. Support ```saveAsSequenceFile``` and ```saveAsHadoopFile``` etc. This would require SerDe in the reverse direction, that can be handled by Pyrolite. Will work on this as a separate PR

Author: Nick Pentreath <nick.pentreath@gmail.com>

Closes #455 from MLnick/pyspark-inputformats and squashes the following commits:

268df7e [Nick Pentreath] Documentation changes mer @pwendell comments
761269b [Nick Pentreath] Address @pwendell comments, simplify default writable conversions and remove registry.
4c972d8 [Nick Pentreath] Add license headers
d150431 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
cde6af9 [Nick Pentreath] Parameterize converter trait
5ebacfa [Nick Pentreath] Update docs for PySpark input formats
a985492 [Nick Pentreath] Move Converter examples to own package
365d0be [Nick Pentreath] Make classes private[python]. Add docs and @Experimental annotation to Converter interface.
eeb8205 [Nick Pentreath] Fix path relative to SPARK_HOME in tests
1eaa08b [Nick Pentreath] HBase -> Cassandra app name oversight
3f90c3e [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
2c18513 [Nick Pentreath] Add examples for reading HBase and Cassandra InputFormats from Python
b65606f [Nick Pentreath] Add converter interface
5757f6e [Nick Pentreath] Default key/value classes for sequenceFile asre None
085b55f [Nick Pentreath] Move input format tests to tests.py and clean up docs
43eb728 [Nick Pentreath] PySpark InputFormats docs into programming guide
94beedc [Nick Pentreath] Clean up args in PythonRDD. Set key/value converter defaults to None for PySpark context.py methods
1a4a1d6 [Nick Pentreath] Address @mateiz style comments
01e0813 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
15a7d07 [Nick Pentreath] Remove default args for key/value classes. Arg names to camelCase
9fe6bd5 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
84fe8e3 [Nick Pentreath] Python programming guide space formatting
d0f52b6 [Nick Pentreath] Python programming guide
7caa73a [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
93ef995 [Nick Pentreath] Add back context.py changes
9ef1896 [Nick Pentreath] Recover earlier changes lost in previous merge for serializers.py
077ecb2 [Nick Pentreath] Recover earlier changes lost in previous merge for context.py
5af4770 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
35b8e3a [Nick Pentreath] Another fix for test ordering
bef3afb [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
e001b94 [Nick Pentreath] Fix test failures due to ordering
78978d9 [Nick Pentreath] Add doc for SequenceFile and InputFormat support to Python programming guide
64eb051 [Nick Pentreath] Scalastyle fix
e7552fa [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
44f2857 [Nick Pentreath] Remove msgpack dependency and switch serialization to Pyrolite, plus some clean up and refactoring
c0ebfb6 [Nick Pentreath] Change sequencefile test data generator to easily be called from PySpark tests
1d7c17c [Nick Pentreath] Amend tests to auto-generate sequencefile data in temp dir
17a656b [Nick Pentreath] remove binary sequencefile for tests
f60959e [Nick Pentreath] Remove msgpack dependency and serializer from PySpark
450e0a2 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
31a2fff [Nick Pentreath] Scalastyle fixes
fc5099e [Nick Pentreath] Add Apache license headers
4e08983 [Nick Pentreath] Clean up docs for PySpark context methods
b20ec7e [Nick Pentreath] Clean up merge duplicate dependencies
951c117 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
f6aac55 [Nick Pentreath] Bring back msgpack
9d2256e [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
1bbbfb0 [Nick Pentreath] Clean up SparkBuild from merge
a67dfad [Nick Pentreath] Clean up Msgpack serialization and registering
7237263 [Nick Pentreath] Add back msgpack serializer and hadoop file code lost during merging
25da1ca [Nick Pentreath] Add generator for nulls, bools, bytes and maps
65360d5 [Nick Pentreath] Adding test SequenceFiles
0c612e5 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
d72bf18 [Nick Pentreath] msgpack
dd57922 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
e67212a [Nick Pentreath] Add back msgpack dependency
f2d76a0 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
41856a5 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
97ef708 [Nick Pentreath] Remove old writeToStream
2beeedb [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
795a763 [Nick Pentreath] Change name to WriteInputFormatTestDataGenerator. Cleanup some var names. Use SPARK_HOME in path for writing test sequencefile data.
174f520 [Nick Pentreath] Add back graphx settings
703ee65 [Nick Pentreath] Add back msgpack
619c0fa [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
1c8efbc [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
eb40036 [Nick Pentreath] Remove unused comment lines
4d7ef2e [Nick Pentreath] Fix indentation
f1d73e3 [Nick Pentreath] mergeConfs returns a copy rather than mutating one of the input arguments
0f5cd84 [Nick Pentreath] Remove unused pair UTF8 class. Add comments to msgpack deserializer
4294cbb [Nick Pentreath] Add old Hadoop api methods. Clean up and expand comments. Clean up argument names
818a1e6 [Nick Pentreath] Add seqencefile and Hadoop InputFormat support to PythonRDD
4e7c9e3 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
c304cc8 [Nick Pentreath] Adding supporting sequncefiles for tests. Cleaning up
4b0a43f [Nick Pentreath] Refactoring utils into own objects. Cleaning up old commented-out code
d86325f [Nick Pentreath] Initial WIP of PySpark support for SequenceFile and arbitrary Hadoop InputFormat
2014-06-09 22:21:03 -07:00
CodingCat 89cdbb087c SPARK-1677: allow user to disable output dir existence checking
https://issues.apache.org/jira/browse/SPARK-1677

For compatibility with older versions of Spark it would be nice to have an option `spark.hadoop.validateOutputSpecs` (default true)  for the user to disable the output directory existence checking

Author: CodingCat <zhunansjtu@gmail.com>

Closes #947 from CodingCat/SPARK-1677 and squashes the following commits:

7930f83 [CodingCat] miao
c0c0e03 [CodingCat] bug fix and doc update
5318562 [CodingCat] bug fix
13219b5 [CodingCat] allow user to disable output dir existence checking
2014-06-05 11:39:35 -07:00
Henry Saputra 6c044ed100 SPARK-2001 : Remove docs/spark-debugger.md from master
Per discussion in dev list:
"
Seemed like the spark-debugger.md is no longer accurate (see
http://spark.apache.org/docs/latest/spark-debugger.html) and since it
was originally written Spark has evolved that makes the doc obsolete.
There are already work pending for new replay debugging (I could not
find the PR links for it) so I
With version control we could always reinstate the old doc if needed,
but as of today the doc is no longer reflect the current state of
Spark's RDD.
"

Author: Henry Saputra <henry.saputra@gmail.com>

Closes #953 from hsaputra/SPARK-2001-hsaputra and squashes the following commits:

dc324aa [Henry Saputra] SPARK-2001 : Remove docs/spark-debugger.md from master since it is obsolete
2014-06-03 13:03:51 -07:00
Yadid Ayzenberg 366c0c4c30 updated java code blocks in spark SQL guide such that ctx will refer to ...
...a JavaSparkContext and sqlCtx will refer to a JavaSQLContext

Author: Yadid Ayzenberg <yadid@media.mit.edu>

Closes #932 from yadid/master and squashes the following commits:

f92fb3a [Yadid Ayzenberg] updated java code blocks in spark SQL guide such that ctx will refer to a JavaSparkContext and sqlCtx will refer to a JavaSQLContext
2014-05-31 19:44:13 -07:00
CodingCat 41bfdda3cc SPARK-1976: fix the misleading part in streaming docs
Spark streaming requires at least two working threads, but the document gives the example like

import org.apache.spark.api.java.function._
import org.apache.spark.streaming._
import org.apache.spark.streaming.api._
// Create a StreamingContext with a local master
val ssc = new StreamingContext("local", "NetworkWordCount", Seconds(1))
http://spark.apache.org/docs/latest/streaming-programming-guide.html

Author: CodingCat <zhunansjtu@gmail.com>

Closes #924 from CodingCat/master and squashes the following commits:

bb89f20 [CodingCat] update streaming docs
2014-05-30 22:06:08 -07:00
nchammas 23ae36630a updated link to mailing list
Author: nchammas <nicholas.chammas@gmail.com>

Closes #923 from nchammas/patch-1 and squashes the following commits:

65c4d18 [nchammas] updated link to mailing list
2014-05-30 22:04:57 -07:00
Andrew Ash 9c1f204d80 Typo: and -> an
Author: Andrew Ash <andrew@andrewash.com>

Closes #927 from ash211/patch-5 and squashes the following commits:

79b577d [Andrew Ash] Typo: and -> an
2014-05-30 22:02:04 -07:00
Matei Zaharia c8bf4131bc [SPARK-1566] consolidate programming guide, and general doc updates
This is a fairly large PR to clean up and update the docs for 1.0. The major changes are:

* A unified programming guide for all languages replaces language-specific ones and shows language-specific info in tabs
* New programming guide sections on key-value pairs, unit testing, input formats beyond text, migrating from 0.9, and passing functions to Spark
* Spark-submit guide moved to a separate page and expanded slightly
* Various cleanups of the menu system, security docs, and others
* Updated look of title bar to differentiate the docs from previous Spark versions

You can find the updated docs at http://people.apache.org/~matei/1.0-docs/_site/ and in particular http://people.apache.org/~matei/1.0-docs/_site/programming-guide.html.

Author: Matei Zaharia <matei@databricks.com>

Closes #896 from mateiz/1.0-docs and squashes the following commits:

03e6853 [Matei Zaharia] Some tweaks to configuration and YARN docs
0779508 [Matei Zaharia] tweak
ef671d4 [Matei Zaharia] Keep frames in JavaDoc links, and other small tweaks
1bf4112 [Matei Zaharia] Review comments
4414f88 [Matei Zaharia] tweaks
d04e979 [Matei Zaharia] Fix some old links to Java guide
a34ed33 [Matei Zaharia] tweak
541bb3b [Matei Zaharia] miscellaneous changes
fcefdec [Matei Zaharia] Moved submitting apps to separate doc
61d72b4 [Matei Zaharia] stuff
181f217 [Matei Zaharia] migration guide, remove old language guides
e11a0da [Matei Zaharia] Add more API functions
6a030a9 [Matei Zaharia] tweaks
8db0ae3 [Matei Zaharia] Added key-value pairs section
318d2c9 [Matei Zaharia] tweaks
1c81477 [Matei Zaharia] New section on basics and function syntax
e38f559 [Matei Zaharia] Actually added programming guide to Git
a33d6fe [Matei Zaharia] First pass at updating programming guide to support all languages, plus other tweaks throughout
3b6a876 [Matei Zaharia] More CSS tweaks
01ec8bf [Matei Zaharia] More CSS tweaks
e6d252e [Matei Zaharia] Change color of doc title bar to differentiate from 0.9.0
2014-05-30 00:34:33 -07:00
Patrick Wendell 7801d44fd3 Organize configuration docs
This PR improves and organizes the config option page
and makes a few other changes to config docs. See a preview here:
http://people.apache.org/~pwendell/config-improvements/configuration.html

The biggest changes are:
1. The configs for the standalone master/workers were moved to the
standalone page and out of the general config doc.
2. SPARK_LOCAL_DIRS was missing from the standalone docs.
3. Expanded discussion of injecting configs with spark-submit, including an
example.
4. Config options were organized into the following categories:
- Runtime Environment
- Shuffle Behavior
- Spark UI
- Compression and Serialization
- Execution Behavior
- Networking
- Scheduling
- Security
- Spark Streaming

Author: Patrick Wendell <pwendell@gmail.com>

Closes #880 from pwendell/config-cleanup and squashes the following commits:

93f56c3 [Patrick Wendell] Feedback from Matei
6f66efc [Patrick Wendell] More feedback
16ae776 [Patrick Wendell] Adding back header section
d9c264f [Patrick Wendell] Small fix
e0c1728 [Patrick Wendell] Response to Matei's review
27d57db [Patrick Wendell] Reverting changes to index.html (covered in #896)
e230ef9 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into config-cleanup
a374369 [Patrick Wendell] Line wrapping fixes
fdff7fc [Patrick Wendell] Merge remote-tracking branch 'apache/master' into config-cleanup
3289ea4 [Patrick Wendell] Pulling in changes from #856
106ee31 [Patrick Wendell] Small link fix
f7e79bc [Patrick Wendell] Re-organizing config options.
54b184d [Patrick Wendell] Adding standalone configs to the standalone page
592e94a [Patrick Wendell] Stash
29b5446 [Patrick Wendell] Better discussion of spark-submit in configuration docs
2d719ef [Patrick Wendell] Small fix
4af9e07 [Patrick Wendell] Adding SPARK_LOCAL_DIRS docs
204b248 [Patrick Wendell] Small fixes
2014-05-28 15:49:54 -07:00
jmu 82eadc3b07 Fix doc about NetworkWordCount/JavaNetworkWordCount usage of spark streaming
Usage: NetworkWordCount <master> <hostname> <port>
-->
Usage: NetworkWordCount <hostname> <port>

Usage: JavaNetworkWordCount <master> <hostname> <port>
-->
Usage: JavaNetworkWordCount <hostname> <port>

Author: jmu <jmujmu@gmail.com>

Closes #826 from jmu/master and squashes the following commits:

9fb7980 [jmu] Merge branch 'master' of https://github.com/jmu/spark
b9a6b02 [jmu] Fix doc for NetworkWordCount/JavaNetworkWordCount Usage: NetworkWordCount <master> <hostname> <port> --> Usage: NetworkWordCount <hostname> <port>
2014-05-27 22:41:47 -07:00