Commit graph

8265 commits

Author SHA1 Message Date
Nicholas Chammas 1651cc117d [EC2] Cleanup Python parens and disk dict
Minor fixes:
* Remove unnecessary parens (Python style)
* Sort `disks_by_instance` dict and remove duplicate `t1.micro` key

Author: Nicholas Chammas <nicholas.chammas@gmail.com>

Closes #2571 from nchammas/ec2-polish and squashes the following commits:

9d203d5 [Nicholas Chammas] paren and dict cleanup
2014-09-28 21:55:09 -07:00
Joseph K. Bradley 0dc2b6361d [SPARK-1545] [mllib] Add Random Forests
This PR adds RandomForest to MLlib.  The implementation is basic, and future performance optimizations will be important.  (Note: RFs = Random Forests.)

# Overview

## RandomForest
* trains multiple trees at once to reduce the number of passes over the data
* allows feature subsets at each node
* uses a queue of nodes instead of fixed groups for each level

This implementation is based an implementation by manishamde and the [Alpine Labs Sequoia Forest](https://github.com/AlpineNow/SparkML2) by codedeft (in particular, the TreePoint, BaggedPoint, and node queue implementations).  Thank you for your inputs!

## Testing

Correctness: This has been tested for correctness with the test suites and with DecisionTreeRunner on example datasets.

Performance: This has been performance tested using [this branch of spark-perf](https://github.com/jkbradley/spark-perf/tree/rfs).  Results below.

### Regression tests for DecisionTree

Summary: For training 1 tree, there are small regressions, especially from feature subsampling.

In the table below, each row is a single (random) dataset.  The 2 different sets of result columns are for 2 different RF implementations:
* (numTrees): This is from an earlier commit, after implementing RandomForest to train multiple trees at once.  It does not include any code for feature subsampling.
* (feature subsets): This is from this current PR's code, after implementing feature subsampling.
These tests were to identify regressions in DecisionTree, so they are training 1 tree with all of the features (i.e., no feature subsampling).

These were run on an EC2 cluster with 15 workers, training 1 tree with maxDepth = 5 (= 6 levels).  Speedup values < 1 indicate slowdowns from the old DecisionTree implementation.

numInstances | numFeatures | runtime (sec) | speedup | runtime (sec) | speedup
---- | ---- | ---- | ---- | ---- | ----
 | | (numTrees) | (numTrees) | (feature subsets) | (feature subsets)
20000 | 100 | 4.051 | 1.044433473 | 4.478 | 0.9448414471
20000 | 500 | 8.472 | 1.104461756 | 9.315 | 1.004508857
20000 | 1500 | 19.354 | 1.05854087 | 20.863 | 0.9819776638
20000 | 3500 | 43.674 | 1.072033704 | 45.887 | 1.020332556
200000 | 100 | 4.196 | 1.171830315 | 4.848 | 1.014232673
200000 | 500 | 8.926 | 1.082791844 | 9.771 | 0.989151571
200000 | 1500 | 20.58 | 1.068415938 | 22.134 | 0.9934038131
200000 | 3500 | 48.043 | 1.075203464 | 52.249 | 0.9886505005
2000000 | 100 | 4.944 | 1.01355178 | 5.796 | 0.8645617667
2000000 | 500 | 11.11 | 1.016831683 | 12.482 | 0.9050632911
2000000 | 1500 | 31.144 | 1.017852556 | 35.274 | 0.8986789136
2000000 | 3500 | 79.981 | 1.085382778 | 101.105 | 0.8586123337
20000000 | 100 | 8.304 | 0.9270231214 | 9.073 | 0.8484514494
20000000 | 500 | 28.174 | 1.083268262 | 34.236 | 0.8914592826
20000000 | 1500 | 143.97 | 0.9579634646 | 159.275 | 0.8659111599

### Tests for forests

I have run other tests with numTrees=10 and with sqrt(numFeatures), and those indicate that multi-model training and feature subsets can speed up training for forests, especially when training deeper trees.

# Details on specific classes

## Changes to DecisionTree
* Main train() method is now in RandomForest.
* findBestSplits() is no longer needed.  (It split levels into groups, but we now use a queue of nodes.)
* Many small changes to support RFs.  (Note: These methods should be moved to RandomForest.scala in a later PR, but are in DecisionTree.scala to make code comparison easier.)

## RandomForest
* Main train() method is from old DecisionTree.
* selectNodesToSplit: Note that it selects nodes and feature subsets jointly to track memory usage.

## RandomForestModel
* Stores an Array[DecisionTreeModel]
* Prediction:
 * For classification, most common label.  For regression, mean.
 * We could support other methods later.

## examples/.../DecisionTreeRunner
* This now takes numTrees and featureSubsetStrategy, to support RFs.

## DTStatsAggregator
* 2 types of functionality (w/ and w/o subsampling features): These require different indexing methods.  (We could treat both as subsampling, but this is less efficient
  DTStatsAggregator is now abstract, and 2 child classes implement these 2 types of functionality.

## impurities
* These now take instance weights.

## Node
* Some vals changed to vars.
 * This is unfortunately a public API change (DeveloperApi).  This could be avoided by creating a LearningNode struct, but would be awkward.

## RandomForestSuite
Please let me know if there are missing tests!

## BaggedPoint
This wraps TreePoint and holds bootstrap weights/counts.

# Design decisions

* BaggedPoint: BaggedPoint is separate from TreePoint since it may be useful for other bagging algorithms later on.

* RandomForest public API: What options should be easily supported by the train* methods?  Should ALL options be in the Java-friendly constructors?  Should there be a constructor taking Strategy?

* Feature subsampling options: What options should be supported?  scikit-learn supports the same options, except for "onethird."  One option would be to allow users to specific fractions ("0.1"): the current options could be supported, and any unrecognized values would be parsed as Doubles in [0,1].

* Splits and bins are computed before bootstrapping, so all trees use the same discretization.

* One queue, instead of one queue per tree.

CC: mengxr manishamde codedeft chouqin  Please let me know if you have suggestions---thanks!

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
Author: qiping.lqp <qiping.lqp@alibaba-inc.com>
Author: chouqin <liqiping1991@gmail.com>

Closes #2435 from jkbradley/rfs-new and squashes the following commits:

c694174 [Joseph K. Bradley] Fixed typo
cc59d78 [Joseph K. Bradley] fixed imports
e25909f [Joseph K. Bradley] Simplified node group maps.  Specifically, created NodeIndexInfo to store node index in agg and feature subsets, and no longer create extra maps in findBestSplits
fbe9a1e [Joseph K. Bradley] Changed default featureSubsetStrategy to be sqrt for classification, onethird for regression.  Updated docs with references.
ef7c293 [Joseph K. Bradley] Updates based on code review.  Most substantial changes: * Simplified DTStatsAggregator * Made RandomForestModel.trees public * Added test for regression to RandomForestSuite
593b13c [Joseph K. Bradley] Fixed bug in metadata for computing log2(num features).  Now it checks >= 1.
a1a08df [Joseph K. Bradley] Removed old comments
866e766 [Joseph K. Bradley] Changed RandomForestSuite randomized tests to use multiple fixed random seeds.
ff8bb96 [Joseph K. Bradley] removed usage of null from RandomForest and replaced with Option
bf1a4c5 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into rfs-new
6b79c07 [Joseph K. Bradley] Added RandomForestSuite, and fixed small bugs, style issues.
d7753d4 [Joseph K. Bradley] Added numTrees and featureSubsetStrategy to DecisionTreeRunner (to support RandomForest).  Fixed bugs so that RandomForest now runs.
746d43c [Joseph K. Bradley] Implemented feature subsampling.  Tested DecisionTree but not RandomForest.
6309d1d [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into rfs-new.  Added RandomForestModel.toString
b7ae594 [Joseph K. Bradley] Updated docs.  Small fix for bug which does not cause errors: No longer allocate unused child nodes for leaf nodes.
121c74e [Joseph K. Bradley] Basic random forests are implemented.  Random features per node not yet implemented.  Test suite not implemented.
325d18a [Joseph K. Bradley] Merge branch 'chouqin-dt-preprune' into rfs-new
4ef9bf1 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into rfs-new
61b2e72 [Joseph K. Bradley] Added max of 10GB for maxMemoryInMB in Strategy.
a95e7c8 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into chouqin-dt-preprune
6da8571 [Joseph K. Bradley] RFs partly implemented, not done yet
eddd1eb [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into rfs-new
5c4ac33 [Joseph K. Bradley] Added check in Strategy to make sure minInstancesPerNode >= 1
0dd4d87 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-spark-3160
95c479d [Joseph K. Bradley] * Fixed typo in tree suite test "do not choose split that does not satisfy min instance per node requirements" * small style fixes
e2628b6 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into chouqin-dt-preprune
19b01af [Joseph K. Bradley] Merge remote-tracking branch 'chouqin/dt-preprune' into chouqin-dt-preprune
f1d11d1 [chouqin] fix typo
c7ebaf1 [chouqin] fix typo
39f9b60 [chouqin] change edge `minInstancesPerNode` to 2 and add one more test
c6e2dfc [Joseph K. Bradley] Added minInstancesPerNode and minInfoGain parameters to DecisionTreeRunner.scala and to Python API in tree.py
306120f [Joseph K. Bradley] Fixed typo in DecisionTreeModel.scala doc
eaa1dcf [Joseph K. Bradley] Added topNode doc in DecisionTree and scalastyle fix
d4d7864 [Joseph K. Bradley] Marked Node.build as deprecated
d4dbb99 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-spark-3160
1a8f0ad [Joseph K. Bradley] Eliminated pre-allocated nodes array in main train() method. * Nodes are constructed and added to the tree structure as needed during training.
0278a11 [chouqin] remove `noSplit` and set `Predict` private to tree
d593ec7 [chouqin] fix docs and change minInstancesPerNode to 1
2ab763b [Joseph K. Bradley] Simplifications to DecisionTree code:
efcc736 [qiping.lqp] fix bug
10b8012 [qiping.lqp] fix style
6728fad [qiping.lqp] minor fix: remove empty lines
bb465ca [qiping.lqp] Merge branch 'master' of https://github.com/apache/spark into dt-preprune
cadd569 [qiping.lqp] add api docs
46b891f [qiping.lqp] fix bug
e72c7e4 [qiping.lqp] add comments
845c6fa [qiping.lqp] fix style
f195e83 [qiping.lqp] fix style
987cbf4 [qiping.lqp] fix bug
ff34845 [qiping.lqp] separate calculation of predict of node from calculation of info gain
ac42378 [qiping.lqp] add min info gain and min instances per node parameters in decision tree
2014-09-28 21:44:50 -07:00
Reynold Xin f350cd3070 [SPARK-3543] TaskContext remaining cleanup work.
Author: Reynold Xin <rxin@apache.org>

Closes #2560 from rxin/TaskContext and squashes the following commits:

9eff95a [Reynold Xin] [SPARK-3543] remaining cleanup work.
2014-09-28 20:32:54 -07:00
Jim Lim 25164a89dd SPARK-2761 refactor #maybeSpill into Spillable
Moved `#maybeSpill` in ExternalSorter and EAOM into `Spillable`.

Author: Jim Lim <jim@quixey.com>

Closes #2416 from jimjh/SPARK-2761 and squashes the following commits:

cf8be9a [Jim Lim] SPARK-2761 fix documentation, reorder code
f94d522 [Jim Lim] SPARK-2761 refactor Spillable to simplify sig
e75a24e [Jim Lim] SPARK-2761 use protected over protected[this]
7270e0d [Jim Lim] SPARK-2761 refactor #maybeSpill into Spillable
2014-09-28 19:04:24 -07:00
Reynold Xin 8e874185ed Revert "[SPARK-1021] Defer the data-driven computation of partition bounds in so..."
This reverts commit 2d972fd84a.

The commit was hanging correlationoptimizer14.
2014-09-28 18:33:11 -07:00
WangTaoTheTonic 1f13a40ccd [SPARK-3715][Docs]minor typo
https://issues.apache.org/jira/browse/SPARK-3715

Author: WangTaoTheTonic <barneystinson@aliyun.com>

Closes #2567 from WangTaoTheTonic/minortypo and squashes the following commits:

9cc3f7a [WangTaoTheTonic] minor typo
2014-09-28 18:30:13 -07:00
William Benton 6918012d0f SPARK-3699: SQL and Hive console tasks now clean up appropriately
The sbt tasks sql/console and hive/console will now `stop()`
the `SparkContext` upon exit.  Previously, they left an ugly stack
trace when quitting.

Author: William Benton <willb@redhat.com>

Closes #2547 from willb/consoleCleanup and squashes the following commits:

d5e431f [William Benton] SQL and Hive console tasks now clean up.
2014-09-28 01:01:27 -07:00
Reynold Xin 66e1c40c67 Minor fix for the previous commit. 2014-09-27 22:18:02 -07:00
Dale 9966d1a8aa SPARK-CORE [SPARK-3651] Group common CoarseGrainedSchedulerBackend variables together
from [SPARK-3651]
In CoarseGrainedSchedulerBackend, we have:

    private val executorActor = new HashMap[String, ActorRef]
    private val executorAddress = new HashMap[String, Address]
    private val executorHost = new HashMap[String, String]
    private val freeCores = new HashMap[String, Int]
    private val totalCores = new HashMap[String, Int]

We only ever put / remove stuff from these maps together. It would simplify the code if we consolidate these all into one map as we have done in JobProgressListener in https://issues.apache.org/jira/browse/SPARK-2299.

Author: Dale <tigerquoll@outlook.com>

Closes #2533 from tigerquoll/SPARK-3651 and squashes the following commits:

d1be0a9 [Dale] [SPARK-3651]  implemented suggested changes. Changed a reference from executorInfo to executorData to be consistent with other usages
6890663 [Dale] [SPARK-3651]  implemented suggested changes
7d671cf [Dale] [SPARK-3651]  Grouped variables under a ExecutorDataObject, and reference them via a map entry as they are all retrieved under the same key
2014-09-27 22:08:10 -07:00
Uri Laserson 248232936e [SPARK-3389] Add Converter for ease of Parquet reading in PySpark
https://issues.apache.org/jira/browse/SPARK-3389

Author: Uri Laserson <laserson@cloudera.com>

Closes #2256 from laserson/SPARK-3389 and squashes the following commits:

0ed363e [Uri Laserson] PEP8'd the python file
0b4b380 [Uri Laserson] Moved converter to examples and added python example
eecf4dc [Uri Laserson] [SPARK-3389] Add Converter for ease of Parquet reading in PySpark
2014-09-27 21:48:05 -07:00
Reynold Xin 5b922bb458 [SPARK-3543] Clean up Java TaskContext implementation.
This addresses some minor issues in https://github.com/apache/spark/pull/2425

Author: Reynold Xin <rxin@apache.org>

Closes #2557 from rxin/TaskContext and squashes the following commits:

a51e5f6 [Reynold Xin] [SPARK-3543] Clean up Java TaskContext implementation.
2014-09-27 14:46:00 -07:00
Davies Liu 0d8cdf0ede [SPARK-3681] [SQL] [PySpark] fix serialization of List and Map in SchemaRDD
Currently, the schema of object in ArrayType or MapType is attached lazily, it will have better performance but introduce issues while serialization or accessing nested objects.

This patch will apply schema to the objects of ArrayType or MapType immediately when accessing them, will be a little bit slower, but much robust.

Author: Davies Liu <davies.liu@gmail.com>

Closes #2526 from davies/nested and squashes the following commits:

2399ae5 [Davies Liu] fix serialization of List and Map in SchemaRDD
2014-09-27 12:21:37 -07:00
Michael Armbrust f0c7e19550 [SPARK-3680][SQL] Fix bug caused by eager typing of HiveGenericUDFs
Typing of UDFs should be lazy as it is often not valid to call `dataType` on an expression until after all of its children are `resolved`.

Author: Michael Armbrust <michael@databricks.com>

Closes #2525 from marmbrus/concatBug and squashes the following commits:

5b8efe7 [Michael Armbrust] fix bug with eager typing of udfs
2014-09-27 12:10:16 -07:00
w00228970 0800881051 [SPARK-3676][SQL] Fix hive test suite failure due to diffs in JDK 1.6/1.7
This is a bug in JDK6: http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4428022

this is because jdk get different result to operate ```double```,
```System.out.println(1/500d)``` in different jdk get different result
jdk 1.6.0(_31) ---- 0.0020
jdk 1.7.0(_05) ---- 0.002
this leads to HiveQuerySuite failed when generate golden answer in jdk 1.7 and run tests in jdk 1.6, result did not match

Author: w00228970 <wangfei1@huawei.com>

Closes #2517 from scwf/HiveQuerySuite and squashes the following commits:

0cb5e8d [w00228970] delete golden answer of division-0 and timestamp cast #1
1df3964 [w00228970] Jdk version leads to different query output for Double, this make HiveQuerySuite failed
2014-09-27 12:06:16 -07:00
CrazyJvm 66107f46f3 Docs : use "--total-executor-cores" rather than "--cores" after spark-shell
Author: CrazyJvm <crazyjvm@gmail.com>

Closes #2540 from CrazyJvm/standalone-core and squashes the following commits:

66d9fc6 [CrazyJvm] use "--total-executor-cores" rather than "--cores" after spark-shell
2014-09-27 09:42:01 -07:00
Reynold Xin 436a7730b6 Minor cleanup to tighten visibility and remove compilation warning.
Author: Reynold Xin <rxin@apache.org>

Closes #2555 from rxin/cleanup and squashes the following commits:

6add199 [Reynold Xin] Minor cleanup to tighten visibility and remove compilation warning.
2014-09-27 00:57:26 -07:00
Erik Erlandson 2d972fd84a [SPARK-1021] Defer the data-driven computation of partition bounds in so...
...rtByKey() until evaluation.

Author: Erik Erlandson <eerlands@redhat.com>

Closes #1689 from erikerlandson/spark-1021-pr and squashes the following commits:

50b6da6 [Erik Erlandson] use standard getIteratorSize in countAsync
4e334a9 [Erik Erlandson] exception mystery fixed by fixing bug in ComplexFutureAction
b88b5d4 [Erik Erlandson] tweak async actions to use ComplexFutureAction[T] so they handle RangePartitioner sampling job properly
b2b20e8 [Erik Erlandson] Fix bug in exception passing with ComplexFutureAction[T]
ca8913e [Erik Erlandson] RangePartition sampling job -> FutureAction
7143f97 [Erik Erlandson] [SPARK-1021] modify range bounds variable to be thread safe
ac67195 [Erik Erlandson] [SPARK-1021] Defer the data-driven computation of partition bounds in sortByKey() until evaluation.
2014-09-26 23:15:10 -07:00
Jeff Steinmetz 9e8ced7847 stop, start and destroy require the EC2_REGION
i.e
./spark-ec2 --region=us-west-1 stop yourclustername

Author: Jeff Steinmetz <jeffrey.steinmetz@gmail.com>

Closes #2473 from jeffsteinmetz/master and squashes the following commits:

7491f2c [Jeff Steinmetz] fix case in EC2 cluster setup documentation
bd3d777 [Jeff Steinmetz] standardized ec2 documenation to use <lower-case> sample args
2bf4a57 [Jeff Steinmetz] standardized ec2 documenation to use <lower-case> sample args
68d8372 [Jeff Steinmetz] standardized ec2 documenation to use <lower-case> sample args
d2ab6e2 [Jeff Steinmetz] standardized ec2 documenation to use <lower-case> sample args
520e6dc [Jeff Steinmetz] standardized ec2 documenation to use <lower-case> sample args
37fc876 [Jeff Steinmetz] stop, start and destroy require the EC2_REGION
2014-09-26 23:00:40 -07:00
Michael Armbrust d8a9d1d442 [SPARK-3675][SQL] Allow starting a JDBC server on an existing context
Author: Michael Armbrust <michael@databricks.com>

Closes #2515 from marmbrus/jdbcExistingContext and squashes the following commits:

7866fad [Michael Armbrust] Allows starting a JDBC server on an existing context.
2014-09-26 22:30:12 -07:00
Michael Armbrust f0eea76d94 [SQL][DOCS] Clarify that the server is for JDBC and ODBC
Author: Michael Armbrust <michael@databricks.com>

Closes #2527 from marmbrus/patch-1 and squashes the following commits:

a0f9f1c [Michael Armbrust] [SQL][DOCS] Clarify that the server is for JDBC and ODBC
2014-09-26 22:24:34 -07:00
wangfei 0cdcdd2c9d [Build]remove spark-staging-1030
Since 1.1.0 has published, remove spark-staging-1030.

Author: wangfei <wangfei1@huawei.com>

Closes #2532 from scwf/patch-2 and squashes the following commits:

bc9e00b [wangfei] remove spark-staging-1030
2014-09-26 22:23:49 -07:00
Sarah Gerweck e976ca236f Slaves file is now a template.
Change 0dc868e removed the `conf/slaves` file and made it a template like most of the other configuration files. This means you can no longer run `make-distribution.sh` unless you manually create a slaves file to be statically bundled in your distribution, which seems at odds with making it a template file.

Author: Sarah Gerweck <sarah.a180@gmail.com>

Closes #2549 from sarahgerweck/noMoreSlaves and squashes the following commits:

d11d99a [Sarah Gerweck] Slaves file is now a template.
2014-09-26 22:21:50 -07:00
Reynold Xin a3feaf04dc Close #2194. 2014-09-26 21:44:10 -07:00
Prashant Sharma 5e34855cf0 [SPARK-3543] Write TaskContext in Java and expose it through a static accessor.
Author: Prashant Sharma <prashant.s@imaginea.com>
Author: Shashank Sharma <shashank21j@gmail.com>

Closes #2425 from ScrapCodes/SPARK-3543/withTaskContext and squashes the following commits:

8ae414c [Shashank Sharma] CR
ee8bd00 [Prashant Sharma] Added internal API in docs comments.
ddb8cbe [Prashant Sharma] Moved setting the thread local to where TaskContext is instantiated.
a7d5e23 [Prashant Sharma] Added doc comments.
edf945e [Prashant Sharma] Code review git add -A
f716fd1 [Prashant Sharma] introduced thread local for getting the task context.
333c7d6 [Prashant Sharma] Translated Task context from scala to java.
2014-09-26 21:29:54 -07:00
Josh Rosen f872e4fb80 Revert "[SPARK-3478] [PySpark] Profile the Python tasks"
This reverts commit 1aa549ba98.
2014-09-26 14:47:14 -07:00
Cheng Hao 7364fa5a17 [SPARK-3393] [SQL] Align the log4j configuration for Spark & SparkSQLCLI
User may be confused for the HQL logging & configurations, we'd better provide a default templates.

Both files are copied from Hive.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #2263 from chenghao-intel/hive_template and squashes the following commits:

53bffa9 [Cheng Hao] Remove the hive-log4j.properties initialization
2014-09-26 12:06:01 -07:00
Daoyuan Wang 0ec2d2e8f0 [SPARK-3531][SQL]select null from table would throw a MatchError
Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #2396 from adrian-wang/selectnull and squashes the following commits:

2458229 [Daoyuan Wang] rebase solution
2014-09-26 12:04:37 -07:00
Andrew Or 8da10bf146 [SPARK-3476] Remove outdated memory checks in Yarn
See description in [JIRA](https://issues.apache.org/jira/browse/SPARK-3476).

Author: Andrew Or <andrewor14@gmail.com>

Closes #2528 from andrewor14/yarn-memory-checks and squashes the following commits:

c5400cd [Andrew Or] Simplify checks
e30ffac [Andrew Or] Remove outdated memory checks
2014-09-26 11:50:48 -07:00
Daoyuan Wang 30461c6ac3 [SPARK-3695]shuffle fetch fail output
should output detailed host and port in error message

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #2539 from adrian-wang/fetchfail and squashes the following commits:

6c1b1e0 [Daoyuan Wang] shuffle fetch fail output
2014-09-26 11:26:53 -07:00
RJ Nowling ec9df6a765 [SPARK-3614][MLLIB] Add minimumOccurence filtering to IDF
This PR for [SPARK-3614](https://issues.apache.org/jira/browse/SPARK-3614) adds functionality for filtering out terms which do not appear in at least a minimum number of documents.

This is implemented using a minimumOccurence parameter (default 0).  When terms' document frequencies are less than minimumOccurence, their IDFs are set to 0, just like when the DF is 0.  As a result, the TF-IDFs for the terms are found to be 0, as if the terms were not present in the documents.

This PR makes the following changes:
* Add a minimumOccurence parameter to the IDF and DocumentFrequencyAggregator classes.
* Create a parameter-less constructor for IDF with a default minimumOccurence value of 0 to remain backwards-compatibility with the original IDF API.
* Sets the IDFs to 0 for terms which DFs are less than minimumOccurence
* Add tests to the Spark IDFSuite and Java JavaTfIdfSuite test suites
* Updated the MLLib Feature Extraction programming guide to describe the new feature

Author: RJ Nowling <rnowling@gmail.com>

Closes #2494 from rnowling/spark-3614-idf-filter and squashes the following commits:

0aa3c63 [RJ Nowling] Fix identation
e6523a8 [RJ Nowling] Remove unnecessary toDouble's from IDFSuite
bfa82ec [RJ Nowling] Add space after if
30d20b3 [RJ Nowling] Add spaces around equals signs
9013447 [RJ Nowling] Add space before division operator
79978fc [RJ Nowling] Remove unnecessary semi-colon
40fd70c [RJ Nowling] Change minimumOccurence to minDocFreq in code and docs
47850ab [RJ Nowling] Changed minimumOccurence to Int from Long
9fb4093 [RJ Nowling] Remove unnecessary lines from IDF class docs
1fc09d8 [RJ Nowling] Add backwards-compatible constructor to DocumentFrequencyAggregator
1801fd2 [RJ Nowling] Fix style errors in IDF.scala
6897252 [RJ Nowling] Preface minimumOccurence members with val to make them final and immutable
a200bab [RJ Nowling] Remove unnecessary else statement
4b974f5 [RJ Nowling] Remove accidentally-added import from testing
c0cc643 [RJ Nowling] Add minimumOccurence filtering to IDF
2014-09-26 09:58:47 -07:00
aniketbhatnagar d16e161d74 SPARK-3639 | Removed settings master in examples
This patch removes setting of master as local in Kinesis examples so that users can set it using submit-job.

Author: aniketbhatnagar <aniket.bhatnagar@gmail.com>

Closes #2536 from aniketbhatnagar/Kinesis-Examples-Master-Unset and squashes the following commits:

c9723ac [aniketbhatnagar] Merge remote-tracking branch 'origin/Kinesis-Examples-Master-Unset' into Kinesis-Examples-Master-Unset
fec8ead [aniketbhatnagar] SPARK-3639 | Removed settings master in examples
31cdc59 [aniketbhatnagar] SPARK-3639 | Removed settings master in examples
2014-09-26 09:48:46 -07:00
Davies Liu 1aa549ba98 [SPARK-3478] [PySpark] Profile the Python tasks
This patch add profiling support for PySpark, it will show the profiling results
before the driver exits, here is one example:

```
============================================================
Profile of RDD<id=3>
============================================================
         5146507 function calls (5146487 primitive calls) in 71.094 seconds

   Ordered by: internal time, cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  5144576   68.331    0.000   68.331    0.000 statcounter.py:44(merge)
       20    2.735    0.137   71.071    3.554 statcounter.py:33(__init__)
       20    0.017    0.001    0.017    0.001 {cPickle.dumps}
     1024    0.003    0.000    0.003    0.000 t.py:16(<lambda>)
       20    0.001    0.000    0.001    0.000 {reduce}
       21    0.001    0.000    0.001    0.000 {cPickle.loads}
       20    0.001    0.000    0.001    0.000 copy_reg.py:95(_slotnames)
       41    0.001    0.000    0.001    0.000 serializers.py:461(read_int)
       40    0.001    0.000    0.002    0.000 serializers.py:179(_batched)
       62    0.000    0.000    0.000    0.000 {method 'read' of 'file' objects}
       20    0.000    0.000   71.072    3.554 rdd.py:863(<lambda>)
       20    0.000    0.000    0.001    0.000 serializers.py:198(load_stream)
    40/20    0.000    0.000   71.072    3.554 rdd.py:2093(pipeline_func)
       41    0.000    0.000    0.002    0.000 serializers.py:130(load_stream)
       40    0.000    0.000   71.072    1.777 rdd.py:304(func)
       20    0.000    0.000   71.094    3.555 worker.py:82(process)
```

Also, use can show profile result manually by `sc.show_profiles()` or dump it into disk
by `sc.dump_profiles(path)`, such as

```python
>>> sc._conf.set("spark.python.profile", "true")
>>> rdd = sc.parallelize(range(100)).map(str)
>>> rdd.count()
100
>>> sc.show_profiles()
============================================================
Profile of RDD<id=1>
============================================================
         284 function calls (276 primitive calls) in 0.001 seconds

   Ordered by: internal time, cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        4    0.000    0.000    0.000    0.000 serializers.py:198(load_stream)
        4    0.000    0.000    0.000    0.000 {reduce}
     12/4    0.000    0.000    0.001    0.000 rdd.py:2092(pipeline_func)
        4    0.000    0.000    0.000    0.000 {cPickle.loads}
        4    0.000    0.000    0.000    0.000 {cPickle.dumps}
      104    0.000    0.000    0.000    0.000 rdd.py:852(<genexpr>)
        8    0.000    0.000    0.000    0.000 serializers.py:461(read_int)
       12    0.000    0.000    0.000    0.000 rdd.py:303(func)
```
The profiling is disabled by default, can be enabled by "spark.python.profile=true".

Also, users can dump the results into disks automatically for future analysis, by "spark.python.profile.dump=path_to_dump"

Author: Davies Liu <davies.liu@gmail.com>

Closes #2351 from davies/profiler and squashes the following commits:

7ef2aa0 [Davies Liu] bugfix, add tests for show_profiles and dump_profiles()
2b0daf2 [Davies Liu] fix docs
7a56c24 [Davies Liu] bugfix
cba9463 [Davies Liu] move show_profiles and dump_profiles to SparkContext
fb9565b [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler
116d52a [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler
09d02c3 [Davies Liu] Merge branch 'master' into profiler
c23865c [Davies Liu] Merge branch 'master' into profiler
15d6f18 [Davies Liu] add docs for two configs
dadee1a [Davies Liu] add docs string and clear profiles after show or dump
4f8309d [Davies Liu] address comment, add tests
0a5b6eb [Davies Liu] fix Python UDF
4b20494 [Davies Liu] add profile for python
2014-09-26 09:27:42 -07:00
Hari Shreedharan b235e01363 [SPARK-3686][STREAMING] Wait for sink to commit the channel before check...
...ing for the channel size.

Author: Hari Shreedharan <hshreedharan@apache.org>

Closes #2531 from harishreedharan/sparksinksuite-fix and squashes the following commits:

30393c1 [Hari Shreedharan] Use more deterministic method to figure out when batches come in.
6ce9d8b [Hari Shreedharan] [SPARK-3686][STREAMING] Wait for sink to commit the channel before checking for the channel size.
2014-09-25 22:56:43 -07:00
zsxwing 86bce76498 SPARK-2634: Change MapOutputTrackerWorker.mapStatuses to ConcurrentHashMap
MapOutputTrackerWorker.mapStatuses is used concurrently, it should be thread-safe. This bug has already been fixed in #1328. Nevertheless, considering #1328 won't be merged soon, I send this trivial fix and hope this issue can be solved soon.

Author: zsxwing <zsxwing@gmail.com>

Closes #1541 from zsxwing/SPARK-2634 and squashes the following commits:

d450053 [zsxwing] SPARK-2634: Change MapOutputTrackerWorker.mapStatuses to ConcurrentHashMap
2014-09-25 18:24:01 -07:00
Kousuke Saruta 0dc868e787 [SPARK-3584] sbin/slaves doesn't work when we use password authentication for SSH
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2444 from sarutak/slaves-scripts-modification and squashes the following commits:

eff7394 [Kousuke Saruta] Improve the description about Cluster Launch Script in docs/spark-standalone.md
7858225 [Kousuke Saruta] Modified sbin/slaves to use the environment variable "SPARK_SSH_FOREGROUND" as a flag
53d7121 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into slaves-scripts-modification
e570431 [Kousuke Saruta] Added a description for SPARK_SSH_FOREGROUND variable
7120a0c [Kousuke Saruta] Added a description about default host for sbin/slaves
1bba8a9 [Kousuke Saruta] Added SPARK_SSH_FOREGROUND flag to sbin/slaves
88e2f17 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into slaves-scripts-modification
297e75d [Kousuke Saruta] Modified sbin/slaves not to export HOSTLIST
2014-09-25 16:49:15 -07:00
Aaron Staple ff637c9380 [SPARK-1484][MLLIB] Warn when running an iterative algorithm on uncached data.
Add warnings to KMeans, GeneralizedLinearAlgorithm, and computeSVD when called with input data that is not cached. KMeans is implemented iteratively, and I believe that GeneralizedLinearAlgorithm’s current optimizers are iterative and its future optimizers are also likely to be iterative. RowMatrix’s computeSVD is iterative against an RDD when run in DistARPACK mode. ALS and DecisionTree are iterative as well, but they implement RDD caching internally so do not require a warning.

I added a warning to GeneralizedLinearAlgorithm rather than inside its optimizers, where the iteration actually occurs, because internally GeneralizedLinearAlgorithm maps its input data to an uncached RDD before passing it to an optimizer. (In other words, the warning would be printed for every GeneralizedLinearAlgorithm run, regardless of whether its input is cached, if the warning were in GradientDescent or other optimizer.) I assume that use of an uncached RDD by GeneralizedLinearAlgorithm is intentional, and that the mapping there (adding label, intercepts and scaling) is a lightweight operation. Arguably a user calling an optimizer such as GradientDescent will be knowledgable enough to cache their data without needing a log warning, so lack of a warning in the optimizers may be ok.

Some of the documentation examples making use of these iterative algorithms did not cache their training RDDs (while others did). I updated the examples to always cache. I also fixed some (unrelated) minor errors in the documentation examples.

Author: Aaron Staple <aaron.staple@gmail.com>

Closes #2347 from staple/SPARK-1484 and squashes the following commits:

bd49701 [Aaron Staple] Address review comments.
ab2d4a4 [Aaron Staple] Disable warnings on python code path.
a7a0f99 [Aaron Staple] Change code comments per review comments.
7cca1dc [Aaron Staple] Change warning message text.
c77e939 [Aaron Staple] [SPARK-1484][MLLIB] Warn when running an iterative algorithm on uncached data.
3b6c511 [Aaron Staple] Minor doc example fixes.
2014-09-25 16:11:00 -07:00
epahomov 9b56e249e0 [SPARK-3690] Closing shuffle writers we swallow more important exception
Author: epahomov <pahomov.egor@gmail.com>

Closes #2537 from epahomov/SPARK-3690 and squashes the following commits:

a0b7de4 [epahomov] [SPARK-3690] Closing shuffle writers we swallow more important exception
2014-09-25 14:50:12 -07:00
Sean Owen c3f2a8588e SPARK-2932 [STREAMING] Move MasterFailureTest out of "main" source directory
(HT @vanzin) Whatever the reason was for having this test class in `main`, if there is one, appear to be moot. This may have been a result of earlier streaming test reorganization.

This simply puts `MasterFailureTest` back under `test/`, removes some redundant copied code, and touches up a few tiny inspection warnings along the way.

Author: Sean Owen <sowen@cloudera.com>

Closes #2399 from srowen/SPARK-2932 and squashes the following commits:

3909411 [Sean Owen] Move MasterFailureTest to src/test, and remove redundant TestOutputStream
2014-09-25 23:20:17 +05:30
Marcelo Vanzin b8487713d3 [SPARK-2778] [yarn] Add yarn integration tests.
This patch adds a couple of, currently, very simple integration tests
to make sure both client and cluster modes are working. The tests don't
do much yet other than run a simple job, but the plan is to enhance
them after we get the framework in.

The cluster tests are noisy, so redirect all log output to a file
like other tests do. Copying the conf around sucks but it's less
work than messing with maven/sbt and having to clean up other
projects.

Note the test is only added for yarn-stable. The code compiles
against yarn-alpha but there are two issues I ran into that I
could not overcome:
- an old netty dependency kept creeping into the classpath and
  causing akka to not work, when using sbt; the old netty was
  correctly suppressed under maven.
- MiniYARNCluster kept failing to execute containers because it
  did not create the NM's local dir itself; this is apparently
  a known behavior, but I'm not sure how to work around it.

None of those issues are present with the stable Yarn.

Also, these tests are a little slow to run. Apparently Spark doesn't
yet tag tests (so that these could be isolated in a "slow" batch),
so this is something to keep in mind.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #2257 from vanzin/yarn-tests and squashes the following commits:

6d5b84e [Marcelo Vanzin] Fix wrong system property being set.
8b0933d [Marcelo Vanzin] Merge branch 'master' into yarn-tests
5c2b56f [Marcelo Vanzin] Use custom log4j conf for Yarn containers.
ec73f17 [Marcelo Vanzin] More review feedback.
67f5b02 [Marcelo Vanzin] Review feedback.
f01517c [Marcelo Vanzin] Review feedback.
68fbbbf [Marcelo Vanzin] Use older constructor available in older Hadoop releases.
d07ef9a [Marcelo Vanzin] Merge branch 'master' into yarn-tests
add8416 [Marcelo Vanzin] [SPARK-2778] [yarn] Add yarn integration tests.
2014-09-24 23:10:26 -07:00
Aaron Staple 8ca4ecb6a5 [SPARK-546] Add full outer join to RDD and DStream.
leftOuterJoin and rightOuterJoin are already implemented.  This patch adds fullOuterJoin.

Author: Aaron Staple <aaron.staple@gmail.com>

Closes #1395 from staple/SPARK-546 and squashes the following commits:

1f5595c [Aaron Staple] Fix python style
7ac0aa9 [Aaron Staple] [SPARK-546] Add full outer join to RDD and DStream.
3b5d137 [Aaron Staple] In JavaPairDStream, make class tag specification in rightOuterJoin consistent with other functions.
31f2956 [Aaron Staple] Fix left outer join documentation comments.
2014-09-24 20:39:09 -07:00
jerryshao 74fb2ecf7a [SPARK-3615][Streaming]Fix Kafka unit test hard coded Zookeeper port issue
Details can be seen in [SPARK-3615](https://issues.apache.org/jira/browse/SPARK-3615).

Author: jerryshao <saisai.shao@intel.com>

Closes #2483 from jerryshao/SPARK_3615 and squashes the following commits:

8555563 [jerryshao] Fix Kafka unit test hard coded Zookeeper port issue
2014-09-24 17:18:55 -07:00
Davies Liu bb96012b73 [SPARK-3679] [PySpark] pickle the exact globals of functions
function.func_code.co_names has all the names used in the function, including name of attributes. It will pickle some unnecessary globals if there is a global having the same name with attribute (in co_names).

There is a regression introduced by #2144, revert part of changes in that PR.

cc JoshRosen

Author: Davies Liu <davies.liu@gmail.com>

Closes #2522 from davies/globals and squashes the following commits:

dfbccf5 [Davies Liu] fix bug while pickle globals of function
2014-09-24 13:00:05 -07:00
Davies Liu c854b9fcb5 [SPARK-3634] [PySpark] User's module should take precedence over system modules
Python modules added through addPyFile should take precedence over system modules.

This patch put the path for user added module in the front of sys.path (just after '').

Author: Davies Liu <davies.liu@gmail.com>

Closes #2492 from davies/path and squashes the following commits:

4a2af78 [Davies Liu] fix tests
f7ff4da [Davies Liu] ad license header
6b0002f [Davies Liu] add tests
c16c392 [Davies Liu] put addPyFile in front of sys.path
2014-09-24 12:10:09 -07:00
Shivaram Venkataraman 50f8633653 [SPARK-3659] Set EC2 version to 1.1.0 and update version map
This brings the master branch in sync with branch-1.1

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #2510 from shivaram/spark-ec2-version and squashes the following commits:

bb0dd16 [Shivaram Venkataraman] Set EC2 version to 1.1.0 and update version map
2014-09-24 11:34:39 -07:00
Nicholas Chammas c429126066 [Build] Diff from branch point
Sometimes Jenkins posts [spurious reports of new classes being added](https://github.com/apache/spark/pull/2339#issuecomment-56570170). I believe this stems from diffing the patch against `master`, as opposed to against `master...`, which starts from the commit the PR was branched from.

This patch fixes that behavior.

Author: Nicholas Chammas <nicholas.chammas@gmail.com>

Closes #2512 from nchammas/diff-only-commits-ahead and squashes the following commits:

c065599 [Nicholas Chammas] comment typo fix
a453c67 [Nicholas Chammas] diff from branch point
2014-09-24 11:33:58 -07:00
Mubarak Seyed 729952a5ef [SPARK-1853] Show Streaming application code context (file, line number) in Spark Stages UI
This is a refactored version of the original PR https://github.com/apache/spark/pull/1723 my mubarak

Please take a look andrewor14, mubarak

Author: Mubarak Seyed <mubarak.seyed@gmail.com>
Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #2464 from tdas/streaming-callsite and squashes the following commits:

dc54c71 [Tathagata Das] Made changes based on PR comments.
390b45d [Tathagata Das] Fixed minor bugs.
904cd92 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into streaming-callsite
7baa427 [Tathagata Das] Refactored getCallSite and setCallSite to make it simpler. Also added unit test for DStream creation site.
b9ed945 [Mubarak Seyed] Adding streaming utils
c461cf4 [Mubarak Seyed] Merge remote-tracking branch 'upstream/master'
ceb43da [Mubarak Seyed] Changing default regex function name
8c5d443 [Mubarak Seyed] Merge remote-tracking branch 'upstream/master'
196121b [Mubarak Seyed] Merge remote-tracking branch 'upstream/master'
491a1eb [Mubarak Seyed] Removing streaming visibility from getRDDCreationCallSite in DStream
33a7295 [Mubarak Seyed] Fixing review comments: Merging both setCallSite methods
c26d933 [Mubarak Seyed] Merge remote-tracking branch 'upstream/master'
f51fd9f [Mubarak Seyed] Fixing scalastyle, Regex for Utils.getCallSite, and changing method names in DStream
5051c58 [Mubarak Seyed] Getting return value of compute() into variable and call setCallSite(prevCallSite) only once. Adding return for other code paths (for None)
a207eb7 [Mubarak Seyed] Fixing code review comments
ccde038 [Mubarak Seyed] Removing Utils import from MappedDStream
2a09ad6 [Mubarak Seyed] Changes in Utils.scala for SPARK-1853
1d90cc3 [Mubarak Seyed] Changes for SPARK-1853
5f3105a [Mubarak Seyed] Merge remote-tracking branch 'upstream/master'
70f494f [Mubarak Seyed] Changes for SPARK-1853
1500deb [Mubarak Seyed] Changes in Spark Streaming UI
9d38d3c [Mubarak Seyed] [SPARK-1853] Show Streaming application code context (file, line number) in Spark Stages UI
d466d75 [Mubarak Seyed] Changes for spark streaming UI
2014-09-23 15:09:12 -07:00
Andrew Or b3fef50e22 [SPARK-3653] Respect SPARK_*_MEMORY for cluster mode
`SPARK_DRIVER_MEMORY` was only used to start the `SparkSubmit` JVM, which becomes the driver only in client mode but not cluster mode. In cluster mode, this property is simply not propagated to the worker nodes.

`SPARK_EXECUTOR_MEMORY` is picked up from `SparkContext`, but in cluster mode the driver runs on one of the worker machines, where this environment variable may not be set.

Author: Andrew Or <andrewor14@gmail.com>

Closes #2500 from andrewor14/memory-env-vars and squashes the following commits:

6217b38 [Andrew Or] Respect SPARK_*_MEMORY for cluster mode
2014-09-23 14:00:33 -07:00
Sandy Ryza d79238d03a SPARK-3612. Executor shouldn't quit if heartbeat message fails to reach ...
...the driver

Author: Sandy Ryza <sandy@cloudera.com>

Closes #2487 from sryza/sandy-spark-3612 and squashes the following commits:

2b7353d [Sandy Ryza] SPARK-3612. Executor shouldn't quit if heartbeat message fails to reach the driver
2014-09-23 13:44:18 -07:00
Marcelo Vanzin 8dfe79ffb2 [SPARK-3647] Add more exceptions to Guava relocation.
Guava's Optional refers to some package private classes / methods, and
when those are relocated the code stops working, throwing exceptions.
So add the affected classes to the exception list too, and add a unit
test.

(Note that this unit test only really makes sense in maven, since we
don't relocate in the sbt build. Also, JavaAPISuite doesn't seem to
be run by "mvn test" - I had to manually add command line options to
enable it.)

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #2496 from vanzin/SPARK-3647 and squashes the following commits:

84f58d7 [Marcelo Vanzin] [SPARK-3647] Add more exceptions to Guava relocation.
2014-09-23 13:42:00 -07:00
Michael Armbrust a08153f8a3 [SPARK-3646][SQL] Copy SQL configuration from SparkConf when a SQLContext is created.
This will allow us to take advantage of things like the spark.defaults file.

Author: Michael Armbrust <michael@databricks.com>

Closes #2493 from marmbrus/copySparkConf and squashes the following commits:

0bd1377 [Michael Armbrust] Copy SQL configuration from SparkConf when a SQLContext is created.
2014-09-23 12:27:12 -07:00