spark-instrumented-optimizer

History

Davies Liu 1c53a5db99 [SPARK-4439] [MLlib] add python api for random forest ``` class RandomForestModel \| A model trained by RandomForest \| \| numTrees(self) \| Get number of trees in forest. \| \| predict(self, x) \| Predict values for a single data point or an RDD of points using the model trained. \| \| toDebugString(self) \| Full model \| \| totalNumNodes(self) \| Get total number of nodes, summed over all trees in the forest. \| class RandomForest \| trainClassifier(cls, data, numClassesForClassification, categoricalFeaturesInfo, numTrees, featureSubsetStrategy='auto', impurity='gini', maxDepth=4, maxBins=32, seed=None): \| Method to train a decision tree model for binary or multiclass classification. \| \| :param data: Training dataset: RDD of LabeledPoint. \| Labels should take values {0, 1, ..., numClasses-1}. \| :param numClassesForClassification: number of classes for classification. \| :param categoricalFeaturesInfo: Map storing arity of categorical features. \| E.g., an entry (n -> k) indicates that feature n is categorical \| with k categories indexed from 0: {0, 1, ..., k-1}. \| :param numTrees: Number of trees in the random forest. \| :param featureSubsetStrategy: Number of features to consider for splits at each node. \| Supported: "auto" (default), "all", "sqrt", "log2", "onethird". \| If "auto" is set, this parameter is set based on numTrees: \| if numTrees == 1, set to "all"; \| if numTrees > 1 (forest) set to "sqrt". \| :param impurity: Criterion used for information gain calculation. \| Supported values: "gini" (recommended) or "entropy". \| :param maxDepth: Maximum depth of the tree. E.g., depth 0 means 1 leaf node; depth 1 means \| 1 internal node + 2 leaf nodes. (default: 4) \| :param maxBins: maximum number of bins used for splitting features (default: 100) \| :param seed: Random seed for bootstrapping and choosing feature subsets. \| :return: RandomForestModel that can be used for prediction \| \| trainRegressor(cls, data, categoricalFeaturesInfo, numTrees, featureSubsetStrategy='auto', impurity='variance', maxDepth=4, maxBins=32, seed=None): \| Method to train a decision tree model for regression. \| \| :param data: Training dataset: RDD of LabeledPoint. \| Labels are real numbers. \| :param categoricalFeaturesInfo: Map storing arity of categorical features. \| E.g., an entry (n -> k) indicates that feature n is categorical \| with k categories indexed from 0: {0, 1, ..., k-1}. \| :param numTrees: Number of trees in the random forest. \| :param featureSubsetStrategy: Number of features to consider for splits at each node. \| Supported: "auto" (default), "all", "sqrt", "log2", "onethird". \| If "auto" is set, this parameter is set based on numTrees: \| if numTrees == 1, set to "all"; \| if numTrees > 1 (forest) set to "onethird". \| :param impurity: Criterion used for information gain calculation. \| Supported values: "variance". \| :param maxDepth: Maximum depth of the tree. E.g., depth 0 means 1 leaf node; depth 1 means \| 1 internal node + 2 leaf nodes.(default: 4) \| :param maxBins: maximum number of bins used for splitting features (default: 100) \| :param seed: Random seed for bootstrapping and choosing feature subsets. \| :return: RandomForestModel that can be used for prediction \| ``` Author: Davies Liu <davies@databricks.com> Closes #3320 from davies/forest and squashes the following commits: 8003dfc [Davies Liu] reorder 53cf510 [Davies Liu] fix docs 4ca593d [Davies Liu] fix docs e0df852 [Davies Liu] fix docs 0431746 [Davies Liu] rebased 2b6f239 [Davies Liu] Merge branch 'master' of github.com:apache/spark into forest 885abee [Davies Liu] address comments dae7fc0 [Davies Liu] address comments 89a000f [Davies Liu] fix docs 565d476 [Davies Liu] add python api for random forest		2014-11-20 15:31:28 -08:00
..
mllib	[SPARK-4439] [MLlib] add python api for random forest	2014-11-20 15:31:28 -08:00
streaming	[DOC][PySpark][Streaming] Fix docstring for sphinx	2014-11-19 14:23:18 -08:00
__init__.py	[SPARK-4348] [PySpark] [MLlib] rename random.py to rand.py	2014-11-13 10:24:54 -08:00
accumulators.py	[SPARK-3478] [PySpark] Profile the Python tasks	2014-09-30 18:24:57 -07:00
broadcast.py	[SPARK-3721] [PySpark] broadcast objects larger than 2G	2014-11-18 16:17:51 -08:00
cloudpickle.py	[SPARK-3679] [PySpark] pickle the exact globals of functions	2014-09-24 13:00:05 -07:00
conf.py	[SPARK-3412] [PySpark] Replace Epydoc with Sphinx to generate Python API docs	2014-10-07 18:09:27 -07:00
context.py	[SPARK-3721] [PySpark] broadcast objects larger than 2G	2014-11-18 16:17:51 -08:00
daemon.py	[SPARK-4088] [PySpark] Python worker should exit after socket is closed by JVM	2014-10-25 01:20:39 -07:00
files.py	[SPARK-3309] [PySpark] Put all public API in __all__	2014-09-03 11:49:45 -07:00
heapq3.py	[SPARK-3073] [PySpark] use external sort in sortBy() and sortByKey()	2014-08-26 16:57:40 -07:00
java_gateway.py	[SPARK-4415] [PySpark] JVM should exit after Python exit	2014-11-14 20:14:33 -08:00
join.py	[SPARK-546] Add full outer join to RDD and DStream.	2014-09-24 20:39:09 -07:00
rdd.py	[SPARK-4327] [PySpark] Python API for RDD.randomSplit()	2014-11-18 16:37:35 -08:00
rddsampler.py	[SPARK-4327] [PySpark] Python API for RDD.randomSplit()	2014-11-18 16:37:35 -08:00
resultiterable.py	[SPARK-2627] [PySpark] have the build enforce PEP 8 automatically	2014-08-06 12:58:24 -07:00
serializers.py	[SPARK-3721] [PySpark] broadcast objects larger than 2G	2014-11-18 16:17:51 -08:00
shell.py	[SPARK-3273][SPARK-3301]We should read the version information from the same place	2014-09-06 15:08:43 -07:00
shuffle.py	[SPARK-4384] [PySpark] improve sort spilling	2014-11-19 15:45:37 -08:00
sql.py	[SPARK-4228][SQL] SchemaRDD to JSON	2014-11-20 13:44:19 -08:00
statcounter.py	StatCounter on NumPy arrays [PYSPARK][SPARK-2012]	2014-08-01 22:33:25 -07:00
storagelevel.py	[SPARK-3417] Use new-style classes in PySpark	2014-09-08 15:45:36 -07:00
tests.py	[SPARK-3721] [PySpark] broadcast objects larger than 2G	2014-11-18 16:17:51 -08:00
traceback_utils.py	[SPARK-1087] Move python traceback utilities into new traceback_utils.py file.	2014-09-15 19:28:17 -07:00
worker.py	[SPARK-3721] [PySpark] broadcast objects larger than 2G	2014-11-18 16:17:51 -08:00