spark-instrumented-optimizer

History

Davies Liu 3cedc4f4d7 [SPARK-2871] [PySpark] add histgram() API RDD.histogram(buckets) Compute a histogram using the provided buckets. The buckets are all open to the right except for the last which is closed. e.g. [1,10,20,50] means the buckets are [1,10) [10,20) [20,50], which means 1<=x<10, 10<=x<20, 20<=x<=50. And on the input of 1 and 50 we would have a histogram of 1,0,1. If your histogram is evenly spaced (e.g. [0, 10, 20, 30]), this can be switched from an O(log n) inseration to O(1) per element(where n = # buckets). Buckets must be sorted and not contain any duplicates, must be at least two elements. If `buckets` is a number, it will generates buckets which is evenly spaced between the minimum and maximum of the RDD. For example, if the min value is 0 and the max is 100, given buckets as 2, the resulting buckets will be [0,50) [50,100]. buckets must be at least 1 If the RDD contains infinity, NaN throws an exception If the elements in RDD do not vary (max == min) always returns a single bucket. It will return an tuple of buckets and histogram. >>> rdd = sc.parallelize(range(51)) >>> rdd.histogram(2) ([0, 25, 50], [25, 26]) >>> rdd.histogram([0, 5, 25, 50]) ([0, 5, 25, 50], [5, 20, 26]) >>> rdd.histogram([0, 15, 30, 45, 60], True) ([0, 15, 30, 45, 60], [15, 15, 15, 6]) >>> rdd = sc.parallelize(["ab", "ac", "b", "bd", "ef"]) >>> rdd.histogram(("a", "b", "c")) (('a', 'b', 'c'), [2, 2]) closes #122, it's duplicated. Author: Davies Liu <davies.liu@gmail.com> Closes #2091 from davies/histgram and squashes the following commits: a322f8a [Davies Liu] fix deprecation of e.message 84e85fa [Davies Liu] remove evenBuckets, add more tests (including str) d9a0722 [Davies Liu] address comments 0e18a2d [Davies Liu] add histgram() API		2014-08-26 13:04:30 -07:00
..
mllib	[SPARK-3136][MLLIB] Create Java-friendly methods in RandomRDDs	2014-08-19 16:06:48 -07:00
__init__.py	[SPARK-2724] Python version of RandomRDDGenerators	2014-07-31 20:32:57 -07:00
accumulators.py	[SPARK-2627] [PySpark] have the build enforce PEP 8 automatically	2014-08-06 12:58:24 -07:00
broadcast.py	[SPARK-1065] [PySpark] improve supporting for large broadcast	2014-08-16 16:59:34 -07:00
cloudpickle.py	[SPARK-791] [PySpark] fix pickle itemgetter with cloudpickle	2014-07-29 01:02:18 -07:00
conf.py	[SPARK-2627] [PySpark] have the build enforce PEP 8 automatically	2014-08-06 12:58:24 -07:00
context.py	[SPARK-1065] [PySpark] improve supporting for large broadcast	2014-08-16 16:59:34 -07:00
daemon.py	[SPARK-2898] [PySpark] fix bugs in deamon.py	2014-08-10 13:00:38 -07:00
files.py	[SPARK-2627] [PySpark] have the build enforce PEP 8 automatically	2014-08-06 12:58:24 -07:00
java_gateway.py	[SPARK-3140] Clarify confusing PySpark exception message	2014-08-20 17:07:39 -07:00
join.py	[SPARK-2470] PEP8 fixes to PySpark	2014-07-21 22:30:53 -07:00
rdd.py	[SPARK-2871] [PySpark] add histgram() API	2014-08-26 13:04:30 -07:00
rddsampler.py	[SPARK-2627] [PySpark] have the build enforce PEP 8 automatically	2014-08-06 12:58:24 -07:00
resultiterable.py	[SPARK-2627] [PySpark] have the build enforce PEP 8 automatically	2014-08-06 12:58:24 -07:00
serializers.py	[SPARK-2790] [PySpark] fix zip with serializers which have different batch sizes.	2014-08-19 14:46:32 -07:00
shell.py	[SPARK-2470] PEP8 fixes to PySpark	2014-07-21 22:30:53 -07:00
shuffle.py	[SPARK-2974] [SPARK-2975] Fix two bugs related to spark.local.dirs	2014-08-19 22:42:50 -07:00
sql.py	[SQL] Using safe floating-point numbers in doctest	2014-08-16 11:26:51 -07:00
statcounter.py	StatCounter on NumPy arrays [PYSPARK][SPARK-2012]	2014-08-01 22:33:25 -07:00
storagelevel.py	[SPARK-2627] [PySpark] have the build enforce PEP 8 automatically	2014-08-06 12:58:24 -07:00
tests.py	[SPARK-2871] [PySpark] add histgram() API	2014-08-26 13:04:30 -07:00
worker.py	[SPARK-3114] [PySpark] Fix Python UDFs in Spark SQL.	2014-08-18 20:42:19 -07:00