spark-instrumented-optimizer

History

Doris Xin 1de1d703bf SPARK-1939 Refactor takeSample method in RDD to use ScaSRS Modified the takeSample method in RDD to use the ScaSRS sampling technique to improve performance. Added a private method that computes sampling rate > sample_size/total to ensure sufficient sample size with success rate >= 0.9999. Added a unit test for the private method to validate choice of sampling rate. Author: Doris Xin <doris.s.xin@gmail.com> Author: dorx <doris.s.xin@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Closes #916 from dorx/takeSample and squashes the following commits: 5b061ae [Doris Xin] merge master 444e750 [Doris Xin] edge cases 3de882b [dorx] Merge pull request #2 from mengxr/SPARK-1939 82dde31 [Xiangrui Meng] update pyspark's takeSample 48d954d [Doris Xin] remove unused imports from RDDSuite fb1452f [Doris Xin] allowing num to be greater than count in all cases 1481b01 [Doris Xin] washing test tubes and making coffee dc699f3 [Doris Xin] give back imports removed by accident in rdd.py 64e445b [Doris Xin] logwarnning as soon as it enters the while loop 55518ed [Doris Xin] added TODO for logging in rdd.py eff89e2 [Doris Xin] addressed reviewer comments. ecab508 [Doris Xin] "fixed checkstyle violation 0a9b3e3 [Doris Xin] "reviewer comment addressed" f80f270 [Doris Xin] Merge branch 'master' into takeSample ae3ad04 [Doris Xin] fixed edge cases to prevent overflow 065ebcd [Doris Xin] Merge branch 'master' into takeSample 9bdd36e [Doris Xin] Check sample size and move computeFraction e3fd6a6 [Doris Xin] Merge branch 'master' into takeSample 7cab53a [Doris Xin] fixed import bug in rdd.py ffea61a [Doris Xin] SPARK-1939: Refactor takeSample method in RDD 1441977 [Doris Xin] SPARK-1939 Refactor takeSample method in RDD to use ScaSRS		2014-06-12 19:44:27 -07:00
..
mllib	[SPARK-2091][MLLIB] use numpy.dot instead of ndarray.dot	2014-06-11 00:22:40 -07:00
__init__.py	SPARK-1004. PySpark on YARN	2014-04-29 23:24:34 -07:00
accumulators.py	Add custom serializer support to PySpark.	2013-11-10 16:45:38 -08:00
broadcast.py	Fix some Python docs and make sure to unset SPARK_TESTING in Python	2013-12-29 20:15:07 -05:00
cloudpickle.py	SPARK-1917: fix PySpark import of scipy.special functions	2014-05-31 14:59:09 -07:00
conf.py	[FIX] do not load defaults when testing SparkConf in pyspark	2014-05-14 14:57:17 -07:00
context.py	SPARK-1416: PySpark support for SequenceFile and Hadoop InputFormats	2014-06-09 22:21:03 -07:00
daemon.py	SPARK-1579: Clean up PythonRDD and avoid swallowing IOExceptions	2014-05-07 09:48:31 -07:00
files.py	Initial work to rename package to org.apache.spark	2013-09-01 14:13:13 -07:00
java_gateway.py	[SPARK-1808] Route bin/pyspark through Spark submit	2014-05-16 22:34:38 -07:00
join.py	Spark 1271: Co-Group and Group-By should pass Iterable[X]	2014-04-08 18:15:59 -07:00
rdd.py	SPARK-1939 Refactor takeSample method in RDD to use ScaSRS	2014-06-12 19:44:27 -07:00
rddsampler.py	SPARK-1438 RDD.sample() make seed param optional	2014-04-24 17:27:16 -07:00
resultiterable.py	Spark 1271: Co-Group and Group-By should pass Iterable[X]	2014-04-08 18:15:59 -07:00
serializers.py	SPARK-1421. Make MLlib work on Python 2.6	2014-04-05 20:52:05 -07:00
shell.py	[SPARK-1808] Route bin/pyspark through Spark submit	2014-05-16 22:34:38 -07:00
sql.py	HOTFIX: PySpark tests should be order insensitive.	2014-06-11 15:54:41 -07:00
statcounter.py	Spark 1246 add min max to stat counter	2014-03-18 00:45:47 -07:00
storagelevel.py	SPARK-1305: Support persisting RDD's directly to Tachyon	2014-04-04 20:38:20 -07:00
tests.py	SPARK-554. Add aggregateByKey.	2014-06-12 08:14:25 -07:00
worker.py	Add Python includes to path before depickling broadcast values	2014-05-10 13:02:13 -07:00