spark-instrumented-optimizer

History

Zhenhua Wang 655f6f86f8 [SPARK-22208][SQL] Improve percentile_approx by not rounding up targetError and starting from index 0 ## What changes were proposed in this pull request? Currently percentile_approx never returns the first element when percentile is in (relativeError, 1/N], where relativeError default 1/10000, and N is the total number of elements. But ideally, percentiles in [0, 1/N] should all return the first element as the answer. For example, given input data 1 to 10, if a user queries 10% (or even less) percentile, it should return 1, because the first value 1 already reaches 10%. Currently it returns 2. Based on the paper, targetError is not rounded up, and searching index should start from 0 instead of 1. By following the paper, we should be able to fix the cases mentioned above. ## How was this patch tested? Added a new test case and fix existing test cases. Author: Zhenhua Wang <wzh_zju@163.com> Closes #19438 from wzhfy/improve_percentile_approx.		2017-10-11 00:16:12 -07:00
..
ml	[SPARK-20679][ML] Support recommending for a subset of users/items in ALSModel	2017-10-09 10:42:33 +02:00
mllib	[SPARK-20862][MLLIB][PYTHON] Avoid passing float to ndarray.reshape in LogisticRegressionModel	2017-05-24 22:55:38 +08:00
sql	[SPARK-22208][SQL] Improve percentile_approx by not rounding up targetError and starting from index 0	2017-10-11 00:16:12 -07:00
streaming	[SPARK-22142][BUILD][STREAMING] Move Flume support behind a profile, take 2	2017-10-06 15:08:28 +01:00
__init__.py	[MINOR] Fix some typo of the document	2017-06-19 20:35:58 +01:00
accumulators.py	[SPARK-8652] [PYSPARK] Check return value for all uses of doctest.testmod()	2015-06-26 08:12:22 -07:00
broadcast.py	[SPARK-12717][PYTHON] Adding thread-safe broadcast pickle registry	2017-08-02 07:12:23 +09:00
cloudpickle.py	[SPARK-21070][PYSPARK] Attempt to update cloudpickle again	2017-08-22 11:17:53 +09:00
conf.py	[SPARK-18447][DOCS] Fix the markdown for `Note:`/`NOTE:`/`Note that` across Python API documentation	2016-11-22 11:40:18 +00:00
context.py	[SPARK-22043][PYTHON] Improves error message for show_profiles and dump_profiles	2017-09-18 13:20:11 +09:00
daemon.py	[SPARK-4897] [PySpark] Python 3 support	2015-04-16 16:20:57 -07:00
files.py	[SPARK-3309] [PySpark] Put all public API in __all__	2014-09-03 11:49:45 -07:00
find_spark_home.py	[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed	2016-11-16 14:22:15 -08:00
heapq3.py	[SPARK-8652] [PYSPARK] Check return value for all uses of doctest.testmod()	2015-06-26 08:12:22 -07:00
java_gateway.py	[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed	2016-11-16 14:22:15 -08:00
join.py	[SPARK-14202] [PYTHON] Use generator expression instead of list comp in python_full_outer_jo…	2016-03-28 14:51:36 -07:00
profiler.py	[SPARK-8652] [PYSPARK] Check return value for all uses of doctest.testmod()	2015-06-26 08:12:22 -07:00
rdd.py	[SPARK-21551][PYTHON] Increase timeout for PythonRDD.serveIterator	2017-08-09 14:03:18 -07:00
rddsampler.py	[SPARK-4897] [PySpark] Python 3 support	2015-04-16 16:20:57 -07:00
resultiterable.py	[SPARK-3074] [PySpark] support groupByKey() with single huge key	2015-04-09 17:07:23 -07:00
serializers.py	[MINOR] Fixed up pandas_udf related docs and formatting	2017-09-28 10:24:51 +09:00
shell.py	[SPARK-19570][PYSPARK] Allow to disable hive in pyspark shell	2017-04-12 10:54:50 -07:00
shuffle.py	[SPARK-10710] Remove ability to disable spilling in core and SQL	2015-09-19 21:40:21 -07:00
statcounter.py	[SPARK-6919] [PYSPARK] Add asDict method to StatCounter	2015-09-29 13:38:15 -07:00
status.py	[SPARK-4172] [PySpark] Progress API in Python	2015-02-17 13:36:43 -08:00
storagelevel.py	[SPARK-13992][CORE][PYSPARK][FOLLOWUP] Update OFF_HEAP semantics for Java api and Python api	2016-04-12 23:06:55 -07:00
taskcontext.py	[SPARK-18576][PYTHON] Add basic TaskContext information to PySpark	2016-12-20 15:51:21 -08:00
tests.py	[SPARK-22043][PYTHON] Improves error message for show_profiles and dump_profiles	2017-09-18 13:20:11 +09:00
traceback_utils.py	[SPARK-1087] Move python traceback utilities into new traceback_utils.py file.	2014-09-15 19:28:17 -07:00
util.py	[SPARK-19505][PYTHON] AttributeError on Exception.message in Python3	2017-04-11 12:18:31 -07:00
version.py	[MINOR] Bump SparkR and PySpark version to 2.3.0.	2017-06-19 11:13:03 +01:00
worker.py	[SPARK-20396][SQL][PYSPARK] groupby().apply() with pandas udf	2017-10-11 07:32:01 +09:00