spark-instrumented-optimizer

History

Marek Novotny a6e883feb3 [SPARK-23935][SQL] Adding map_entries function ## What changes were proposed in this pull request? This PR adds `map_entries` function that returns an unordered array of all entries in the given map. ## How was this patch tested? New tests added into: - `CollectionExpressionSuite` - `DataFrameFunctionsSuite` ## CodeGen examples ### Primitive types ``` val df = Seq(Map(1 -> 5, 2 -> 6)).toDF("m") df.filter('m.isNotNull).select(map_entries('m)).debugCodegen ``` Result: ``` /* 042 / boolean project_isNull_0 = false; / 043 / / 044 / ArrayData project_value_0 = null; / 045 / / 046 / final int project_numElements_0 = inputadapter_value_0.numElements(); / 047 / final ArrayData project_keys_0 = inputadapter_value_0.keyArray(); / 048 / final ArrayData project_values_0 = inputadapter_value_0.valueArray(); / 049 / / 050 / final long project_size_0 = UnsafeArrayData.calculateSizeOfUnderlyingByteArray( / 051 / project_numElements_0, / 052 / 32); / 053 / if (project_size_0 > 2147483632) { / 054 / final Object[] project_internalRowArray_0 = new Object[project_numElements_0]; / 055 / for (int z = 0; z < project_numElements_0; z++) { / 056 / project_internalRowArray_0[z] = new org.apache.spark.sql.catalyst.expressions.GenericInternalRow(new Object[]{project_keys_0.getInt(z), project_values_0.getInt(z)}); / 057 / } / 058 / project_value_0 = new org.apache.spark.sql.catalyst.util.GenericArrayData(project_internalRowArray_0); / 059 / / 060 / } else { / 061 / final byte[] project_arrayBytes_0 = new byte[(int)project_size_0]; / 062 / UnsafeArrayData project_unsafeArrayData_0 = new UnsafeArrayData(); / 063 / Platform.putLong(project_arrayBytes_0, 16, project_numElements_0); / 064 / project_unsafeArrayData_0.pointTo(project_arrayBytes_0, 16, (int)project_size_0); / 065 / / 066 / final int project_structsOffset_0 = UnsafeArrayData.calculateHeaderPortionInBytes(project_numElements_0) + project_numElements_0 8; /* 067 / UnsafeRow project_unsafeRow_0 = new UnsafeRow(2); / 068 / for (int z = 0; z < project_numElements_0; z++) { / 069 / long offset = project_structsOffset_0 + z 24L; /* 070 / project_unsafeArrayData_0.setLong(z, (offset << 32) + 24L); / 071 / project_unsafeRow_0.pointTo(project_arrayBytes_0, 16 + offset, 24); / 072 / project_unsafeRow_0.setInt(0, project_keys_0.getInt(z)); / 073 / project_unsafeRow_0.setInt(1, project_values_0.getInt(z)); / 074 / } / 075 / project_value_0 = project_unsafeArrayData_0; / 076 / / 077 / } ``` ### Non-primitive types ``` val df = Seq(Map("a" -> "foo", "b" -> null)).toDF("m") df.filter('m.isNotNull).select(map_entries('m)).debugCodegen ``` Result: ``` / 042 / boolean project_isNull_0 = false; / 043 / / 044 / ArrayData project_value_0 = null; / 045 / / 046 / final int project_numElements_0 = inputadapter_value_0.numElements(); / 047 / final ArrayData project_keys_0 = inputadapter_value_0.keyArray(); / 048 / final ArrayData project_values_0 = inputadapter_value_0.valueArray(); / 049 / / 050 / final Object[] project_internalRowArray_0 = new Object[project_numElements_0]; / 051 / for (int z = 0; z < project_numElements_0; z++) { / 052 / project_internalRowArray_0[z] = new org.apache.spark.sql.catalyst.expressions.GenericInternalRow(new Object[]{project_keys_0.getUTF8String(z), project_values_0.getUTF8String(z)}); / 053 / } / 054 */ project_value_0 = new org.apache.spark.sql.catalyst.util.GenericArrayData(project_internalRowArray_0); ``` Author: Marek Novotny <mn.mikke@gmail.com> Closes #21236 from mn-mikke/feature/array-api-map_entries-to-master.		2018-05-21 23:14:03 +09:00
..
ml	[SPARK-24058][ML][PYSPARK] Default Params in ML should be saved separately: Python API	2018-05-15 16:50:09 -07:00
mllib	[SPARK-15750][MLLIB][PYSPARK] Constructing FPGrowth fails when no numPartitions specified in pyspark	2018-05-07 14:47:58 -07:00
sql	[SPARK-23935][SQL] Adding map_entries function	2018-05-21 23:14:03 +09:00
streaming	[SPARK-24126][PYSPARK] Use build-specific temp directory for pyspark tests.	2018-05-07 13:00:18 +08:00
__init__.py	[SPARK-23328][PYTHON] Disallow default value None in na.replace/replace when 'to_replace' is not a dictionary	2018-02-09 14:21:10 +08:00
_globals.py	[SPARK-23328][PYTHON] Disallow default value None in na.replace/replace when 'to_replace' is not a dictionary	2018-02-09 14:21:10 +08:00
accumulators.py	[SPARK-23522][PYTHON] always use sys.exit over builtin exit	2018-03-08 20:38:34 +09:00
broadcast.py	[SPARK-23522][PYTHON] always use sys.exit over builtin exit	2018-03-08 20:38:34 +09:00
cloudpickle.py	[SPARK-24303][PYTHON] Update cloudpickle to v0.4.4	2018-05-18 09:53:24 -07:00
conf.py	[SPARK-23522][PYTHON] always use sys.exit over builtin exit	2018-03-08 20:38:34 +09:00
context.py	[SPARK-21945][YARN][PYTHON] Make --py-files work with PySpark shell in Yarn client mode	2018-05-17 12:07:58 +08:00
daemon.py	[PYSPARK] Update py4j to version 0.10.7.	2018-05-09 10:47:35 -07:00
files.py	[SPARK-3309] [PySpark] Put all public API in __all__	2014-09-03 11:49:45 -07:00
find_spark_home.py	[SPARK-23522][PYTHON] always use sys.exit over builtin exit	2018-03-08 20:38:34 +09:00
heapq3.py	[SPARK-23522][PYTHON] always use sys.exit over builtin exit	2018-03-08 20:38:34 +09:00
java_gateway.py	[PYSPARK] Update py4j to version 0.10.7.	2018-05-09 10:47:35 -07:00
join.py	[SPARK-14202] [PYTHON] Use generator expression instead of list comp in python_full_outer_jo…	2016-03-28 14:51:36 -07:00
profiler.py	[SPARK-23522][PYTHON] always use sys.exit over builtin exit	2018-03-08 20:38:34 +09:00
rdd.py	[PYSPARK] Update py4j to version 0.10.7.	2018-05-09 10:47:35 -07:00
rddsampler.py	[SPARK-4897] [PySpark] Python 3 support	2015-04-16 16:20:57 -07:00
resultiterable.py	[SPARK-3074] [PySpark] support groupByKey() with single huge key	2015-04-09 17:07:23 -07:00
serializers.py	[SPARK-23522][PYTHON] always use sys.exit over builtin exit	2018-03-08 20:38:34 +09:00
shell.py	[SPARK-19570][PYSPARK] Allow to disable hive in pyspark shell	2017-04-12 10:54:50 -07:00
shuffle.py	[SPARK-23522][PYTHON] always use sys.exit over builtin exit	2018-03-08 20:38:34 +09:00
statcounter.py	[SPARK-6919] [PYSPARK] Add asDict method to StatCounter	2015-09-29 13:38:15 -07:00
status.py	[SPARK-4172] [PySpark] Progress API in Python	2015-02-17 13:36:43 -08:00
storagelevel.py	[SPARK-13992][CORE][PYSPARK][FOLLOWUP] Update OFF_HEAP semantics for Java api and Python api	2016-04-12 23:06:55 -07:00
taskcontext.py	[SPARK-18576][PYTHON] Add basic TaskContext information to PySpark	2016-12-20 15:51:21 -08:00
tests.py	[SPARK-24131][PYSPARK][FOLLOWUP] Add majorMinorVersion API to PySpark for determining Spark versions	2018-05-08 21:22:54 +08:00
traceback_utils.py	[SPARK-1087] Move python traceback utilities into new traceback_utils.py file.	2014-09-15 19:28:17 -07:00
util.py	[SPARK-24131][PYSPARK][FOLLOWUP] Add majorMinorVersion API to PySpark for determining Spark versions	2018-05-08 21:22:54 +08:00
version.py	[SPARK-23028] Bump master branch version to 2.4.0-SNAPSHOT	2018-01-13 00:37:59 +08:00
worker.py	[SPARK-24262][PYTHON] Fix typo in UDF type match error message	2018-05-13 13:19:03 -07:00