ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
gatorsmile	446c45bd87	[SPARK-14182][SQL] Parse DDL Command: Alter View This PR is to provide native parsing support for DDL commands: `Alter View`. Since its AST trees are highly similar to `Alter Table`. Thus, both implementation are integrated into the same one. Based on the Hive DDL document: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL and https://cwiki.apache.org/confluence/display/Hive/PartitionedViews Syntax: ```SQL ALTER VIEW view_name RENAME TO new_view_name ``` - to change the name of a view to a different name Syntax: ```SQL ALTER VIEW view_name SET TBLPROPERTIES ('comment' = new_comment); ``` - to add metadata to a view Syntax: ```SQL ALTER VIEW view_name UNSET TBLPROPERTIES [IF EXISTS] ('comment', 'key') ``` - to remove metadata from a view Syntax: ```SQL ALTER VIEW view_name ADD [IF NOT EXISTS] PARTITION spec1[, PARTITION spec2, ...] ``` - to add the partitioning metadata for a view. - the syntax of partition spec in `ALTER VIEW` is identical to `ALTER TABLE`, EXCEPT that it is ILLEGAL to specify a `LOCATION` clause. Syntax: ```SQL ALTER VIEW view_name DROP [IF EXISTS] PARTITION spec1[, PARTITION spec2, ...] ``` - to drop the related partition metadata for a view. Added the related test cases to `DDLCommandSuite` Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #11987 from gatorsmile/parseAlterView.	2016-03-31 12:04:03 -07:00
Sameer Agarwal	3586929320	[SPARK-14278][SQL] Initialize columnar batch with proper memory mode ## What changes were proposed in this pull request? Fixes a minor bug in the record reader constructor that was possibly introduced during refactoring. ## How was this patch tested? N/A Author: Sameer Agarwal <sameer@databricks.com> Closes #12070 from sameeragarwal/vectorized-rr.	2016-03-31 11:56:28 -07:00
Sameer Agarwal	8d6207206c	[SPARK-14263][SQL] Benchmark Vectorized HashMap for GroupBy Aggregates ## What changes were proposed in this pull request? This PR proposes a new data-structure based on a vectorized hashmap that can be potentially _codegened_ in `TungstenAggregate` to speed up aggregates with group by. Micro-benchmarks show a 10x improvement over the current `BytesToBytes` aggregation map. ## How was this patch tested? Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz BytesToBytesMap: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- hash 108 / 119 96.9 10.3 1.0X fast hash 63 / 70 166.2 6.0 1.7X arrayEqual 70 / 73 150.8 6.6 1.6X Java HashMap (Long) 141 / 200 74.3 13.5 0.8X Java HashMap (two ints) 145 / 185 72.3 13.8 0.7X Java HashMap (UnsafeRow) 499 / 524 21.0 47.6 0.2X BytesToBytesMap (off Heap) 483 / 548 21.7 46.0 0.2X BytesToBytesMap (on Heap) 485 / 562 21.6 46.2 0.2X Vectorized Hashmap 54 / 60 193.7 5.2 2.0X Author: Sameer Agarwal <sameer@databricks.com> Closes #12055 from sameeragarwal/vectorized-hashmap.	2016-03-31 11:53:13 -07:00
Herman van Hovell	a9b93e0739	[SPARK-14211][SQL] Remove ANTLR3 based parser ### What changes were proposed in this pull request? This PR removes the ANTLR3 based parser, and moves the new ANTLR4 based parser into the `org.apache.spark.sql.catalyst.parser package`. ### How was this patch tested? Existing unit tests. cc rxin andrewor14 yhuai Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #12071 from hvanhovell/SPARK-14211.	2016-03-31 09:25:09 -07:00
Cheng Lian	26445c2e47	[SPARK-14206][SQL] buildReader() implementation for CSV ## What changes were proposed in this pull request? Major changes: 1. Implement `FileFormat.buildReader()` for the CSV data source. 1. Add an extra argument to `FileFormat.buildReader()`, `physicalSchema`, which is basically the result of `FileFormat.inferSchema` or user specified schema. This argument is necessary because the CSV data source needs to know all the columns of the underlying files to read the file. ## How was this patch tested? Existing tests should do the work. Author: Cheng Lian <lian@databricks.com> Closes #12002 from liancheng/spark-14206-csv-build-reader.	2016-03-30 18:21:06 -07:00
Travis Crawford	da54abfd87	[SPARK-14081][SQL] - Preserve DataFrame column types when filling nulls. ## What changes were proposed in this pull request? This change resolves an issue where `DataFrameNaFunctions.fill` changes a `FloatType` column to a `DoubleType`. We also clarify the contract that replacement values will be cast to the column data type, which may change the replacement value when casting to a lower precision type. ## How was this patch tested? This patch has associated unit tests. Author: Travis Crawford <travis@medium.com> Closes #11967 from traviscrawford/SPARK-14081-dataframena.	2016-03-30 16:59:52 -07:00
Dongjoon Hyun	258a243419	[SPARK-14282][SQL] CodeFormatter should handle oneline comment with /* / properly ## What changes were proposed in this pull request? This PR improves `CodeFormatter` to fix the following malformed indentations. ```java / 019 / public java.lang.Object apply(java.lang.Object _i) { / 020 / InternalRow i = (InternalRow) _i; / 021 / / createexternalrow(if (isnull(input[0, double])) null else input[0, double], if (isnull(input[1, int])) null else input[1, int], ... / / 022 / boolean isNull = false; / 023 / final Object[] values = new Object[2]; / 024 / / if (isnull(input[0, double])) null else input[0, double] / / 025 / / isnull(input[0, double]) / ... / 053 / if (!false && false) { / 054 / / null / / 055 / final int value9 = -1; / 056 / isNull6 = true; / 057 / value6 = value9; / 058 / } else { ... / 077 / return mutableRow; / 078 / } / 079 / } / 080 / ``` After this PR, the code will be formatted like the following. ```java / 019 / public java.lang.Object apply(java.lang.Object _i) { / 020 / InternalRow i = (InternalRow) _i; / 021 / / createexternalrow(if (isnull(input[0, double])) null else input[0, double], if (isnull(input[1, int])) null else input[1, int], ... / / 022 / boolean isNull = false; / 023 / final Object[] values = new Object[2]; / 024 / / if (isnull(input[0, double])) null else input[0, double] / / 025 / / isnull(input[0, double]) / ... / 053 / if (!false && false) { / 054 / / null / / 055 / final int value9 = -1; / 056 / isNull6 = true; / 057 / value6 = value9; / 058 / } else { ... / 077 / return mutableRow; / 078 / } / 079 / } / 080 / ``` Also, this issue fixes the following too. (Similar with [SPARK-14185](https://issues.apache.org/jira/browse/SPARK-14185)) ```java 16/03/30 12:39:24 DEBUG WholeStageCodegen: / 001 / public Object generate(Object[] references) { / 002 / return new GeneratedIterator(references); / 003 / } ``` ```java 16/03/30 12:46:32 DEBUG WholeStageCodegen: / 001 / public Object generate(Object[] references) { / 002 / return new GeneratedIterator(references); / 003 */ } ``` ## How was this patch tested? Pass the Jenkins tests (including new CodeFormatterSuite testcases.) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12072 from dongjoon-hyun/SPARK-14282.	2016-03-30 16:15:37 -07:00
Takeshi YAMAMURO	dadf0138b3	[SPARK-14259][SQL] Add a FileSourceStrategy option for limiting #files in a partition ## What changes were proposed in this pull request? This pr is to add a config to control the maximum number of files as even small files have a non-trivial fixed cost. The current packing can put a lot of small files together which cases straggler tasks. ## How was this patch tested? I added tests to check if many files get split into partitions in FileSourceStrategySuite. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #12068 from maropu/SPARK-14259.	2016-03-30 16:02:48 -07:00
Wenchen Fan	d46c71b39d	[SPARK-14268][SQL] rename toRowExpressions and fromRowExpression to serializer and deserializer in ExpressionEncoder ## What changes were proposed in this pull request? In `ExpressionEncoder`, we use `constructorFor` to build `fromRowExpression` as the `deserializer` in `ObjectOperator`. It's kind of confusing, we should make the name consistent. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #12058 from cloud-fan/rename.	2016-03-30 11:03:15 -07:00
Wenchen Fan	816f359cf0	[SPARK-14114][SQL] implement buildReader for text data source ## What changes were proposed in this pull request? This PR implements buildReader for text data source and enable it in the new data source code path. ## How was this patch tested? Existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #11934 from cloud-fan/text.	2016-03-30 17:32:53 +08:00
gatorsmile	b66b97cd04	[SPARK-14124][SQL] Implement Database-related DDL Commands #### What changes were proposed in this pull request? This PR is to implement the following four Database-related DDL commands: - `CREATE DATABASE\|SCHEMA [IF NOT EXISTS] database_name` - `DROP DATABASE [IF EXISTS] database_name [RESTRICT\|CASCADE]` - `DESCRIBE DATABASE [EXTENDED] db_name` - `ALTER (DATABASE\|SCHEMA) database_name SET DBPROPERTIES (property_name=property_value, ...)` Another PR will be submitted to handle the unsupported commands. In the Database-related DDL commands, we will issue an error exception for `ALTER (DATABASE\|SCHEMA) database_name SET OWNER [USER\|ROLE] user_or_role`. cc yhuai andrewor14 rxin Could you review the changes? Is it in the right direction? Thanks! #### How was this patch tested? Added a few test cases in `command/DDLSuite.scala` for testing DDL command execution in `SQLContext`. Since `HiveContext` also shares the same implementation, the existing test cases in `\hive` also verifies the correctness of these commands. Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #12009 from gatorsmile/dbDDL.	2016-03-29 17:39:52 -07:00
Davies Liu	a7a93a116d	[SPARK-14215] [SQL] [PYSPARK] Support chained Python UDFs ## What changes were proposed in this pull request? This PR brings the support for chained Python UDFs, for example ```sql select udf1(udf2(a)) select udf1(udf2(a) + 3) select udf1(udf2(a) + udf3(b)) ``` Also directly chained unary Python UDFs are put in single batch of Python UDFs, others may require multiple batches. For example, ```python >>> sqlContext.sql("select double(double(1))").explain() == Physical Plan == WholeStageCodegen : +- Project [pythonUDF#10 AS double(double(1))#9] : +- INPUT +- !BatchPythonEvaluation double(double(1)), [pythonUDF#10] +- Scan OneRowRelation[] >>> sqlContext.sql("select double(double(1) + double(2))").explain() == Physical Plan == WholeStageCodegen : +- Project [pythonUDF#19 AS double((double(1) + double(2)))#16] : +- INPUT +- !BatchPythonEvaluation double((pythonUDF#17 + pythonUDF#18)), [pythonUDF#17,pythonUDF#18,pythonUDF#19] +- !BatchPythonEvaluation double(2), [pythonUDF#17,pythonUDF#18] +- !BatchPythonEvaluation double(1), [pythonUDF#17] +- Scan OneRowRelation[] ``` TODO: will support multiple unrelated Python UDFs in one batch (another PR). ## How was this patch tested? Added new unit tests for chained UDFs. Author: Davies Liu <davies@databricks.com> Closes #12014 from davies/py_udfs.	2016-03-29 15:06:29 -07:00
Eric Liang	e58c4cb3c5	[SPARK-14227][SQL] Add method for printing out generated code for debugging ## What changes were proposed in this pull request? This adds `debugCodegen` to the debug package for query execution. ## How was this patch tested? Unit and manual testing. Output example: ``` scala> import org.apache.spark.sql.execution.debug._ import org.apache.spark.sql.execution.debug._ scala> sqlContext.range(100).groupBy("id").count().orderBy("id").debugCodegen() Found 3 WholeStageCodegen subtrees. == Subtree 1 / 3 == WholeStageCodegen : +- TungstenAggregate(key=[id#0L], functions=[(count(1),mode=Partial,isDistinct=false)], output=[id#0L,count#9L]) : +- Range 0, 1, 1, 100, [id#0L] Generated code: /* 001 / public Object generate(Object[] references) { / 002 / return new GeneratedIterator(references); / 003 / } / 004 / / 005 / /* Codegened pipeline for: /* 006 / TungstenAggregate(key=[id#0L], functions=[(count(1),mode=Partial,isDistinct=false)], output=[id#0L,count#9L]) /* 007 / +- Range 0, 1, 1, 100, [id#0L] / 008 / / /* 009 / final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator { / 010 / private Object[] references; / 011 / private boolean agg_initAgg; / 012 / private org.apache.spark.sql.execution.aggregate.TungstenAggregate agg_plan; / 013 / private org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap agg_hashMap; / 014 / private org.apache.spark.sql.execution.UnsafeKVExternalSorter agg_sorter; / 015 / private org.apache.spark.unsafe.KVIterator agg_mapIter; / 016 / private org.apache.spark.sql.execution.metric.LongSQLMetric range_numOutputRows; / 017 / private org.apache.spark.sql.execution.metric.LongSQLMetricValue range_metricValue; / 018 / private boolean range_initRange; / 019 / private long range_partitionEnd; / 020 / private long range_number; / 021 / private boolean range_overflow; / 022 / private scala.collection.Iterator range_input; / 023 / private UnsafeRow range_result; / 024 / private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder range_holder; / 025 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter range_rowWriter; / 026 / private UnsafeRow agg_result; / 027 / private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder agg_holder; / 028 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter agg_rowWriter; / 029 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowJoiner agg_unsafeRowJoiner; / 030 / private org.apache.spark.sql.execution.metric.LongSQLMetric wholestagecodegen_numOutputRows; / 031 / private org.apache.spark.sql.execution.metric.LongSQLMetricValue wholestagecodegen_metricValue; / 032 / / 033 / public GeneratedIterator(Object[] references) { / 034 / this.references = references; / 035 / } / 036 / / 037 / public void init(scala.collection.Iterator inputs[]) { / 038 / agg_initAgg = false; / 039 / this.agg_plan = (org.apache.spark.sql.execution.aggregate.TungstenAggregate) references[0]; / 040 / agg_hashMap = agg_plan.createHashMap(); / 041 / / 042 / this.range_numOutputRows = (org.apache.spark.sql.execution.metric.LongSQLMetric) references[1]; / 043 / range_metricValue = (org.apache.spark.sql.execution.metric.LongSQLMetricValue) range_numOutputRows.localValue(); / 044 / range_initRange = false; / 045 / range_partitionEnd = 0L; / 046 / range_number = 0L; / 047 / range_overflow = false; / 048 / range_input = inputs[0]; / 049 / range_result = new UnsafeRow(1); / 050 / this.range_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(range_result, 0); / 051 / this.range_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(range_holder, 1); / 052 / agg_result = new UnsafeRow(1); / 053 / this.agg_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(agg_result, 0); / 054 / this.agg_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(agg_holder, 1); / 055 / agg_unsafeRowJoiner = agg_plan.createUnsafeJoiner(); / 056 / this.wholestagecodegen_numOutputRows = (org.apache.spark.sql.execution.metric.LongSQLMetric) references[2]; / 057 / wholestagecodegen_metricValue = (org.apache.spark.sql.execution.metric.LongSQLMetricValue) wholestagecodegen_numOutputRows.localValue(); / 058 / } / 059 / / 060 / private void agg_doAggregateWithKeys() throws java.io.IOException { / 061 / /** PRODUCE: Range 0, 1, 1, 100, [id#0L] / / 062 / / 063 / // initialize Range / 064 / if (!range_initRange) { / 065 / range_initRange = true; / 066 / if (range_input.hasNext()) { / 067 / initRange(((InternalRow) range_input.next()).getInt(0)); / 068 / } else { / 069 / return; / 070 / } / 071 / } / 072 / / 073 / while (!range_overflow && range_number < range_partitionEnd) { / 074 / long range_value = range_number; / 075 / range_number += 1L; / 076 / if (range_number < range_value ^ 1L < 0) { / 077 / range_overflow = true; / 078 / } / 079 / / 080 / /** CONSUME: TungstenAggregate(key=[id#0L], functions=[(count(1),mode=Partial,isDistinct=false)], output=[id#0L,count#9L]) / / 081 / / 082 / // generate grouping key / 083 / agg_rowWriter.write(0, range_value); / 084 / / hash(input[0, bigint], 42) / / 085 / int agg_value1 = 42; / 086 / / 087 / agg_value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashLong(range_value, agg_value1); / 088 / UnsafeRow agg_aggBuffer = null; / 089 / if (true) { / 090 / // try to get the buffer from hash map / 091 / agg_aggBuffer = agg_hashMap.getAggregationBufferFromUnsafeRow(agg_result, agg_value1); / 092 / } / 093 / if (agg_aggBuffer == null) { / 094 / if (agg_sorter == null) { / 095 / agg_sorter = agg_hashMap.destructAndCreateExternalSorter(); / 096 / } else { / 097 / agg_sorter.merge(agg_hashMap.destructAndCreateExternalSorter()); / 098 / } / 099 / / 100 / // the hash map had be spilled, it should have enough memory now, / 101 / // try to allocate buffer again. / 102 / agg_aggBuffer = agg_hashMap.getAggregationBufferFromUnsafeRow(agg_result, agg_value1); / 103 / if (agg_aggBuffer == null) { / 104 / // failed to allocate the first page / 105 / throw new OutOfMemoryError("No enough memory for aggregation"); / 106 / } / 107 / } / 108 / / 109 / // evaluate aggregate function / 110 / / (input[0, bigint] + 1) / / 111 / / input[0, bigint] / / 112 / long agg_value4 = agg_aggBuffer.getLong(0); / 113 / / 114 / long agg_value3 = -1L; / 115 / agg_value3 = agg_value4 + 1L; / 116 / // update aggregate buffer / 117 / agg_aggBuffer.setLong(0, agg_value3); / 118 / / 119 / if (shouldStop()) return; / 120 / } / 121 / / 122 / agg_mapIter = agg_plan.finishAggregate(agg_hashMap, agg_sorter); / 123 / } / 124 / / 125 / private void initRange(int idx) { / 126 / java.math.BigInteger index = java.math.BigInteger.valueOf(idx); / 127 / java.math.BigInteger numSlice = java.math.BigInteger.valueOf(1L); / 128 / java.math.BigInteger numElement = java.math.BigInteger.valueOf(100L); / 129 / java.math.BigInteger step = java.math.BigInteger.valueOf(1L); / 130 / java.math.BigInteger start = java.math.BigInteger.valueOf(0L); / 131 / / 132 / java.math.BigInteger st = index.multiply(numElement).divide(numSlice).multiply(step).add(start); / 133 / if (st.compareTo(java.math.BigInteger.valueOf(Long.MAX_VALUE)) > 0) { / 134 / range_number = Long.MAX_VALUE; / 135 / } else if (st.compareTo(java.math.BigInteger.valueOf(Long.MIN_VALUE)) < 0) { / 136 / range_number = Long.MIN_VALUE; / 137 / } else { / 138 / range_number = st.longValue(); / 139 / } / 140 / / 141 / java.math.BigInteger end = index.add(java.math.BigInteger.ONE).multiply(numElement).divide(numSlice) / 142 / .multiply(step).add(start); / 143 / if (end.compareTo(java.math.BigInteger.valueOf(Long.MAX_VALUE)) > 0) { / 144 / range_partitionEnd = Long.MAX_VALUE; / 145 / } else if (end.compareTo(java.math.BigInteger.valueOf(Long.MIN_VALUE)) < 0) { / 146 / range_partitionEnd = Long.MIN_VALUE; / 147 / } else { / 148 / range_partitionEnd = end.longValue(); / 149 / } / 150 / / 151 / range_metricValue.add((range_partitionEnd - range_number) / 1L); / 152 / } / 153 / / 154 / protected void processNext() throws java.io.IOException { / 155 / /** PRODUCE: TungstenAggregate(key=[id#0L], functions=[(count(1),mode=Partial,isDistinct=false)], output=[id#0L,count#9L]) / / 156 / / 157 / if (!agg_initAgg) { / 158 / agg_initAgg = true; / 159 / agg_doAggregateWithKeys(); / 160 / } / 161 / / 162 / // output the result / 163 / while (agg_mapIter.next()) { / 164 / wholestagecodegen_metricValue.add(1); / 165 / UnsafeRow agg_aggKey = (UnsafeRow) agg_mapIter.getKey(); / 166 / UnsafeRow agg_aggBuffer1 = (UnsafeRow) agg_mapIter.getValue(); / 167 / / 168 / UnsafeRow agg_resultRow = agg_unsafeRowJoiner.join(agg_aggKey, agg_aggBuffer1); / 169 / / 170 / /** CONSUME: WholeStageCodegen / / 171 / / 172 / append(agg_resultRow); / 173 / / 174 / if (shouldStop()) return; / 175 / } / 176 / / 177 / agg_mapIter.close(); / 178 / if (agg_sorter == null) { / 179 / agg_hashMap.free(); / 180 / } / 181 / } / 182 / } == Subtree 2 / 3 == WholeStageCodegen : +- Sort [id#0L ASC], true, 0 : +- INPUT +- Exchange rangepartitioning(id#0L ASC, 200), None +- WholeStageCodegen : +- TungstenAggregate(key=[id#0L], functions=[(count(1),mode=Final,isDistinct=false)], output=[id#0L,count#4L]) : +- INPUT +- Exchange hashpartitioning(id#0L, 200), None +- WholeStageCodegen : +- TungstenAggregate(key=[id#0L], functions=[(count(1),mode=Partial,isDistinct=false)], output=[id#0L,count#9L]) : +- Range 0, 1, 1, 100, [id#0L] Generated code: / 001 / public Object generate(Object[] references) { / 002 / return new GeneratedIterator(references); / 003 / } / 004 / / 005 / /* Codegened pipeline for: /* 006 / Sort [id#0L ASC], true, 0 /* 007 / +- INPUT / 008 / / /* 009 / final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator { / 010 / private Object[] references; / 011 / private boolean sort_needToSort; / 012 / private org.apache.spark.sql.execution.Sort sort_plan; / 013 / private org.apache.spark.sql.execution.UnsafeExternalRowSorter sort_sorter; / 014 / private org.apache.spark.executor.TaskMetrics sort_metrics; / 015 / private scala.collection.Iterator<UnsafeRow> sort_sortedIter; / 016 / private scala.collection.Iterator inputadapter_input; / 017 / private org.apache.spark.sql.execution.metric.LongSQLMetric sort_dataSize; / 018 / private org.apache.spark.sql.execution.metric.LongSQLMetricValue sort_metricValue; / 019 / private org.apache.spark.sql.execution.metric.LongSQLMetric sort_spillSize; / 020 / private org.apache.spark.sql.execution.metric.LongSQLMetricValue sort_metricValue1; / 021 / / 022 / public GeneratedIterator(Object[] references) { / 023 / this.references = references; / 024 / } / 025 / / 026 / public void init(scala.collection.Iterator inputs[]) { / 027 / sort_needToSort = true; / 028 / this.sort_plan = (org.apache.spark.sql.execution.Sort) references[0]; / 029 / sort_sorter = sort_plan.createSorter(); / 030 / sort_metrics = org.apache.spark.TaskContext.get().taskMetrics(); / 031 / / 032 / inputadapter_input = inputs[0]; / 033 / this.sort_dataSize = (org.apache.spark.sql.execution.metric.LongSQLMetric) references[1]; / 034 / sort_metricValue = (org.apache.spark.sql.execution.metric.LongSQLMetricValue) sort_dataSize.localValue(); / 035 / this.sort_spillSize = (org.apache.spark.sql.execution.metric.LongSQLMetric) references[2]; / 036 / sort_metricValue1 = (org.apache.spark.sql.execution.metric.LongSQLMetricValue) sort_spillSize.localValue(); / 037 / } / 038 / / 039 / private void sort_addToSorter() throws java.io.IOException { / 040 / /** PRODUCE: INPUT / / 041 / / 042 / while (inputadapter_input.hasNext()) { / 043 / InternalRow inputadapter_row = (InternalRow) inputadapter_input.next(); / 044 / /** CONSUME: Sort [id#0L ASC], true, 0 / / 045 / / 046 / sort_sorter.insertRow((UnsafeRow)inputadapter_row); / 047 / if (shouldStop()) return; / 048 / } / 049 / / 050 / } / 051 / / 052 / protected void processNext() throws java.io.IOException { / 053 / /** PRODUCE: Sort [id#0L ASC], true, 0 / / 054 / if (sort_needToSort) { / 055 / sort_addToSorter(); / 056 / Long sort_spillSizeBefore = sort_metrics.memoryBytesSpilled(); / 057 / sort_sortedIter = sort_sorter.sort(); / 058 / sort_metricValue.add(sort_sorter.getPeakMemoryUsage()); / 059 / sort_metricValue1.add(sort_metrics.memoryBytesSpilled() - sort_spillSizeBefore); / 060 / sort_metrics.incPeakExecutionMemory(sort_sorter.getPeakMemoryUsage()); / 061 / sort_needToSort = false; / 062 / } / 063 / / 064 / while (sort_sortedIter.hasNext()) { / 065 / UnsafeRow sort_outputRow = (UnsafeRow)sort_sortedIter.next(); / 066 / / 067 / /** CONSUME: WholeStageCodegen / / 068 / / 069 / append(sort_outputRow); / 070 / / 071 / if (shouldStop()) return; / 072 / } / 073 / } / 074 / } == Subtree 3 / 3 == WholeStageCodegen : +- TungstenAggregate(key=[id#0L], functions=[(count(1),mode=Final,isDistinct=false)], output=[id#0L,count#4L]) : +- INPUT +- Exchange hashpartitioning(id#0L, 200), None +- WholeStageCodegen : +- TungstenAggregate(key=[id#0L], functions=[(count(1),mode=Partial,isDistinct=false)], output=[id#0L,count#9L]) : +- Range 0, 1, 1, 100, [id#0L] Generated code: / 001 / public Object generate(Object[] references) { / 002 / return new GeneratedIterator(references); / 003 / } / 004 / / 005 / /* Codegened pipeline for: /* 006 / TungstenAggregate(key=[id#0L], functions=[(count(1),mode=Final,isDistinct=false)], output=[id#0L,count#4L]) /* 007 / +- INPUT / 008 / / /* 009 / final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator { / 010 / private Object[] references; / 011 / private boolean agg_initAgg; / 012 / private org.apache.spark.sql.execution.aggregate.TungstenAggregate agg_plan; / 013 / private org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap agg_hashMap; / 014 / private org.apache.spark.sql.execution.UnsafeKVExternalSorter agg_sorter; / 015 / private org.apache.spark.unsafe.KVIterator agg_mapIter; / 016 / private scala.collection.Iterator inputadapter_input; / 017 / private UnsafeRow agg_result; / 018 / private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder agg_holder; / 019 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter agg_rowWriter; / 020 / private UnsafeRow agg_result1; / 021 / private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder agg_holder1; / 022 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter agg_rowWriter1; / 023 / private org.apache.spark.sql.execution.metric.LongSQLMetric wholestagecodegen_numOutputRows; / 024 / private org.apache.spark.sql.execution.metric.LongSQLMetricValue wholestagecodegen_metricValue; / 025 / / 026 / public GeneratedIterator(Object[] references) { / 027 / this.references = references; / 028 / } / 029 / / 030 / public void init(scala.collection.Iterator inputs[]) { / 031 / agg_initAgg = false; / 032 / this.agg_plan = (org.apache.spark.sql.execution.aggregate.TungstenAggregate) references[0]; / 033 / agg_hashMap = agg_plan.createHashMap(); / 034 / / 035 / inputadapter_input = inputs[0]; / 036 / agg_result = new UnsafeRow(1); / 037 / this.agg_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(agg_result, 0); / 038 / this.agg_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(agg_holder, 1); / 039 / agg_result1 = new UnsafeRow(2); / 040 / this.agg_holder1 = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(agg_result1, 0); / 041 / this.agg_rowWriter1 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(agg_holder1, 2); / 042 / this.wholestagecodegen_numOutputRows = (org.apache.spark.sql.execution.metric.LongSQLMetric) references[1]; / 043 / wholestagecodegen_metricValue = (org.apache.spark.sql.execution.metric.LongSQLMetricValue) wholestagecodegen_numOutputRows.localValue(); / 044 / } / 045 / / 046 / private void agg_doAggregateWithKeys() throws java.io.IOException { / 047 / /** PRODUCE: INPUT / / 048 / / 049 / while (inputadapter_input.hasNext()) { / 050 / InternalRow inputadapter_row = (InternalRow) inputadapter_input.next(); / 051 / /** CONSUME: TungstenAggregate(key=[id#0L], functions=[(count(1),mode=Final,isDistinct=false)], output=[id#0L,count#4L]) / / 052 / / input[0, bigint] / / 053 / long inputadapter_value = inputadapter_row.getLong(0); / 054 / / input[1, bigint] / / 055 / long inputadapter_value1 = inputadapter_row.getLong(1); / 056 / / 057 / // generate grouping key / 058 / agg_rowWriter.write(0, inputadapter_value); / 059 / / hash(input[0, bigint], 42) / / 060 / int agg_value1 = 42; / 061 / / 062 / agg_value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashLong(inputadapter_value, agg_value1); / 063 / UnsafeRow agg_aggBuffer = null; / 064 / if (true) { / 065 / // try to get the buffer from hash map / 066 / agg_aggBuffer = agg_hashMap.getAggregationBufferFromUnsafeRow(agg_result, agg_value1); / 067 / } / 068 / if (agg_aggBuffer == null) { / 069 / if (agg_sorter == null) { / 070 / agg_sorter = agg_hashMap.destructAndCreateExternalSorter(); / 071 / } else { / 072 / agg_sorter.merge(agg_hashMap.destructAndCreateExternalSorter()); / 073 / } / 074 / / 075 / // the hash map had be spilled, it should have enough memory now, / 076 / // try to allocate buffer again. / 077 / agg_aggBuffer = agg_hashMap.getAggregationBufferFromUnsafeRow(agg_result, agg_value1); / 078 / if (agg_aggBuffer == null) { / 079 / // failed to allocate the first page / 080 / throw new OutOfMemoryError("No enough memory for aggregation"); / 081 / } / 082 / } / 083 / / 084 / // evaluate aggregate function / 085 / / (input[0, bigint] + input[2, bigint]) / / 086 / / input[0, bigint] / / 087 / long agg_value4 = agg_aggBuffer.getLong(0); / 088 / / 089 / long agg_value3 = -1L; / 090 / agg_value3 = agg_value4 + inputadapter_value1; / 091 / // update aggregate buffer / 092 / agg_aggBuffer.setLong(0, agg_value3); / 093 / if (shouldStop()) return; / 094 / } / 095 / / 096 / agg_mapIter = agg_plan.finishAggregate(agg_hashMap, agg_sorter); / 097 / } / 098 / / 099 / protected void processNext() throws java.io.IOException { / 100 / /** PRODUCE: TungstenAggregate(key=[id#0L], functions=[(count(1),mode=Final,isDistinct=false)], output=[id#0L,count#4L]) / / 101 / / 102 / if (!agg_initAgg) { / 103 / agg_initAgg = true; / 104 / agg_doAggregateWithKeys(); / 105 / } / 106 / / 107 / // output the result / 108 / while (agg_mapIter.next()) { / 109 / wholestagecodegen_metricValue.add(1); / 110 / UnsafeRow agg_aggKey = (UnsafeRow) agg_mapIter.getKey(); / 111 / UnsafeRow agg_aggBuffer1 = (UnsafeRow) agg_mapIter.getValue(); / 112 / / 113 / / input[0, bigint] / / 114 / long agg_value6 = agg_aggKey.getLong(0); / 115 / / input[0, bigint] / / 116 / long agg_value7 = agg_aggBuffer1.getLong(0); / 117 / / 118 / /** CONSUME: WholeStageCodegen / / 119 / / 120 / agg_rowWriter1.write(0, agg_value6); / 121 / / 122 / agg_rowWriter1.write(1, agg_value7); / 123 / append(agg_result1); / 124 / / 125 / if (shouldStop()) return; / 126 / } / 127 / / 128 / agg_mapIter.close(); / 129 / if (agg_sorter == null) { / 130 / agg_hashMap.free(); / 131 / } / 132 / } / 133 */ } ``` rxin Author: Eric Liang <ekl@databricks.com> Closes #12025 from ericl/spark-14227.	2016-03-29 13:31:51 -07:00
Dongjoon Hyun	838cb4583d	[MINOR][SQL] Fix exception message to print string-array correctly. ## What changes were proposed in this pull request? This PR is a simple fix for an exception message to print `string[]` content correctly. ```java String[] colPath = requestedSchema.getPaths().get(i); ... - throw new IOException("Required column is missing in data file. Col: " + colPath); + throw new IOException("Required column is missing in data file. Col: " + Arrays.toString(colPath)); ``` ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12041 from dongjoon-hyun/fix_exception_message_with_string_array.	2016-03-29 12:47:30 -07:00
Cheng Lian	a632bb56f8	[SPARK-14208][SQL] Renames spark.sql.parquet.fileScan ## What changes were proposed in this pull request? Renames SQL option `spark.sql.parquet.fileScan` since now all `HadoopFsRelation` based data sources are being migrated to `FileScanRDD` code path. ## How was this patch tested? None. Author: Cheng Lian <lian@databricks.com> Closes #12003 from liancheng/spark-14208-option-renaming.	2016-03-29 20:56:01 +08:00
Wenchen Fan	83775bc78e	[SPARK-14158][SQL] implement buildReader for json data source ## What changes were proposed in this pull request? This PR implements buildReader for json data source and enable it in the new data source code path. ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes #11960 from cloud-fan/json.	2016-03-29 14:34:12 +08:00
Nong Li	a180286b79	[SPARK-14210] [SQL] Add a metric for time spent in scans. ## What changes were proposed in this pull request? This adds a metric to parquet scans that measures the time in just the scan phase. This is only possible when the scan returns ColumnarBatches, otherwise the overhead is too high. This combined with the pipeline metric lets us easily see what percent of the time was in the scan. Author: Nong Li <nong@databricks.com> Closes #12007 from nongli/spark-14210.	2016-03-28 21:37:46 -07:00
Nong Li	4a55c33639	[SPARK-13981][SQL] Defer evaluating variables within Filter operator. ## What changes were proposed in this pull request? This improves the Filter codegen for NULLs by deferring loading the values for IsNotNull. Instead of generating code like: boolean isNull = ... int value = ... if (isNull) continue; we will generate: boolean isNull = ... if (isNull) continue; int value = ... This is useful since retrieving the values can be non-trivial (they can be dictionary encoded among other things). This currently only works when the attribute comes from the column batch but could be extended to other cases in the future. ## How was this patch tested? On tpcds q55, this fixes the regression from introducing the IsNotNull predicates. ``` TPCDS Snappy: Best/Avg Time(ms) Rate(M/s) Per Row(ns) -------------------------------------------------------------------------------- q55 4564 / 5036 25.2 39.6 q55 4064 / 4340 28.3 35.3 ``` Author: Nong Li <nong@databricks.com> Closes #11792 from nongli/spark-13981.	2016-03-28 20:32:58 -07:00
Wenchen Fan	38326cad87	[SPARK-14205][SQL] remove trait Queryable ## What changes were proposed in this pull request? After DataFrame and Dataset are merged, the trait `Queryable` becomes unnecessary as it has only one implementation. We should remove it. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #12001 from cloud-fan/df-ds.	2016-03-28 18:53:47 -07:00
Andrew Or	27aab80695	[SPARK-14013][SQL] Proper temp function support in catalog ## What changes were proposed in this pull request? Session catalog was added in #11750. However, it doesn't really support temporary functions properly; right now we only store the metadata in the form of `CatalogFunction`, but this doesn't make sense for temporary functions because there is no class name. This patch moves the `FunctionRegistry` into the `SessionCatalog`. With this, the user can call `catalog.createTempFunction` and `catalog.lookupFunction` to use the function they registered previously. This is currently still dead code, however. ## How was this patch tested? `SessionCatalogSuite`. Author: Andrew Or <andrew@databricks.com> Closes #11972 from andrewor14/temp-functions.	2016-03-28 16:45:02 -07:00
Shixiong Zhu	2f98ee67df	[SPARK-14169][CORE] Add UninterruptibleThread ## What changes were proposed in this pull request? Extract the workaround for HADOOP-10622 introduced by #11940 into UninterruptibleThread so that we can test and reuse it. ## How was this patch tested? Unit tests Author: Shixiong Zhu <shixiong@databricks.com> Closes #11971 from zsxwing/uninterrupt.	2016-03-28 16:29:11 -07:00
Andrew Or	eebc8c1c95	[SPARK-13923][SPARK-14014][SQL] Session catalog follow-ups ## What changes were proposed in this pull request? This patch addresses the remaining comments left in #11750 and #11918 after they are merged. For a full list of changes in this patch, just trace the commits. ## How was this patch tested? `SessionCatalogSuite` and `CatalogTestCases` Author: Andrew Or <andrew@databricks.com> Closes #12006 from andrewor14/session-catalog-followup.	2016-03-28 16:25:15 -07:00
Herman van Hovell	328c71161b	[SPARK-14086][SQL] Add DDL commands to ANTLR4 parser #### What changes were proposed in this pull request? This PR adds all the current Spark SQL DDL commands to the new ANTLR 4 based SQL parser. I have found a few inconsistencies in the current commands: - Function has an alias field. This is actually the class name of the function. - Partition specifications should contain nulls in some commands, and contain `None`s in others. - `AlterTableSkewedLocation`: Should defines which columns have skewed values, and should allow us to define storage for each skewed combination of values. We currently only allow one value per field. - `AlterTableSetFileFormat`: Should only have one file format, it currently supports both. I have implemented all these comments like they were, and I propose to improve them in follow-up PRs. #### How was this patch tested? The existing DDLCommandSuite. cc rxin andrewor14 yhuai Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #12011 from hvanhovell/SPARK-14086.	2016-03-28 16:22:02 -07:00
Davies Liu	d7b58f1461	[SPARK-14052] [SQL] build a BytesToBytesMap directly in HashedRelation ## What changes were proposed in this pull request? Currently, for the key that can not fit within a long, we build a hash map for UnsafeHashedRelation, it's converted to BytesToBytesMap after serialization and deserialization. We should build a BytesToBytesMap directly to have better memory efficiency. In order to do that, BytesToBytesMap should support multiple (K,V) pair with the same K, Location.putNewKey() is renamed to Location.append(), which could append multiple values for the same key (same Location). `Location.newValue()` is added to find the next value for the same key. ## How was this patch tested? Existing tests. Added benchmark for broadcast hash join with duplicated keys. Author: Davies Liu <davies@databricks.com> Closes #11870 from davies/map2.	2016-03-28 13:07:32 -07:00
Herman van Hovell	600c0b69ca	[SPARK-13713][SQL] Migrate parser from ANTLR3 to ANTLR4 ### What changes were proposed in this pull request? The current ANTLR3 parser is quite complex to maintain and suffers from code blow-ups. This PR introduces a new parser that is based on ANTLR4. This parser is based on the [Presto's SQL parser](https://github.com/facebook/presto/blob/master/presto-parser/src/main/antlr4/com/facebook/presto/sql/parser/SqlBase.g4). The current implementation can parse and create Catalyst and SQL plans. Large parts of the HiveQl DDL and some of the DML functionality is currently missing, the plan is to add this in follow-up PRs. This PR is a work in progress, and work needs to be done in the following area's: - [x] Error handling should be improved. - [x] Documentation should be improved. - [x] Multi-Insert needs to be tested. - [ ] Naming and package locations. ### How was this patch tested? Catalyst and SQL unit tests. Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #11557 from hvanhovell/ngParser.	2016-03-28 12:31:12 -07:00
gatorsmile	a01b6a92b5	[SPARK-14177][SQL] Native Parsing for DDL Command "Describe Database" and "Alter Database" #### What changes were proposed in this pull request? This PR is to provide native parsing support for two DDL commands: ```Describe Database``` and ```Alter Database Set Properties``` Based on the Hive DDL document: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL ##### 1. ALTER DATABASE Syntax: ```SQL ALTER (DATABASE\|SCHEMA) database_name SET DBPROPERTIES (property_name=property_value, ...) ``` - `ALTER DATABASE` is to add new (key, value) pairs into `DBPROPERTIES` ##### 2. DESCRIBE DATABASE Syntax: ```SQL DESCRIBE DATABASE [EXTENDED] db_name ``` - `DESCRIBE DATABASE` shows the name of the database, its comment (if one has been set), and its root location on the filesystem. When `extended` is true, it also shows the database's properties #### How was this patch tested? Added the related test cases to `DDLCommandSuite` Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> This patch had conflicts when merged, resolved by Committer: Yin Huai <yhuai@databricks.com> Closes #11977 from gatorsmile/parseAlterDatabase.	2016-03-26 20:12:30 -07:00
Liang-Chi Hsieh	bc925b73a6	[SPARK-14157][SQL] Parse Drop Function DDL command ## What changes were proposed in this pull request? JIRA: https://issues.apache.org/jira/browse/SPARK-14157 We only parse create function command. In order to support native drop function command, we need to parse it too. From Hive [manual](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/ReloadFunction), the drop function command has syntax as: DROP [TEMPORARY] FUNCTION [IF EXISTS] function_name; ## How was this patch tested? Added test into `DDLCommandSuite`. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #11959 from viirya/parse-drop-func.	2016-03-26 20:09:01 -07:00
Cheng Lian	b547de8a60	[SPARK-14116][SQL] Implements buildReader() for ORC data source ## What changes were proposed in this pull request? This PR implements `FileFormat.buildReader()` for our ORC data source. It also fixed several minor styling issues related to `HadoopFsRelation` planning code path. Note that `OrcNewInputFormat` doesn't rely on `OrcNewSplit` for creating `OrcRecordReader`s, plain `FileSplit` is just fine. That's why we can simply create the record reader with the help of `OrcNewInputFormat` and `FileSplit`. ## How was this patch tested? Existing test cases should do the work Author: Cheng Lian <lian@databricks.com> Closes #11936 from liancheng/spark-14116-build-reader-for-orc.	2016-03-26 16:10:35 -07:00
gatorsmile	8989d3a396	[SPARK-14161][SQL] Native Parsing for DDL Command Drop Database ### What changes were proposed in this pull request? Based on the Hive DDL document https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL The syntax of DDL command for Drop Database is ```SQL DROP (DATABASE\|SCHEMA) [IF EXISTS] database_name [RESTRICT\|CASCADE]; ``` - If `IF EXISTS` is not specified, the default behavior is to issue a warning message if `database_name` does't exist - `RESTRICT` is the default behavior. This PR is to provide a native parsing support for `DROP DATABASE`. #### How was this patch tested? Added a test case `DDLCommandSuite` Author: gatorsmile <gatorsmile@gmail.com> Closes #11962 from gatorsmile/parseDropDatabase.	2016-03-26 14:11:13 -07:00
Davies Liu	bd94ea4c80	[SPARK-14175][SQL] whole stage codegen interface refactor ## What changes were proposed in this pull request? 1. merge consumeChild into consume() 2. always generate code for input variables and UnsafeRow, a plan can use eight of them. ## How was this patch tested? Existing tests. Author: Davies Liu <davies@databricks.com> Closes #11975 from davies/gen_refactor.	2016-03-26 11:03:05 -07:00
Dongjoon Hyun	1808465855	[MINOR] Fix newly added java-lint errors ## What changes were proposed in this pull request? This PR fixes some newly added java-lint errors(unused-imports, line-lengsth). ## How was this patch tested? Pass the Jenkins tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11968 from dongjoon-hyun/SPARK-14167.	2016-03-26 11:55:49 +00:00
Tathagata Das	13945dd83b	[SPARK-14109][SQL] Fix HDFSMetadataLog to fallback from FileContext to FileSystem API ## What changes were proposed in this pull request? HDFSMetadataLog uses newer FileContext API to achieve atomic renaming. However, FileContext implementations may not exist for many scheme for which there may be FileSystem implementations. In those cases, rather than failing completely, we should fallback to the FileSystem based implementation, and log warning that there may be file consistency issues in case the log directory is concurrently modified. In addition I have also added more tests to increase the code coverage. ## How was this patch tested? Unit test. Tested on cluster with custom file system. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #11925 from tdas/SPARK-14109.	2016-03-25 20:07:54 -07:00
Shixiong Zhu	24587ce433	[SPARK-14073][STREAMING][TEST-MAVEN] Move flume back to Spark ## What changes were proposed in this pull request? This PR moves flume back to Spark as per the discussion in the dev mail-list. ## How was this patch tested? Existing Jenkins tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #11895 from zsxwing/move-flume-back.	2016-03-25 17:37:16 -07:00
Shixiong Zhu	b554b3c46b	[SPARK-14131][SQL] Add a workaround for HADOOP-10622 to fix DataFrameReaderWriterSuite ## What changes were proposed in this pull request? There is a potential dead-lock in Hadoop Shell.runCommand before 2.5.0 ([HADOOP-10622](https://issues.apache.org/jira/browse/HADOOP-10622)). If we interrupt some thread running Shell.runCommand, we may hit this issue. This PR adds some protecion to prevent from interrupting the microBatchThread when we may run into Shell.runCommand. There are two places will call Shell.runCommand now: - offsetLog.add - FileStreamSource.getOffset They will create a file using HDFS API and call Shell.runCommand to set the file permission. ## How was this patch tested? Existing unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #11940 from zsxwing/workaround-for-HADOOP-10622.	2016-03-25 13:28:26 -07:00
Tathagata Das	11fa8741ca	[SQL][HOTFIX] Fix flakiness in StateStoreRDDSuite ## What changes were proposed in this pull request? StateStoreCoordinator.reportActiveInstance is async, so subsequence state checks must be in eventually. ## How was this patch tested? Jenkins tests Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #11924 from tdas/state-store-flaky-fix.	2016-03-25 12:04:47 -07:00
Sameer Agarwal	b5f8c36e3c	[SPARK-14144][SQL] Explicitly identify/catch UnsupportedOperationException during parquet reader initialization ## What changes were proposed in this pull request? This PR is a minor cleanup task as part of https://issues.apache.org/jira/browse/SPARK-14008 to explicitly identify/catch the `UnsupportedOperationException` while initializing the vectorized parquet reader. Other exceptions will simply be thrown back to `SqlNewHadoopPartition`. ## How was this patch tested? N/A (cleanup only; no new functionality added) Author: Sameer Agarwal <sameer@databricks.com> Closes #11950 from sameeragarwal/parquet-cleanup.	2016-03-25 11:48:05 -07:00
Wenchen Fan	43b15e01c4	[SPARK-14061][SQL] implement CreateMap ## What changes were proposed in this pull request? As we have `CreateArray` and `CreateStruct`, we should also have `CreateMap`. This PR adds the `CreateMap` expression, and the DataFrame API, and python API. ## How was this patch tested? various new tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #11879 from cloud-fan/create_map.	2016-03-25 09:50:06 -07:00
Reynold Xin	70a6f0bb57	[SPARK-14149] Log exceptions in tryOrIOException ## What changes were proposed in this pull request? We ran into a problem today debugging some class loading problem during deserialization, and JVM was masking the underlying exception which made it very difficult to debug. We can however log the exceptions using try/catch ourselves in serialization/deserialization. The good thing is that all these methods are already using Utils.tryOrIOException, so we can just put the try catch and logging in a single place. ## How was this patch tested? A logging change with a manual test. Author: Reynold Xin <rxin@databricks.com> Closes #11951 from rxin/SPARK-14149.	2016-03-25 01:17:23 -07:00
Andrew Or	20ddf5fddf	[SPARK-14014][SQL] Integrate session catalog (attempt #2 ) ## What changes were proposed in this pull request? This reopens #11836, which was merged but promptly reverted because it introduced flaky Hive tests. ## How was this patch tested? See `CatalogTestCases`, `SessionCatalogSuite` and `HiveContextSuite`. Author: Andrew Or <andrew@databricks.com> Closes #11938 from andrewor14/session-catalog-again.	2016-03-24 22:59:35 -07:00
Reynold Xin	1c70b7650f	[SPARK-14145][SQL] Remove the untyped version of Dataset.groupByKey ## What changes were proposed in this pull request? Dataset has two variants of groupByKey, one for untyped and the other for typed. It actually doesn't make as much sense to have an untyped API here, since apps that want to use untyped APIs should just use the groupBy "DataFrame" API. ## How was this patch tested? This patch removes a method, and removes the associated tests. Author: Reynold Xin <rxin@databricks.com> Closes #11949 from rxin/SPARK-14145.	2016-03-24 22:56:34 -07:00
Reynold Xin	3619fec1ec	[SPARK-14142][SQL] Replace internal use of unionAll with union ## What changes were proposed in this pull request? unionAll has been deprecated in SPARK-14088. ## How was this patch tested? Should be covered by all existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #11946 from rxin/SPARK-14142.	2016-03-24 22:34:55 -07:00
gatorsmile	05f652d6c2	[SPARK-13957][SQL] Support Group By Ordinal in SQL #### What changes were proposed in this pull request? This PR is to support group by position in SQL. For example, when users input the following query ```SQL select c1 as a, c2, c3, sum() from tbl group by 1, 3, c4 ``` The ordinals are recognized as the positions in the select list. Thus, `Analyzer` converts it to ```SQL select c1, c2, c3, sum() from tbl group by c1, c3, c4 ``` This is controlled by the config option `spark.sql.groupByOrdinal`. - When true, the ordinal numbers in group by clauses are treated as the position in the select list. - When false, the ordinal numbers are ignored. - Only convert integer literals (not foldable expressions). If found foldable expressions, ignore them. - When the positions specified in the group by clauses correspond to the aggregate functions in select list, output an exception message. - star is not allowed to use in the select list when users specify ordinals in group by Note: This PR is taken from https://github.com/apache/spark/pull/10731. When merging this PR, please give the credit to zhichao-li Also cc all the people who are involved in the previous discussion: rxin cloud-fan marmbrus yhuai hvanhovell adrian-wang chenghao-intel tejasapatil #### How was this patch tested? Added a few test cases for both positive and negative test cases. Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #11846 from gatorsmile/groupByOrdinal.	2016-03-25 12:55:58 +08:00
Andrew Or	c44d140cae	Revert "[SPARK-14014][SQL] Replace existing catalog with SessionCatalog" This reverts commit `5dfc01976b`.	2016-03-23 22:21:15 -07:00
gatorsmile	f42eaf42bd	[SPARK-14085][SQL] Star Expansion for Hash #### What changes were proposed in this pull request? This PR is to support star expansion in hash. For example, ```SQL val structDf = testData2.select("a", "b").as("record") structDf.select(hash($"") ``` In addition, it refactors the codes for the rule `ResolveStar` and fixes a regression for star expansion in group by when using SQL API. For example, ```SQL SELECT FROM testData2 group by a, b ``` cc cloud-fan Now, the code for star resolution is much cleaner. The coverage is better. Could you check if this refactoring is good? Thanks! #### How was this patch tested? Added a few test cases to cover it. Author: gatorsmile <gatorsmile@gmail.com> Closes #11904 from gatorsmile/starResolution.	2016-03-24 11:13:36 +08:00
Andrew Or	5dfc01976b	[SPARK-14014][SQL] Replace existing catalog with SessionCatalog ## What changes were proposed in this pull request? `SessionCatalog`, introduced in #11750, is a catalog that keeps track of temporary functions and tables, and delegates metastore operations to `ExternalCatalog`. This functionality overlaps a lot with the existing `analysis.Catalog`. As of this commit, `SessionCatalog` and `ExternalCatalog` will no longer be dead code. There are still things that need to be done after this patch, namely: - SPARK-14013: Properly implement temporary functions in `SessionCatalog` - SPARK-13879: Decide which DDL/DML commands to support natively in Spark - SPARK-?????: Implement the ones we do want to support through `SessionCatalog`. - SPARK-?????: Merge SQL/HiveContext ## How was this patch tested? This is largely a refactoring task so there are no new tests introduced. The particularly relevant tests are `SessionCatalogSuite` and `ExternalCatalogSuite`. Author: Andrew Or <andrew@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes #11836 from andrewor14/use-session-catalog.	2016-03-23 13:34:22 -07:00
Michael Armbrust	6bc4be64f8	[SPARK-14078] Streaming Parquet Based FileSink This PR adds a new `Sink` implementation that writes out Parquet files. In order to correctly handle partial failures while maintaining exactly once semantics, the files for each batch are written out to a unique directory and then atomically appended to a metadata log. When a parquet based `DataSource` is initialized for reading, we first check for this log directory and use it instead of file listing when present. Unit tests are added, as well as a stress test that checks the answer after non-deterministic injected failures. Author: Michael Armbrust <michael@databricks.com> Closes #11897 from marmbrus/fileSink.	2016-03-23 13:03:25 -07:00
Tathagata Das	8c826880f5	[SPARK-13809][SQL] State store for streaming aggregations ## What changes were proposed in this pull request? In this PR, I am implementing a new abstraction for management of streaming state data - State Store. It is a key-value store for persisting running aggregates for aggregate operations in streaming dataframes. The motivation and design is discussed here. https://docs.google.com/document/d/1-ncawFx8JS5Zyfq1HAEGBx56RDet9wfVp_hDM8ZL254/edit# ## How was this patch tested? - [x] Unit tests - [x] Cluster tests Coverage from unit tests <img width="952" alt="screen shot 2016-03-21 at 3 09 40 pm" src="https://cloud.githubusercontent.com/assets/663212/13935872/fdc8ba86-ef76-11e5-93e8-9fa310472c7b.png"> ## TODO - [x] Fix updates() iterator to avoid duplicate updates for same key - [x] Use Coordinator in ContinuousQueryManager - [x] Plugging in hadoop conf and other confs - [x] Unit tests - [x] StateStore object lifecycle and methods - [x] StateStoreCoordinator communication and logic - [x] StateStoreRDD fault-tolerance - [x] StateStoreRDD preferred location using StateStoreCoordinator - [ ] Cluster tests - [ ] Whether preferred locations are set correctly - [ ] Whether recovery works correctly with distributed storage - [x] Basic performance tests - [x] Docs Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #11645 from tdas/state-store.	2016-03-23 12:48:05 -07:00
Sameer Agarwal	0a64294fcb	[SPARK-14015][SQL] Support TimestampType in vectorized parquet reader ## What changes were proposed in this pull request? This PR adds support for TimestampType in the vectorized parquet reader ## How was this patch tested? 1. `VectorizedColumnReader` initially had a gating condition on `primitiveType.getPrimitiveTypeName() == PrimitiveType.PrimitiveTypeName.INT96)` that made us fall back on parquet-mr for handling timestamps. This condition is now removed. 2. The `ParquetHadoopFsRelationSuite` (that tests for all supported hive types -- including `TimestampType`) fails when the gating condition is removed (https://github.com/apache/spark/pull/11808) and should now pass with this change. Similarly, the `ParquetHiveCompatibilitySuite.SPARK-10177 timestamp` test that fails when the gating condition is removed, should now pass as well. 3. Added tests in `HadoopFsRelationTest` that test both the dictionary encoded and non-encoded versions across all supported datatypes. Author: Sameer Agarwal <sameer@databricks.com> Closes #11882 from sameeragarwal/timestamp-parquet.	2016-03-23 12:13:32 -07:00
Davies Liu	02d9c352c7	[SPARK-14092] [SQL] move shouldStop() to end of while loop ## What changes were proposed in this pull request? This PR rollback some changes in #11274 , which introduced some performance regression when do a simple aggregation on parquet scan with one integer column. Does not really understand how this change introduce this huge impact, maybe related show JIT compiler inline functions. (saw very different stats from profiling). ## How was this patch tested? Manually run the parquet reader benchmark, before this change: ``` Intel(R) Core(TM) i7-4558U CPU 2.80GHz Int and String Scan: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- SQL Parquet Vectorized 2391 / 3107 43.9 22.8 1.0X ``` After this change ``` Java HotSpot(TM) 64-Bit Server VM 1.7.0_60-b19 on Mac OS X 10.9.5 Intel(R) Core(TM) i7-4558U CPU 2.80GHz Int and String Scan: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- SQL Parquet Vectorized 2032 / 2626 51.6 19.4 1.0X``` Author: Davies Liu <davies@databricks.com> Closes #11912 from davies/fix_regression.	2016-03-23 11:58:43 -07:00
Josh Rosen	3de24ae2ed	[SPARK-14075] Refactor MemoryStore to be testable independent of BlockManager This patch refactors the `MemoryStore` so that it can be tested without needing to construct / mock an entire `BlockManager`. - The block manager's serialization- and compression-related methods have been moved from `BlockManager` to `SerializerManager`. - `BlockInfoManager `is now passed directly to classes that need it, rather than being passed via the `BlockManager`. - The `MemoryStore` now calls `dropFromMemory` via a new `BlockEvictionHandler` interface rather than directly calling the `BlockManager`. This change helps to enforce a narrow interface between the `MemoryStore` and `BlockManager` functionality and makes this interface easier to mock in tests. - Several of the block unrolling tests have been moved from `BlockManagerSuite` into a new `MemoryStoreSuite`. Author: Josh Rosen <joshrosen@databricks.com> Closes #11899 from JoshRosen/reduce-memorystore-blockmanager-coupling.	2016-03-23 10:15:23 -07:00

1 2 3 4 5 ...

2044 commits