ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
gaborgsomogyi	33d43bf1b6	[SPARK-22484][DOC] Document PySpark DataFrame csv writer behavior whe… ## What changes were proposed in this pull request? In PySpark API Document, DataFrame.write.csv() says that setting the quote parameter to an empty string should turn off quoting. Instead, it uses the [null character](https://en.wikipedia.org/wiki/Null_character) as the quote. This PR fixes the doc. ## How was this patch tested? Manual. ``` cd python/docs make html open _build/html/pyspark.sql.html ``` Author: gaborgsomogyi <gabor.g.somogyi@gmail.com> Closes #19814 from gaborgsomogyi/SPARK-22484.	2017-11-28 10:14:35 +09:00
Marco Gaido	087879a77a	[SPARK-22520][SQL] Support code generation for large CaseWhen ## What changes were proposed in this pull request? Code generation is disabled for CaseWhen when the number of branches is higher than `spark.sql.codegen.maxCaseBranches` (which defaults to 20). This was done to prevent the well known 64KB method limit exception. This PR proposes to support code generation also in those cases (without causing exceptions of course). As a side effect, we could get rid of the `spark.sql.codegen.maxCaseBranches` configuration. ## How was this patch tested? existing UTs Author: Marco Gaido <mgaido@hortonworks.com> Author: Marco Gaido <marcogaido91@gmail.com> Closes #19752 from mgaido91/SPARK-22520.	2017-11-28 07:46:18 +08:00
Zhenhua Wang	1ff4a77be4	[SPARK-22529][SQL] Relation stats should be consistent with other plans based on cbo config ## What changes were proposed in this pull request? Currently, relation stats is the same whether cbo is enabled or not. While relation (`LogicalRelation` or `HiveTableRelation`) is a `LogicalPlan`, its behavior is inconsistent with other plans. This can cause confusion when user runs EXPLAIN COST commands. Besides, when CBO is disabled, we apply the size-only estimation strategy, so there's no need to propagate other catalog statistics to relation. ## How was this patch tested? Enhanced existing tests case and added a test case. Author: Zhenhua Wang <wangzhenhua@huawei.com> Closes #19757 from wzhfy/catalog_stats_conversion.	2017-11-28 01:13:44 +08:00
Kazuaki Ishizaki	2dbe275b2d	[SPARK-22603][SQL] Fix 64KB JVM bytecode limit problem with FormatString ## What changes were proposed in this pull request? This PR changes `FormatString` code generation to place generated code for expressions for arguments into separated methods if these size could be large. This PR passes variable arguments by using an `Object` array. ## How was this patch tested? Added new test cases into `StringExpressionSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19817 from kiszk/SPARK-22603.	2017-11-27 20:32:01 +08:00
Wenchen Fan	5a02e3a2ac	[SPARK-22602][SQL] remove ColumnVector#loadBytes ## What changes were proposed in this pull request? `ColumnVector#loadBytes` is only used as an optimization for reading UTF8String in `WritableColumnVector`, this PR moves this optimization to `WritableColumnVector` and simplified it. ## How was this patch tested? existing test Author: Wenchen Fan <wenchen@databricks.com> Closes #19815 from cloud-fan/load-bytes.	2017-11-26 21:49:09 -08:00
Sean Owen	fba63c1a7b	[SPARK-22607][BUILD] Set large stack size consistently for tests to avoid StackOverflowError ## What changes were proposed in this pull request? Set `-ea` and `-Xss4m` consistently for tests, to fix in particular: ``` OrderingSuite: ... - GenerateOrdering with ShortType * RUN ABORTED * java.lang.StackOverflowError: at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:370) at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) ... ``` ## How was this patch tested? Existing tests. Manually verified it resolves the StackOverflowError this intends to resolve. Author: Sean Owen <sowen@cloudera.com> Closes #19820 from srowen/SPARK-22607.	2017-11-26 07:42:44 -06:00
Wenchen Fan	e3fd93f149	[SPARK-22604][SQL] remove the get address methods from ColumnVector ## What changes were proposed in this pull request? `nullsNativeAddress` and `valuesNativeAddress` are only used in tests and benchmark, no need to be top class API. ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes #19818 from cloud-fan/minor.	2017-11-24 22:43:47 -08:00
Wenchen Fan	70221903f5	[SPARK-22596][SQL] set ctx.currentVars in CodegenSupport.consume ## What changes were proposed in this pull request? `ctx.currentVars` means the input variables for the current operator, which is already decided in `CodegenSupport`, we can set it there instead of `doConsume`. also add more comments to help people understand the codegen framework. After this PR, we now have a principle about setting `ctx.currentVars` and `ctx.INPUT_ROW`: 1. for non-whole-stage-codegen path, never set them. (permit some special cases like generating ordering) 2. for whole-stage-codegen `produce` path, mostly we don't need to set them, but blocking operators may need to set them for expressions that produce data from data source, sort buffer, aggregate buffer, etc. 3. for whole-stage-codegen `consume` path, mostly we don't need to set them because `currentVars` is automatically set to child input variables and `INPUT_ROW` is mostly not used. A few plans need to tweak them as they may have different inputs, or they use the input row. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #19803 from cloud-fan/codegen.	2017-11-24 21:50:30 -08:00
Kazuaki Ishizaki	554adc77d2	[SPARK-22595][SQL] fix flaky test: CastSuite.SPARK-22500: cast for struct should not generate codes beyond 64KB ## What changes were proposed in this pull request? This PR reduces the number of fields in the test case of `CastSuite` to fix an issue that is pointed at [here](https://github.com/apache/spark/pull/19800#issuecomment-346634950). ``` java.lang.OutOfMemoryError: GC overhead limit exceeded java.lang.OutOfMemoryError: GC overhead limit exceeded at org.codehaus.janino.UnitCompiler.findClass(UnitCompiler.java:10971) at org.codehaus.janino.UnitCompiler.findTypeByName(UnitCompiler.java:7607) at org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:5758) at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:5732) at org.codehaus.janino.UnitCompiler.access$13200(UnitCompiler.java:206) at org.codehaus.janino.UnitCompiler$18.visitReferenceType(UnitCompiler.java:5668) at org.codehaus.janino.UnitCompiler$18.visitReferenceType(UnitCompiler.java:5660) at org.codehaus.janino.Java$ReferenceType.accept(Java.java:3356) at org.codehaus.janino.UnitCompiler.getType(UnitCompiler.java:5660) at org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:2892) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2764) at org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262) at org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) at org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377) at org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369) at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) at org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) at org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377) at org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369) at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) at org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) ... ``` ## How was this patch tested? Used existing test case Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19806 from kiszk/SPARK-22595.	2017-11-24 12:08:49 +01:00
Liang-Chi Hsieh	62a826f17c	[SPARK-22591][SQL] GenerateOrdering shouldn't change CodegenContext.INPUT_ROW ## What changes were proposed in this pull request? When I played with codegen in developing another PR, I found the value of `CodegenContext.INPUT_ROW` is not reliable. Under wholestage codegen, it is assigned to null first and then suddenly changed to `i`. The reason is `GenerateOrdering` changes `CodegenContext.INPUT_ROW` but doesn't restore it back. ## How was this patch tested? Added test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #19800 from viirya/SPARK-22591.	2017-11-24 11:46:58 +01:00
Wenchen Fan	c1217565e2	[SPARK-22592][SQL] cleanup filter converting for hive ## What changes were proposed in this pull request? We have 2 different methods to convert filters for hive, regarding a config. This introduces duplicated and inconsistent code(e.g. one use helper objects for pattern match and one doesn't). ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes #19801 from cloud-fan/cleanup.	2017-11-23 15:33:26 -08:00
Wenchen Fan	42f83d7c40	[SPARK-17920][FOLLOWUP] simplify the schema file creation in test ## What changes were proposed in this pull request? a followup of https://github.com/apache/spark/pull/19779 , to simplify the file creation. ## How was this patch tested? test only change Author: Wenchen Fan <wenchen@databricks.com> Closes #19799 from cloud-fan/minor.	2017-11-23 18:20:16 +01:00
Wenchen Fan	0605ad7614	[SPARK-22543][SQL] fix java 64kb compile error for deeply nested expressions ## What changes were proposed in this pull request? A frequently reported issue of Spark is the Java 64kb compile error. This is because Spark generates a very big method and it's usually caused by 3 reasons: 1. a deep expression tree, e.g. a very complex filter condition 2. many individual expressions, e.g. expressions can have many children, operators can have many expressions. 3. a deep query plan tree (with whole stage codegen) This PR focuses on 1. There are already several patches(#15620 #18972 #18641) trying to fix this issue and some of them are already merged. However this is an endless job as every non-leaf expression has this issue. This PR proposes to fix this issue in `Expression.genCode`, to make sure the code for a single expression won't grow too big. According to maropu 's benchmark, no regression is found with TPCDS (thanks maropu !): https://docs.google.com/spreadsheets/d/1K3_7lX05-ZgxDXi9X_GleNnDjcnJIfoSlSCDZcL4gdg/edit?usp=sharing ## How was this patch tested? existing test Author: Wenchen Fan <wenchen@databricks.com> Author: Wenchen Fan <cloud0fan@gmail.com> Closes #19767 from cloud-fan/codegen.	2017-11-22 10:05:46 -08:00
Kazuaki Ishizaki	572af5027e	[SPARK-20101][SQL][FOLLOW-UP] use correct config name "spark.sql.columnVector.offheap.enabled" ## What changes were proposed in this pull request? This PR addresses [the spelling miss](https://github.com/apache/spark/pull/17436#discussion_r152189670) of the config name `spark.sql.columnVector.offheap.enabled`. We should use `spark.sql.columnVector.offheap.enabled`. ## How was this patch tested? Existing tests Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19794 from kiszk/SPARK-20101-follow.	2017-11-22 13:27:20 +01:00
Takeshi Yamamuro	2c0fe818a6	[SPARK-22445][SQL][FOLLOW-UP] Respect stream-side child's needCopyResult in BroadcastHashJoin ## What changes were proposed in this pull request? I found #19656 causes some bugs, for example, it changed the result set of `q6` in tpcds (I keep tracking TPCDS results daily [here](https://github.com/maropu/spark-tpcds-datagen/tree/master/reports/tests)): - w/o pr19658 ``` +-----+---+ \|state\|cnt\| +-----+---+ \| MA\| 10\| \| AK\| 10\| \| AZ\| 11\| \| ME\| 13\| \| VT\| 14\| \| NV\| 15\| \| NH\| 16\| \| UT\| 17\| \| NJ\| 21\| \| MD\| 22\| \| WY\| 25\| \| NM\| 26\| \| OR\| 31\| \| WA\| 36\| \| ND\| 38\| \| ID\| 39\| \| SC\| 45\| \| WV\| 50\| \| FL\| 51\| \| OK\| 53\| \| MT\| 53\| \| CO\| 57\| \| AR\| 58\| \| NY\| 58\| \| PA\| 62\| \| AL\| 63\| \| LA\| 63\| \| SD\| 70\| \| WI\| 80\| \| null\| 81\| \| MI\| 82\| \| NC\| 82\| \| MS\| 83\| \| CA\| 84\| \| MN\| 85\| \| MO\| 88\| \| IL\| 95\| \| IA\|102\| \| TN\|102\| \| IN\|103\| \| KY\|104\| \| NE\|113\| \| OH\|114\| \| VA\|130\| \| KS\|139\| \| GA\|168\| \| TX\|216\| +-----+---+ ``` - w/ pr19658 ``` +-----+---+ \|state\|cnt\| +-----+---+ \| RI\| 14\| \| AK\| 16\| \| FL\| 20\| \| NJ\| 21\| \| NM\| 21\| \| NV\| 22\| \| MA\| 22\| \| MD\| 22\| \| UT\| 22\| \| AZ\| 25\| \| SC\| 28\| \| AL\| 36\| \| MT\| 36\| \| WA\| 39\| \| ND\| 41\| \| MI\| 44\| \| AR\| 45\| \| OR\| 47\| \| OK\| 52\| \| PA\| 53\| \| LA\| 55\| \| CO\| 55\| \| NY\| 64\| \| WV\| 66\| \| SD\| 72\| \| MS\| 73\| \| NC\| 79\| \| IN\| 82\| \| null\| 85\| \| ID\| 88\| \| MN\| 91\| \| WI\| 95\| \| IL\| 96\| \| MO\| 97\| \| CA\|109\| \| CA\|109\| \| TN\|114\| \| NE\|115\| \| KY\|128\| \| OH\|131\| \| IA\|156\| \| TX\|160\| \| VA\|182\| \| KS\|211\| \| GA\|230\| +-----+---+ ``` This pr is to keep the original logic of `CodegenContext.copyResult` in `BroadcastHashJoinExec`. ## How was this patch tested? Existing tests Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #19781 from maropu/SPARK-22445-bugfix.	2017-11-22 09:09:50 +01:00
vinodkc	e0d7665cec	[SPARK-17920][SPARK-19580][SPARK-19878][SQL] Support writing to Hive table which uses Avro schema url 'avro.schema.url' ## What changes were proposed in this pull request? SPARK-19580 Support for avro.schema.url while writing to hive table SPARK-19878 Add hive configuration when initialize hive serde in InsertIntoHiveTable.scala SPARK-17920 HiveWriterContainer passes null configuration to serde.initialize, causing NullPointerException in AvroSerde when using avro.schema.url Support writing to Hive table which uses Avro schema url 'avro.schema.url' For ex: create external table avro_in (a string) stored as avro location '/avro-in/' tblproperties ('avro.schema.url'='/avro-schema/avro.avsc'); create external table avro_out (a string) stored as avro location '/avro-out/' tblproperties ('avro.schema.url'='/avro-schema/avro.avsc'); insert overwrite table avro_out select * from avro_in; // fails with java.lang.NullPointerException WARN AvroSerDe: Encountered exception determining schema. Returning signal schema to indicate problem java.lang.NullPointerException at org.apache.hadoop.fs.FileSystem.getDefaultUri(FileSystem.java:182) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:174) ## Changes proposed in this fix Currently 'null' value is passed to serializer, which causes NPE during insert operation, instead pass Hadoop configuration object ## How was this patch tested? Added new test case in VersionsSuite Author: vinodkc <vinod.kc.in@gmail.com> Closes #19779 from vinodkc/br_Fix_SPARK-17920.	2017-11-21 22:31:46 -08:00
Jia Li	881c5c8073	[SPARK-22548][SQL] Incorrect nested AND expression pushed down to JDBC data source ## What changes were proposed in this pull request? Let’s say I have a nested AND expression shown below and p2 can not be pushed down, (p1 AND p2) OR p3 In current Spark code, during data source filter translation, (p1 AND p2) is returned as p1 only and p2 is simply lost. This issue occurs with JDBC data source and is similar to [SPARK-12218](https://github.com/apache/spark/pull/10362) for Parquet. When we have AND nested below another expression, we should either push both legs or nothing. Note that: - The current Spark code will always split conjunctive predicate before it determines if a predicate can be pushed down or not - If I have (p1 AND p2) AND p3, it will be split into p1, p2, p3. There won't be nested AND expression. - The current Spark code logic for OR is OK. It either pushes both legs or nothing. The same translation method is also called by Data Source V2. ## How was this patch tested? Added new unit test cases to JDBCSuite gatorsmile Author: Jia Li <jiali@us.ibm.com> Closes #19776 from jliwork/spark-22548.	2017-11-21 17:30:02 -08:00
Kazuaki Ishizaki	ac10171bea	[SPARK-22500][SQL] Fix 64KB JVM bytecode limit problem with cast ## What changes were proposed in this pull request? This PR changes `cast` code generation to place generated code for expression for fields of a structure into separated methods if these size could be large. ## How was this patch tested? Added new test cases into `CastSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19730 from kiszk/SPARK-22500.	2017-11-21 22:24:43 +01:00
Marco Gaido	b96f61b6b2	[SPARK-22475][SQL] show histogram in DESC COLUMN command ## What changes were proposed in this pull request? Added the histogram representation to the output of the `DESCRIBE EXTENDED table_name column_name` command. ## How was this patch tested? Modified SQL UT and checked output Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Marco Gaido <mgaido@hortonworks.com> Closes #19774 from mgaido91/SPARK-22475.	2017-11-21 20:55:24 +01:00
hyukjinkwon	6d7ebf2f9f	[SPARK-22165][SQL] Fixes type conflicts between double, long, decimals, dates and timestamps in partition column ## What changes were proposed in this pull request? This PR proposes to add a rule that re-uses `TypeCoercion.findWiderCommonType` when resolving type conflicts in partition values. Currently, this uses numeric precedence-like comparison; therefore, it looks introducing failures for type conflicts between timestamps, dates and decimals, please see: ```scala private val upCastingOrder: Seq[DataType] = Seq(NullType, IntegerType, LongType, FloatType, DoubleType, StringType) ... literals.map(_.dataType).maxBy(upCastingOrder.indexOf(_)) ``` The codes below: ```scala val df = Seq((1, "2015-01-01"), (2, "2016-01-01 00:00:00")).toDF("i", "ts") df.write.format("parquet").partitionBy("ts").save("/tmp/foo") spark.read.load("/tmp/foo").printSchema() val df = Seq((1, "1"), (2, "1" * 30)).toDF("i", "decimal") df.write.format("parquet").partitionBy("decimal").save("/tmp/bar") spark.read.load("/tmp/bar").printSchema() ``` produces output as below: Before ``` root \|-- i: integer (nullable = true) \|-- ts: date (nullable = true) root \|-- i: integer (nullable = true) \|-- decimal: integer (nullable = true) ``` After ``` root \|-- i: integer (nullable = true) \|-- ts: timestamp (nullable = true) root \|-- i: integer (nullable = true) \|-- decimal: decimal(30,0) (nullable = true) ``` ### Type coercion table: This PR proposes the type conflict resolusion as below: Before \|InputA \ InputB\|`NullType`\|`IntegerType`\|`LongType`\|`DecimalType(38,0)`\|`DoubleType`\|`DateType`\|`TimestampType`\|`StringType`\| \|------------------------\|----------\|----------\|----------\|----------\|----------\|----------\|----------\|----------\| \|`NullType`\|`StringType`\|`IntegerType`\|`LongType`\|`StringType`\|`DoubleType`\|`StringType`\|`StringType`\|`StringType`\| \|`IntegerType`\|`IntegerType`\|`IntegerType`\|`LongType`\|`IntegerType`\|`DoubleType`\|`IntegerType`\|`IntegerType`\|`StringType`\| \|`LongType`\|`LongType`\|`LongType`\|`LongType`\|`LongType`\|`DoubleType`\|`LongType`\|`LongType`\|`StringType`\| \|`DecimalType(38,0)`\|`StringType`\|`IntegerType`\|`LongType`\|`DecimalType(38,0)`\|`DoubleType`\|`DecimalType(38,0)`\|`DecimalType(38,0)`\|`StringType`\| \|`DoubleType`\|`DoubleType`\|`DoubleType`\|`DoubleType`\|`DoubleType`\|`DoubleType`\|`DoubleType`\|`DoubleType`\|`StringType`\| \|`DateType`\|`StringType`\|`IntegerType`\|`LongType`\|`DateType`\|`DoubleType`\|`DateType`\|`DateType`\|`StringType`\| \|`TimestampType`\|`StringType`\|`IntegerType`\|`LongType`\|`TimestampType`\|`DoubleType`\|`TimestampType`\|`TimestampType`\|`StringType`\| \|`StringType`\|`StringType`\|`StringType`\|`StringType`\|`StringType`\|`StringType`\|`StringType`\|`StringType`\|`StringType`\| After \|InputA \ InputB\|`NullType`\|`IntegerType`\|`LongType`\|`DecimalType(38,0)`\|`DoubleType`\|`DateType`\|`TimestampType`\|`StringType`\| \|------------------------\|----------\|----------\|----------\|----------\|----------\|----------\|----------\|----------\| \|`NullType`\|`NullType`\|`IntegerType`\|`LongType`\|`DecimalType(38,0)`\|`DoubleType`\|`DateType`\|`TimestampType`\|`StringType`\| \|`IntegerType`\|`IntegerType`\|`IntegerType`\|`LongType`\|`DecimalType(38,0)`\|`DoubleType`\|`StringType`\|`StringType`\|`StringType`\| \|`LongType`\|`LongType`\|`LongType`\|`LongType`\|`DecimalType(38,0)`\|`StringType`\|`StringType`\|`StringType`\|`StringType`\| \|`DecimalType(38,0)`\|`DecimalType(38,0)`\|`DecimalType(38,0)`\|`DecimalType(38,0)`\|`DecimalType(38,0)`\|`StringType`\|`StringType`\|`StringType`\|`StringType`\| \|`DoubleType`\|`DoubleType`\|`DoubleType`\|`StringType`\|`StringType`\|`DoubleType`\|`StringType`\|`StringType`\|`StringType`\| \|`DateType`\|`DateType`\|`StringType`\|`StringType`\|`StringType`\|`StringType`\|`DateType`\|`TimestampType`\|`StringType`\| \|`TimestampType`\|`TimestampType`\|`StringType`\|`StringType`\|`StringType`\|`StringType`\|`TimestampType`\|`TimestampType`\|`StringType`\| \|`StringType`\|`StringType`\|`StringType`\|`StringType`\|`StringType`\|`StringType`\|`StringType`\|`StringType`\|`StringType`\| This was produced by: ```scala test("Print out chart") { val supportedTypes: Seq[DataType] = Seq( NullType, IntegerType, LongType, DecimalType(38, 0), DoubleType, DateType, TimestampType, StringType) // Old type conflict resolution: val upCastingOrder: Seq[DataType] = Seq(NullType, IntegerType, LongType, FloatType, DoubleType, StringType) def oldResolveTypeConflicts(dataTypes: Seq[DataType]): DataType = { val topType = dataTypes.maxBy(upCastingOrder.indexOf(_)) if (topType == NullType) StringType else topType } println(s"\|InputA \\ InputB\|${supportedTypes.map(dt => s"`${dt.toString}`").mkString("\|")}\|") println(s"\|------------------------\|${supportedTypes.map(_ => "----------").mkString("\|")}\|") supportedTypes.foreach { inputA => val types = supportedTypes.map(inputB => oldResolveTypeConflicts(Seq(inputA, inputB))) println(s"\|`$inputA`\|${types.map(dt => s"`${dt.toString}`").mkString("\|")}\|") } // New type conflict resolution: def newResolveTypeConflicts(dataTypes: Seq[DataType]): DataType = { dataTypes.fold[DataType](NullType)(findWiderTypeForPartitionColumn) } println(s"\|InputA \\ InputB\|${supportedTypes.map(dt => s"`${dt.toString}`").mkString("\|")}\|") println(s"\|------------------------\|${supportedTypes.map(_ => "----------").mkString("\|")}\|") supportedTypes.foreach { inputA => val types = supportedTypes.map(inputB => newResolveTypeConflicts(Seq(inputA, inputB))) println(s"\|`$inputA`\|${types.map(dt => s"`${dt.toString}`").mkString("\|")}\|") } } ``` ## How was this patch tested? Unit tests added in `ParquetPartitionDiscoverySuite`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #19389 from HyukjinKwon/partition-type-coercion.	2017-11-21 20:53:38 +01:00
gatorsmile	96e947ed6c	[SPARK-22569][SQL] Clean usage of addMutableState and splitExpressions ## What changes were proposed in this pull request? This PR is to clean the usage of addMutableState and splitExpressions 1. replace hardcoded type string to ctx.JAVA_BOOLEAN etc. 2. create a default value of the initCode for ctx.addMutableStats 3. Use named arguments when calling `splitExpressions ` ## How was this patch tested? The existing test cases Author: gatorsmile <gatorsmile@gmail.com> Closes #19790 from gatorsmile/codeClean.	2017-11-21 13:48:09 +01:00
Kazuaki Ishizaki	9bdff0bcd8	[SPARK-22550][SQL] Fix 64KB JVM bytecode limit problem with elt ## What changes were proposed in this pull request? This PR changes `elt` code generation to place generated code for expression for arguments into separated methods if these size could be large. This PR resolved the case of `elt` with a lot of argument ## How was this patch tested? Added new test cases into `StringExpressionsSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19778 from kiszk/SPARK-22550.	2017-11-21 12:19:11 +01:00
Kazuaki Ishizaki	c957714806	[SPARK-22508][SQL] Fix 64KB JVM bytecode limit problem with GenerateUnsafeRowJoiner.create() ## What changes were proposed in this pull request? This PR changes `GenerateUnsafeRowJoiner.create()` code generation to place generated code for statements to operate bitmap and offset into separated methods if these size could be large. ## How was this patch tested? Added a new test case into `GenerateUnsafeRowJoinerSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19737 from kiszk/SPARK-22508.	2017-11-21 12:16:54 +01:00
Kazuaki Ishizaki	41c6f36018	[SPARK-22549][SQL] Fix 64KB JVM bytecode limit problem with concat_ws ## What changes were proposed in this pull request? This PR changes `concat_ws` code generation to place generated code for expression for arguments into separated methods if these size could be large. This PR resolved the case of `concat_ws` with a lot of argument ## How was this patch tested? Added new test cases into `StringExpressionsSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19777 from kiszk/SPARK-22549.	2017-11-21 01:42:05 +01:00
Kazuaki Ishizaki	3c3eebc873	[SPARK-20101][SQL] Use OffHeapColumnVector when "spark.sql.columnVector.offheap.enable" is set to "true" This PR enables to use ``OffHeapColumnVector`` when ``spark.sql.columnVector.offheap.enable`` is set to ``true``. While ``ColumnVector`` has two implementations ``OnHeapColumnVector`` and ``OffHeapColumnVector``, only ``OnHeapColumnVector`` is always used. This PR implements the followings - Pass ``OffHeapColumnVector`` to ``ColumnarBatch.allocate()`` when ``spark.sql.columnVector.offheap.enable`` is set to ``true`` - Free all of off-heap memory regions by ``OffHeapColumnVector.close()`` - Ensure to call ``OffHeapColumnVector.close()`` Use existing tests Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #17436 from kiszk/SPARK-20101.	2017-11-20 12:40:26 +01:00
Dongjoon Hyun	b10837ab1a	[SPARK-22557][TEST] Use ThreadSignaler explicitly ## What changes were proposed in this pull request? ScalaTest 3.0 uses an implicit `Signaler`. This PR makes it sure all Spark tests uses `ThreadSignaler` explicitly which has the same default behavior of interrupting a thread on the JVM like ScalaTest 2.2.x. This will reduce potential flakiness. ## How was this patch tested? This is testsuite-only update. This should passes the Jenkins tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #19784 from dongjoon-hyun/use_thread_signaler.	2017-11-20 13:32:01 +09:00
Kazuaki Ishizaki	d54bfec2e0	[SPARK-22498][SQL] Fix 64KB JVM bytecode limit problem with concat ## What changes were proposed in this pull request? This PR changes `concat` code generation to place generated code for expression for arguments into separated methods if these size could be large. This PR resolved the case of `concat` with a lot of argument ## How was this patch tested? Added new test cases into `StringExpressionsSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19728 from kiszk/SPARK-22498.	2017-11-18 19:40:06 +01:00
Shixiong Zhu	bf0c0ae2dc	[SPARK-22544][SS] FileStreamSource should use its own hadoop conf to call globPathIfNecessary ## What changes were proposed in this pull request? Pass the FileSystem created using the correct Hadoop conf into `globPathIfNecessary` so that it can pick up user's hadoop configurations, such as credentials. ## How was this patch tested? Jenkins Author: Shixiong Zhu <zsxwing@gmail.com> Closes #19771 from zsxwing/fix-file-stream-conf.	2017-11-17 15:35:24 -08:00
Li Jin	7d039e0c0a	[SPARK-22409] Introduce function type argument in pandas_udf ## What changes were proposed in this pull request? * Add a "function type" argument to pandas_udf. * Add a new public enum class `PandasUdfType` in pyspark.sql.functions * Refactor udf related code from pyspark.sql.functions to pyspark.sql.udf * Merge "PythonUdfType" and "PythonEvalType" into a single enum class "PythonEvalType" Example: ``` from pyspark.sql.functions import pandas_udf, PandasUDFType pandas_udf('double', PandasUDFType.SCALAR): def plus_one(v): return v + 1 ``` ## Design doc https://docs.google.com/document/d/1KlLaa-xJ3oz28xlEJqXyCAHU3dwFYkFs_ixcUXrJNTc/edit ## How was this patch tested? Added PandasUDFTests ## TODO: * [x] Implement proper enum type for `PandasUDFType` * [x] Update documentation * [x] Add more tests in PandasUDFTests Author: Li Jin <ice.xelloss@gmail.com> Closes #19630 from icexelloss/spark-22409-pandas-udf-type.	2017-11-17 16:43:08 +01:00
Wenchen Fan	b9dcbe5e1b	[SPARK-22542][SQL] remove unused features in ColumnarBatch ## What changes were proposed in this pull request? `ColumnarBatch` provides features to do fast filter and project in a columnar fashion, however this feature is never used by Spark, as Spark uses whole stage codegen and processes the data in a row fashion. This PR proposes to remove these unused features as we won't switch to columnar execution in the near future. Even we do, I think this part needs a proper redesign. This is also a step to make `ColumnVector` public, as we don't wanna expose these features to users. ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes #19766 from cloud-fan/vector.	2017-11-16 18:23:00 -08:00
Kazuaki Ishizaki	7f2e62ee6b	[SPARK-22501][SQL] Fix 64KB JVM bytecode limit problem with in ## What changes were proposed in this pull request? This PR changes `In` code generation to place generated code for expression for expressions for arguments into separated methods if these size could be large. ## How was this patch tested? Added new test cases into `PredicateSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19733 from kiszk/SPARK-22501.	2017-11-16 18:24:49 +01:00
Marco Gaido	4e7f07e255	[SPARK-22494][SQL] Fix 64KB limit exception with Coalesce and AtleastNNonNulls ## What changes were proposed in this pull request? Both `Coalesce` and `AtLeastNNonNulls` can cause the 64KB limit exception when used with a lot of arguments and/or complex expressions. This PR splits their expressions in order to avoid the issue. ## How was this patch tested? Added UTs Author: Marco Gaido <marcogaido91@gmail.com> Author: Marco Gaido <mgaido@hortonworks.com> Closes #19720 from mgaido91/SPARK-22494.	2017-11-16 18:19:13 +01:00
Kazuaki Ishizaki	ed885e7a65	[SPARK-22499][SQL] Fix 64KB JVM bytecode limit problem with least and greatest ## What changes were proposed in this pull request? This PR changes `least` and `greatest` code generation to place generated code for expression for arguments into separated methods if these size could be large. This PR resolved two cases: * `least` with a lot of argument * `greatest` with a lot of argument ## How was this patch tested? Added a new test case into `ArithmeticExpressionsSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19729 from kiszk/SPARK-22499.	2017-11-16 17:56:21 +01:00
osatici	2014e7a789	[SPARK-22479][SQL] Exclude credentials from SaveintoDataSourceCommand.simpleString ## What changes were proposed in this pull request? Do not include jdbc properties which may contain credentials in logging a logical plan with `SaveIntoDataSourceCommand` in it. ## How was this patch tested? building locally and trying to reproduce (per the steps in https://issues.apache.org/jira/browse/SPARK-22479): ``` == Parsed Logical Plan == SaveIntoDataSourceCommand org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider570127fa, Map(dbtable -> test20, driver -> org.postgresql.Driver, url -> *******(redacted), password -> *****(redacted)), ErrorIfExists +- Range (0, 100, step=1, splits=Some(8)) == Analyzed Logical Plan == SaveIntoDataSourceCommand org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider570127fa, Map(dbtable -> test20, driver -> org.postgresql.Driver, url -> *****(redacted), password -> *****(redacted)), ErrorIfExists +- Range (0, 100, step=1, splits=Some(8)) == Optimized Logical Plan == SaveIntoDataSourceCommand org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider570127fa, Map(dbtable -> test20, driver -> org.postgresql.Driver, url -> *****(redacted), password -> *****(redacted)), ErrorIfExists +- Range (0, 100, step=1, splits=Some(8)) == Physical Plan == Execute SaveIntoDataSourceCommand +- SaveIntoDataSourceCommand org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider570127fa, Map(dbtable -> test20, driver -> org.postgresql.Driver, url -> *****(redacted), password -> *******(redacted)), ErrorIfExists +- Range (0, 100, step=1, splits=Some(8)) ``` Author: osatici <osatici@palantir.com> Closes #19708 from onursatici/os/redact-jdbc-creds.	2017-11-15 14:08:51 -08:00
liutang123	bc0848b4c1	[SPARK-22469][SQL] Accuracy problem in comparison with string and numeric ## What changes were proposed in this pull request? This fixes a problem caused by #15880 `select '1.5' > 0.5; // Result is NULL in Spark but is true in Hive. ` When compare string and numeric, cast them as double like Hive. Author: liutang123 <liutang123@yeah.net> Closes #19692 from liutang123/SPARK-22469.	2017-11-15 09:02:54 -08:00
Wenchen Fan	dce1610ae3	[SPARK-22514][SQL] move ColumnVector.Array and ColumnarBatch.Row to individual files ## What changes were proposed in this pull request? Logically the `Array` doesn't belong to `ColumnVector`, and `Row` doesn't belong to `ColumnarBatch`. e.g. `ColumnVector` needs to return `Array` for `getArray`, and `Row` for `getStruct`. `Array` and `Row` can return each other with the `getArray`/`getStruct` methods. This is also a step to make `ColumnVector` public, it's cleaner to have `Array` and `Row` as top-level classes. This PR is just code moving around, with 2 renaming: `Array` -> `VectorBasedArray`, `Row` -> `VectorBasedRow`. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #19740 from cloud-fan/vector.	2017-11-15 14:42:37 +01:00
Marcelo Vanzin	0ffa7c488f	[SPARK-20652][SQL] Store SQL UI data in the new app status store. This change replaces the SQLListener with a new implementation that saves the data to the same store used by the SparkContext's status store. For that, the types used by the old SQLListener had to be updated a bit so that they're more serialization-friendly. The interface for getting data from the store was abstracted into a new class, SQLAppStatusStore (following the convention used in core). Another change is the way that the SQL UI hooks up into the core UI or the SHS. The old "SparkHistoryListenerFactory" was replaced with a new "AppStatePlugin" that more explicitly differentiates between the two use cases: processing events, and showing the UI. Both live apps and the SHS use this new API (previously, it was restricted to the SHS). Note on the above: this causes a slight change of behavior for live apps; the SQL tab will only show up after the first execution is started. The metrics gathering code was re-worked a bit so that the types used are less memory hungry and more serialization-friendly. This reduces memory usage when using in-memory stores, and reduces load times when using disk stores. Tested with existing and added unit tests. Note one unit test was disabled because it depends on SPARK-20653, which isn't in yet. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #19681 from vanzin/SPARK-20652.	2017-11-14 15:28:22 -06:00
Zhenhua Wang	11b60af737	[SPARK-17074][SQL] Generate equi-height histogram in column statistics ## What changes were proposed in this pull request? Equi-height histogram is effective in cardinality estimation, and more accurate than basic column stats (min, max, ndv, etc) especially in skew distribution. So we need to support it. For equi-height histogram, all buckets (intervals) have the same height (frequency). In this PR, we use a two-step method to generate an equi-height histogram: 1. use `ApproximatePercentile` to get percentiles `p(0), p(1/n), p(2/n) ... p((n-1)/n), p(1)`; 2. construct range values of buckets, e.g. `[p(0), p(1/n)], [p(1/n), p(2/n)] ... [p((n-1)/n), p(1)]`, and use `ApproxCountDistinctForIntervals` to count ndv in each bucket. Each bucket is of the form: `(lowerBound, higherBound, ndv)`. ## How was this patch tested? Added new test cases and modified some existing test cases. Author: Zhenhua Wang <wangzhenhua@huawei.com> Author: Zhenhua Wang <wzh_zju@163.com> Closes #19479 from wzhfy/generate_histogram.	2017-11-14 16:41:43 +01:00
hyukjinkwon	673c670465	[SPARK-17310][SQL] Add an option to disable record-level filter in Parquet-side ## What changes were proposed in this pull request? There is a concern that Spark-side codegen row-by-row filtering might be faster than Parquet's one in general due to type-boxing and additional fuction calls which Spark's one tries to avoid. So, this PR adds an option to disable/enable record-by-record filtering in Parquet side. It sets the default to `false` to take the advantage of the improvement. This was also discussed in https://github.com/apache/spark/pull/14671. ## How was this patch tested? Manually benchmarks were performed. I generated a billion (1,000,000,000) records and tested equality comparison concatenated with `OR`. This filter combinations were made from 5 to 30. It seem indeed Spark-filtering is faster in the test case and the gap increased as the filter tree becomes larger. The details are as below: Code ``` scala test("Parquet-side filter vs Spark-side filter - record by record") { withTempPath { path => val N = 1000 * 1000 * 1000 val df = spark.range(N).toDF("a") df.write.parquet(path.getAbsolutePath) val benchmark = new Benchmark("Parquet-side vs Spark-side", N) Seq(5, 10, 20, 30).foreach { num => val filterExpr = (0 to num).map(i => s"a = $i").mkString(" OR ") benchmark.addCase(s"Parquet-side filter - number of filters [$num]", 3) { _ => withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> false.toString, SQLConf.PARQUET_RECORD_FILTER_ENABLED.key -> true.toString) { // We should strip Spark-side filter to compare correctly. stripSparkFilter( spark.read.parquet(path.getAbsolutePath).filter(filterExpr)).count() } } benchmark.addCase(s"Spark-side filter - number of filters [$num]", 3) { _ => withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> false.toString, SQLConf.PARQUET_RECORD_FILTER_ENABLED.key -> false.toString) { spark.read.parquet(path.getAbsolutePath).filter(filterExpr).count() } } } benchmark.run() } } ``` Result ``` Parquet-side vs Spark-side: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ Parquet-side filter - number of filters [5] 4268 / 4367 234.3 4.3 0.8X Spark-side filter - number of filters [5] 3709 / 3741 269.6 3.7 0.9X Parquet-side filter - number of filters [10] 5673 / 5727 176.3 5.7 0.6X Spark-side filter - number of filters [10] 3588 / 3632 278.7 3.6 0.9X Parquet-side filter - number of filters [20] 8024 / 8440 124.6 8.0 0.4X Spark-side filter - number of filters [20] 3912 / 3946 255.6 3.9 0.8X Parquet-side filter - number of filters [30] 11936 / 12041 83.8 11.9 0.3X Spark-side filter - number of filters [30] 3929 / 3978 254.5 3.9 0.8X ``` Author: hyukjinkwon <gurwls223@gmail.com> Closes #15049 from HyukjinKwon/SPARK-17310.	2017-11-14 12:34:21 +01:00
Wenchen Fan	f7534b37ee	[SPARK-22487][SQL][FOLLOWUP] still keep spark.sql.hive.version ## What changes were proposed in this pull request? a followup of https://github.com/apache/spark/pull/19712 , adds back the `spark.sql.hive.version`, so that if users try to read this config, they can still get a default value instead of null. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #19719 from cloud-fan/minor.	2017-11-13 13:10:13 -08:00
Bryan Cutler	209b9361ac	[SPARK-20791][PYSPARK] Use Arrow to create Spark DataFrame from Pandas ## What changes were proposed in this pull request? This change uses Arrow to optimize the creation of a Spark DataFrame from a Pandas DataFrame. The input df is sliced according to the default parallelism. The optimization is enabled with the existing conf "spark.sql.execution.arrow.enabled" and is disabled by default. ## How was this patch tested? Added new unit test to create DataFrame with and without the optimization enabled, then compare results. Author: Bryan Cutler <cutlerb@gmail.com> Author: Takuya UESHIN <ueshin@databricks.com> Closes #19459 from BryanCutler/arrow-createDataFrame-from_pandas-SPARK-20791.	2017-11-13 13:16:01 +09:00
Kazuaki Ishizaki	9bf696dbec	[SPARK-21720][SQL] Fix 64KB JVM bytecode limit problem with AND or OR ## What changes were proposed in this pull request? This PR changes `AND` or `OR` code generation to place condition and then expressions' generated code into separated methods if these size could be large. When the method is newly generated, variables for `isNull` and `value` are declared as an instance variable to pass these values (e.g. `isNull1409` and `value1409`) to the callers of the generated method. This PR resolved two cases: * large code size of left expression * large code size of right expression ## How was this patch tested? Added a new test case into `CodeGenerationSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #18972 from kiszk/SPARK-21720.	2017-11-12 22:44:47 +01:00
Wenchen Fan	21a7bfd5c3	[SPARK-10365][SQL] Support Parquet logical type TIMESTAMP_MICROS ## What changes were proposed in this pull request? This PR makes Spark to be able to read Parquet TIMESTAMP_MICROS values, and add a new config to allow Spark to write timestamp values to parquet as TIMESTAMP_MICROS type. ## How was this patch tested? new test Author: Wenchen Fan <wenchen@databricks.com> Closes #19702 from cloud-fan/parquet.	2017-11-11 22:40:26 +01:00
gatorsmile	d6ee69e776	[SPARK-22488][SQL] Fix the view resolution issue in the SparkSession internal table() API ## What changes were proposed in this pull request? The current internal `table()` API of `SparkSession` bypasses the Analyzer and directly calls `sessionState.catalog.lookupRelation` API. This skips the view resolution logics in our Analyzer rule `ResolveRelations`. This internal API is widely used by various DDL commands, public and internal APIs. Users might get the strange error caused by view resolution when the default database is different. ``` Table or view not found: t1; line 1 pos 14 org.apache.spark.sql.AnalysisException: Table or view not found: t1; line 1 pos 14 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) ``` This PR is to fix it by enforcing it to use `ResolveRelations` to resolve the table. ## How was this patch tested? Added a test case and modified the existing test cases Author: gatorsmile <gatorsmile@gmail.com> Closes #19713 from gatorsmile/viewResolution.	2017-11-11 18:20:11 +01:00
Liang-Chi Hsieh	154351e6db	[SPARK-22462][SQL] Make rdd-based actions in Dataset trackable in SQL UI ## What changes were proposed in this pull request? For the few Dataset actions such as `foreach`, currently no SQL metrics are visible in the SQL tab of SparkUI. It is because it binds wrongly to Dataset's `QueryExecution`. As the actions directly evaluate on the RDD which has individual `QueryExecution`, to show correct SQL metrics on UI, we should bind to RDD's `QueryExecution`. ## How was this patch tested? Manually test. Screenshot is attached in the PR. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #19689 from viirya/SPARK-22462.	2017-11-11 12:34:30 +01:00
Rekha Joshi	808e886b96	[SPARK-21667][STREAMING] ConsoleSink should not fail streaming query with checkpointLocation option ## What changes were proposed in this pull request? Fix to allow recovery on console , avoid checkpoint exception ## How was this patch tested? existing tests manual tests [ Replicating error and seeing no checkpoint error after fix] Author: Rekha Joshi <rekhajoshm@gmail.com> Author: rjoshi2 <rekhajoshm@gmail.com> Closes #19407 from rekhajoshm/SPARK-21667.	2017-11-10 15:18:11 -08:00
Kazuaki Ishizaki	f2da738c76	[SPARK-22284][SQL] Fix 64KB JVM bytecode limit problem in calculating hash for nested structs ## What changes were proposed in this pull request? This PR avoids to generate a huge method for calculating a murmur3 hash for nested structs. This PR splits a huge method (e.g. `apply_4`) into multiple smaller methods. Sample program ``` val structOfString = new StructType().add("str", StringType) var inner = new StructType() for (_ <- 0 until 800) { inner = inner1.add("structOfString", structOfString) } var schema = new StructType() for (_ <- 0 until 50) { schema = schema.add("structOfStructOfStrings", inner) } GenerateMutableProjection.generate(Seq(Murmur3Hash(exprs, 42))) ``` Without this PR ``` /* 005 / class SpecificMutableProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection { / 006 / / 007 / private Object[] references; / 008 / private InternalRow mutableRow; / 009 / private int value; / 010 / private int value_0; ... / 034 / public java.lang.Object apply(java.lang.Object _i) { / 035 / InternalRow i = (InternalRow) _i; / 036 / / 037 / / 038 / / 039 / value = 42; / 040 / apply_0(i); / 041 / apply_1(i); / 042 / apply_2(i); / 043 / apply_3(i); / 044 / apply_4(i); / 045 / nestedClassInstance.apply_5(i); ... / 089 / nestedClassInstance8.apply_49(i); / 090 / value_0 = value; / 091 / / 092 / // copy all the results into MutableRow / 093 / mutableRow.setInt(0, value_0); / 094 / return mutableRow; / 095 / } / 096 / / 097 / / 098 / private void apply_4(InternalRow i) { / 099 / / 100 / boolean isNull5 = i.isNullAt(4); / 101 / InternalRow value5 = isNull5 ? null : (i.getStruct(4, 800)); / 102 / if (!isNull5) { / 103 / / 104 / if (!value5.isNullAt(0)) { / 105 / / 106 / final InternalRow element6400 = value5.getStruct(0, 1); / 107 / / 108 / if (!element6400.isNullAt(0)) { / 109 / / 110 / final UTF8String element6401 = element6400.getUTF8String(0); / 111 / value = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(element6401.getBaseObject(), element6401.getBaseOffset(), element6401.numBytes(), value); / 112 / / 113 / } / 114 / / 115 / / 116 / } / 117 / / 118 / / 119 / if (!value5.isNullAt(1)) { / 120 / / 121 / final InternalRow element6402 = value5.getStruct(1, 1); / 122 / / 123 / if (!element6402.isNullAt(0)) { / 124 / / 125 / final UTF8String element6403 = element6402.getUTF8String(0); / 126 / value = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(element6403.getBaseObject(), element6403.getBaseOffset(), element6403.numBytes(), value); / 127 / / 128 / } / 128 / } / 129 / / 130 / / 131 / } / 132 / / 133 / / 134 / if (!value5.isNullAt(2)) { / 135 / / 136 / final InternalRow element6404 = value5.getStruct(2, 1); / 137 / / 138 / if (!element6404.isNullAt(0)) { / 139 / / 140 / final UTF8String element6405 = element6404.getUTF8String(0); / 141 / value = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(element6405.getBaseObject(), element6405.getBaseOffset(), element6405.numBytes(), value); / 142 / / 143 / } / 144 / / 145 / / 146 / } / 147 / ... / 12074 / if (!value5.isNullAt(798)) { / 12075 / / 12076 / final InternalRow element7996 = value5.getStruct(798, 1); / 12077 / / 12078 / if (!element7996.isNullAt(0)) { / 12079 / / 12080 / final UTF8String element7997 = element7996.getUTF8String(0); / 12083 / } / 12084 / / 12085 / / 12086 / } / 12087 / / 12088 / / 12089 / if (!value5.isNullAt(799)) { / 12090 / / 12091 / final InternalRow element7998 = value5.getStruct(799, 1); / 12092 / / 12093 / if (!element7998.isNullAt(0)) { / 12094 / / 12095 / final UTF8String element7999 = element7998.getUTF8String(0); / 12096 / value = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(element7999.getBaseObject(), element7999.getBaseOffset(), element7999.numBytes(), value); / 12097 / / 12098 / } / 12099 / / 12100 / / 12101 / } / 12102 / / 12103 / } / 12104 / / 12105 / } / 12106 / / 12106 / / 12107 / / 12108 / private void apply_1(InternalRow i) { ... ``` With this PR ``` / 005 / class SpecificMutableProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection { / 006 / / 007 / private Object[] references; / 008 / private InternalRow mutableRow; / 009 / private int value; / 010 / private int value_0; / 011 / ... / 034 / public java.lang.Object apply(java.lang.Object _i) { / 035 / InternalRow i = (InternalRow) _i; / 036 / / 037 / / 038 / / 039 / value = 42; / 040 / nestedClassInstance11.apply50_0(i); / 041 / nestedClassInstance11.apply50_1(i); ... / 088 / nestedClassInstance11.apply50_48(i); / 089 / nestedClassInstance11.apply50_49(i); / 090 / value_0 = value; / 091 / / 092 / // copy all the results into MutableRow / 093 / mutableRow.setInt(0, value_0); / 094 / return mutableRow; / 095 / } / 096 / ... / 37717 / private void apply4_0(InternalRow value5, InternalRow i) { / 37718 / / 37719 / if (!value5.isNullAt(0)) { / 37720 / / 37721 / final InternalRow element6400 = value5.getStruct(0, 1); / 37722 / / 37723 / if (!element6400.isNullAt(0)) { / 37724 / / 37725 / final UTF8String element6401 = element6400.getUTF8String(0); / 37726 / value = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(element6401.getBaseObject(), element6401.getBaseOffset(), element6401.numBytes(), value); / 37727 / / 37728 / } / 37729 / / 37730 / / 37731 / } / 37732 / / 37733 / if (!value5.isNullAt(1)) { / 37734 / / 37735 / final InternalRow element6402 = value5.getStruct(1, 1); / 37736 / / 37737 / if (!element6402.isNullAt(0)) { / 37738 / / 37739 / final UTF8String element6403 = element6402.getUTF8String(0); / 37740 / value = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(element6403.getBaseObject(), element6403.getBaseOffset(), element6403.numBytes(), value); / 37741 / / 37742 / } / 37743 / / 37744 / / 37745 / } / 37746 / / 37747 / if (!value5.isNullAt(2)) { / 37748 / / 37749 / final InternalRow element6404 = value5.getStruct(2, 1); / 37750 / / 37751 / if (!element6404.isNullAt(0)) { / 37752 / / 37753 / final UTF8String element6405 = element6404.getUTF8String(0); / 37754 / value = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(element6405.getBaseObject(), element6405.getBaseOffset(), element6405.numBytes(), value); / 37755 / / 37756 / } / 37757 / / 37758 / / 37759 / } / 37760 / / 37761 / } ... / 218470 / / 218471 / private void apply50_4(InternalRow i) { / 218472 / / 218473 / boolean isNull5 = i.isNullAt(4); / 218474 / InternalRow value5 = isNull5 ? null : (i.getStruct(4, 800)); / 218475 / if (!isNull5) { / 218476 / apply4_0(value5, i); / 218477 / apply4_1(value5, i); / 218478 / apply4_2(value5, i); ... / 218742 / nestedClassInstance.apply4_266(value5, i); / 218743 / } / 218744 / / 218745 */ } ``` ## How was this patch tested? Added new test to `HashExpressionsSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19563 from kiszk/SPARK-22284.	2017-11-10 21:17:49 +01:00
Shixiong Zhu	24ea781cd3	[SPARK-19644][SQL] Clean up Scala reflection garbage after creating Encoder ## What changes were proposed in this pull request? Because of the memory leak issue in `scala.reflect.api.Types.TypeApi.<:<` (https://github.com/scala/bug/issues/8302), creating an encoder may leak memory. This PR adds `cleanUpReflectionObjects` to clean up these leaking objects for methods calling `scala.reflect.api.Types.TypeApi.<:<`. ## How was this patch tested? The updated unit tests. Author: Shixiong Zhu <zsxwing@gmail.com> Closes #19687 from zsxwing/SPARK-19644.	2017-11-10 11:27:28 -08:00
Marco Gaido	5b41cbf13b	[SPARK-22473][TEST] Replace deprecated AsyncAssertions.Waiter and methods of java.sql.Date ## What changes were proposed in this pull request? In `spark-sql` module tests there are deprecations warnings caused by the usage of deprecated methods of `java.sql.Date` and the usage of the deprecated `AsyncAssertions.Waiter` class. This PR replace the deprecated methods of `java.sql.Date` with non-deprecated ones (using `Calendar` where needed). It replaces also the deprecated `org.scalatest.concurrent.AsyncAssertions.Waiter` with `org.scalatest.concurrent.Waiters._`. ## How was this patch tested? existing UTs Author: Marco Gaido <mgaido@hortonworks.com> Closes #19696 from mgaido91/SPARK-22473.	2017-11-10 11:24:24 -06:00
Kent Yao	28ab5bf597	[SPARK-22487][SQL][HIVE] Remove the unused HIVE_EXECUTION_VERSION property ## What changes were proposed in this pull request? At the beginning https://github.com/apache/spark/pull/2843 added `spark.sql.hive.version` to reveal underlying hive version for jdbc connections. For some time afterwards, it was used as a version identifier for the execution hive client. Actually there is no hive client for executions in spark now and there are no usages of HIVE_EXECUTION_VERSION found in whole spark project. HIVE_EXECUTION_VERSION is set by `spark.sql.hive.version`, which is still set internally in some places or by users, this may confuse developers and users with HIVE_METASTORE_VERSION(spark.sql.hive.metastore.version). It might better to be removed. ## How was this patch tested? modify some existing ut cc cloud-fan gatorsmile Author: Kent Yao <yaooqinn@hotmail.com> Closes #19712 from yaooqinn/SPARK-22487.	2017-11-10 12:01:02 +01:00

1 2 3 4 5 ...

5990 commits