ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Liang-Chi Hsieh	00d176d2fe	[SPARK-20392][SQL] Set barrier to prevent re-entering a tree ## What changes were proposed in this pull request? The SQL `Analyzer` goes through a whole query plan even most part of it is analyzed. This increases the time spent on query analysis for long pipelines in ML, especially. This patch adds a logical node called `AnalysisBarrier` that wraps an analyzed logical plan to prevent it from analysis again. The barrier is applied to the analyzed logical plan in `Dataset`. It won't change the output of wrapped logical plan and just acts as a wrapper to hide it from analyzer. New operations on the dataset will be put on the barrier, so only the new nodes created will be analyzed. This analysis barrier will be removed at the end of analysis stage. ## How was this patch tested? Added tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #19873 from viirya/SPARK-20392-reopen.	2017-12-05 21:43:41 -08:00
Dongjoon Hyun	82183f7b57	[SPARK-22686][SQL] DROP TABLE IF EXISTS should not show AnalysisException ## What changes were proposed in this pull request? During [SPARK-22488](https://github.com/apache/spark/pull/19713) to fix view resolution issue, there occurs a regression at `2.2.1` and `master` branch like the following. This PR fixes that. ```scala scala> spark.version res2: String = 2.2.1 scala> sql("DROP TABLE IF EXISTS t").show 17/12/04 21:01:06 WARN DropTableCommand: org.apache.spark.sql.AnalysisException: Table or view not found: t; org.apache.spark.sql.AnalysisException: Table or view not found: t; ``` ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #19888 from dongjoon-hyun/SPARK-22686.	2017-12-06 10:52:29 +08:00
Dongjoon Hyun	326f1d6728	[SPARK-20728][SQL] Make OrcFileFormat configurable between sql/hive and sql/core ## What changes were proposed in this pull request? This PR aims to provide a configuration to choose the default `OrcFileFormat` from legacy `sql/hive` module or new `sql/core` module. For example, this configuration will affects the following operations. ```scala spark.read.orc(...) ``` ```sql CREATE TABLE t USING ORC ... ``` ## How was this patch tested? Pass the Jenkins with new test suites. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #19871 from dongjoon-hyun/spark-sql-orc-enabled.	2017-12-05 20:46:35 +08:00
gatorsmile	53e5251bb3	[SPARK-22675][SQL] Refactoring PropagateTypes in TypeCoercion ## What changes were proposed in this pull request? PropagateTypes are called twice in TypeCoercion. We do not need to call it twice. Instead, we should call it after each change on the types. ## How was this patch tested? The existing tests Author: gatorsmile <gatorsmile@gmail.com> Closes #19874 from gatorsmile/deduplicatePropagateTypes.	2017-12-05 20:43:02 +08:00
Wenchen Fan	295df746ec	[SPARK-22677][SQL] cleanup whole stage codegen for hash aggregate ## What changes were proposed in this pull request? The `HashAggregateExec` whole stage codegen path is a little messy and hard to understand, this code cleans it up a little bit, especially for the fast hash map part. ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes #19869 from cloud-fan/hash-agg.	2017-12-05 12:38:26 +08:00
Dongjoon Hyun	f23dddf105	[SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileFormat based on ORC 1.4.1 ## What changes were proposed in this pull request? Since [SPARK-2883](https://issues.apache.org/jira/browse/SPARK-2883), Apache Spark supports Apache ORC inside `sql/hive` module with Hive dependency. This PR aims to add a new ORC data source inside `sql/core` and to replace the old ORC data source eventually. This PR resolves the following three issues. - [SPARK-20682](https://issues.apache.org/jira/browse/SPARK-20682): Add new ORCFileFormat based on Apache ORC 1.4.1 - [SPARK-15474](https://issues.apache.org/jira/browse/SPARK-15474): ORC data source fails to write and read back empty dataframe - [SPARK-21791](https://issues.apache.org/jira/browse/SPARK-21791): ORC should support column names with dot ## How was this patch tested? Pass the Jenkins with the existing all tests and new tests for SPARK-15474 and SPARK-21791. Author: Dongjoon Hyun <dongjoon@apache.org> Author: Wenchen Fan <wenchen@databricks.com> Closes #19651 from dongjoon-hyun/SPARK-20682.	2017-12-03 22:21:44 +08:00
Shixiong Zhu	ee10ca7ec6	[SPARK-22638][SS] Use a separate queue for StreamingQueryListenerBus ## What changes were proposed in this pull request? Use a separate Spark event queue for StreamingQueryListenerBus so that if there are many non-streaming events, streaming query listeners don't need to wait for other Spark listeners and can catch up. ## How was this patch tested? Jenkins Author: Shixiong Zhu <zsxwing@gmail.com> Closes #19838 from zsxwing/SPARK-22638.	2017-12-01 13:02:03 -08:00
sujith71955	16adaf634b	[SPARK-22601][SQL] Data load is getting displayed successful on providing non existing nonlocal file path ## What changes were proposed in this pull request? When user tries to load data with a non existing hdfs file path system is not validating it and the load command operation is getting successful. This is misleading to the user. already there is a validation in the scenario of none existing local file path. This PR has added validation in the scenario of nonexisting hdfs file path ## How was this patch tested? UT has been added for verifying the issue, also snapshots has been added after the verification in a spark yarn cluster Author: sujith71955 <sujithchacko.2010@gmail.com> Closes #19823 from sujith71955/master_LoadComand_Issue.	2017-11-30 20:45:30 -08:00
Adrian Ionescu	f5f8e84d9d	[SPARK-22614] Dataset API: repartitionByRange(...) ## What changes were proposed in this pull request? This PR introduces a way to explicitly range-partition a Dataset. So far, only round-robin and hash partitioning were possible via `df.repartition(...)`, but sometimes range partitioning might be desirable: e.g. when writing to disk, for better compression without the cost of global sort. The current implementation piggybacks on the existing `RepartitionByExpression` `LogicalPlan` and simply adds the following logic: If its expressions are of type `SortOrder`, then it will do `RangePartitioning`; otherwise `HashPartitioning`. This was by far the least intrusive solution I could come up with. ## How was this patch tested? Unit test for `RepartitionByExpression` changes, a test to ensure we're not changing the behavior of existing `.repartition()` and a few end-to-end tests in `DataFrameSuite`. Author: Adrian Ionescu <adrian@databricks.com> Closes #19828 from adrian-ionescu/repartitionByRange.	2017-11-30 15:41:34 -08:00
Yuming Wang	bcceab6495	[SPARK-22489][SQL] Shouldn't change broadcast join buildSide if user clearly specified ## What changes were proposed in this pull request? How to reproduce: ```scala import org.apache.spark.sql.execution.joins.BroadcastHashJoinExec spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", "value").createTempView("table1") spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", "value").createTempView("table2") val bl = sql("SELECT /+ MAPJOIN(t1) / * FROM table1 t1 JOIN table2 t2 ON t1.key = t2.key").queryExecution.executedPlan println(bl.children.head.asInstanceOf[BroadcastHashJoinExec].buildSide) ``` The result is `BuildRight`, but should be `BuildLeft`. This PR fix this issue. ## How was this patch tested? unit tests Author: Yuming Wang <wgyumg@gmail.com> Closes #19714 from wangyum/SPARK-22489.	2017-11-30 15:36:26 -08:00
Wenchen Fan	9c29c55763	[SPARK-22643][SQL] ColumnarArray should be an immutable view ## What changes were proposed in this pull request? To make `ColumnVector` public, `ColumnarArray` need to be public too, and we should not have mutable public fields in a public class. This PR proposes to make `ColumnarArray` an immutable view of the data, and always create a new instance of `ColumnarArray` in `ColumnVector#getArray` ## How was this patch tested? new benchmark in `ColumnarBatchBenchmark` Author: Wenchen Fan <wenchen@databricks.com> Closes #19842 from cloud-fan/column-vector.	2017-11-30 18:34:38 +08:00
Wenchen Fan	444a2bbb67	[SPARK-22652][SQL] remove set methods in ColumnarRow ## What changes were proposed in this pull request? As a step to make `ColumnVector` public, the `ColumnarRow` returned by `ColumnVector#getStruct` should be immutable. However we do need the mutability of `ColumnaRow` for the fast vectorized hashmap in hash aggregate. To solve this, this PR introduces a `MutableColumnarRow` for this use case. ## How was this patch tested? existing test. Author: Wenchen Fan <wenchen@databricks.com> Closes #19847 from cloud-fan/mutable-row.	2017-11-30 18:28:58 +08:00
Wang Gengliang	57687280d4	[SPARK-22615][SQL] Handle more cases in PropagateEmptyRelation ## What changes were proposed in this pull request? Currently, in the optimize rule `PropagateEmptyRelation`, the following cases is not handled: 1. empty relation as right child in left outer join 2. empty relation as left child in right outer join 3. empty relation as right child in left semi join 4. empty relation as right child in left anti join 5. only one empty relation in full outer join case 1 / 2 / 5 can be treated as Cartesian product and cause exception. See the new test cases. ## How was this patch tested? Unit test Author: Wang Gengliang <ltnwgl@gmail.com> Closes #19825 from gengliangwang/SPARK-22615.	2017-11-29 09:17:39 -08:00
Wenchen Fan	20b239845b	[SPARK-22605][SQL] SQL write job should also set Spark task output metrics ## What changes were proposed in this pull request? For SQL write jobs, we only set metrics for the SQL listener and display them in the SQL plan UI. We should also set metrics for Spark task output metrics, which will be shown in spark job UI. ## How was this patch tested? test it manually. For a simple write job ``` spark.range(1000).write.parquet("/tmp/p1") ``` now the spark job UI looks like ![ui](https://user-images.githubusercontent.com/3182036/33326478-05a25b7c-d490-11e7-96ef-806117774356.jpg) Author: Wenchen Fan <wenchen@databricks.com> Closes #19833 from cloud-fan/ui.	2017-11-29 19:18:47 +08:00
Herman van Hovell	475a29f11e	[SPARK-22637][SQL] Only refresh a logical plan once. ## What changes were proposed in this pull request? `CatalogImpl.refreshTable` uses `foreach(..)` to refresh all tables in a view. This traverses all nodes in the subtree and calls `LogicalPlan.refresh()` on these nodes. However `LogicalPlan.refresh()` is also refreshing its children, as a result refreshing a large view can be quite expensive. This PR just calls `LogicalPlan.refresh()` on the top node. ## How was this patch tested? Existing tests. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #19837 from hvanhovell/SPARK-22637.	2017-11-28 16:03:47 -08:00
Sunitha Kambhampati	a10b328dbc	[SPARK-22431][SQL] Ensure that the datatype in the schema for the table/view metadata is parseable by Spark before persisting it ## What changes were proposed in this pull request? * JIRA: [SPARK-22431](https://issues.apache.org/jira/browse/SPARK-22431) : Creating Permanent view with illegal type Description: - It is possible in Spark SQL to create a permanent view that uses an nested field with an illegal name. - For example if we create the following view: ```create view x as select struct('a' as `$q`, 1 as b) q``` - A simple select fails with the following exception: ``` select * from x; org.apache.spark.SparkException: Cannot recognize hive type string: struct<$q:string,b:int> at org.apache.spark.sql.hive.client.HiveClientImpl$.fromHiveColumn(HiveClientImpl.scala:812) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:378) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:378) ... ``` Issue/Analysis: Right now, we can create a view with a schema that cannot be read back by Spark from the Hive metastore. For more details, please see the discussion about the analysis and proposed fix options in comment 1 and comment 2 in the [SPARK-22431](https://issues.apache.org/jira/browse/SPARK-22431) Proposed changes: - Fix the hive table/view codepath to check whether the schema datatype is parseable by Spark before persisting it in the metastore. This change is localized to HiveClientImpl to do the check similar to the check in FromHiveColumn. This is fail-fast and we will avoid the scenario where we write something to the metastore that we are unable to read it back. - Added new unit tests - Ran the sql related unit test suites ( hive/test, sql/test, catalyst/test) OK With the fix: ``` create view x as select struct('a' as `$q`, 1 as b) q; 17/11/28 10:44:55 ERROR SparkSQLDriver: Failed in [create view x as select struct('a' as `$q`, 1 as b) q] org.apache.spark.SparkException: Cannot recognize hive type string: struct<$q:string,b:int> at org.apache.spark.sql.hive.client.HiveClientImpl$.org$apache$spark$sql$hive$client$HiveClientImpl$$getSparkSQLDataType(HiveClientImpl.scala:884) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$org$apache$spark$sql$hive$client$HiveClientImpl$$verifyColumnDataType$1.apply(HiveClientImpl.scala:906) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$org$apache$spark$sql$hive$client$HiveClientImpl$$verifyColumnDataType$1.apply(HiveClientImpl.scala:906) at scala.collection.Iterator$class.foreach(Iterator.scala:893) ... ``` ## How was this patch tested? - New unit tests have been added. hvanhovell, Please review and share your thoughts/comments. Thank you so much. Author: Sunitha Kambhampati <skambha@us.ibm.com> Closes #19747 from skambha/spark22431.	2017-11-28 22:01:01 +01:00
Zhenhua Wang	da35574297	[SPARK-22515][SQL] Estimation relation size based on numRows * rowSize ## What changes were proposed in this pull request? Currently, relation size is computed as the sum of file size, which is error-prone because storage format like parquet may have a much smaller file size compared to in-memory size. When we choose broadcast join based on file size, there's a risk of OOM. But if the number of rows is available in statistics, we can get a better estimation by `numRows * rowSize`, which helps to alleviate this problem. ## How was this patch tested? Added a new test case for data source table and hive table. Author: Zhenhua Wang <wzh_zju@163.com> Author: Zhenhua Wang <wangzhenhua@huawei.com> Closes #19743 from wzhfy/better_leaf_size.	2017-11-28 11:43:21 -08:00
Takuya UESHIN	64817c423c	[SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp values for Pandas to respect session timezone ## What changes were proposed in this pull request? When converting Pandas DataFrame/Series from/to Spark DataFrame using `toPandas()` or pandas udfs, timestamp values behave to respect Python system timezone instead of session timezone. For example, let's say we use `"America/Los_Angeles"` as session timezone and have a timestamp value `"1970-01-01 00:00:01"` in the timezone. Btw, I'm in Japan so Python timezone would be `"Asia/Tokyo"`. The timestamp value from current `toPandas()` will be the following: ``` >>> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") >>> df = spark.createDataFrame([28801], "long").selectExpr("timestamp(value) as ts") >>> df.show() +-------------------+ \| ts\| +-------------------+ \|1970-01-01 00:00:01\| +-------------------+ >>> df.toPandas() ts 0 1970-01-01 17:00:01 ``` As you can see, the value becomes `"1970-01-01 17:00:01"` because it respects Python timezone. As we discussed in #18664, we consider this behavior is a bug and the value should be `"1970-01-01 00:00:01"`. ## How was this patch tested? Added tests and existing tests. Author: Takuya UESHIN <ueshin@databricks.com> Closes #19607 from ueshin/issues/SPARK-22395.	2017-11-28 16:45:22 +08:00
gaborgsomogyi	33d43bf1b6	[SPARK-22484][DOC] Document PySpark DataFrame csv writer behavior whe… ## What changes were proposed in this pull request? In PySpark API Document, DataFrame.write.csv() says that setting the quote parameter to an empty string should turn off quoting. Instead, it uses the [null character](https://en.wikipedia.org/wiki/Null_character) as the quote. This PR fixes the doc. ## How was this patch tested? Manual. ``` cd python/docs make html open _build/html/pyspark.sql.html ``` Author: gaborgsomogyi <gabor.g.somogyi@gmail.com> Closes #19814 from gaborgsomogyi/SPARK-22484.	2017-11-28 10:14:35 +09:00
Marco Gaido	087879a77a	[SPARK-22520][SQL] Support code generation for large CaseWhen ## What changes were proposed in this pull request? Code generation is disabled for CaseWhen when the number of branches is higher than `spark.sql.codegen.maxCaseBranches` (which defaults to 20). This was done to prevent the well known 64KB method limit exception. This PR proposes to support code generation also in those cases (without causing exceptions of course). As a side effect, we could get rid of the `spark.sql.codegen.maxCaseBranches` configuration. ## How was this patch tested? existing UTs Author: Marco Gaido <mgaido@hortonworks.com> Author: Marco Gaido <marcogaido91@gmail.com> Closes #19752 from mgaido91/SPARK-22520.	2017-11-28 07:46:18 +08:00
Zhenhua Wang	1ff4a77be4	[SPARK-22529][SQL] Relation stats should be consistent with other plans based on cbo config ## What changes were proposed in this pull request? Currently, relation stats is the same whether cbo is enabled or not. While relation (`LogicalRelation` or `HiveTableRelation`) is a `LogicalPlan`, its behavior is inconsistent with other plans. This can cause confusion when user runs EXPLAIN COST commands. Besides, when CBO is disabled, we apply the size-only estimation strategy, so there's no need to propagate other catalog statistics to relation. ## How was this patch tested? Enhanced existing tests case and added a test case. Author: Zhenhua Wang <wangzhenhua@huawei.com> Closes #19757 from wzhfy/catalog_stats_conversion.	2017-11-28 01:13:44 +08:00
Wenchen Fan	5a02e3a2ac	[SPARK-22602][SQL] remove ColumnVector#loadBytes ## What changes were proposed in this pull request? `ColumnVector#loadBytes` is only used as an optimization for reading UTF8String in `WritableColumnVector`, this PR moves this optimization to `WritableColumnVector` and simplified it. ## How was this patch tested? existing test Author: Wenchen Fan <wenchen@databricks.com> Closes #19815 from cloud-fan/load-bytes.	2017-11-26 21:49:09 -08:00
Wenchen Fan	e3fd93f149	[SPARK-22604][SQL] remove the get address methods from ColumnVector ## What changes were proposed in this pull request? `nullsNativeAddress` and `valuesNativeAddress` are only used in tests and benchmark, no need to be top class API. ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes #19818 from cloud-fan/minor.	2017-11-24 22:43:47 -08:00
Wenchen Fan	70221903f5	[SPARK-22596][SQL] set ctx.currentVars in CodegenSupport.consume ## What changes were proposed in this pull request? `ctx.currentVars` means the input variables for the current operator, which is already decided in `CodegenSupport`, we can set it there instead of `doConsume`. also add more comments to help people understand the codegen framework. After this PR, we now have a principle about setting `ctx.currentVars` and `ctx.INPUT_ROW`: 1. for non-whole-stage-codegen path, never set them. (permit some special cases like generating ordering) 2. for whole-stage-codegen `produce` path, mostly we don't need to set them, but blocking operators may need to set them for expressions that produce data from data source, sort buffer, aggregate buffer, etc. 3. for whole-stage-codegen `consume` path, mostly we don't need to set them because `currentVars` is automatically set to child input variables and `INPUT_ROW` is mostly not used. A few plans need to tweak them as they may have different inputs, or they use the input row. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #19803 from cloud-fan/codegen.	2017-11-24 21:50:30 -08:00
Wenchen Fan	0605ad7614	[SPARK-22543][SQL] fix java 64kb compile error for deeply nested expressions ## What changes were proposed in this pull request? A frequently reported issue of Spark is the Java 64kb compile error. This is because Spark generates a very big method and it's usually caused by 3 reasons: 1. a deep expression tree, e.g. a very complex filter condition 2. many individual expressions, e.g. expressions can have many children, operators can have many expressions. 3. a deep query plan tree (with whole stage codegen) This PR focuses on 1. There are already several patches(#15620 #18972 #18641) trying to fix this issue and some of them are already merged. However this is an endless job as every non-leaf expression has this issue. This PR proposes to fix this issue in `Expression.genCode`, to make sure the code for a single expression won't grow too big. According to maropu 's benchmark, no regression is found with TPCDS (thanks maropu !): https://docs.google.com/spreadsheets/d/1K3_7lX05-ZgxDXi9X_GleNnDjcnJIfoSlSCDZcL4gdg/edit?usp=sharing ## How was this patch tested? existing test Author: Wenchen Fan <wenchen@databricks.com> Author: Wenchen Fan <cloud0fan@gmail.com> Closes #19767 from cloud-fan/codegen.	2017-11-22 10:05:46 -08:00
Takeshi Yamamuro	2c0fe818a6	[SPARK-22445][SQL][FOLLOW-UP] Respect stream-side child's needCopyResult in BroadcastHashJoin ## What changes were proposed in this pull request? I found #19656 causes some bugs, for example, it changed the result set of `q6` in tpcds (I keep tracking TPCDS results daily [here](https://github.com/maropu/spark-tpcds-datagen/tree/master/reports/tests)): - w/o pr19658 ``` +-----+---+ \|state\|cnt\| +-----+---+ \| MA\| 10\| \| AK\| 10\| \| AZ\| 11\| \| ME\| 13\| \| VT\| 14\| \| NV\| 15\| \| NH\| 16\| \| UT\| 17\| \| NJ\| 21\| \| MD\| 22\| \| WY\| 25\| \| NM\| 26\| \| OR\| 31\| \| WA\| 36\| \| ND\| 38\| \| ID\| 39\| \| SC\| 45\| \| WV\| 50\| \| FL\| 51\| \| OK\| 53\| \| MT\| 53\| \| CO\| 57\| \| AR\| 58\| \| NY\| 58\| \| PA\| 62\| \| AL\| 63\| \| LA\| 63\| \| SD\| 70\| \| WI\| 80\| \| null\| 81\| \| MI\| 82\| \| NC\| 82\| \| MS\| 83\| \| CA\| 84\| \| MN\| 85\| \| MO\| 88\| \| IL\| 95\| \| IA\|102\| \| TN\|102\| \| IN\|103\| \| KY\|104\| \| NE\|113\| \| OH\|114\| \| VA\|130\| \| KS\|139\| \| GA\|168\| \| TX\|216\| +-----+---+ ``` - w/ pr19658 ``` +-----+---+ \|state\|cnt\| +-----+---+ \| RI\| 14\| \| AK\| 16\| \| FL\| 20\| \| NJ\| 21\| \| NM\| 21\| \| NV\| 22\| \| MA\| 22\| \| MD\| 22\| \| UT\| 22\| \| AZ\| 25\| \| SC\| 28\| \| AL\| 36\| \| MT\| 36\| \| WA\| 39\| \| ND\| 41\| \| MI\| 44\| \| AR\| 45\| \| OR\| 47\| \| OK\| 52\| \| PA\| 53\| \| LA\| 55\| \| CO\| 55\| \| NY\| 64\| \| WV\| 66\| \| SD\| 72\| \| MS\| 73\| \| NC\| 79\| \| IN\| 82\| \| null\| 85\| \| ID\| 88\| \| MN\| 91\| \| WI\| 95\| \| IL\| 96\| \| MO\| 97\| \| CA\|109\| \| CA\|109\| \| TN\|114\| \| NE\|115\| \| KY\|128\| \| OH\|131\| \| IA\|156\| \| TX\|160\| \| VA\|182\| \| KS\|211\| \| GA\|230\| +-----+---+ ``` This pr is to keep the original logic of `CodegenContext.copyResult` in `BroadcastHashJoinExec`. ## How was this patch tested? Existing tests Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #19781 from maropu/SPARK-22445-bugfix.	2017-11-22 09:09:50 +01:00
Jia Li	881c5c8073	[SPARK-22548][SQL] Incorrect nested AND expression pushed down to JDBC data source ## What changes were proposed in this pull request? Let’s say I have a nested AND expression shown below and p2 can not be pushed down, (p1 AND p2) OR p3 In current Spark code, during data source filter translation, (p1 AND p2) is returned as p1 only and p2 is simply lost. This issue occurs with JDBC data source and is similar to [SPARK-12218](https://github.com/apache/spark/pull/10362) for Parquet. When we have AND nested below another expression, we should either push both legs or nothing. Note that: - The current Spark code will always split conjunctive predicate before it determines if a predicate can be pushed down or not - If I have (p1 AND p2) AND p3, it will be split into p1, p2, p3. There won't be nested AND expression. - The current Spark code logic for OR is OK. It either pushes both legs or nothing. The same translation method is also called by Data Source V2. ## How was this patch tested? Added new unit test cases to JDBCSuite gatorsmile Author: Jia Li <jiali@us.ibm.com> Closes #19776 from jliwork/spark-22548.	2017-11-21 17:30:02 -08:00
Marco Gaido	b96f61b6b2	[SPARK-22475][SQL] show histogram in DESC COLUMN command ## What changes were proposed in this pull request? Added the histogram representation to the output of the `DESCRIBE EXTENDED table_name column_name` command. ## How was this patch tested? Modified SQL UT and checked output Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Marco Gaido <mgaido@hortonworks.com> Closes #19774 from mgaido91/SPARK-22475.	2017-11-21 20:55:24 +01:00
hyukjinkwon	6d7ebf2f9f	[SPARK-22165][SQL] Fixes type conflicts between double, long, decimals, dates and timestamps in partition column ## What changes were proposed in this pull request? This PR proposes to add a rule that re-uses `TypeCoercion.findWiderCommonType` when resolving type conflicts in partition values. Currently, this uses numeric precedence-like comparison; therefore, it looks introducing failures for type conflicts between timestamps, dates and decimals, please see: ```scala private val upCastingOrder: Seq[DataType] = Seq(NullType, IntegerType, LongType, FloatType, DoubleType, StringType) ... literals.map(_.dataType).maxBy(upCastingOrder.indexOf(_)) ``` The codes below: ```scala val df = Seq((1, "2015-01-01"), (2, "2016-01-01 00:00:00")).toDF("i", "ts") df.write.format("parquet").partitionBy("ts").save("/tmp/foo") spark.read.load("/tmp/foo").printSchema() val df = Seq((1, "1"), (2, "1" * 30)).toDF("i", "decimal") df.write.format("parquet").partitionBy("decimal").save("/tmp/bar") spark.read.load("/tmp/bar").printSchema() ``` produces output as below: Before ``` root \|-- i: integer (nullable = true) \|-- ts: date (nullable = true) root \|-- i: integer (nullable = true) \|-- decimal: integer (nullable = true) ``` After ``` root \|-- i: integer (nullable = true) \|-- ts: timestamp (nullable = true) root \|-- i: integer (nullable = true) \|-- decimal: decimal(30,0) (nullable = true) ``` ### Type coercion table: This PR proposes the type conflict resolusion as below: Before \|InputA \ InputB\|`NullType`\|`IntegerType`\|`LongType`\|`DecimalType(38,0)`\|`DoubleType`\|`DateType`\|`TimestampType`\|`StringType`\| \|------------------------\|----------\|----------\|----------\|----------\|----------\|----------\|----------\|----------\| \|`NullType`\|`StringType`\|`IntegerType`\|`LongType`\|`StringType`\|`DoubleType`\|`StringType`\|`StringType`\|`StringType`\| \|`IntegerType`\|`IntegerType`\|`IntegerType`\|`LongType`\|`IntegerType`\|`DoubleType`\|`IntegerType`\|`IntegerType`\|`StringType`\| \|`LongType`\|`LongType`\|`LongType`\|`LongType`\|`LongType`\|`DoubleType`\|`LongType`\|`LongType`\|`StringType`\| \|`DecimalType(38,0)`\|`StringType`\|`IntegerType`\|`LongType`\|`DecimalType(38,0)`\|`DoubleType`\|`DecimalType(38,0)`\|`DecimalType(38,0)`\|`StringType`\| \|`DoubleType`\|`DoubleType`\|`DoubleType`\|`DoubleType`\|`DoubleType`\|`DoubleType`\|`DoubleType`\|`DoubleType`\|`StringType`\| \|`DateType`\|`StringType`\|`IntegerType`\|`LongType`\|`DateType`\|`DoubleType`\|`DateType`\|`DateType`\|`StringType`\| \|`TimestampType`\|`StringType`\|`IntegerType`\|`LongType`\|`TimestampType`\|`DoubleType`\|`TimestampType`\|`TimestampType`\|`StringType`\| \|`StringType`\|`StringType`\|`StringType`\|`StringType`\|`StringType`\|`StringType`\|`StringType`\|`StringType`\|`StringType`\| After \|InputA \ InputB\|`NullType`\|`IntegerType`\|`LongType`\|`DecimalType(38,0)`\|`DoubleType`\|`DateType`\|`TimestampType`\|`StringType`\| \|------------------------\|----------\|----------\|----------\|----------\|----------\|----------\|----------\|----------\| \|`NullType`\|`NullType`\|`IntegerType`\|`LongType`\|`DecimalType(38,0)`\|`DoubleType`\|`DateType`\|`TimestampType`\|`StringType`\| \|`IntegerType`\|`IntegerType`\|`IntegerType`\|`LongType`\|`DecimalType(38,0)`\|`DoubleType`\|`StringType`\|`StringType`\|`StringType`\| \|`LongType`\|`LongType`\|`LongType`\|`LongType`\|`DecimalType(38,0)`\|`StringType`\|`StringType`\|`StringType`\|`StringType`\| \|`DecimalType(38,0)`\|`DecimalType(38,0)`\|`DecimalType(38,0)`\|`DecimalType(38,0)`\|`DecimalType(38,0)`\|`StringType`\|`StringType`\|`StringType`\|`StringType`\| \|`DoubleType`\|`DoubleType`\|`DoubleType`\|`StringType`\|`StringType`\|`DoubleType`\|`StringType`\|`StringType`\|`StringType`\| \|`DateType`\|`DateType`\|`StringType`\|`StringType`\|`StringType`\|`StringType`\|`DateType`\|`TimestampType`\|`StringType`\| \|`TimestampType`\|`TimestampType`\|`StringType`\|`StringType`\|`StringType`\|`StringType`\|`TimestampType`\|`TimestampType`\|`StringType`\| \|`StringType`\|`StringType`\|`StringType`\|`StringType`\|`StringType`\|`StringType`\|`StringType`\|`StringType`\|`StringType`\| This was produced by: ```scala test("Print out chart") { val supportedTypes: Seq[DataType] = Seq( NullType, IntegerType, LongType, DecimalType(38, 0), DoubleType, DateType, TimestampType, StringType) // Old type conflict resolution: val upCastingOrder: Seq[DataType] = Seq(NullType, IntegerType, LongType, FloatType, DoubleType, StringType) def oldResolveTypeConflicts(dataTypes: Seq[DataType]): DataType = { val topType = dataTypes.maxBy(upCastingOrder.indexOf(_)) if (topType == NullType) StringType else topType } println(s"\|InputA \\ InputB\|${supportedTypes.map(dt => s"`${dt.toString}`").mkString("\|")}\|") println(s"\|------------------------\|${supportedTypes.map(_ => "----------").mkString("\|")}\|") supportedTypes.foreach { inputA => val types = supportedTypes.map(inputB => oldResolveTypeConflicts(Seq(inputA, inputB))) println(s"\|`$inputA`\|${types.map(dt => s"`${dt.toString}`").mkString("\|")}\|") } // New type conflict resolution: def newResolveTypeConflicts(dataTypes: Seq[DataType]): DataType = { dataTypes.fold[DataType](NullType)(findWiderTypeForPartitionColumn) } println(s"\|InputA \\ InputB\|${supportedTypes.map(dt => s"`${dt.toString}`").mkString("\|")}\|") println(s"\|------------------------\|${supportedTypes.map(_ => "----------").mkString("\|")}\|") supportedTypes.foreach { inputA => val types = supportedTypes.map(inputB => newResolveTypeConflicts(Seq(inputA, inputB))) println(s"\|`$inputA`\|${types.map(dt => s"`${dt.toString}`").mkString("\|")}\|") } } ``` ## How was this patch tested? Unit tests added in `ParquetPartitionDiscoverySuite`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #19389 from HyukjinKwon/partition-type-coercion.	2017-11-21 20:53:38 +01:00
gatorsmile	96e947ed6c	[SPARK-22569][SQL] Clean usage of addMutableState and splitExpressions ## What changes were proposed in this pull request? This PR is to clean the usage of addMutableState and splitExpressions 1. replace hardcoded type string to ctx.JAVA_BOOLEAN etc. 2. create a default value of the initCode for ctx.addMutableStats 3. Use named arguments when calling `splitExpressions ` ## How was this patch tested? The existing test cases Author: gatorsmile <gatorsmile@gmail.com> Closes #19790 from gatorsmile/codeClean.	2017-11-21 13:48:09 +01:00
Kazuaki Ishizaki	3c3eebc873	[SPARK-20101][SQL] Use OffHeapColumnVector when "spark.sql.columnVector.offheap.enable" is set to "true" This PR enables to use ``OffHeapColumnVector`` when ``spark.sql.columnVector.offheap.enable`` is set to ``true``. While ``ColumnVector`` has two implementations ``OnHeapColumnVector`` and ``OffHeapColumnVector``, only ``OnHeapColumnVector`` is always used. This PR implements the followings - Pass ``OffHeapColumnVector`` to ``ColumnarBatch.allocate()`` when ``spark.sql.columnVector.offheap.enable`` is set to ``true`` - Free all of off-heap memory regions by ``OffHeapColumnVector.close()`` - Ensure to call ``OffHeapColumnVector.close()`` Use existing tests Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #17436 from kiszk/SPARK-20101.	2017-11-20 12:40:26 +01:00
Dongjoon Hyun	b10837ab1a	[SPARK-22557][TEST] Use ThreadSignaler explicitly ## What changes were proposed in this pull request? ScalaTest 3.0 uses an implicit `Signaler`. This PR makes it sure all Spark tests uses `ThreadSignaler` explicitly which has the same default behavior of interrupting a thread on the JVM like ScalaTest 2.2.x. This will reduce potential flakiness. ## How was this patch tested? This is testsuite-only update. This should passes the Jenkins tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #19784 from dongjoon-hyun/use_thread_signaler.	2017-11-20 13:32:01 +09:00
Shixiong Zhu	bf0c0ae2dc	[SPARK-22544][SS] FileStreamSource should use its own hadoop conf to call globPathIfNecessary ## What changes were proposed in this pull request? Pass the FileSystem created using the correct Hadoop conf into `globPathIfNecessary` so that it can pick up user's hadoop configurations, such as credentials. ## How was this patch tested? Jenkins Author: Shixiong Zhu <zsxwing@gmail.com> Closes #19771 from zsxwing/fix-file-stream-conf.	2017-11-17 15:35:24 -08:00
Li Jin	7d039e0c0a	[SPARK-22409] Introduce function type argument in pandas_udf ## What changes were proposed in this pull request? * Add a "function type" argument to pandas_udf. * Add a new public enum class `PandasUdfType` in pyspark.sql.functions * Refactor udf related code from pyspark.sql.functions to pyspark.sql.udf * Merge "PythonUdfType" and "PythonEvalType" into a single enum class "PythonEvalType" Example: ``` from pyspark.sql.functions import pandas_udf, PandasUDFType pandas_udf('double', PandasUDFType.SCALAR): def plus_one(v): return v + 1 ``` ## Design doc https://docs.google.com/document/d/1KlLaa-xJ3oz28xlEJqXyCAHU3dwFYkFs_ixcUXrJNTc/edit ## How was this patch tested? Added PandasUDFTests ## TODO: * [x] Implement proper enum type for `PandasUDFType` * [x] Update documentation * [x] Add more tests in PandasUDFTests Author: Li Jin <ice.xelloss@gmail.com> Closes #19630 from icexelloss/spark-22409-pandas-udf-type.	2017-11-17 16:43:08 +01:00
Wenchen Fan	b9dcbe5e1b	[SPARK-22542][SQL] remove unused features in ColumnarBatch ## What changes were proposed in this pull request? `ColumnarBatch` provides features to do fast filter and project in a columnar fashion, however this feature is never used by Spark, as Spark uses whole stage codegen and processes the data in a row fashion. This PR proposes to remove these unused features as we won't switch to columnar execution in the near future. Even we do, I think this part needs a proper redesign. This is also a step to make `ColumnVector` public, as we don't wanna expose these features to users. ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes #19766 from cloud-fan/vector.	2017-11-16 18:23:00 -08:00
osatici	2014e7a789	[SPARK-22479][SQL] Exclude credentials from SaveintoDataSourceCommand.simpleString ## What changes were proposed in this pull request? Do not include jdbc properties which may contain credentials in logging a logical plan with `SaveIntoDataSourceCommand` in it. ## How was this patch tested? building locally and trying to reproduce (per the steps in https://issues.apache.org/jira/browse/SPARK-22479): ``` == Parsed Logical Plan == SaveIntoDataSourceCommand org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider570127fa, Map(dbtable -> test20, driver -> org.postgresql.Driver, url -> *******(redacted), password -> *****(redacted)), ErrorIfExists +- Range (0, 100, step=1, splits=Some(8)) == Analyzed Logical Plan == SaveIntoDataSourceCommand org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider570127fa, Map(dbtable -> test20, driver -> org.postgresql.Driver, url -> *****(redacted), password -> *****(redacted)), ErrorIfExists +- Range (0, 100, step=1, splits=Some(8)) == Optimized Logical Plan == SaveIntoDataSourceCommand org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider570127fa, Map(dbtable -> test20, driver -> org.postgresql.Driver, url -> *****(redacted), password -> *****(redacted)), ErrorIfExists +- Range (0, 100, step=1, splits=Some(8)) == Physical Plan == Execute SaveIntoDataSourceCommand +- SaveIntoDataSourceCommand org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider570127fa, Map(dbtable -> test20, driver -> org.postgresql.Driver, url -> *****(redacted), password -> *******(redacted)), ErrorIfExists +- Range (0, 100, step=1, splits=Some(8)) ``` Author: osatici <osatici@palantir.com> Closes #19708 from onursatici/os/redact-jdbc-creds.	2017-11-15 14:08:51 -08:00
liutang123	bc0848b4c1	[SPARK-22469][SQL] Accuracy problem in comparison with string and numeric ## What changes were proposed in this pull request? This fixes a problem caused by #15880 `select '1.5' > 0.5; // Result is NULL in Spark but is true in Hive. ` When compare string and numeric, cast them as double like Hive. Author: liutang123 <liutang123@yeah.net> Closes #19692 from liutang123/SPARK-22469.	2017-11-15 09:02:54 -08:00
Wenchen Fan	dce1610ae3	[SPARK-22514][SQL] move ColumnVector.Array and ColumnarBatch.Row to individual files ## What changes were proposed in this pull request? Logically the `Array` doesn't belong to `ColumnVector`, and `Row` doesn't belong to `ColumnarBatch`. e.g. `ColumnVector` needs to return `Array` for `getArray`, and `Row` for `getStruct`. `Array` and `Row` can return each other with the `getArray`/`getStruct` methods. This is also a step to make `ColumnVector` public, it's cleaner to have `Array` and `Row` as top-level classes. This PR is just code moving around, with 2 renaming: `Array` -> `VectorBasedArray`, `Row` -> `VectorBasedRow`. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #19740 from cloud-fan/vector.	2017-11-15 14:42:37 +01:00
Marcelo Vanzin	0ffa7c488f	[SPARK-20652][SQL] Store SQL UI data in the new app status store. This change replaces the SQLListener with a new implementation that saves the data to the same store used by the SparkContext's status store. For that, the types used by the old SQLListener had to be updated a bit so that they're more serialization-friendly. The interface for getting data from the store was abstracted into a new class, SQLAppStatusStore (following the convention used in core). Another change is the way that the SQL UI hooks up into the core UI or the SHS. The old "SparkHistoryListenerFactory" was replaced with a new "AppStatePlugin" that more explicitly differentiates between the two use cases: processing events, and showing the UI. Both live apps and the SHS use this new API (previously, it was restricted to the SHS). Note on the above: this causes a slight change of behavior for live apps; the SQL tab will only show up after the first execution is started. The metrics gathering code was re-worked a bit so that the types used are less memory hungry and more serialization-friendly. This reduces memory usage when using in-memory stores, and reduces load times when using disk stores. Tested with existing and added unit tests. Note one unit test was disabled because it depends on SPARK-20653, which isn't in yet. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #19681 from vanzin/SPARK-20652.	2017-11-14 15:28:22 -06:00
Zhenhua Wang	11b60af737	[SPARK-17074][SQL] Generate equi-height histogram in column statistics ## What changes were proposed in this pull request? Equi-height histogram is effective in cardinality estimation, and more accurate than basic column stats (min, max, ndv, etc) especially in skew distribution. So we need to support it. For equi-height histogram, all buckets (intervals) have the same height (frequency). In this PR, we use a two-step method to generate an equi-height histogram: 1. use `ApproximatePercentile` to get percentiles `p(0), p(1/n), p(2/n) ... p((n-1)/n), p(1)`; 2. construct range values of buckets, e.g. `[p(0), p(1/n)], [p(1/n), p(2/n)] ... [p((n-1)/n), p(1)]`, and use `ApproxCountDistinctForIntervals` to count ndv in each bucket. Each bucket is of the form: `(lowerBound, higherBound, ndv)`. ## How was this patch tested? Added new test cases and modified some existing test cases. Author: Zhenhua Wang <wangzhenhua@huawei.com> Author: Zhenhua Wang <wzh_zju@163.com> Closes #19479 from wzhfy/generate_histogram.	2017-11-14 16:41:43 +01:00
hyukjinkwon	673c670465	[SPARK-17310][SQL] Add an option to disable record-level filter in Parquet-side ## What changes were proposed in this pull request? There is a concern that Spark-side codegen row-by-row filtering might be faster than Parquet's one in general due to type-boxing and additional fuction calls which Spark's one tries to avoid. So, this PR adds an option to disable/enable record-by-record filtering in Parquet side. It sets the default to `false` to take the advantage of the improvement. This was also discussed in https://github.com/apache/spark/pull/14671. ## How was this patch tested? Manually benchmarks were performed. I generated a billion (1,000,000,000) records and tested equality comparison concatenated with `OR`. This filter combinations were made from 5 to 30. It seem indeed Spark-filtering is faster in the test case and the gap increased as the filter tree becomes larger. The details are as below: Code ``` scala test("Parquet-side filter vs Spark-side filter - record by record") { withTempPath { path => val N = 1000 * 1000 * 1000 val df = spark.range(N).toDF("a") df.write.parquet(path.getAbsolutePath) val benchmark = new Benchmark("Parquet-side vs Spark-side", N) Seq(5, 10, 20, 30).foreach { num => val filterExpr = (0 to num).map(i => s"a = $i").mkString(" OR ") benchmark.addCase(s"Parquet-side filter - number of filters [$num]", 3) { _ => withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> false.toString, SQLConf.PARQUET_RECORD_FILTER_ENABLED.key -> true.toString) { // We should strip Spark-side filter to compare correctly. stripSparkFilter( spark.read.parquet(path.getAbsolutePath).filter(filterExpr)).count() } } benchmark.addCase(s"Spark-side filter - number of filters [$num]", 3) { _ => withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> false.toString, SQLConf.PARQUET_RECORD_FILTER_ENABLED.key -> false.toString) { spark.read.parquet(path.getAbsolutePath).filter(filterExpr).count() } } } benchmark.run() } } ``` Result ``` Parquet-side vs Spark-side: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ Parquet-side filter - number of filters [5] 4268 / 4367 234.3 4.3 0.8X Spark-side filter - number of filters [5] 3709 / 3741 269.6 3.7 0.9X Parquet-side filter - number of filters [10] 5673 / 5727 176.3 5.7 0.6X Spark-side filter - number of filters [10] 3588 / 3632 278.7 3.6 0.9X Parquet-side filter - number of filters [20] 8024 / 8440 124.6 8.0 0.4X Spark-side filter - number of filters [20] 3912 / 3946 255.6 3.9 0.8X Parquet-side filter - number of filters [30] 11936 / 12041 83.8 11.9 0.3X Spark-side filter - number of filters [30] 3929 / 3978 254.5 3.9 0.8X ``` Author: hyukjinkwon <gurwls223@gmail.com> Closes #15049 from HyukjinKwon/SPARK-17310.	2017-11-14 12:34:21 +01:00
Bryan Cutler	209b9361ac	[SPARK-20791][PYSPARK] Use Arrow to create Spark DataFrame from Pandas ## What changes were proposed in this pull request? This change uses Arrow to optimize the creation of a Spark DataFrame from a Pandas DataFrame. The input df is sliced according to the default parallelism. The optimization is enabled with the existing conf "spark.sql.execution.arrow.enabled" and is disabled by default. ## How was this patch tested? Added new unit test to create DataFrame with and without the optimization enabled, then compare results. Author: Bryan Cutler <cutlerb@gmail.com> Author: Takuya UESHIN <ueshin@databricks.com> Closes #19459 from BryanCutler/arrow-createDataFrame-from_pandas-SPARK-20791.	2017-11-13 13:16:01 +09:00
Kazuaki Ishizaki	9bf696dbec	[SPARK-21720][SQL] Fix 64KB JVM bytecode limit problem with AND or OR ## What changes were proposed in this pull request? This PR changes `AND` or `OR` code generation to place condition and then expressions' generated code into separated methods if these size could be large. When the method is newly generated, variables for `isNull` and `value` are declared as an instance variable to pass these values (e.g. `isNull1409` and `value1409`) to the callers of the generated method. This PR resolved two cases: * large code size of left expression * large code size of right expression ## How was this patch tested? Added a new test case into `CodeGenerationSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #18972 from kiszk/SPARK-21720.	2017-11-12 22:44:47 +01:00
Wenchen Fan	21a7bfd5c3	[SPARK-10365][SQL] Support Parquet logical type TIMESTAMP_MICROS ## What changes were proposed in this pull request? This PR makes Spark to be able to read Parquet TIMESTAMP_MICROS values, and add a new config to allow Spark to write timestamp values to parquet as TIMESTAMP_MICROS type. ## How was this patch tested? new test Author: Wenchen Fan <wenchen@databricks.com> Closes #19702 from cloud-fan/parquet.	2017-11-11 22:40:26 +01:00
gatorsmile	d6ee69e776	[SPARK-22488][SQL] Fix the view resolution issue in the SparkSession internal table() API ## What changes were proposed in this pull request? The current internal `table()` API of `SparkSession` bypasses the Analyzer and directly calls `sessionState.catalog.lookupRelation` API. This skips the view resolution logics in our Analyzer rule `ResolveRelations`. This internal API is widely used by various DDL commands, public and internal APIs. Users might get the strange error caused by view resolution when the default database is different. ``` Table or view not found: t1; line 1 pos 14 org.apache.spark.sql.AnalysisException: Table or view not found: t1; line 1 pos 14 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) ``` This PR is to fix it by enforcing it to use `ResolveRelations` to resolve the table. ## How was this patch tested? Added a test case and modified the existing test cases Author: gatorsmile <gatorsmile@gmail.com> Closes #19713 from gatorsmile/viewResolution.	2017-11-11 18:20:11 +01:00
Liang-Chi Hsieh	154351e6db	[SPARK-22462][SQL] Make rdd-based actions in Dataset trackable in SQL UI ## What changes were proposed in this pull request? For the few Dataset actions such as `foreach`, currently no SQL metrics are visible in the SQL tab of SparkUI. It is because it binds wrongly to Dataset's `QueryExecution`. As the actions directly evaluate on the RDD which has individual `QueryExecution`, to show correct SQL metrics on UI, we should bind to RDD's `QueryExecution`. ## How was this patch tested? Manually test. Screenshot is attached in the PR. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #19689 from viirya/SPARK-22462.	2017-11-11 12:34:30 +01:00
Rekha Joshi	808e886b96	[SPARK-21667][STREAMING] ConsoleSink should not fail streaming query with checkpointLocation option ## What changes were proposed in this pull request? Fix to allow recovery on console , avoid checkpoint exception ## How was this patch tested? existing tests manual tests [ Replicating error and seeing no checkpoint error after fix] Author: Rekha Joshi <rekhajoshm@gmail.com> Author: rjoshi2 <rekhajoshm@gmail.com> Closes #19407 from rekhajoshm/SPARK-21667.	2017-11-10 15:18:11 -08:00
Marco Gaido	5b41cbf13b	[SPARK-22473][TEST] Replace deprecated AsyncAssertions.Waiter and methods of java.sql.Date ## What changes were proposed in this pull request? In `spark-sql` module tests there are deprecations warnings caused by the usage of deprecated methods of `java.sql.Date` and the usage of the deprecated `AsyncAssertions.Waiter` class. This PR replace the deprecated methods of `java.sql.Date` with non-deprecated ones (using `Calendar` where needed). It replaces also the deprecated `org.scalatest.concurrent.AsyncAssertions.Waiter` with `org.scalatest.concurrent.Waiters._`. ## How was this patch tested? existing UTs Author: Marco Gaido <mgaido@hortonworks.com> Closes #19696 from mgaido91/SPARK-22473.	2017-11-10 11:24:24 -06:00
Wenchen Fan	0025ddeb1d	[SPARK-22472][SQL] add null check for top-level primitive values ## What changes were proposed in this pull request? One powerful feature of `Dataset` is, we can easily map SQL rows to Scala/Java objects and do runtime null check automatically. For example, let's say we have a parquet file with schema `<a: int, b: string>`, and we have a `case class Data(a: Int, b: String)`. Users can easily read this parquet file into `Data` objects, and Spark will throw NPE if column `a` has null values. However the null checking is left behind for top-level primitive values. For example, let's say we have a parquet file with schema `<a: Int>`, and we read it into Scala `Int`. If column `a` has null values, we will get some weird results. ``` scala> val ds = spark.read.parquet(...).as[Int] scala> ds.show() +----+ \|v \| +----+ \|null\| \|1 \| +----+ scala> ds.collect res0: Array[Long] = Array(0, 1) scala> ds.map(_ * 2).show +-----+ \|value\| +-----+ \|-2 \| \|2 \| +-----+ ``` This is because internally Spark use some special default values for primitive types, but never expect users to see/operate these default value directly. This PR adds null check for top-level primitive values ## How was this patch tested? new test Author: Wenchen Fan <wenchen@databricks.com> Closes #19707 from cloud-fan/bug.	2017-11-09 21:56:20 -08:00
Nathan Kronenfeld	b57ed2245c	[SPARK-22308][TEST-MAVEN] Support alternative unit testing styles in external applications Continuation of PR#19528 (https://github.com/apache/spark/pull/19529#issuecomment-340252119) The problem with the maven build in the previous PR was the new tests.... the creation of a spark session outside the tests meant there was more than one spark session around at a time. I was using the spark session outside the tests so that the tests could share data; I've changed it so that each test creates the data anew. Author: Nathan Kronenfeld <nicole.oresme@gmail.com> Author: Nathan Kronenfeld <nkronenfeld@uncharted.software> Closes #19705 from nkronenfeld/alternative-style-tests-2.	2017-11-09 19:11:30 -08:00

1 2 3 4 5 ...

4212 commits