ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Bryan Cutler	be08b415da	[SPARK-27163][PYTHON] Cleanup and consolidate Pandas UDF functionality ## What changes were proposed in this pull request? This change is a cleanup and consolidation of 3 areas related to Pandas UDFs: 1) `ArrowStreamPandasSerializer` now inherits from `ArrowStreamSerializer` and uses the base class `dump_stream`, `load_stream` to create Arrow reader/writer and send Arrow record batches. `ArrowStreamPandasSerializer` makes the conversions to/from Pandas and converts to Arrow record batch iterators. This change removed duplicated creation of Arrow readers/writers. 2) `createDataFrame` with Arrow now uses `ArrowStreamPandasSerializer` instead of doing its own conversions from Pandas to Arrow and sending record batches through `ArrowStreamSerializer`. 3) Grouped Map UDFs now reuse existing logic in `ArrowStreamPandasSerializer` to send Pandas DataFrame results as a `StructType` instead of separating each column from the DataFrame. This makes the code a little more consistent with the Python worker, but does require that the returned StructType column is flattened out in `FlatMapGroupsInPandasExec` in Scala. ## How was this patch tested? Existing tests and ran tests with pyarrow 0.12.0 Closes #24095 from BryanCutler/arrow-refactor-cleanup-UDFs. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-21 17:44:51 +09:00
Venkata krishnan Sowrirajan	b1857a4d7d	[SPARK-26894][SQL] Handle Alias as well in AggregateEstimation to propagate child stats ## What changes were proposed in this pull request? Currently aliases are not handled in the Aggregate Estimation due to which stats are not getting propagated. This causes CBO join-reordering to not give optimal join plans. ProjectEstimation is already taking care of aliases, we need same logic for AggregateEstimation as well to properly propagate stats when CBO is enabled. ## How was this patch tested? This patch is manually tested using the query Q83 of TPCDS benchmark (scale 1000) Closes #23803 from venkata91/aggstats. Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@qubole.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-03-21 11:21:56 +09:00
Shixiong Zhu	c26379b446	[SPARK-27221][SQL] Improve the assert error message in TreeNode.parseToJson ## What changes were proposed in this pull request? When `TreeNode.parseToJson` may throw an assert error without any error message when a TreeNode is not implemented properly, and it's hard to find the bad TreeNode implementation. This PR adds the assert message to improve the error, like what `TreeNode.jsonFields` does. ## How was this patch tested? Jenkins Closes #24159 from zsxwing/SPARK-27221. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-21 11:15:05 +09:00
maryannxue	2e090ba628	[SPARK-27223][SQL] Remove private methods that skip conversion when passing user schemas for constructing a DataFrame ## What changes were proposed in this pull request? When passing in a user schema to create a DataFrame, there might be mismatched nullability between the user schema and the the actual data. All related public interfaces now perform catalyst conversion using the user provided schema, which catches such mismatches to avoid runtime errors later on. However, there're private methods which allow this conversion to be skipped, so we need to remove these private methods which may lead to confusion and potential issues. ## How was this patch tested? Passed existing tests. No new tests were added since this PR removed the private interfaces that would potentially cause null problems and other interfaces are covered already by existing tests. Closes #24162 from maryannxue/spark-27223. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-21 11:13:25 +09:00
wangguangxin.cn	46f9f44918	[SPARK-27202][MINOR][SQL] Update comments to keep according with code ## What changes were proposed in this pull request? Update comments in `InMemoryFileIndex.listLeafFiles` to keep according with code. ## How was this patch tested? existing test cases Closes #24146 from WangGuangxin/SPARK-27202. Authored-by: wangguangxin.cn <wangguangxin.cn@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-20 17:54:28 -05:00
Sean Owen	c65f9b2bc3	[SPARK-26839][SQL] Work around classloader changes in Java 9 for Hive isolation Note, this doesn't really resolve the JIRA, but makes the changes we can make so far that would be required to solve it. ## What changes were proposed in this pull request? Java 9+ changed how ClassLoaders work. The two most salient points: - The boot classloader no longer 'sees' the platform classes. A new 'platform classloader' does and should be the parent of new ClassLoaders - The system classloader is no longer a URLClassLoader, so we can't get the URLs of JARs in its classpath ## How was this patch tested? We'll see whether Java 8 tests still pass here. Java 11 tests do not fully pass at this point; more notes below. This does make progress on the failures though. (NB: to test with Java 11, you need to build with Java 8 first, setting JAVA_HOME and java's executable correctly, then switch both to Java 11 for testing.) Closes #24057 from srowen/SPARK-26839. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-20 09:12:52 -05:00
Maxim Gekk	1882912cca	[SPARK-27199][SQL] Replace TimeZone by ZoneId in TimestampFormatter API ## What changes were proposed in this pull request? In the PR, I propose to use `ZoneId` instead of `TimeZone` in: - the `apply` and `getFractionFormatter ` methods of the `TimestampFormatter` object, - and in implementations of the `TimestampFormatter` trait like `FractionTimestampFormatter`. The reason of the changes is to avoid unnecessary conversion from `TimeZone` to `ZoneId` because `ZoneId` is used in `TimestampFormatter` implementations internally, and the conversion is performed via `String` which is not for free. Also taking into account that `TimeZone` instances are converted from `String` in some cases, the worse case looks like `String` -> `TimeZone` -> `String` -> `ZoneId`. The PR eliminates the unneeded conversions. ## How was this patch tested? It was tested by `DateExpressionsSuite`, `DateTimeUtilsSuite` and `TimestampFormatterSuite`. Closes #24141 from MaxGekk/zone-id. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-20 21:28:11 +09:00
Huon Wilson	b67d369572	[SPARK-27099][SQL] Add 'xxhash64' for hashing arbitrary columns to Long ## What changes were proposed in this pull request? This introduces a new SQL function 'xxhash64' for getting a 64-bit hash of an arbitrary number of columns. This is designed to exactly mimic the 32-bit `hash`, which uses MurmurHash3. The name is designed to be more future-proof than the 'hash', by indicating the exact algorithm used, similar to md5 and the sha hashes. ## How was this patch tested? The tests for the existing `hash` function were duplicated to run with `xxhash64`. Closes #24019 from huonw/hash64. Authored-by: Huon Wilson <Huon.Wilson@data61.csiro.au> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-20 16:34:34 +08:00
Darcy Shen	9a43852f17	[SPARK-27160][SQL] Fix DecimalType when building orc filters ## What changes were proposed in this pull request? DecimalType Literal should not be casted to Long. eg. For `df.filter("x < 3.14")`, assuming df (x in DecimalType) reads from a ORC table and uses the native ORC reader with predicate push down enabled, we will push down the `x < 3.14` predicate to the ORC reader via a SearchArgument. OrcFilters will construct the SearchArgument, but not handle the DecimalType correctly. The previous impl will construct `x < 3` from `x < 3.14`. ## How was this patch tested? ``` $ sbt > sql/testOnly OrcFilterSuite > sql/testOnly OrcQuerySuite -- -z "27160" ``` Closes #24092 from sadhen/spark27160. Authored-by: Darcy Shen <sadhen@zoho.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-19 20:28:46 -07:00
Dongjoon Hyun	257391497b	[SPARK-26975][SQL] Support nested-column pruning over limit/sample/repartition ## What changes were proposed in this pull request? As [SPARK-26958](https://github.com/apache/spark/pull/23862/files) benchmark shows, nested-column pruning has limitations. This PR aims to remove the limitations on `limit/repartition/sample`. Here, repartition means `Repartition`, not `RepartitionByExpression`. PREPARATION ```scala scala> spark.range(100).map(x => (x, (x, s"$x" * 100))).toDF("col1", "col2").write.mode("overwrite").save("/tmp/p") scala> sql("set spark.sql.optimizer.nestedSchemaPruning.enabled=true") scala> spark.read.parquet("/tmp/p").createOrReplaceTempView("t") ``` BEFORE ```scala scala> sql("SELECT col2._1 FROM (SELECT col2 FROM t LIMIT 1000000)").explain == Physical Plan == CollectLimit 1000000 +- (1) Project [col2#22._1 AS _1#28L] +- (1) FileScan parquet [col2#22] Batched: false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/p], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<col2:struct<_1:bigint>> scala> sql("SELECT col2._1 FROM (SELECT /+ REPARTITION(1) / col2 FROM t)").explain == Physical Plan == (2) Project [col2#22._1 AS _1#33L] +- Exchange RoundRobinPartitioning(1) +- (1) Project [col2#22] +- (1) FileScan parquet [col2#22] Batched: false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/p], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<col2:struct<_1:bigint,_2:string>> ``` AFTER* ```scala scala> sql("SELECT col2._1 FROM (SELECT /+ REPARTITION(1) / col2 FROM t)").explain == Physical Plan == Exchange RoundRobinPartitioning(1) +- (1) Project [col2#5._1 AS _1#11L] +- (1) FileScan parquet [col2#5] Batched: false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/p], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<col2:struct<_1:bigint>> ``` This supercedes https://github.com/apache/spark/pull/23542 and https://github.com/apache/spark/pull/23873 . ## How was this patch tested? Pass the Jenkins with a newly added test suite. Closes #23964 from dongjoon-hyun/SPARK-26975-ALIAS. Lead-authored-by: Dongjoon Hyun <dhyun@apple.com> Co-authored-by: DB Tsai <d_tsai@apple.com> Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-19 20:24:22 -07:00
Dongjoon Hyun	4d5247778a	[SPARK-27197][SQL][TEST] Add ReadNestedSchemaTest for file-based data sources ## What changes were proposed in this pull request? The reader schema is said to be evolved (or projected) when it changed after the data is written by writers. Apache Spark file-based data sources have a test coverage for that; e.g. [ReadSchemaSuite.scala](https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/ReadSchemaSuite.scala). This PR aims to add a test coverage for nested columns by adding and hiding nested columns. ## How was this patch tested? Pass the Jenkins with newly added tests. Closes #24139 from dongjoon-hyun/SPARK-27197. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2019-03-20 00:22:05 +00:00
s71955	e402de5fd0	[SPARK-26176][SQL] Verify column names for CTAS with `STORED AS` ## What changes were proposed in this pull request? Currently, users meet job abortions while creating a table using the Hive serde "STORED AS" with invalid column names. We had better prevent this by raising AnalysisException with a guide to use aliases instead like Paquet data source tables. thus making compatible with error message shown while creating Parquet/ORC native table. BEFORE ```scala scala> sql("set spark.sql.hive.convertMetastoreParquet=false") scala> sql("CREATE TABLE a STORED AS PARQUET AS SELECT 1 AS `COUNT(ID)`") Caused by: java.lang.IllegalArgumentException: No enum constant parquet.schema.OriginalType.col1 ``` AFTER ```scala scala> sql("CREATE TABLE a STORED AS PARQUET AS SELECT 1 AS `COUNT(ID)`") Please use alias to rename it.;eption: Attribute name "count(ID)" contains invalid character(s) among " ,;{}()\n\t=". ``` ## How was this patch tested? Pass the Jenkins with the newly added test case. Closes #24075 from sujith71955/master_serde. Authored-by: s71955 <sujithchacko.2010@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-19 20:29:47 +08:00
Takeshi Yamamuro	901c7408a4	[SPARK-27161][SQL][FOLLOWUP] Drops non-keywords from docs/sql-keywords.md ## What changes were proposed in this pull request? This pr is a follow-up of #24093 and includes fixes below; - Lists up all the keywords of Spark only (that is, drops non-keywords there); I listed up all the keywords of ANSI SQL-2011 in the previous commit (SPARK-26215). - Sorts the keywords in `SqlBase.g4` in a alphabetical order ## How was this patch tested? Pass Jenkins. Closes #24125 from maropu/SPARK-27161-FOLLOWUP. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-19 20:18:40 +08:00
mwlon	d5c08fcaab	[SPARK-26555][SQL] make ScalaReflection subtype checking thread safe ## What changes were proposed in this pull request? Make ScalaReflection subtype checking thread safe by adding a lock. There is a thread safety bug in the <:< operator in all versions of scala (https://github.com/scala/bug/issues/10766). ## How was this patch tested? Existing tests and a new one for the new subtype checking function. Closes #24085 from mwlon/SPARK-26555. Authored-by: mwlon <mloncaric@hmc.edu> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-19 18:22:01 +08:00
wuyi	a8af23d7ab	[SPARK-27193][SQL] CodeFormatter should format multiple comment lines correctly ## What changes were proposed in this pull request? when enable `spark.sql.codegen.comments`, there will be multiple comment lines. However, CodeFormatter can not handle multiple comment lines currently: ``` /* 001 / public Object generate(Object[] references) { / 002 / return new GeneratedIteratorForCodegenStage1(references); / 003 / } / 004 / / 005 / /* * Codegend pipeline for stage (id=1) * (1) Project [(id#0L + 1) AS (id + 1)#3L] +- (1) Filter (id#0L = 1) +- (1) Range (0, 10, step=1, splits=4) / /* 006 / // codegenStageId=1 / 007 / final class GeneratedIteratorForCodegenStage1 extends org.apache.spark.sql.execution.BufferedRowIterator { ``` After applying this pr: ``` / 001 / public Object generate(Object[] references) { / 002 / return new GeneratedIteratorForCodegenStage1(references); / 003 / } / 004 / / 005 / /* /* 006 / Codegend pipeline for stage (id=1) /* 007 / (1) Project [(id#0L + 1) AS (id + 1)#4L] / 008 / +- (1) Filter (id#0L = 1) / 009 / +- (1) Range (0, 10, step=1, splits=2) / 010 / / /* 011 / // codegenStageId=1 / 012 */ final class GeneratedIteratorForCodegenStage1 extends org.apache.spark.sql.execution.BufferedRowIterator { ``` ## How was this patch tested? Tested Manually. Closes #24133 from Ngone51/fix-codeformatter-for-multi-comment-lines. Authored-by: wuyi <ngone_5451@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-19 14:47:51 +08:00
Gengliang Wang	28d35c8578	[SPARK-27162][SQL] Add new method asCaseSensitiveMap in CaseInsensitiveStringMap ## What changes were proposed in this pull request? Currently, DataFrameReader/DataFrameReader supports setting Hadoop configurations via method `.option()`. E.g, the following test case should be passed in both ORC V1 and V2 ``` class TestFileFilter extends PathFilter { override def accept(path: Path): Boolean = path.getParent.getName != "p=2" } withTempPath { dir => val path = dir.getCanonicalPath val df = spark.range(2) df.write.orc(path + "/p=1") df.write.orc(path + "/p=2") val extraOptions = Map( "mapred.input.pathFilter.class" -> classOf[TestFileFilter].getName, "mapreduce.input.pathFilter.class" -> classOf[TestFileFilter].getName ) assert(spark.read.options(extraOptions).orc(path).count() === 2) } } ``` While Hadoop Configurations are case sensitive, the current data source V2 APIs are using `CaseInsensitiveStringMap` in the top level entry `TableProvider`. To create Hadoop configurations correctly, I suggest 1. adding a new method `asCaseSensitiveMap` in `CaseInsensitiveStringMap`. 2. Make `CaseInsensitiveStringMap` read-only to ambiguous conversion in `asCaseSensitiveMap` ## How was this patch tested? Unit test Closes #24094 from gengliangwang/originalMap. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-19 13:35:47 +08:00
Dongjoon Hyun	26e9849cb4	[SPARK-27195][SQL][TEST] Add AvroReadSchemaSuite ## What changes were proposed in this pull request? The reader schema is said to be evolved (or projected) when it changed after the data is written by writers. Apache Spark file-based data sources have a test coverage for that, [ReadSchemaSuite.scala](https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/ReadSchemaSuite.scala). This PR aims to add `AvroReadSchemaSuite` to ensure the minimal consistency among file-based data sources and prevent a future regression in Avro data source. ## How was this patch tested? Pass the Jenkins with the newly added test suite. Closes #24135 from dongjoon-hyun/SPARK-27195. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-18 20:10:30 -07:00
Ryan Blue	e348f14259	[SPARK-26811][SQL] Add capabilities to v2.Table ## What changes were proposed in this pull request? This adds a new method, `capabilities` to `v2.Table` that returns a set of `TableCapability`. Capabilities are used to fail queries during analysis checks, `V2WriteSupportCheck`, when the table does not support operations, like truncation. ## How was this patch tested? Existing tests for regressions, added new analysis suite, `V2WriteSupportCheckSuite`, for new capability checks. Closes #24012 from rdblue/SPARK-26811-add-capabilities. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-18 18:25:11 +08:00
Wenchen Fan	dbcb4792f2	[SPARK-27161][SQL] improve the document of SQL keywords ## What changes were proposed in this pull request? Make it more clear about how Spark categories keywords regarding to the config `spark.sql.parser.ansi.enabled` ## How was this patch tested? existing tests Closes #24093 from cloud-fan/parser. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-03-18 15:19:52 +09:00
Jungtaek Lim (HeartSaVioR)	4adbcdc424	[SPARK-22000][SQL][FOLLOW-UP] Fix bad test to ensure it can test properly ## What changes were proposed in this pull request? There was some mistake on test code: it has wrong assertion. The patch proposes fixing it, as well as fixing other stuff to make test really pass. ## How was this patch tested? Fixed unit test. Closes #24112 from HeartSaVioR/SPARK-22000-hotfix. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-03-17 08:25:40 +09:00
Dilip Biswal	aea9a574c4	[SPARK-27134][SQL] array_distinct function does not work correctly with columns containing array of array ## What changes were proposed in this pull request? Correct the logic to compute the distinct. Below is a small repro snippet. ``` scala> val df = Seq(Seq(Seq(1, 2), Seq(1, 2), Seq(1, 2), Seq(3, 4), Seq(4, 5))).toDF("array_col") df: org.apache.spark.sql.DataFrame = [array_col: array<array<int>>] scala> val distinctDF = df.select(array_distinct(col("array_col"))) distinctDF: org.apache.spark.sql.DataFrame = [array_distinct(array_col): array<array<int>>] scala> df.show(false) +----------------------------------------+ \|array_col \| +----------------------------------------+ \|[[1, 2], [1, 2], [1, 2], [3, 4], [4, 5]]\| +----------------------------------------+ ``` Error ``` scala> distinctDF.show(false) +-------------------------+ \|array_distinct(array_col)\| +-------------------------+ \|[[1, 2], [1, 2], [1, 2]] \| +-------------------------+ ``` Expected result ``` scala> distinctDF.show(false) +-------------------------+ \|array_distinct(array_col)\| +-------------------------+ \|[[1, 2], [3, 4], [4, 5]] \| +-------------------------+ ``` ## How was this patch tested? Added an additional test. Closes #24073 from dilipbiswal/SPARK-27134. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-16 14:30:42 -05:00
Dilip Biswal	7a136f8670	[SPARK-27096][SQL][FOLLOWUP] Do the correct validation of join types in R side and fix join docs for scala, python and r ## What changes were proposed in this pull request? This is a minor follow-up PR for SPARK-27096. The original PR reconciled the join types supported between dataset and sql interface. In case of R, we do the join type validation in the R side. In this PR we do the correct validation and adds tests in R to test all the join types along with the error condition. Along with this, i made the necessary doc correction. ## How was this patch tested? Add R tests. Closes #24087 from dilipbiswal/joinfix_followup. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-16 13:04:54 +09:00
Zhu, Lipeng	8ee09f26d5	[SPARK-27159][SQL] update mssql server dialect to support binary type ## What changes were proposed in this pull request? Change the binary type mapping from default blob to varbinary(max) for mssql server. https://docs.microsoft.com/en-us/sql/t-sql/data-types/binary-and-varbinary-transact-sql?view=sql-server-2017 ![image](https://user-images.githubusercontent.com/698621/54351715-0e8c8780-468b-11e9-8931-7ecb85c5ad6b.png) ## How was this patch tested? Unit test. Closes #24091 from lipzhu/SPARK-27159. Authored-by: Zhu, Lipeng <lipzhu@ebay.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-15 20:21:59 -05:00
Gengliang Wang	2a37d6ed93	[SPARK-27132][SQL] Improve file source V2 framework ## What changes were proposed in this pull request? During the migration of CSV V2(https://github.com/apache/spark/pull/24005), I find that we can improve the file source v2 framework by: 1. check duplicated column names in both read and write 2. Not all the file sources support filter push down. So remove `SupportsPushDownFilters` from FileScanBuilder 3. The method `isSplitable` might require data source options. Add a new member `options` to FileScan. 4. Make `FileTable.schema` a lazy value instead of a method. ## How was this patch tested? Unit test Closes #24066 from gengliangwang/reviseFileSourceV2. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-15 11:58:03 +08:00
Dongjoon Hyun	74d2f04183	[SPARK-27166][SQL] Improve `printSchema` to print up to the given level ## What changes were proposed in this pull request? This PR aims to improve `printSchema` to be able to print up to the given level of the schema. ```scala scala> val df = Seq((1,(2,(3,4)))).toDF df: org.apache.spark.sql.DataFrame = [_1: int, _2: struct<_1: int, _2: struct<_1: int, _2: int>>] scala> df.printSchema root \|-- _1: integer (nullable = false) \|-- _2: struct (nullable = true) \| \|-- _1: integer (nullable = false) \| \|-- _2: struct (nullable = true) \| \| \|-- _1: integer (nullable = false) \| \| \|-- _2: integer (nullable = false) scala> df.printSchema(1) root \|-- _1: integer (nullable = false) \|-- _2: struct (nullable = true) scala> df.printSchema(2) root \|-- _1: integer (nullable = false) \|-- _2: struct (nullable = true) \| \|-- _1: integer (nullable = false) \| \|-- _2: struct (nullable = true) scala> df.printSchema(3) root \|-- _1: integer (nullable = false) \|-- _2: struct (nullable = true) \| \|-- _1: integer (nullable = false) \| \|-- _2: struct (nullable = true) \| \| \|-- _1: integer (nullable = false) \| \| \|-- _2: integer (nullable = false) ``` ## How was this patch tested? Pass the Jenkins with the newly added test case. Closes #24098 from dongjoon-hyun/SPARK-27166. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-14 20:27:55 -07:00
Gengliang Wang	6d22ee3969	[SPARK-27136][SQL] Remove data source option check_files_exist ## What changes were proposed in this pull request? The data source option check_files_exist is introduced in In #23383 when the file source V2 framework is implemented. In the PR, FileIndex was created as a member of FileTable, so that we could implement partition pruning like 0f9fcab in the future. At that time `FileIndex`es will always be created for file writes, so we needed the option to decide whether to check file existence. After https://github.com/apache/spark/pull/23774, the option is not needed anymore, since Dataframe writes won't create unnecessary FileIndex. This PR is to remove the option. ## How was this patch tested? Unit test. Closes #24069 from gengliangwang/removeOptionCheckFilesExist. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-15 10:19:26 +08:00
Dave DeCaprio	8819eaba4d	[SPARK-26917][SQL] Further reduce locks in CacheManager ## What changes were proposed in this pull request? Further load increases in our production environment have shown that even the read locks can cause some contention, since they contain a mechanism that turns a read lock into an exclusive lock if a writer has been starved out. This PR reduces the potential for lock contention even further than https://github.com/apache/spark/pull/23833. Additionally, it uses more idiomatic scala than the previous implementation. cloud-fan & gatorsmile This is a relatively minor improvement to the previous CacheManager changes. At this point, I think we finally are doing the minimum possible amount of locking. ## How was this patch tested? Has been tested on a live system where the blocking was causing major issues and it is working well. CacheManager has no explicit unit test but is used in many places internally as part of the SharedState. Closes #24028 from DaveDeCaprio/read-locks-master. Lead-authored-by: Dave DeCaprio <daved@alum.mit.edu> Co-authored-by: David DeCaprio <daved@alum.mit.edu> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-15 10:13:34 +08:00
Shahid	8b5224097b	[SPARK-27145][MINOR] Close store in the SQLAppStatusListenerSuite after test ## What changes were proposed in this pull request? We create many stores in the SQLAppStatusListenerSuite, but we need to the close store after test. ## How was this patch tested? Existing tests Closes #24079 from shahidki31/SPARK-27145. Authored-by: Shahid <shahidki31@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-03-14 13:08:41 -07:00
Yuming Wang	da7db9abf6	[SPARK-23749][SQL] Replace built-in Hive API (isSub/toKryo) and remove OrcProto.Type usage ## What changes were proposed in this pull request? In order to make the upgrade built-in Hive changes smaller. This pr workaround the simplest 3 API changes first. ## How was this patch tested? manual tests Closes #24018 from wangyum/SPARK-23749. Lead-authored-by: Yuming Wang <yumwang@ebay.com> Co-authored-by: Yuming Wang <wgyumg@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-14 11:41:40 -07:00
Takeshi Yamamuro	66c5cd2d9c	[SPARK-27151][SQL] ClearCacheCommand extends IgnoreCahedData to avoid plan node copys ## What changes were proposed in this pull request? In SPARK-27011, we introduced `IgnoreCahedData` to avoid plan node copys in `CacheManager`. Since `ClearCacheCommand` has no argument, it also can extend `IgnoreCahedData`. ## How was this patch tested? Pass Jenkins. Closes #24081 from maropu/SPARK-27011-2. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-14 11:36:16 -07:00
Takeshi Yamamuro	bacffb8810	[SPARK-23264][SQL] Make INTERVAL keyword optional in INTERVAL clauses when ANSI mode enabled ## What changes were proposed in this pull request? This pr updated parsing rules in `SqlBase.g4` to support a SQL query below when ANSI mode enabled; ``` SELECT CAST('2017-08-04' AS DATE) + 1 days; ``` The current master cannot parse it though, other dbms-like systems support the syntax (e.g., hive and mysql). Also, the syntax is frequently used in the official TPC-DS queries. This pr added new tokens as follows; ``` YEAR \| YEARS \| MONTH \| MONTHS \| WEEK \| WEEKS \| DAY \| DAYS \| HOUR \| HOURS \| MINUTE MINUTES \| SECOND \| SECONDS \| MILLISECOND \| MILLISECONDS \| MICROSECOND \| MICROSECONDS ``` Then, it registered the keywords below as the ANSI reserved (this follows SQL-2011); ``` DAY \| HOUR \| MINUTE \| MONTH \| SECOND \| YEAR ``` ## How was this patch tested? Added tests in `SQLQuerySuite`, `ExpressionParserSuite`, and `TableIdentifierParserSuite`. Closes #20433 from maropu/SPARK-23264. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-03-14 10:45:29 +09:00
Dongjoon Hyun	250946ff93	[SPARK-27123][SQL][FOLLOWUP] Use isRenaming check for limit too. ## What changes were proposed in this pull request? This is a followup for https://github.com/apache/spark/pull/24049 to reduce the scope of pattern based on the review comments. ## How was this patch tested? Pass the existing test. Closes #24082 from dongjoon-hyun/SPARK-27123-2. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-13 15:01:01 -07:00
Jungtaek Lim (HeartSaVioR)	733f2c0b98	[MINOR][SQL] Deduplicate huge if statements in get between specialized getters ## What changes were proposed in this pull request? This patch deduplicates the huge if statements regarding getting value between specialized getters. ## How was this patch tested? Existing UT. Closes #24016 from HeartSaVioR/MINOR-deduplicate-get-from-specialized-getters. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-13 15:52:21 -05:00
Dongjoon Hyun	3221bf4cd5	[SPARK-27034][SPARK-27123][SQL][FOLLOWUP] Update Nested Schema Pruning BM result with EC2 ## What changes were proposed in this pull request? This is a follow up PR for #23943 in order to update the benchmark result with EC2 `r3.xlarge` instance. ## How was this patch tested? N/A. (Manually compare the diff) Closes #24078 from dongjoon-hyun/SPARK-27034. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2019-03-13 20:27:10 +00:00
Wenchen Fan	2a80a4cd39	[SPARK-27106][SQL] merge CaseInsensitiveStringMap and DataSourceOptions ## What changes were proposed in this pull request? It's a little awkward to have 2 different classes(`CaseInsensitiveStringMap` and `DataSourceOptions`) to present the options in data source and catalog API. This PR merges these 2 classes, while keeping the name `CaseInsensitiveStringMap`, which is more precise. ## How was this patch tested? existing tests Closes #24025 from cloud-fan/option. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-14 01:23:27 +08:00
Dave DeCaprio	812ad55461	[SPARK-26103][SQL] Limit the length of debug strings for query plans ## What changes were proposed in this pull request? The PR puts in a limit on the size of a debug string generated for a tree node. Helps to fix out of memory errors when large plans have huge debug strings. In addition to SPARK-26103, this should also address SPARK-23904 and SPARK-25380. AN alternative solution was proposed in #23076, but that solution doesn't address all the cases that can cause a large query. This limit is only on calls treeString that don't pass a Writer, which makes it play nicely with #22429, #23018 and #23039. Full plans can be written to files, but truncated plans will be used when strings are held in memory, such as for the UI. - A new configuration parameter called spark.sql.debug.maxPlanLength was added to control the length of the plans. - When plans are truncated, "..." is printed to indicate that it isn't a full plan - A warning is printed out the first time a truncated plan is displayed. The warning explains what happened and how to adjust the limit. ## How was this patch tested? Unit tests were created for the new SizeLimitedWriter. Also a unit test for TreeNode was created that checks that a long plan is correctly truncated. Closes #23169 from DaveDeCaprio/text-plan-size. Lead-authored-by: Dave DeCaprio <daved@alum.mit.edu> Co-authored-by: David DeCaprio <daved@alum.mit.edu> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-03-13 09:58:43 -07:00
Wenchen Fan	d3813d8b21	[SPARK-27064][SS] create StreamingWrite at the beginning of streaming execution ## What changes were proposed in this pull request? According to the [design](https://docs.google.com/document/d/1vI26UEuDpVuOjWw4WPoH2T6y8WAekwtI7qoowhOFnI4/edit?usp=sharing), the life cycle of `StreamingWrite` should be the same as the read side `MicroBatch/ContinuousStream`, i.e. each run of the stream query, instead of each epoch. This PR fixes it. ## How was this patch tested? existing tests Closes #23981 from cloud-fan/dsv2. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-13 19:47:54 +08:00
Liang-Chi Hsieh	f55c760df6	[SPARK-27034][SQL][FOLLOWUP] Rename ParquetSchemaPruning to SchemaPruning ## What changes were proposed in this pull request? This is a followup to #23943. This proposes to rename ParquetSchemaPruning to SchemaPruning as ParquetSchemaPruning supports both Parquet and ORC v1 now. ## How was this patch tested? Existing tests. Closes #24077 from viirya/nested-schema-pruning-orc-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-13 20:12:01 +09:00
Jungtaek Lim (HeartSaVioR)	1b06cda532	[MINOR][SQL] Refactor RowEncoder to use existing (De)serializerBuildHelper methods ## What changes were proposed in this pull request? This patch proposes to reuse existing methods in (De)serializerBuildHelper in RowEncoder to achieve deduplication as well as consistent creation of serialization/deserialization of same type. ## How was this patch tested? Existing UT. Closes #24014 from HeartSaVioR/SPARK-27092. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-13 10:54:47 +08:00
Takeshi Yamamuro	1e9469bb7a	[SPARK-26976][SQL] Forbid reserved keywords as identifiers when ANSI mode is on ## What changes were proposed in this pull request? This pr added code to forbid reserved keywords as identifiers when ANSI mode is on. This is a follow-up of SPARK-26215(#23259). ## How was this patch tested? Added tests in `TableIdentifierParserSuite`. Closes #23880 from maropu/SPARK-26976. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-03-13 11:20:27 +09:00
Ajith	e60d8fce0b	[SPARK-27045][SQL] SQL tab in UI shows actual SQL instead of callsite in case of SparkSQLDriver ## What changes were proposed in this pull request? When we run sql in spark via SparkSQLDriver (thrift server, spark-sql), SQL string is siet via ``setJobDescription``. the SparkUI SQL tab must show SQL instead of stacktrace in case ``setJobDescription`` is set which is more useful to end user. Instead it currently shows in description column the callsite shortform which is less useful ![image](https://user-images.githubusercontent.com/22072336/53734682-aaa7d900-3eaa-11e9-957b-0e5006db417e.png) ## How was this patch tested? Manually: ![image](https://user-images.githubusercontent.com/22072336/53734657-9f54ad80-3eaa-11e9-8dc5-2b38f6970f4e.png) Closes #23958 from ajithme/sqlui. Authored-by: Ajith <ajith2489@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-12 16:14:29 -07:00
Liang-Chi Hsieh	b0c2b3bfd9	[SPARK-27034][SQL] Nested schema pruning for ORC ## What changes were proposed in this pull request? We only supported nested schema pruning for Parquet previously. This proposes to support nested schema pruning for ORC too. Note: This only covers ORC v1. For ORC v2, the necessary change is at the schema pruning rule. We should deal with ORC v2 as a TODO item, in order to reduce review burden. ## How was this patch tested? Added tests. Closes #23943 from viirya/nested-schema-pruning-orc. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-12 15:39:16 -07:00
Dongjoon Hyun	78314af580	[SPARK-27123][SQL] Improve CollapseProject to handle projects cross limit/repartition/sample ## What changes were proposed in this pull request? `CollapseProject` optimizer rule simplifies some plans by merging the adjacent projects and performing alias substitutions. ```scala scala> sql("SELECT b c FROM (SELECT a b FROM t)").explain == Physical Plan == (1) Project [a#5 AS c#1] +- Scan hive default.t [a#5], HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#5] ``` We can do that more complex cases like the following. This PR aims to handle adjacent projects across limit/repartition/sample. Here, repartition means `Repartition`, not `RepartitionByExpression`. BEFORE* ```scala scala> sql("SELECT b c FROM (SELECT /+ REPARTITION(1) / a b FROM t)").explain == Physical Plan == (2) Project [b#0 AS c#1] +- Exchange RoundRobinPartitioning(1) +- (1) Project [a#5 AS b#0] +- Scan hive default.t [a#5], HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#5] ``` AFTER ```scala scala> sql("SELECT b c FROM (SELECT /+ REPARTITION(1) / a b FROM t)").explain == Physical Plan == Exchange RoundRobinPartitioning(1) +- *(1) Project [a#11 AS c#7] +- Scan hive default.t [a#11], HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#11] ``` ## How was this patch tested? Pass the Jenkins with the newly added and updated test cases. Closes #24049 from dongjoon-hyun/SPARK-27123. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2019-03-12 21:45:40 +00:00
zuotingbing	3f9247de1e	[SPARK-27010][SQL] Find out the actual port number when hive.server2.thrift.port=0 ## What changes were proposed in this pull request? Currently, if we set hive.server2.thrift.port=0, it hard to find out the actual port number which one we should use when using beeline to connect. before: ![2019-02-28_170942](https://user-images.githubusercontent.com/24823338/53557240-779ad800-3b80-11e9-9567-175f28aa61da.png) after: ![2019-02-28_170904](https://user-images.githubusercontent.com/24823338/53557255-7f5a7c80-3b80-11e9-8ba6-9764c03e5407.png) use beeline to connect success: ![2019-02-28_170844](https://user-images.githubusercontent.com/24823338/53557267-85e8f400-3b80-11e9-90a5-f7f53a51cc32.png) ## How was this patch tested? manual tests Closes #23917 from zuotingbing/SPARK-27010. Authored-by: zuotingbing <zuo.tingbing9@zte.com.cn> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-12 13:38:41 -05:00
shivusondur	4b6d39d85d	[SPARK-27090][CORE] Removing old LEGACY_DRIVER_IDENTIFIER ("<driver>") ## What changes were proposed in this pull request? LEGACY_DRIVER_IDENTIFIER and its reference are removed. corresponding references test are updated. ## How was this patch tested? tested UT test cases Closes #24026 from shivusondur/newjira2. Authored-by: shivusondur <shivusondur@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-12 13:29:39 -05:00
Shahid	1853db3186	[SPARK-27125][SQL][TEST] Add test suite for sql execution page ## What changes were proposed in this pull request? Added test suite for AllExecutionsPage class. Checked the scenarios for SPARK-27019 and SPARK-27075. ## How was this patch tested? Added UT, manually tested Closes #24052 from shahidki31/SPARK-27125. Authored-by: Shahid <shahidki31@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-12 10:15:28 -05:00
Ajith	b8dd84b9e4	[SPARK-27011][SQL] reset command fails with cache ## What changes were proposed in this pull request? When cache is enabled ( i.e once cache table command is executed), any following sql will trigger CacheManager#lookupCachedData which will create a copy of the tree node, which inturn calls TreeNode#makeCopy. Here the problem is it will try to create a copy instance. But as ResetCommand is a case object this will fail ## How was this patch tested? Added UT to reproduce the issue Closes #23918 from ajithme/reset. Authored-by: Ajith <ajith2489@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-12 11:02:09 +08:00
Maxim Gekk	60be6d2ea3	[SPARK-27109][SQL] Refactoring of TimestampFormatter and DateFormatter ## What changes were proposed in this pull request? In PR, I propose to refactor the `parse()` method of `Iso8601DateFormatter`/`Iso8601DateFormatter` and `toInstantWithZoneId` of `toInstantWithZoneId` to achieve the following: - Avoid unnecessary conversion of parsed input to `java.time.Instant` before converting it to micros and days. Necessary information exists in `ZonedDateTime` already, and micros/days can be extracted from the former one. - Avoid additional extraction of LocalTime from parsed object, more precisely, double query of `TemporalQueries.localTime` from `temporalAccessor`. - Avoid additional extraction of zone id from parsed object, in particular, double query of `TemporalQueries.offset()`. - Using `ZoneOffset.UTC` instead of `ZoneId.of` in `DateFormatter`. This allows to avoid looking for zone offset by zone id. ## How was this patch tested? By existing test suite `DateTimeUtilsSuite`, `TimestampFormatterSuite` and `DateFormatterSuite`. Closes #24030 from MaxGekk/query-localtime. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-11 19:02:30 -05:00
Hyukjin Kwon	3725b1324f	[SPARK-26923][SQL][R] Refactor ArrowRRunner and RRunner to share one BaseRRunner ## What changes were proposed in this pull request? This PR proposes to have one base R runner. In the high level, Previously, it had `ArrowRRunner` and it inherited `RRunner`: ``` └── RRunner └── ArrowRRunner ``` After this PR, now it has a `BaseRRunner`, and `ArrowRRunner` and `RRunner` inherit `BaseRRunner`: ``` └── BaseRRunner ├── ArrowRRunner └── RRunner ``` This way is consistent with Python's. In more details, see below: ```scala class BaseRRunner[IN, OUT] { def compute: Iterator[OUT] = { ... newWriterThread(...).start() ... newReaderIterator(...) ... } // Make a thread that writes data from JVM to R process abstract protected def newWriterThread(..., iter: Iterator[IN], ...): WriterThread // Make an iterator that reads data from the R process to JVM abstract protected def newReaderIterator(...): ReaderIterator abstract class WriterThread(..., iter: Iterator[IN], ...) extends Thread { override def run(): Unit { ... writeIteratorToStream(...) ... } // Actually writing logic to the socket stream. abstract protected def writeIteratorToStream(dataOut: DataOutputStream): Unit } abstract class ReaderIterator extends Iterator[OUT] { override def hasNext(): Boolean = { ... read(...) ... } override def next(): OUT = { ... hasNext() ... } // Actually reading logic from the socket stream. abstract protected def read(...): OUT } } ``` ```scala case [Arrow]RRunner extends BaseRRunner { override def newWriterThread(...) { new WriterThread(...) { override def writeIteratorToStream(...) { ... } } } override def newReaderIterator(...) { new ReaderIterator(...) { override def read(...) { ... } } } } ``` ## How was this patch tested? Manually tested and existing tests should cover. Closes #23977 from HyukjinKwon/SPARK-26923. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-12 08:45:29 +09:00
Wenchen Fan	31878c9daa	[SPARK-27119][SQL] Do not infer schema when reading Hive serde table with native data source ## What changes were proposed in this pull request? In Spark 2.1, we hit a correctness bug. When reading a Hive serde parquet table with the native parquet data source, and the actual file schema doesn't match the table schema in Hive metastore(only upper/lower case difference), the query returns 0 results. The reason is that, the parquet reader is case sensitive. If we push down filters with column names that don't match the file physical schema case-sensitively, no data will be returned. To fix this bug, there were 2 solutions proposed at that time: 1. Add a config to optionally disable parquet filter pushdown, and make parquet column pruning case insensitive. https://github.com/apache/spark/pull/16797 2. Infer the actual schema from data files, when reading Hive serde table with native data source. A config is provided to disable it. https://github.com/apache/spark/pull/17229 Solution 2 was accepted and merged to Spark 2.1.1 In Spark 2.4, we refactored the parquet data source a little: 1. do parquet filter pushdown with the actual file schema. https://github.com/apache/spark/pull/21696 2. make parquet filter pushdown case insensitive. https://github.com/apache/spark/pull/22197 3. make parquet column pruning case insensitive. https://github.com/apache/spark/pull/22148 With these patches, the correctness bug in Spark 2.1 no longer exists, and the schema inference becomes unnecessary. To be safe, this PR just changes the default value to NEVER_INFER, so that users can set it back to INFER_AND_SAVE. If we don't receive any bug reports for it, we can remove the related code in the next release. ## How was this patch tested? existing tests Closes #24041 from cloud-fan/infer. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-03-11 09:44:29 -07:00

1 2 3 4 5 ...

7666 commits