ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Hyukjin Kwon	84ec06d95e	Revert "[SPARK-27262][R] Add explicit UTF-8 Encoding to DESCRIPTION" This reverts commit `240c6a4d75`.	2019-03-25 11:02:14 +09:00
Dongjoon Hyun	6ef94e0f18	[SPARK-27260][SS] Upgrade to Kafka 2.2.0 ## What changes were proposed in this pull request? This PR aims to update Kafka dependency to 2.2.0 to bring the following improvement and bug fixes. - https://issues.apache.org/jira/projects/KAFKA/versions/12344063 Due to [KAFKA-4453](https://issues.apache.org/jira/browse/KAFKA-4453), data plane API and controller plane API are separated. Apache Spark needs the following changes. ```scala - servers.head.apis.metadataCache + servers.head.dataPlaneRequestProcessor.metadataCache ``` ## How was this patch tested? Pass the Jenkins with the existing tests. Closes #24190 from dongjoon-hyun/SPARK-27260. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-24 17:39:57 -07:00
Maxim Gekk	52671d631d	[SPARK-27008][SQL][FOLLOWUP] Fix typo from `_EANBLED` to `_ENABLED` ## What changes were proposed in this pull request? This fixes a typo in the SQL config value: DATETIME_JAVA8API_EANBLED -> DATETIME_JAVA8API_ENABLED. ## How was this patch tested? This was tested by `RowEncoderSuite` and `LiteralExpressionSuite`. Closes #24194 from MaxGekk/date-localdate-followup. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-24 17:16:33 -07:00
pgandhi	a6c207c9c0	[SPARK-24935][SQL] fix Hive UDAF with two aggregation buffers ## What changes were proposed in this pull request? Hive UDAF knows the aggregation mode when creating the aggregation buffer, so that it can create different buffers for different inputs: the original data or the aggregation buffer. Please see an example in the [sketches library](`7f9e76e9e0/src/main/java/com/yahoo/sketches/hive/cpc/DataToSketchUDAF.java (L107)`). However, the Hive UDAF adapter in Spark always creates the buffer with partial1 mode, which can only deal with one input: the original data. This PR fixes it. All credits go to pgandhi999 , who investigate the problem and study the Hive UDAF behaviors, and write the tests. close https://github.com/apache/spark/pull/23778 ## How was this patch tested? a new test Closes #24144 from cloud-fan/hive. Lead-authored-by: pgandhi <pgandhi@verizonmedia.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-03-24 16:07:35 -07:00
John Zhuge	a15f17ce27	[SPARK-27250][TEST-MAVEN][BUILD] Scala 2.11 maven compile should target Java 1.8 ## What changes were proposed in this pull request? Fix Scala 2.11 maven build issue after merging SPARK-26946. ## How was this patch tested? Maven Scala 2.11 and 2.12 builds with `-Phadoop-provided -Phadoop-2.7 -Pyarn -Phive -Phive-thriftserver`. Closes #24184 from jzhuge/SPARK-26946-1. Authored-by: John Zhuge <jzhuge@apache.org> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-24 09:05:41 -05:00
Liang-Chi Hsieh	6f18ac9e99	[SPARK-27241][SQL] Support map_keys and map_values in SelectedField ## What changes were proposed in this pull request? `SelectedField` doesn't support map_keys and map_values for now. When map key or value is complex struct, we should be able to prune unnecessary fields from keys/values. This proposes to add map_keys and map_values support to `SelectedField`. ## How was this patch tested? Added tests. Closes #24179 from viirya/SPARK-27241. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-23 23:13:31 -07:00
Takeshi Yamamuro	01e63053df	[SPARK-25196][SPARK-27251][SQL][FOLLOWUP] Add synchronized for InMemoryRelation.statsOfPlanToCache ## What changes were proposed in this pull request? This is a follow-up of #24047; to follow the `CacheManager.cachedData` lock semantics, this pr wrapped the `statsOfPlanToCache` update with `synchronized`. ## How was this patch tested? Pass Jenkins Closes #24178 from maropu/SPARK-24047-FOLLOWUP. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-23 22:54:27 -07:00
Gengliang Wang	624288556d	[SPARK-27085][SQL] Migrate CSV to File Data Source V2 ## What changes were proposed in this pull request? Migrate CSV to File Data Source V2. ## How was this patch tested? Unit test Closes #24005 from gengliangwang/CSVDataSourceV2. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-23 15:43:46 -07:00
Michael Chirico	240c6a4d75	[SPARK-27262][R] Add explicit UTF-8 Encoding to DESCRIPTION ## What changes were proposed in this pull request? I got this warning when following the recommended approach to generating documentation: ``` Warning message: roxygen2 requires Encoding: UTF-8 ``` As can be seen in [other](https://github.com/tidyverse/tidyverse/blob/master/DESCRIPTION) [`tidyverse`](https://github.com/tidyverse/dplyr/blob/master/DESCRIPTION) [`DESCRIPTION`s](https://github.com/tidyverse/readr/blob/master/DESCRIPTION), this is standard practice This PR adds `Encoding: UTF-8` to `R/pkg/DESCRIPTION` ## How was this patch tested? Pass the Jenkins without warning. Closes #23823 from MichaelChirico/patch-1. Authored-by: Michael Chirico <michaelchirico4@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-23 15:24:54 -07:00
Maxim Gekk	027ed2d11b	[SPARK-23643][CORE][SQL][ML] Shrinking the buffer in hashSeed up to size of the seed parameter ## What changes were proposed in this pull request? The hashSeed method allocates 64 bytes instead of 8. Other bytes are always zeros (thanks to default behavior of ByteBuffer). And they could be excluded from hash calculation because they don't differentiate inputs. ## How was this patch tested? By running the existing tests - XORShiftRandomSuite Closes #20793 from MaxGekk/hash-buff-size. Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-23 11:26:09 -05:00
Marco Gaido	fe317dc74e	[SPARK-27243][SQL] RuleExecutor.dumpTimeSpent should not throw exception when empty ## What changes were proposed in this pull request? `RuleExecutor.dumpTimeSpent` currently throws an exception when invoked before any rule is run or immediately after `RuleExecutor.reset`. The PR makes it returning an empty summary, which is the expected output instead. ## How was this patch tested? added UT Closes #24180 from mgaido91/SPARK-27243. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-23 09:49:20 +09:00
hehuiyuan	68abf77b1a	[SPARK-27184][CORE] Avoid hardcoded 'spark.jars', 'spark.files', 'spark.submit.pyFiles' and 'spark.submit.deployMode' ## What changes were proposed in this pull request? For [SPARK-27184](https://issues.apache.org/jira/browse/SPARK-27184) In the `org.apache.spark.internal.config`, we define the variables of `FILES` and `JARS`, we can use them instead of "spark.jars" and "spark.files". ```scala private[spark] val JARS = ConfigBuilder("spark.jars") .stringConf .toSequence .createWithDefault(Nil) ``` ```scala private[spark] val FILES = ConfigBuilder("spark.files") .stringConf .toSequence .createWithDefault(Nil) ``` Other : In the `org.apache.spark.internal.config`, we define the variables of `SUBMIT_PYTHON_FILES ` and `SUBMIT_DEPLOY_MODE `, we can use them instead of "spark.submit.pyFiles" and "spark.submit.deployMode". ```scala private[spark] val SUBMIT_PYTHON_FILES = ConfigBuilder("spark.submit.pyFiles") .stringConf .toSequence .createWithDefault(Nil) ``` ```scala private[spark] val SUBMIT_DEPLOY_MODE = ConfigBuilder("spark.submit.deployMode") .stringConf .createWithDefault("client") ``` Closes #24123 from hehuiyuan/hehuiyuan-patch-6. Authored-by: hehuiyuan <hehuiyuan@ZBMAC-C02WD3K5H.local> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-23 09:43:00 +09:00
Jungtaek Lim (HeartSaVioR)	8a9eb05137	[SPARK-26606][CORE] Handle driver options properly when submitting to standalone cluster mode via legacy Client ## What changes were proposed in this pull request? This patch fixes the issue that ClientEndpoint in standalone cluster doesn't recognize about driver options which are passed to SparkConf instead of system properties. When `Client` is executed via cli they should be provided as system properties, but with `spark-submit` they can be provided as SparkConf. (SpartSubmit will call `ClientApp.start` with SparkConf which would contain these options.) ## How was this patch tested? Manually tested via following steps: 1) setup standalone cluster (launch master and worker via `./sbin/start-all.sh`) 2) submit one of example app with standalone cluster mode ``` ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master "spark://localhost:7077" --conf "spark.driver.extraJavaOptions=-Dfoo=BAR" --deploy-mode "cluster" --num-executors 1 --driver-memory 512m --executor-memory 512m --executor-cores 1 examples/jars/spark-examples*.jar 10 ``` 3) check whether `foo=BAR` is provided in system properties in Spark UI <img width="877" alt="Screen Shot 2019-03-21 at 8 18 04 AM" src="https://user-images.githubusercontent.com/1317309/54728501-97db1700-4bc1-11e9-89da-078445c71e9b.png"> Closes #24163 from HeartSaVioR/SPARK-26606. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-03-22 15:07:49 -07:00
Ryan Blue	34e3cc7060	[SPARK-27108][SQL] Add parsed SQL plans for create, CTAS. ## What changes were proposed in this pull request? This moves parsing `CREATE TABLE ... USING` statements into catalyst. Catalyst produces logical plans with the parsed information and those plans are converted to v1 `DataSource` plans in `DataSourceAnalysis`. This prepares for adding v2 create plans that should receive the information parsed from SQL without being translated to v1 plans first. This also makes it possible to parse in catalyst instead of breaking the parser across the abstract `AstBuilder` in catalyst and `SparkSqlParser` in core. For more information, see the [mailing list thread](https://lists.apache.org/thread.html/54f4e1929ceb9a2b0cac7cb058000feb8de5d6c667b2e0950804c613%3Cdev.spark.apache.org%3E). ## How was this patch tested? This uses existing tests to catch regressions. This introduces no behavior changes. Closes #24029 from rdblue/SPARK-27108-add-parsed-create-logical-plans. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-22 13:58:54 -07:00
Jungtaek Lim (HeartSaVioR)	78d546fe15	[SPARK-27210][SS] Cleanup incomplete output files in ManifestFileCommitProtocol if task is aborted ## What changes were proposed in this pull request? This patch proposes ManifestFileCommitProtocol to clean up incomplete output files in task level if task aborts. Please note that this works as 'best-effort', not kind of guarantee, as we have in HadoopMapReduceCommitProtocol. ## How was this patch tested? Added UT. Closes #24154 from HeartSaVioR/SPARK-27210. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>	2019-03-22 11:26:53 -07:00
Martin Junghanns	8efc5ec72e	[SPARK-27174][SQL] Add support for casting integer types to binary Co-authored-by: Philip Stutz <philip.stutzgmail.com> ## What changes were proposed in this pull request? This PR adds support for casting * `ByteType` * `ShortType` * `IntegerType` * `LongType` to `BinaryType`. ## How was this patch tested? We added unit tests for casting instances of the above types. For validation, we used Javas `DataOutputStream` to compare the resulting byte array with the result of `Cast`. We state that the contribution is our original work and that we license the work to the project under the project’s open source license. cloud-fan we'd appreciate a review if you find the time, thx Closes #24107 from s1ck/cast_to_binary. Authored-by: Martin Junghanns <martin.junghanns@neotechnology.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-22 10:09:35 -07:00
10087686	8204dc1e54	[SPARK-27141][YARN] Use ConfigEntry for hardcoded configs for Yarn ## What changes were proposed in this pull request? There is some hardcode configs in code, I think it best to modify。 ## How was this patch tested? Existing tests Closes #24103 from wangjiaochun/yarnHardCode. Authored-by: 10087686 <wang.jiaochun@zte.com.cn> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-22 05:29:29 -05:00
Jungtaek Lim (HeartSaVioR)	174531c183	[MINOR][CORE] Leverage modified Utils.classForName to reduce scalastyle off for Class.forName ## What changes were proposed in this pull request? This patch modifies Utils.classForName to have optional parameters - initialize, noSparkClassLoader - to let callers of Class.forName with thread context classloader to use it instead. This helps to reduce scalastyle off for Class.forName. ## How was this patch tested? Existing UTs. Closes #24148 from HeartSaVioR/MINOR-reduce-scalastyle-off-for-class-forname. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-22 05:28:46 -05:00
Maxim Gekk	a529be2930	[SPARK-27212][SQL] Eliminate TimeZone to ZoneId conversion in stringToTimestamp ## What changes were proposed in this pull request? In the PR, I propose to avoid the `TimeZone` to `ZoneId` conversion in `DateTimeUtils.stringToTimestamp` by changing signature of the method, and require a parameter of `ZoneId` type. This will allow to avoid unnecessary conversion (`TimeZone` -> `String` -> `ZoneId`) per each row. Also the PR avoids creation of `ZoneId` instances from `ZoneOffset` because `ZoneOffset` is a sub-class, and the conversion is unnecessary too. ## How was this patch tested? It was tested by `DateTimeUtilsSuite` and `CastSuite`. Closes #24155 from MaxGekk/stringtotimestamp-zoneid. Authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-22 18:01:29 +09:00
maryannxue	9f58d3b436	[SPARK-27236][TEST] Refactor log-appender pattern in tests ## What changes were proposed in this pull request? Refactored code in tests regarding the "withLogAppender()" pattern by creating a general helper method in SparkFunSuite. ## How was this patch tested? Passed existing tests. Closes #24172 from maryannxue/log-appender. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-03-21 19:18:30 -07:00
John Zhuge	80565ce253	[SPARK-26946][SQL] Identifiers for multi-catalog ## What changes were proposed in this pull request? - Support N-part identifier in SQL - N-part identifier extractor in Analyzer ## How was this patch tested? - A new unit test suite ResolveMultipartRelationSuite - CatalogLoadingSuite rblue cloud-fan mccheah Closes #23848 from jzhuge/SPARK-26946. Authored-by: John Zhuge <jzhuge@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-21 18:04:50 -07:00
Maxim Gekk	0f4f8160e6	[SPARK-27222][SQL] Support Instant and LocalDate in Literal.apply ## What changes were proposed in this pull request? In the PR, I propose to extend `Literal.apply` to support constructing literals of `TimestampType` and `DateType` from `java.time.Instant` and `java.time.LocalDate`. The java classes have been already supported as external types for `TimestampType` and `DateType` by the PRs #23811 and #23913. ## How was this patch tested? Added new tests to `LiteralExpressionSuite`. Closes #24161 from MaxGekk/literal-instant-localdate. Authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-21 12:57:32 -07:00
Takeshi Yamamuro	0627850b7e	[SPARK-25196][SQL] Extends the analyze column command for cached tables ## What changes were proposed in this pull request? This pr extended `ANALYZE` commands to analyze column stats for cached table. In common use cases, users read catalog table data, join/aggregate them, and then cache the result for following reuse. Since we are only allowed to analyze column statistics in catalog tables via ANALYZE commands, the current optimization depends on non-existing or inaccurate column statistics of cached data. So, it would be great if we could analyze cached data as follows; ```scala scala> def printColumnStats(tableName: String) = { \| spark.table(tableName).queryExecution.optimizedPlan.stats.attributeStats.foreach { \| case (k, v) => println(s"[$k]: $v") \| } \| } scala> sql("SET spark.sql.cbo.enabled=true") scala> sql("SET spark.sql.statistics.histogram.enabled=true") scala> spark.range(1000).selectExpr("id % 33 AS c0", "rand() AS c1", "0 AS c2").write.saveAsTable("t") scala> sql("ANALYZE TABLE t COMPUTE STATISTICS FOR COLUMNS c0, c1, c2") scala> spark.table("t").groupBy("c0").agg(count("c1").as("v1"), sum("c2").as("v2")).createTempView("temp") // Prints column statistics in catalog table `t` scala> printColumnStats("t") [c0#7073L]: ColumnStat(Some(33),Some(0),Some(32),Some(0),Some(8),Some(8),Some(Histogram(3.937007874015748,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;9f7c1c)),2) [c1#7074]: ColumnStat(Some(944),Some(3.2108484832404915E-4),Some(0.997584797423909),Some(0),Some(8),Some(8),Some(Histogram(3.937007874015748,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;60a386b1)),2) [c2#7075]: ColumnStat(Some(1),Some(0),Some(0),Some(0),Some(4),Some(4),Some(Histogram(3.937007874015748,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;5ffd29e8)),2) // Prints column statistics on cached table `temp` scala> sql("CACHE TABLE temp") scala> printColumnStats("temp") <No Column Statistics> // Analyzes columns `v1` and `v2` on cached table `temp` scala> sql("ANALYZE TABLE temp COMPUTE STATISTICS FOR COLUMNS v1, v2") // Then, prints again scala> printColumnStats("temp") [v1#7084L]: ColumnStat(Some(2),Some(30),Some(31),Some(0),Some(8),Some(8),Some(Histogram(0.12992125984251968,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;49f7bb6f)),2) [v2#7086L]: ColumnStat(Some(1),Some(0),Some(0),Some(0),Some(8),Some(8),Some(Histogram(0.12992125984251968,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;12701677)),2) // Analyzes one left column and prints again scala> sql("ANALYZE TABLE temp COMPUTE STATISTICS FOR COLUMNS c0") scala> printColumnStats("temp") [v1#7084L]: ColumnStat(Some(2),Some(30),Some(31),Some(0),Some(8),Some(8),Some(Histogram(0.12992125984251968,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;49f7bb6f)),2) [v2#7086L]: ColumnStat(Some(1),Some(0),Some(0),Some(0),Some(8),Some(8),Some(Histogram(0.12992125984251968,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;12701677)),2) [c0#7073L]: ColumnStat(Some(33),Some(0),Some(32),Some(0),Some(8),Some(8),Some(Histogram(0.12992125984251968,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;1f5c1b81)),2) ``` ## How was this patch tested? Added tests in `CachedTableSuite` and `StatisticsCollectionSuite`. Closes #24047 from maropu/SPARK-25196-4. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-21 09:20:35 -07:00
gengjiaan	22c9ed6a9c	[MINOR][SQL] Put the grammar of database together, because this is good for maintenance and readability. ## What changes were proposed in this pull request? The SQL grammar `SHOW DATABASES` is mixed in some grammar of table. I think should arrange the grammar of database together. This is good for maintenance and readability. ## How was this patch tested? No UT Closes #24138 from beliefer/arrange-sql-grammar. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-21 06:41:04 -05:00
Bryan Cutler	be08b415da	[SPARK-27163][PYTHON] Cleanup and consolidate Pandas UDF functionality ## What changes were proposed in this pull request? This change is a cleanup and consolidation of 3 areas related to Pandas UDFs: 1) `ArrowStreamPandasSerializer` now inherits from `ArrowStreamSerializer` and uses the base class `dump_stream`, `load_stream` to create Arrow reader/writer and send Arrow record batches. `ArrowStreamPandasSerializer` makes the conversions to/from Pandas and converts to Arrow record batch iterators. This change removed duplicated creation of Arrow readers/writers. 2) `createDataFrame` with Arrow now uses `ArrowStreamPandasSerializer` instead of doing its own conversions from Pandas to Arrow and sending record batches through `ArrowStreamSerializer`. 3) Grouped Map UDFs now reuse existing logic in `ArrowStreamPandasSerializer` to send Pandas DataFrame results as a `StructType` instead of separating each column from the DataFrame. This makes the code a little more consistent with the Python worker, but does require that the returned StructType column is flattened out in `FlatMapGroupsInPandasExec` in Scala. ## How was this patch tested? Existing tests and ran tests with pyarrow 0.12.0 Closes #24095 from BryanCutler/arrow-refactor-cleanup-UDFs. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-21 17:44:51 +09:00
Venkata krishnan Sowrirajan	b1857a4d7d	[SPARK-26894][SQL] Handle Alias as well in AggregateEstimation to propagate child stats ## What changes were proposed in this pull request? Currently aliases are not handled in the Aggregate Estimation due to which stats are not getting propagated. This causes CBO join-reordering to not give optimal join plans. ProjectEstimation is already taking care of aliases, we need same logic for AggregateEstimation as well to properly propagate stats when CBO is enabled. ## How was this patch tested? This patch is manually tested using the query Q83 of TPCDS benchmark (scale 1000) Closes #23803 from venkata91/aggstats. Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@qubole.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-03-21 11:21:56 +09:00
Shixiong Zhu	c26379b446	[SPARK-27221][SQL] Improve the assert error message in TreeNode.parseToJson ## What changes were proposed in this pull request? When `TreeNode.parseToJson` may throw an assert error without any error message when a TreeNode is not implemented properly, and it's hard to find the bad TreeNode implementation. This PR adds the assert message to improve the error, like what `TreeNode.jsonFields` does. ## How was this patch tested? Jenkins Closes #24159 from zsxwing/SPARK-27221. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-21 11:15:05 +09:00
maryannxue	2e090ba628	[SPARK-27223][SQL] Remove private methods that skip conversion when passing user schemas for constructing a DataFrame ## What changes were proposed in this pull request? When passing in a user schema to create a DataFrame, there might be mismatched nullability between the user schema and the the actual data. All related public interfaces now perform catalyst conversion using the user provided schema, which catches such mismatches to avoid runtime errors later on. However, there're private methods which allow this conversion to be skipped, so we need to remove these private methods which may lead to confusion and potential issues. ## How was this patch tested? Passed existing tests. No new tests were added since this PR removed the private interfaces that would potentially cause null problems and other interfaces are covered already by existing tests. Closes #24162 from maryannxue/spark-27223. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-21 11:13:25 +09:00
Ruocheng Jiang	d6ee2f331d	[MINOR][EXAMPLES] Add missing return keyword streaming word count example This is a very low level error. Closes #24153 from jiangruocheng/master. Authored-by: Ruocheng Jiang <jiangruocheng@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-20 17:59:12 -05:00
Jungtaek Lim (HeartSaVioR)	a8d9531edc	[SPARK-27205][CORE] Remove complicated logic for just leaving warning log when main class is scala.App ## What changes were proposed in this pull request? [SPARK-26977](https://issues.apache.org/jira/browse/SPARK-26977) introduced very strange bug which spark-shell is no longer able to load classes which are provided via `--packages`. TBH I don't know about the details why it is broken, but looks like initializing `object class` brings the weirdness (maybe due to static initialization done twice?). This patch removes the logic to leave warning log when main class is scala.App, to not deal with such complexity for just leaving warning message. ## How was this patch tested? Manual test: suppose we run spark-shell with `--packages` option like below: ``` ./bin/spark-shell --verbose --master "local[]" --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0 ``` Before this patch, importing class in transitive dependency fails: ``` Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://localhost:4040 Spark context available as 'sc' (master = local[], app id = local-1553005771597). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.0.0-SNAPSHOT /_/ Using Scala version 2.12.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_191) Type in expressions to have them evaluated. Type :help for more information. scala> import org.apache.kafka <console>:23: error: object kafka is not a member of package org.apache import org.apache.kafka ``` After this patch, importing class in transitive dependency succeeds: ``` Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://localhost:4040 Spark context available as 'sc' (master = local[*], app id = local-1553004095542). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.0.0-SNAPSHOT /_/ Using Scala version 2.12.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_191) Type in expressions to have them evaluated. Type :help for more information. scala> import org.apache.kafka import org.apache.kafka ``` Closes #24147 from HeartSaVioR/SPARK-27205. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-20 17:55:48 -05:00
wangguangxin.cn	46f9f44918	[SPARK-27202][MINOR][SQL] Update comments to keep according with code ## What changes were proposed in this pull request? Update comments in `InMemoryFileIndex.listLeafFiles` to keep according with code. ## How was this patch tested? existing test cases Closes #24146 from WangGuangxin/SPARK-27202. Authored-by: wangguangxin.cn <wangguangxin.cn@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-20 17:54:28 -05:00
Rob Vesse	61d99462a0	[SPARK-26729][K8S] Make image names under test configurable ## What changes were proposed in this pull request? Allow specifying system properties to customise the image names for the images used in the integration testing. Useful if your CI/CD pipeline or policy requires using a different naming format. This is one part of addressing SPARK-26729, I plan to have a follow up patch that will also make the names configurable when using `docker-image-tool.sh` ## How was this patch tested? Ran integration tests against custom images generated by our CI/CD pipeline that do not follow Spark's existing hardcoded naming conventions using the new system properties to override the image names appropriately: ``` mvn clean integration-test -pl :spark-kubernetes-integration-tests_${SCALA_VERSION} \ -Pkubernetes -Pkubernetes-integration-tests \ -P${SPARK_HADOOP_PROFILE} -Dhadoop.version=${HADOOP_VERSION} \ -Dspark.kubernetes.test.sparkTgz=${TARBALL} \ -Dspark.kubernetes.test.imageTag=${TAG} \ -Dspark.kubernetes.test.imageRepo=${REPO} \ -Dspark.kubernetes.test.namespace=${K8S_NAMESPACE} \ -Dspark.kubernetes.test.kubeConfigContext=${K8S_CONTEXT} \ -Dspark.kubernetes.test.deployMode=${K8S_TEST_DEPLOY_MODE} \ -Dspark.kubernetes.test.jvmImage=apache-spark \ -Dspark.kubernetes.test.pythonImage=apache-spark-py \ -Dspark.kubernetes.test.rImage=apache-spark-r \ -Dtest.include.tags=k8s ... [INFO] --- scalatest-maven-plugin:1.0:test (integration-test) spark-kubernetes-integration-tests_2.12 --- Discovery starting. Discovery completed in 230 milliseconds. Run starting. Expected test count is: 15 KubernetesSuite: - Run SparkPi with no resources - Run SparkPi with a very long application name. - Use SparkLauncher.NO_RESOURCE - Run SparkPi with a master URL without a scheme. - Run SparkPi with an argument. - Run SparkPi with custom labels, annotations, and environment variables. - Run extraJVMOptions check on driver - Run SparkRemoteFileTest using a remote data file - Run SparkPi with env and mount secrets. - Run PySpark on simple pi.py example - Run PySpark with Python2 to test a pyfiles example - Run PySpark with Python3 to test a pyfiles example - Run PySpark with memory customization - Run in client mode. - Start pod creation from template Run completed in 8 minutes, 33 seconds. Total number of tests run: 15 Suites: completed 2, aborted 0 Tests: succeeded 15, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes #23846 from rvesse/SPARK-26729. Authored-by: Rob Vesse <rvesse@dotnetrdf.org> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-03-20 14:28:27 -07:00
Lantao Jin	93c6d2a198	[SPARK-27215][CORE] Correct the kryo configurations ## What changes were proposed in this pull request? ```scala val KRYO_USE_UNSAFE = ConfigBuilder("spark.kyro.unsafe") .booleanConf .createWithDefault(false) val KRYO_USE_POOL = ConfigBuilder("spark.kyro.pool") .booleanConf .createWithDefault(true) ``` kyro should be kryo ## How was this patch tested? no need Closes #24156 from LantaoJin/SPARK-27215. Authored-by: Lantao Jin <jinlantao@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-03-20 14:27:05 -07:00
Marcelo Vanzin	ec5e34205a	[SPARK-27094][YARN] Work around RackResolver swallowing thread interrupt. To avoid the case where the YARN libraries would swallow the exception and prevent YarnAllocator from shutting down, call the offending code in a separate thread, so that the parent thread can respond appropriately to the shut down. As a safeguard, also explicitly stop the executor launch thread pool when shutting down the application, to prevent new executors from coming up after the application started its shutdown. Tested with unit tests + some internal tests on real cluster. Closes #24017 from vanzin/SPARK-27094. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-03-20 11:48:06 -07:00
Sean Owen	c65f9b2bc3	[SPARK-26839][SQL] Work around classloader changes in Java 9 for Hive isolation Note, this doesn't really resolve the JIRA, but makes the changes we can make so far that would be required to solve it. ## What changes were proposed in this pull request? Java 9+ changed how ClassLoaders work. The two most salient points: - The boot classloader no longer 'sees' the platform classes. A new 'platform classloader' does and should be the parent of new ClassLoaders - The system classloader is no longer a URLClassLoader, so we can't get the URLs of JARs in its classpath ## How was this patch tested? We'll see whether Java 8 tests still pass here. Java 11 tests do not fully pass at this point; more notes below. This does make progress on the failures though. (NB: to test with Java 11, you need to build with Java 8 first, setting JAVA_HOME and java's executable correctly, then switch both to Java 11 for testing.) Closes #24057 from srowen/SPARK-26839. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-20 09:12:52 -05:00
Gengliang Wang	ef2d63bfb1	[SPARK-27201][WEBUI] Toggle full job description on click ## What changes were proposed in this pull request? Previously, in https://github.com/apache/spark/pull/6646 there was an improvement to show full job description after double clicks. I think this is a bit hard to be noticed by some users. I suggest changing the event to one click. Also, after the full description is shown, another click should be able to hide the overflow text again. Before click: ![short](https://user-images.githubusercontent.com/1097932/54608784-79bfca80-4a8c-11e9-912b-30799be0d6cb.png) After click: ![full](https://user-images.githubusercontent.com/1097932/54608790-7b898e00-4a8c-11e9-9251-86061158db68.png) Click again: ![short](https://user-images.githubusercontent.com/1097932/54608784-79bfca80-4a8c-11e9-912b-30799be0d6cb.png) ## How was this patch tested? Manually check. Closes #24145 from gengliangwang/showDescriptionDetail. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-20 21:29:13 +09:00
Maxim Gekk	1882912cca	[SPARK-27199][SQL] Replace TimeZone by ZoneId in TimestampFormatter API ## What changes were proposed in this pull request? In the PR, I propose to use `ZoneId` instead of `TimeZone` in: - the `apply` and `getFractionFormatter ` methods of the `TimestampFormatter` object, - and in implementations of the `TimestampFormatter` trait like `FractionTimestampFormatter`. The reason of the changes is to avoid unnecessary conversion from `TimeZone` to `ZoneId` because `ZoneId` is used in `TimestampFormatter` implementations internally, and the conversion is performed via `String` which is not for free. Also taking into account that `TimeZone` instances are converted from `String` in some cases, the worse case looks like `String` -> `TimeZone` -> `String` -> `ZoneId`. The PR eliminates the unneeded conversions. ## How was this patch tested? It was tested by `DateExpressionsSuite`, `DateTimeUtilsSuite` and `TimestampFormatterSuite`. Closes #24141 from MaxGekk/zone-id. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-20 21:28:11 +09:00
Ajith	1f692e522c	[SPARK-27200][WEBUI][HISTORYSERVER] History Environment tab must sort Configurations/Properties by default Environment Page in SparkUI have all the configuration sorted by key. But this is not the case in History server case, to keep UX same, we can have it sorted in history server too ## What changes were proposed in this pull request? On render of Env page the properties are sorted before creating page ## How was this patch tested? Manually tested in UI Closes #24143 from ajithme/historyenv. Authored-by: Ajith <ajith2489@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-20 20:16:17 +09:00
Huon Wilson	b67d369572	[SPARK-27099][SQL] Add 'xxhash64' for hashing arbitrary columns to Long ## What changes were proposed in this pull request? This introduces a new SQL function 'xxhash64' for getting a 64-bit hash of an arbitrary number of columns. This is designed to exactly mimic the 32-bit `hash`, which uses MurmurHash3. The name is designed to be more future-proof than the 'hash', by indicating the exact algorithm used, similar to md5 and the sha hashes. ## How was this patch tested? The tests for the existing `hash` function were duplicated to run with `xxhash64`. Closes #24019 from huonw/hash64. Authored-by: Huon Wilson <Huon.Wilson@data61.csiro.au> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-20 16:34:34 +08:00
Darcy Shen	9a43852f17	[SPARK-27160][SQL] Fix DecimalType when building orc filters ## What changes were proposed in this pull request? DecimalType Literal should not be casted to Long. eg. For `df.filter("x < 3.14")`, assuming df (x in DecimalType) reads from a ORC table and uses the native ORC reader with predicate push down enabled, we will push down the `x < 3.14` predicate to the ORC reader via a SearchArgument. OrcFilters will construct the SearchArgument, but not handle the DecimalType correctly. The previous impl will construct `x < 3` from `x < 3.14`. ## How was this patch tested? ``` $ sbt > sql/testOnly OrcFilterSuite > sql/testOnly OrcQuerySuite -- -z "27160" ``` Closes #24092 from sadhen/spark27160. Authored-by: Darcy Shen <sadhen@zoho.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-19 20:28:46 -07:00
Dongjoon Hyun	257391497b	[SPARK-26975][SQL] Support nested-column pruning over limit/sample/repartition ## What changes were proposed in this pull request? As [SPARK-26958](https://github.com/apache/spark/pull/23862/files) benchmark shows, nested-column pruning has limitations. This PR aims to remove the limitations on `limit/repartition/sample`. Here, repartition means `Repartition`, not `RepartitionByExpression`. PREPARATION ```scala scala> spark.range(100).map(x => (x, (x, s"$x" * 100))).toDF("col1", "col2").write.mode("overwrite").save("/tmp/p") scala> sql("set spark.sql.optimizer.nestedSchemaPruning.enabled=true") scala> spark.read.parquet("/tmp/p").createOrReplaceTempView("t") ``` BEFORE ```scala scala> sql("SELECT col2._1 FROM (SELECT col2 FROM t LIMIT 1000000)").explain == Physical Plan == CollectLimit 1000000 +- (1) Project [col2#22._1 AS _1#28L] +- (1) FileScan parquet [col2#22] Batched: false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/p], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<col2:struct<_1:bigint>> scala> sql("SELECT col2._1 FROM (SELECT /+ REPARTITION(1) / col2 FROM t)").explain == Physical Plan == (2) Project [col2#22._1 AS _1#33L] +- Exchange RoundRobinPartitioning(1) +- (1) Project [col2#22] +- (1) FileScan parquet [col2#22] Batched: false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/p], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<col2:struct<_1:bigint,_2:string>> ``` AFTER* ```scala scala> sql("SELECT col2._1 FROM (SELECT /+ REPARTITION(1) / col2 FROM t)").explain == Physical Plan == Exchange RoundRobinPartitioning(1) +- (1) Project [col2#5._1 AS _1#11L] +- (1) FileScan parquet [col2#5] Batched: false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/p], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<col2:struct<_1:bigint>> ``` This supercedes https://github.com/apache/spark/pull/23542 and https://github.com/apache/spark/pull/23873 . ## How was this patch tested? Pass the Jenkins with a newly added test suite. Closes #23964 from dongjoon-hyun/SPARK-26975-ALIAS. Lead-authored-by: Dongjoon Hyun <dhyun@apple.com> Co-authored-by: DB Tsai <d_tsai@apple.com> Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-19 20:24:22 -07:00
Dongjoon Hyun	4d5247778a	[SPARK-27197][SQL][TEST] Add ReadNestedSchemaTest for file-based data sources ## What changes were proposed in this pull request? The reader schema is said to be evolved (or projected) when it changed after the data is written by writers. Apache Spark file-based data sources have a test coverage for that; e.g. [ReadSchemaSuite.scala](https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/ReadSchemaSuite.scala). This PR aims to add a test coverage for nested columns by adding and hiding nested columns. ## How was this patch tested? Pass the Jenkins with newly added tests. Closes #24139 from dongjoon-hyun/SPARK-27197. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2019-03-20 00:22:05 +00:00
Hyukjin Kwon	c99463d4cf	[SPARK-26979][PYTHON][FOLLOW-UP] Make binary math/string functions take string as columns as well ## What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/23882 to handle binary math/string functions. For instance, see the cases below: Before: ```python >>> from pyspark.sql.functions import lit, ascii >>> spark.range(1).select(lit('a').alias("value")).select(ascii("value")) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/functions.py", line 51, in _ jc = getattr(sc._jvm.functions, name)(col._jc if isinstance(col, Column) else col) File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1286, in __call__ File "/.../spark/python/pyspark/sql/utils.py", line 63, in deco return f(a, kw) File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py", line 332, in get_return_value py4j.protocol.Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.ascii. Trace: py4j.Py4JException: Method ascii([class java.lang.String]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:339) at py4j.Gateway.invoke(Gateway.java:276) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) ``` ```python >>> from pyspark.sql.functions import atan2 >>> spark.range(1).select(atan2("id", "id")) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/functions.py", line 78, in _ jc = getattr(sc._jvm.functions, name)(col1._jc if isinstance(col1, Column) else float(col1), ValueError: could not convert string to float: id ``` After:* ```python >>> from pyspark.sql.functions import lit, ascii >>> spark.range(1).select(lit('a').alias("value")).select(ascii("value")) DataFrame[ascii(value): int] ``` ```python >>> from pyspark.sql.functions import atan2 >>> spark.range(1).select(atan2("id", "id")) DataFrame[ATAN2(id, id): double] ``` Note that, - This PR causes a slight behaviour changes for math functions. For instance, numbers as strings (e.g., `"1"`) were supported as arguments of binary math functions before. After this PR, it recognises it as column names. - I also intentionally didn't document this behaviour changes since we're going ahead for Spark 3.0 and I don't think numbers as strings make much sense in math functions. - There is another exception `when`, which takes string as literal values as below. This PR doeesn't fix this ambiguity. ```python >>> spark.range(1).select(when(lit(True), col("id"))).show() ``` ``` +--------------------------+ \|CASE WHEN true THEN id END\| +--------------------------+ \| 0\| +--------------------------+ ``` ```python >>> spark.range(1).select(when(lit(True), "id")).show() ``` ``` +--------------------------+ \|CASE WHEN true THEN id END\| +--------------------------+ \| id\| +--------------------------+ ``` This PR also fixes as below: https://github.com/apache/spark/pull/23882 fixed it to: - Rename `_create_function` to `_create_name_function` - Define new `_create_function` to take strings as column names. This PR, I proposes to: - Revert `_create_name_function` name to `_create_function`. - Define new `_create_function_over_column` to take strings as column names. ## How was this patch tested? Some unit tests were added for binary math / string functions. Closes #24121 from HyukjinKwon/SPARK-26979. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-20 08:06:10 +09:00
weixiuli	8b0aa59218	[SPARK-26288][CORE] add initRegisteredExecutorsDB ## What changes were proposed in this pull request? As we all know that spark on Yarn uses DB https://github.com/apache/spark/pull/7943 to record RegisteredExecutors information which can be reloaded and used again when the ExternalShuffleService is restarted . The RegisteredExecutors information can't be recorded both in the mode of spark's standalone and spark on k8s , which will cause the RegisteredExecutors information to be lost ,when the ExternalShuffleService is restarted. To solve the problem above, a method is proposed and is committed . ## How was this patch tested? new unit tests Closes #23393 from weixiuli/SPARK-26288. Authored-by: weixiuli <weixiuli@jd.com> Signed-off-by: Imran Rashid <irashid@cloudera.com>	2019-03-19 16:16:43 -05:00
Yuming Wang	6783831f68	[SPARK-27179][BUILD] Exclude javax.ws.rs:jsr311-api from hadoop-client ## What changes were proposed in this pull request? Since [YARN-7113](https://issues.apache.org/jira/browse/YARN-7113)(Hadoop-3.1.0), `hadoop-client` add `javax.ws.rs:jsr311-api` to its dependency. This conflict with [javax.ws.rs-api-2.0.1.jar](`f26a1f3d37/dev/deps/spark-deps-hadoop-3.1 (L105)`). ```shell build/sbt "core/testOnly .UISeleniumSuite .HistoryServerSuite" -Phadoop-3.2 ... [info] <pre> Server Error</pre></p><h3>Caused by:</h3><pre>java.lang.NoSuchMethodError: javax.ws.rs.core.Application.getProperties()Ljava/util/Map; ... ``` This pr exclude `javax.ws.rs:jsr311-api` from hadoop-client. ## How was this patch tested? manual tests: ```shell build/sbt "core/testOnly .UISeleniumSuite .HistoryServerSuite" -Phadoop-3.2 ``` Closes #24114 from wangyum/SPARK-27179. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-19 13:31:40 -04:00
Zhu, Lipeng	99c427b1d3	[SPARK-27168][SQL][TEST] Add docker integration test for MsSql server ## What changes were proposed in this pull request? This PR aims to add a JDBC integration test for MsSql server. ## How was this patch tested? ``` ./build/mvn clean install -DskipTests ./build/mvn test -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.12 \ -Dtest=none -DwildcardSuites=org.apache.spark.sql.jdbc.MsSqlServerIntegrationSuite ``` Closes #24099 from lipzhu/SPARK-27168. Lead-authored-by: Zhu, Lipeng <lipzhu@ebay.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Co-authored-by: Lipeng Zhu <lipzhu@icloud.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-19 08:43:23 -07:00
s71955	e402de5fd0	[SPARK-26176][SQL] Verify column names for CTAS with `STORED AS` ## What changes were proposed in this pull request? Currently, users meet job abortions while creating a table using the Hive serde "STORED AS" with invalid column names. We had better prevent this by raising AnalysisException with a guide to use aliases instead like Paquet data source tables. thus making compatible with error message shown while creating Parquet/ORC native table. BEFORE ```scala scala> sql("set spark.sql.hive.convertMetastoreParquet=false") scala> sql("CREATE TABLE a STORED AS PARQUET AS SELECT 1 AS `COUNT(ID)`") Caused by: java.lang.IllegalArgumentException: No enum constant parquet.schema.OriginalType.col1 ``` AFTER ```scala scala> sql("CREATE TABLE a STORED AS PARQUET AS SELECT 1 AS `COUNT(ID)`") Please use alias to rename it.;eption: Attribute name "count(ID)" contains invalid character(s) among " ,;{}()\n\t=". ``` ## How was this patch tested? Pass the Jenkins with the newly added test case. Closes #24075 from sujith71955/master_serde. Authored-by: s71955 <sujithchacko.2010@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-19 20:29:47 +08:00
Takeshi Yamamuro	901c7408a4	[SPARK-27161][SQL][FOLLOWUP] Drops non-keywords from docs/sql-keywords.md ## What changes were proposed in this pull request? This pr is a follow-up of #24093 and includes fixes below; - Lists up all the keywords of Spark only (that is, drops non-keywords there); I listed up all the keywords of ANSI SQL-2011 in the previous commit (SPARK-26215). - Sorts the keywords in `SqlBase.g4` in a alphabetical order ## How was this patch tested? Pass Jenkins. Closes #24125 from maropu/SPARK-27161-FOLLOWUP. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-19 20:18:40 +08:00
mwlon	d5c08fcaab	[SPARK-26555][SQL] make ScalaReflection subtype checking thread safe ## What changes were proposed in this pull request? Make ScalaReflection subtype checking thread safe by adding a lock. There is a thread safety bug in the <:< operator in all versions of scala (https://github.com/scala/bug/issues/10766). ## How was this patch tested? Existing tests and a new one for the new subtype checking function. Closes #24085 from mwlon/SPARK-26555. Authored-by: mwlon <mloncaric@hmc.edu> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-19 18:22:01 +08:00
wuyi	a8af23d7ab	[SPARK-27193][SQL] CodeFormatter should format multiple comment lines correctly ## What changes were proposed in this pull request? when enable `spark.sql.codegen.comments`, there will be multiple comment lines. However, CodeFormatter can not handle multiple comment lines currently: ``` /* 001 / public Object generate(Object[] references) { / 002 / return new GeneratedIteratorForCodegenStage1(references); / 003 / } / 004 / / 005 / /* * Codegend pipeline for stage (id=1) * (1) Project [(id#0L + 1) AS (id + 1)#3L] +- (1) Filter (id#0L = 1) +- (1) Range (0, 10, step=1, splits=4) / /* 006 / // codegenStageId=1 / 007 / final class GeneratedIteratorForCodegenStage1 extends org.apache.spark.sql.execution.BufferedRowIterator { ``` After applying this pr: ``` / 001 / public Object generate(Object[] references) { / 002 / return new GeneratedIteratorForCodegenStage1(references); / 003 / } / 004 / / 005 / /* /* 006 / Codegend pipeline for stage (id=1) /* 007 / (1) Project [(id#0L + 1) AS (id + 1)#4L] / 008 / +- (1) Filter (id#0L = 1) / 009 / +- (1) Range (0, 10, step=1, splits=2) / 010 / / /* 011 / // codegenStageId=1 / 012 */ final class GeneratedIteratorForCodegenStage1 extends org.apache.spark.sql.execution.BufferedRowIterator { ``` ## How was this patch tested? Tested Manually. Closes #24133 from Ngone51/fix-codeformatter-for-multi-comment-lines. Authored-by: wuyi <ngone_5451@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-19 14:47:51 +08:00

1 2 3 4 5 ...

24072 commits