ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Max Gekk	b58f0976a9	[SPARK-34437][SQL][DOCS] Update Spark SQL guide about the rebasing DS options and SQL configs ### What changes were proposed in this pull request? In the PR, I propose to update the Spark SQL guide about the SQL configs that are related to datetime rebasing: - spark.sql.parquet.int96RebaseModeInWrite - spark.sql.parquet.datetimeRebaseModeInWrite - spark.sql.parquet.int96RebaseModeInRead - spark.sql.parquet.datetimeRebaseModeInRead - spark.sql.avro.datetimeRebaseModeInWrite - spark.sql.avro.datetimeRebaseModeInRead Parquet options added by #31489: - datetimeRebaseMode - int96RebaseMode and Avro options added by #31529: - datetimeRebaseMode <img width="998" alt="Screenshot 2021-02-17 at 21 42 09" src="https://user-images.githubusercontent.com/1580697/108252043-3afb8900-7169-11eb-8568-511e21fa7f78.png"> ### Why are the changes needed? To inform users about supported DS options and SQL configs. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By generating the doc and manually checking: ``` $ SKIP_API=1 SKIP_SCALADOC=1 SKIP_PYTHONDOC=1 SKIP_RDOC=1 jekyll serve --watch ``` Closes #31564 from MaxGekk/doc-rebase-options. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-18 17:48:50 +09:00
William Hyun	1cf09b77eb	[MINOR][DOCS] Update Parquet website link ### What changes were proposed in this pull request? This PR aims to update the Parquet website link from http://parquet.io to https://parquet.apache.orc ### Why are the changes needed? The old website goes to the incubator site. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes #31208 from williamhyun/minor-parquet. Authored-by: William Hyun <williamhyun3@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-16 11:24:32 -08:00
Huaxin Gao	46be1e01e9	[SPARK-31319][SQL][FOLLOW-UP] Add a SQL example for UDAF ### What changes were proposed in this pull request? Add a SQL example for UDAF ### Why are the changes needed? To make SQL Reference complete ### Does this PR introduce any user-facing change? Yes. Add the following page, also change ```Sql``` to ```SQL``` in the example tab for all the sql examples. <img width="1110" alt="Screen Shot 2020-04-13 at 6 09 24 PM" src="https://user-images.githubusercontent.com/13592258/79175240-06cd7400-7db2-11ea-8f3e-af71a591a64b.png"> ### How was this patch tested? Manually build and check Closes #28209 from huaxingao/udf_followup. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-14 13:29:44 +09:00
beliefer	4fc8ee74fc	[SPARK-31295][DOC] Supplement version for configuration appear in doc ### What changes were proposed in this pull request? This PR supplements version for configuration appear in docs. I sorted out some information show below. docs/spark-standalone.md Item name \| Since version \| JIRA ID \| Commit ID \| Note -- \| -- \| -- \| -- \| -- spark.deploy.retainedApplications \| 0.8.0 \| None \| 46eecd110a4017ea0c86cbb1010d0ccd6a5eb2ef#diff-29dffdccd5a7f4c8b496c293e87c8668 \| spark.deploy.retainedDrivers \| 1.1.0 \| None \| 7446f5ff93142d2dd5c79c63fa947f47a1d4db8b#diff-29dffdccd5a7f4c8b496c293e87c8668 \| spark.deploy.spreadOut \| 0.6.1 \| None \| bb2b9ff37cd2503cc6ea82c5dd395187b0910af0#diff-0e7ae91819fc8f7b47b0f97be7116325 \| spark.deploy.defaultCores \| 0.9.0 \| None \| d8bcc8e9a095c1b20dd7a17b6535800d39bff80e#diff-29dffdccd5a7f4c8b496c293e87c8668 \| spark.deploy.maxExecutorRetries \| 1.6.3 \| SPARK-16956 \| ace458f0330f22463ecf7cbee7c0465e10fba8a8#diff-29dffdccd5a7f4c8b496c293e87c8668 \| spark.worker.resource.{resourceName}.amount \| 3.0.0 \| SPARK-27371 \| cbad616d4cb0c58993a88df14b5e30778c7f7e85#diff-d25032e4a3ae1b85a59e4ca9ccf189a8 \| spark.worker.resource.{resourceName}.discoveryScript \| 3.0.0 \| SPARK-27371 \| cbad616d4cb0c58993a88df14b5e30778c7f7e85#diff-d25032e4a3ae1b85a59e4ca9ccf189a8 \| spark.worker.resourcesFile \| 3.0.0 \| SPARK-27369 \| 7cbe01e8efc3f6cd3a0cac4bcfadea8fcc74a955#diff-b2fc8d6ab7ac5735085e2d6cfacb95da \| spark.shuffle.service.db.enabled \| 3.0.0 \| SPARK-26288 \| 8b0aa59218c209d39cbba5959302d8668b885cf6#diff-6bdad48cfc34314e89599655442ff210 \| spark.storage.cleanupFilesAfterExecutorExit \| 2.4.0 \| SPARK-24340 \| 8ef167a5f9ba8a79bb7ca98a9844fe9cfcfea060#diff-916ca56b663f178f302c265b7ef38499 \| spark.deploy.recoveryMode \| 0.8.1 \| None \| d66c01f2b6defb3db6c1be99523b734a4d960532#diff-29dffdccd5a7f4c8b496c293e87c8668 \| spark.deploy.recoveryDirectory \| 0.8.1 \| None \| d66c01f2b6defb3db6c1be99523b734a4d960532#diff-29dffdccd5a7f4c8b496c293e87c8668 \| docs/sql-data-sources-avro.md Item name \| Since version \| JIRA ID \| Commit ID \| Note -- \| -- \| -- \| -- \| -- spark.sql.legacy.replaceDatabricksSparkAvro.enabled \| 2.4.0 \| SPARK-25129 \| ac0174e55af2e935d41545721e9f430c942b3a0c#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.avro.compression.codec \| 2.4.0 \| SPARK-24881 \| 0a0f68bae6c0a1bf30184b1e9ac6bf3805bd7511#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.avro.deflate.level \| 2.4.0 \| SPARK-24881 \| 0a0f68bae6c0a1bf30184b1e9ac6bf3805bd7511#diff-9a6b543db706f1a90f790783d6930a13 \| docs/sql-data-sources-orc.md Item name \| Since version \| JIRA ID \| Commit ID \| Note -- \| -- \| -- \| -- \| -- spark.sql.orc.impl \| 2.3.0 \| SPARK-20728 \| 326f1d6728a7734c228d8bfaa69442a1c7b92e9b#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.orc.enableVectorizedReader \| 2.3.0 \| SPARK-16060 \| 60f6b994505e3f82091a04eed2dc0a9e8bd523ce#diff-9a6b543db706f1a90f790783d6930a13 \| docs/sql-data-sources-parquet.md Item name \| Since version \| JIRA ID \| Commit ID \| Note -- \| -- \| -- \| -- \| -- spark.sql.parquet.binaryAsString \| 1.1.1 \| SPARK-2927 \| de501e169f24e4573747aec85b7651c98633c028#diff-41ef65b9ef5b518f77e2a03559893f4d \| spark.sql.parquet.int96AsTimestamp \| 1.3.0 \| SPARK-4987 \| 67d52207b5cf2df37ca70daff2a160117510f55e#diff-41ef65b9ef5b518f77e2a03559893f4d \| spark.sql.parquet.compression.codec \| 1.1.1 \| SPARK-3131 \| 3a9d874d7a46ab8b015631d91ba479d9a0ba827f#diff-41ef65b9ef5b518f77e2a03559893f4d \| spark.sql.parquet.filterPushdown \| 1.2.0 \| SPARK-4391 \| 576688aa2a19bd4ba239a2b93af7947f983e5124#diff-41ef65b9ef5b518f77e2a03559893f4d \| spark.sql.hive.convertMetastoreParquet \| 1.1.1 \| SPARK-2406 \| cc4015d2fa3785b92e6ab079b3abcf17627f7c56#diff-ff50aea397a607b79df9bec6f2a841db \| spark.sql.parquet.mergeSchema \| 1.5.0 \| SPARK-8690 \| 246265f2bb056d5e9011d3331b809471a24ff8d7#diff-41ef65b9ef5b518f77e2a03559893f4d \| spark.sql.parquet.writeLegacyFormat \| 1.6.0 \| SPARK-10400 \| 01cd688f5245cbb752863100b399b525b31c3510#diff-41ef65b9ef5b518f77e2a03559893f4d \| ### Why are the changes needed? Supplemental configuration version information. ### Does this PR introduce any user-facing change? 'No'. ### How was this patch tested? Jenkins test Closes #28064 from beliefer/supplement-doc-for-data-sources. Authored-by: beliefer <beliefer@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-31 12:33:46 +09:00
uncleGen	5f1ef544f3	[MINOR][DOCS] Use proper html tag in markdown ### What changes were proposed in this pull request? This PR fix and use proper html tag in docs ### Why are the changes needed? Fix documentation format error. ### Does this PR introduce any user-facing change? No ### How was this patch tested? N/A Closes #26302 from uncleGen/minor-doc. Authored-by: uncleGen <hustyugm@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-10-30 15:30:58 +09:00
Sean Owen	754f820035	[SPARK-26918][DOCS] All .md should have ASF license header ## What changes were proposed in this pull request? Add AL2 license to metadata of all .md files. This seemed to be the tidiest way as it will get ignored by .md renderers and other tools. Attempts to write them as markdown comments revealed that there is no such standard thing. ## How was this patch tested? Doc build Closes #24243 from srowen/SPARK-26918. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-30 19:49:45 -05:00
liuxian	421ff6f60e	[MINOR][DOC] Writing to partitioned Hive metastore Parquet tables is not supported for Spark SQL ## What changes were proposed in this pull request? Even if `spark.sql.hive.convertMetastoreParquet` is true, when writing to partitioned Hive metastore Parquet tables, Spark SQL still can not use its own Parquet support instead of Hive SerDe. Related code: `d53e11ffce/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala (L198)` ## How was this patch tested? N/A Closes #23671 from 10110346/parquetdoc. Authored-by: liuxian <liu.xian3@zte.com.cn> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-02-01 18:34:13 -06:00
Gengliang Wang	f5b9370da2	[SPARK-26709][SQL] OptimizeMetadataOnlyQuery does not handle empty records correctly ## What changes were proposed in this pull request? When reading from empty tables, the optimization `OptimizeMetadataOnlyQuery` may return wrong results: ``` sql("CREATE TABLE t (col1 INT, p1 INT) USING PARQUET PARTITIONED BY (p1)") sql("INSERT INTO TABLE t PARTITION (p1 = 5) SELECT ID FROM range(1, 1)") sql("SELECT MAX(p1) FROM t") ``` The result is supposed to be `null`. However, with the optimization the result is `5`. The rule is originally ported from https://issues.apache.org/jira/browse/HIVE-1003 in #13494. In Hive, the rule is disabled by default in a later release(https://issues.apache.org/jira/browse/HIVE-15397), due to the same problem. It is hard to completely avoid the correctness issue. Because data sources like Parquet can be metadata-only. Spark can't tell whether it is empty or not without actually reading it. This PR disable the optimization by default. ## How was this patch tested? Unit test Closes #23635 from gengliangwang/optimizeMetadata. Lead-authored-by: Gengliang Wang <gengliang.wang@databricks.com> Co-authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-01-24 18:24:49 -08:00
dima-asana	bd00f10773	[MINOR][SQL][DOC] Correct parquet nullability documentation ## What changes were proposed in this pull request? Parquet files appear to have nullability info when being written, not being read. ## How was this patch tested? Some test code: (running spark 2.3, but the relevant code in DataSource looks identical on master) case class NullTest(bo: Boolean, opbol: Option[Boolean]) val testDf = spark.createDataFrame(Seq(NullTest(true, Some(false)))) defined class NullTest testDf: org.apache.spark.sql.DataFrame = [bo: boolean, opbol: boolean] testDf.write.parquet("s3://asana-stats/tmp_dima/parquet_check_schema") spark.read.parquet("s3://asana-stats/tmp_dima/parquet_check_schema/part-00000-b1bf4a19-d9fe-4ece-a2b4-9bbceb490857-c000.snappy.parquet4").printSchema() root \|-- bo: boolean (nullable = true) \|-- opbol: boolean (nullable = true) Meanwhile, the parquet file formed does have nullable info: []batchprod-report000:/tmp/dimakamalov-batch$ aws s3 ls s3://asana-stats/tmp_dima/parquet_check_schema/ 2018-10-17 21:03:52 0 _SUCCESS 2018-10-17 21:03:50 504 part-00000-b1bf4a19-d9fe-4ece-a2b4-9bbceb490857-c000.snappy.parquet []batchprod-report000:/tmp/dimakamalov-batch$ aws s3 cp s3://asana-stats/tmp_dima/parquet_check_schema/part-00000-b1bf4a19-d9fe-4ece-a2b4-9bbceb490857-c000.snappy.parquet . download: s3://asana-stats/tmp_dima/parquet_check_schema/part-00000-b1bf4a19-d9fe-4ece-a2b4-9bbceb490857-c000.snappy.parquet to ./part-00000-b1bf4a19-d9fe-4ece-a2b4-9bbceb490857-c000.snappy.parquet []batchprod-report000:/tmp/dimakamalov-batch$ java -jar parquet-tools-1.8.2.jar schema part-00000-b1bf4a19-d9fe-4ece-a2b4-9bbceb490857-c000.snappy.parquet message spark_schema { required boolean bo; optional boolean opbol; } Closes #22759 from dima-asana/dima-asana-nullable-parquet-doc. Authored-by: dima-asana <42555784+dima-asana@users.noreply.github.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-12-07 14:14:43 -06:00
Yuanjian Li	987f386588	[SPARK-24499][SQL][DOC] Split the page of sql-programming-guide.html to multiple separate pages ## What changes were proposed in this pull request? 1. Split the main page of sql-programming-guide into 7 parts: - Getting Started - Data Sources - Performance Turing - Distributed SQL Engine - PySpark Usage Guide for Pandas with Apache Arrow - Migration Guide - Reference 2. Add left menu for sql-programming-guide, keep first level index for each part in the menu. ![image](https://user-images.githubusercontent.com/4833765/47016859-6332e180-d183-11e8-92e8-ce62518a83c4.png) ## How was this patch tested? Local test with jekyll build/serve. Closes #22746 from xuanyuanking/SPARK-24499. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-10-18 11:59:06 -07:00

10 commits