ODIn/spark-instrumented-optimizer

25560 commits 6 branches 181 tags 397 MiB

Author	SHA1	Message	Date
Jules Damji	b71abd654d	[MINOR][DOC] Avro data source documentation change ## What changes were proposed in this pull request? This is a minor documentation change whereby the https://spark.apache.org/docs/latest/sql-data-sources-avro.html mentions "The date type and naming of record fields should match the input Avro data or Catalyst data," The term Catalyst data is confusing. It should instead say, Spark's internal data type such as String Type or IntegerType. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) There are no code changes; only doc changes. Please review https://spark.apache.org/contributing.html before opening a pull request. Closes #24787 from dmatrix/br-orc-ds.doc.changes. Authored-by: Jules Damji <dmatrix@comcast.net> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-06-04 16:17:53 -07:00
Sean Owen	754f820035	[SPARK-26918][DOCS] All .md should have ASF license header ## What changes were proposed in this pull request? Add AL2 license to metadata of all .md files. This seemed to be the tidiest way as it will get ignored by .md renderers and other tools. Attempts to write them as markdown comments revealed that there is no such standard thing. ## How was this patch tested? Doc build Closes #24243 from srowen/SPARK-26918. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-30 19:49:45 -05:00
Gabor Somogyi	3729efb4d0	[SPARK-26856][PYSPARK] Python support for from_avro and to_avro APIs ## What changes were proposed in this pull request? Avro is built-in but external data source module since Spark 2.4 but `from_avro` and `to_avro` APIs not yet supported in pyspark. In this PR I've made them available from pyspark. ## How was this patch tested? Please see the python API examples what I've added. cd docs/ SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll build Manual webpage check. Closes #23797 from gaborgsomogyi/SPARK-26856. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-11 10:15:07 +09:00
Hyukjin Kwon	c406472970	[SPARK-26870][SQL] Move to_avro/from_avro into functions object due to Java compatibility ## What changes were proposed in this pull request? Currently, looks, to use `from_avro` and `to_avro` in Java APIs side, ```java import static org.apache.spark.sql.avro.package$.MODULE$; MODULE$.to_avro MODULE$.from_avro ``` This PR targets to deprecate and move both functions under `avro` package into `functions` object like the way of our `org.apache.spark.sql.functions`. Therefore, Java side can import: ```java import static org.apache.spark.sql.avro.functions.*; ``` and Scala side can import: ```scala import org.apache.spark.sql.avro.functions._ ``` ## How was this patch tested? Manually tested, and unit tests for Java APIs were added. Closes #23784 from HyukjinKwon/SPARK-26870. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-02-15 10:24:35 +08:00
Keiji Yoshida	de42281527	[MINOR][DOCS][WIP] Fix Typos ## What changes were proposed in this pull request? Fix Typos. ## How was this patch tested? NA Closes #23145 from kjmrknsn/docUpdate. Authored-by: Keiji Yoshida <kjmrknsn@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-11-29 10:39:00 -06:00
Gengliang Wang	24e8c27dfe	[SPARK-25819][SQL] Support parse mode option for the function `from_avro` ## What changes were proposed in this pull request? Current the function `from_avro` throws exception on reading corrupt records. In practice, there could be various reasons of data corruption. It would be good to support `PERMISSIVE` mode and allow the function from_avro to process all the input file/streaming, which is consistent with from_json and from_csv. There is no obvious down side for supporting `PERMISSIVE` mode. Different from `from_csv` and `from_json`, the default parse mode is `FAILFAST` for the following reasons: 1. Since Avro is structured data format, input data is usually able to be parsed by certain schema. In such case, exposing the problems of input data to users is better than hiding it. 2. For `PERMISSIVE` mode, we have to force the data schema as fully nullable. This seems quite unnecessary for Avro. Reversing non-null schema might archive more perf optimizations in Spark. 3. To be consistent with the behavior in Spark 2.4 . ## How was this patch tested? Unit test Manual previewing generated html for the Avro data source doc: ![image](https://user-images.githubusercontent.com/1097932/47510100-02558880-d8aa-11e8-9d57-a43daee4c6b9.png) Closes #22814 from gengliangwang/improve_from_avro. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-10-26 11:39:38 +08:00
Yuanjian Li	987f386588	[SPARK-24499][SQL][DOC] Split the page of sql-programming-guide.html to multiple separate pages ## What changes were proposed in this pull request? 1. Split the main page of sql-programming-guide into 7 parts: - Getting Started - Data Sources - Performance Turing - Distributed SQL Engine - PySpark Usage Guide for Pandas with Apache Arrow - Migration Guide - Reference 2. Add left menu for sql-programming-guide, keep first level index for each part in the menu. ![image](https://user-images.githubusercontent.com/4833765/47016859-6332e180-d183-11e8-92e8-ce62518a83c4.png) ## How was this patch tested? Local test with jekyll build/serve. Closes #22746 from xuanyuanking/SPARK-24499. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-10-18 11:59:06 -07:00

Renamed from docs/avro-data-source-guide.md (Browse further)

7 commits