History

Pablo Langa bbb2cba615 [SPARK-32025][SQL] Csv schema inference problems with different types in the same column ### What changes were proposed in this pull request? This pull request fixes a bug present in the csv type inference. We have problems when we have different types in the same column. Previously: ``` $ cat /example/f1.csv col1 43200000 true spark.read.csv(path="file:///example/.csv", header=True, inferSchema=True).show() +----+ \|col1\| +----+ \|null\| \|true\| +----+ root \|-- col1: boolean (nullable = true) ``` Now* ``` spark.read.csv(path="file:///example/*.csv", header=True, inferSchema=True).show() +-------------+ \|col1 \| +-------------+ \|43200000 \| \|true \| +-------------+ root \|-- col1: string (nullable = true) ``` Previously the hierarchy of type inference is the following: > IntegerType > > LongType > > > DecimalType > > > > DoubleType > > > > > TimestampType > > > > > > BooleanType > > > > > > > StringType So, when, for example, we have integers in one column, and the last element is a boolean, all the column is inferred as a boolean column incorrectly and all the number are shown as null when you see the data We need the following hierarchy. When we have different numeric types in the column it will be resolved correctly. And when we have other different types it will be resolved as a String type column > IntegerType > > LongType > > > DecimalType > > > > DoubleType > > > > > StringType > TimestampType > > StringType > BooleanType > > StringType > StringType ### Why are the changes needed? Fix the bug explained ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test and manual tests Closes #28896 from planga82/feature/SPARK-32025_csv_inference. Authored-by: Pablo Langa <soypab@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>		2020-06-26 10:41:27 +09:00
..
catalyst	[SPARK-32025][SQL] Csv schema inference problems with different types in the same column	2020-06-26 10:41:27 +09:00
core	[SPARK-32025][SQL] Csv schema inference problems with different types in the same column	2020-06-26 10:41:27 +09:00
hive	[SPARK-31710][SQL] Fail casting numeric to timestamp by default	2020-06-16 08:35:35 +00:00
hive-thriftserver	[SPARK-32034][SQL] Port HIVE-14817: Shutdown the SessionManager timeoutChecker thread properly upon shutdown	2020-06-21 16:28:00 -07:00
create-docs.sh	[SPARK-31550][SQL][DOCS] Set nondeterministic configurations with general meanings in sql configuration doc	2020-04-27 17:08:52 +09:00
gen-sql-api-docs.py	[SPARK-31474][SQL][FOLLOWUP] Replace _FUNC_ placeholder with functionname in the note field of expression info	2020-04-23 13:33:04 +09:00
gen-sql-config-docs.py	[SPARK-31550][SQL][DOCS] Set nondeterministic configurations with general meanings in sql configuration doc	2020-04-27 17:08:52 +09:00
gen-sql-functions-docs.py	[SPARK-31562][SQL] Update ExpressionDescription for substring, current_date, and current_timestamp	2020-04-26 11:46:52 -07:00
mkdocs.yml	[SPARK-30731] Update deprecated Mkdocs option	2020-02-19 17:28:58 +09:00
README.md	[SPARK-30510][SQL][DOCS] Publicly document Spark SQL configuration options	2020-02-09 19:20:47 +09:00

README.md

Spark SQL

This module provides support for executing relational queries expressed in either SQL or the DataFrame/Dataset API.

Spark SQL is broken up into four subprojects:

Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions.
Execution (sql/core) - A query planner / execution engine for translating Catalyst's logical query plans into Spark RDDs. This component also includes a new public interface, SQLContext, that allows users to execute SQL or LINQ statements against existing RDDs and Parquet files.
Hive Support (sql/hive) - Includes extensions that allow users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allow users to run queries that include Hive UDFs, UDAFs, and UDTFs.
HiveServer and CLI support (sql/hive-thriftserver) - Includes support for the SQL CLI (bin/spark-sql) and a HiveServer2 (for JDBC/ODBC) compatible server.

Running ./sql/create-docs.sh generates SQL documentation for built-in functions under sql/site, and SQL configuration documentation that gets included as part of configuration.md in the main docs directory.