spark-instrumented-optimizer/R/pkg/inst/tests/testthat
Burak Yavuz 0d1bf2b6c8 [SPARK-18510] Fix data corruption from inferred partition column dataTypes
## What changes were proposed in this pull request?

### The Issue

If I specify my schema when doing
```scala
spark.read
  .schema(someSchemaWherePartitionColumnsAreStrings)
```
but if the partition inference can infer it as IntegerType or I assume LongType or DoubleType (basically fixed size types), then once UnsafeRows are generated, your data will be corrupted.

### Proposed solution

The partition handling code path is kind of a mess. In my fix I'm probably adding to the mess, but at least trying to standardize the code path.

The real issue is that a user that uses the `spark.read` code path can never clearly specify what the partition columns are. If you try to specify the fields in `schema`, we practically ignore what the user provides, and fall back to our inferred data types. What happens in the end is data corruption.

My solution tries to fix this by always trying to infer partition columns the first time you specify the table. Once we find what the partition columns are, we try to find them in the user specified schema and use the dataType provided there, or fall back to the smallest common data type.

We will ALWAYS append partition columns to the user's schema, even if they didn't ask for it. We will only use the data type they provided if they specified it. While this is confusing, this has been the behavior since Spark 1.6, and I didn't want to change this behavior in the QA period of Spark 2.1. We may revisit this decision later.

A side effect of this PR is that we won't need https://github.com/apache/spark/pull/15942 if this PR goes in.

## How was this patch tested?

Regression tests

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #15951 from brkyvz/partition-corruption.
2016-11-23 11:48:59 -08:00
..
jarTest.R [SPARK-10683][SPARK-16510][SPARKR] Move SparkR include jar test to SparkSubmitSuite 2016-07-19 19:28:08 -07:00
packageInAJarTest.R [SPARKR][MINOR] R examples and test updates 2016-07-13 13:33:34 -07:00
test_binary_function.R [SPARK-16519][SPARKR] Handle SparkR RDD generics that create warnings in R CMD check 2016-08-16 11:19:18 -07:00
test_binaryFile.R [SPARK-16519][SPARKR] Handle SparkR RDD generics that create warnings in R CMD check 2016-08-16 11:19:18 -07:00
test_broadcast.R [SPARK-16519][SPARKR] Handle SparkR RDD generics that create warnings in R CMD check 2016-08-16 11:19:18 -07:00
test_client.R [MINOR] [SPARKR] Update data-manipulation.R to use native csv reader 2016-05-09 09:58:36 -07:00
test_context.R [SPARK-17577][FOLLOW-UP][SPARKR] SparkR spark.addFile supports adding directory recursively 2016-09-26 16:47:57 -07:00
test_includePackage.R [SPARK-16519][SPARKR] Handle SparkR RDD generics that create warnings in R CMD check 2016-08-16 11:19:18 -07:00
test_jvm_api.R [SPARK-16581][SPARKR] Fix JVM API tests in SparkR 2016-08-31 16:56:41 -07:00
test_mllib.R [SPARK-18501][ML][SPARKR] Fix spark.glm errors when fitting on collinear data 2016-11-22 19:17:48 -08:00
test_parallelize_collect.R [SPARK-16519][SPARKR] Handle SparkR RDD generics that create warnings in R CMD check 2016-08-16 11:19:18 -07:00
test_rdd.R [SPARK-16519][SPARKR] Handle SparkR RDD generics that create warnings in R CMD check 2016-08-16 11:19:18 -07:00
test_Serde.R [SPARK-16027][SPARKR] Fix R tests SparkSession init/stop 2016-07-17 19:02:21 -07:00
test_shuffle.R [SPARK-16519][SPARKR] Handle SparkR RDD generics that create warnings in R CMD check 2016-08-16 11:19:18 -07:00
test_sparkR.R [SPARK-18444][SPARKR] SparkR running in yarn-cluster mode should not download Spark package. 2016-11-22 00:05:30 -08:00
test_sparkSQL.R [SPARK-18510] Fix data corruption from inferred partition column dataTypes 2016-11-23 11:48:59 -08:00
test_take.R [SPARK-16519][SPARKR] Handle SparkR RDD generics that create warnings in R CMD check 2016-08-16 11:19:18 -07:00
test_textFile.R [SPARK-16519][SPARKR] Handle SparkR RDD generics that create warnings in R CMD check 2016-08-16 11:19:18 -07:00
test_utils.R [SPARK-17838][SPARKR] Check named arguments for options and use formatted R friendly message from JVM exception message 2016-11-01 22:14:53 -07:00
test_Windows.R [SPARK-8603][SPARKR] Use shell() instead of system2() for SparkR on Windows 2016-05-26 20:55:06 -07:00