[SPARK-13665][SQL] Separate the concerns of HadoopFsRelation
`HadoopFsRelation` is used for reading most files into Spark SQL. However today this class mixes the concerns of file management, schema reconciliation, scan building, bucketing, partitioning, and writing data. As a result, many data sources are forced to reimplement the same functionality and the various layers have accumulated a fair bit of inefficiency. This PR is a first cut at separating this into several components / interfaces that are each described below. Additionally, all implementations inside of Spark (parquet, csv, json, text, orc, svmlib) have been ported to the new API `FileFormat`. External libraries, such as spark-avro will also need to be ported to work with Spark 2.0. ### HadoopFsRelation A simple `case class` that acts as a container for all of the metadata required to read from a datasource. All discovery, resolution and merging logic for schemas and partitions has been removed. This an internal representation that no longer needs to be exposed to developers. ```scala case class HadoopFsRelation( sqlContext: SQLContext, location: FileCatalog, partitionSchema: StructType, dataSchema: StructType, bucketSpec: Option[BucketSpec], fileFormat: FileFormat, options: Map[String, String]) extends BaseRelation ``` ### FileFormat The primary interface that will be implemented by each different format including external libraries. Implementors are responsible for reading a given format and converting it into `InternalRow` as well as writing out an `InternalRow`. A format can optionally return a schema that is inferred from a set of files. ```scala trait FileFormat { def inferSchema( sqlContext: SQLContext, options: Map[String, String], files: Seq[FileStatus]): Option[StructType] def prepareWrite( sqlContext: SQLContext, job: Job, options: Map[String, String], dataSchema: StructType): OutputWriterFactory def buildInternalScan( sqlContext: SQLContext, dataSchema: StructType, requiredColumns: Array[String], filters: Array[Filter], bucketSet: Option[BitSet], inputFiles: Array[FileStatus], broadcastedConf: Broadcast[SerializableConfiguration], options: Map[String, String]): RDD[InternalRow] } ``` The current interface is based on what was required to get all the tests passing again, but still mixes a couple of concerns (i.e. `bucketSet` is passed down to the scan instead of being resolved by the planner). Additionally, scans are still returning `RDD`s instead of iterators for single files. In a future PR, bucketing should be removed from this interface and the scan should be isolated to a single file. ### FileCatalog This interface is used to list the files that make up a given relation, as well as handle directory based partitioning. ```scala trait FileCatalog { def paths: Seq[Path] def partitionSpec(schema: Option[StructType]): PartitionSpec def allFiles(): Seq[FileStatus] def getStatus(path: Path): Array[FileStatus] def refresh(): Unit } ``` Currently there are two implementations: - `HDFSFileCatalog` - based on code from the old `HadoopFsRelation`. Infers partitioning by recursive listing and caches this data for performance - `HiveFileCatalog` - based on the above, but it uses the partition spec from the Hive Metastore. ### ResolvedDataSource Produces a logical plan given the following description of a Data Source (which can come from DataFrameReader or a metastore): - `paths: Seq[String] = Nil` - `userSpecifiedSchema: Option[StructType] = None` - `partitionColumns: Array[String] = Array.empty` - `bucketSpec: Option[BucketSpec] = None` - `provider: String` - `options: Map[String, String]` This class is responsible for deciding which of the Data Source APIs a given provider is using (including the non-file based ones). All reconciliation of partitions, buckets, schema from metastores or inference is done here. ### DataSourceAnalysis / DataSourceStrategy Responsible for analyzing and planning reading/writing of data using any of the Data Source APIs, including: - pruning the files from partitions that will be read based on filters. - appending partition columns* - applying additional filters when a data source can not evaluate them internally. - constructing an RDD that is bucketed correctly when required* - sanity checking schema match-up and other analysis when writing. *In the future we should do that following: - Break out file handling into its own Strategy as its sufficiently complex / isolated. - Push the appending of partition columns down in to `FileFormat` to avoid an extra copy / unvectorization. - Use a custom RDD for scans instead of `SQLNewNewHadoopRDD2` Author: Michael Armbrust <michael@databricks.com> Author: Wenchen Fan <wenchen@databricks.com> Closes #11509 from marmbrus/fileDataSource.
This commit is contained in:
parent
0eea12a3d9
commit
e720dda42e
|
@ -54,7 +54,8 @@ private[spark] abstract class ZippedPartitionsBaseRDD[V: ClassTag](
|
|||
override def getPartitions: Array[Partition] = {
|
||||
val numParts = rdds.head.partitions.length
|
||||
if (!rdds.forall(rdd => rdd.partitions.length == numParts)) {
|
||||
throw new IllegalArgumentException("Can't zip RDDs with unequal numbers of partitions")
|
||||
throw new IllegalArgumentException(
|
||||
s"Can't zip RDDs with unequal numbers of partitions: ${rdds.map(_.partitions.length)}")
|
||||
}
|
||||
Array.tabulate[Partition](numParts) { i =>
|
||||
val prefs = rdds.map(rdd => rdd.preferredLocations(rdd.partitions(i)))
|
||||
|
|
|
@ -19,74 +19,23 @@ package org.apache.spark.ml.source.libsvm
|
|||
|
||||
import java.io.IOException
|
||||
|
||||
import com.google.common.base.Objects
|
||||
import org.apache.hadoop.fs.{FileStatus, Path}
|
||||
import org.apache.hadoop.io.{NullWritable, Text}
|
||||
import org.apache.hadoop.mapreduce.{RecordWriter, TaskAttemptContext}
|
||||
import org.apache.hadoop.mapreduce.{Job, RecordWriter, TaskAttemptContext}
|
||||
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
|
||||
|
||||
import org.apache.spark.annotation.Since
|
||||
import org.apache.spark.broadcast.Broadcast
|
||||
import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
|
||||
import org.apache.spark.mllib.util.MLUtils
|
||||
import org.apache.spark.rdd.RDD
|
||||
import org.apache.spark.sql.{DataFrame, DataFrameReader, Row, SQLContext}
|
||||
import org.apache.spark.sql.catalyst.InternalRow
|
||||
import org.apache.spark.sql.catalyst.encoders.RowEncoder
|
||||
import org.apache.spark.sql.sources._
|
||||
import org.apache.spark.sql.types._
|
||||
|
||||
/**
|
||||
* LibSVMRelation provides the DataFrame constructed from LibSVM format data.
|
||||
* @param path File path of LibSVM format
|
||||
* @param numFeatures The number of features
|
||||
* @param vectorType The type of vector. It can be 'sparse' or 'dense'
|
||||
* @param sqlContext The Spark SQLContext
|
||||
*/
|
||||
private[libsvm] class LibSVMRelation(val path: String, val numFeatures: Int, val vectorType: String)
|
||||
(@transient val sqlContext: SQLContext)
|
||||
extends HadoopFsRelation with Serializable {
|
||||
|
||||
override def buildScan(requiredColumns: Array[String], inputFiles: Array[FileStatus])
|
||||
: RDD[Row] = {
|
||||
val sc = sqlContext.sparkContext
|
||||
val baseRdd = MLUtils.loadLibSVMFile(sc, path, numFeatures)
|
||||
val sparse = vectorType == "sparse"
|
||||
baseRdd.map { pt =>
|
||||
val features = if (sparse) pt.features.toSparse else pt.features.toDense
|
||||
Row(pt.label, features)
|
||||
}
|
||||
}
|
||||
|
||||
override def hashCode(): Int = {
|
||||
Objects.hashCode(path, Double.box(numFeatures), vectorType)
|
||||
}
|
||||
|
||||
override def equals(other: Any): Boolean = other match {
|
||||
case that: LibSVMRelation =>
|
||||
path == that.path &&
|
||||
numFeatures == that.numFeatures &&
|
||||
vectorType == that.vectorType
|
||||
case _ =>
|
||||
false
|
||||
}
|
||||
|
||||
override def prepareJobForWrite(job: _root_.org.apache.hadoop.mapreduce.Job):
|
||||
_root_.org.apache.spark.sql.sources.OutputWriterFactory = {
|
||||
new OutputWriterFactory {
|
||||
override def newInstance(
|
||||
path: String,
|
||||
dataSchema: StructType,
|
||||
context: TaskAttemptContext): OutputWriter = {
|
||||
new LibSVMOutputWriter(path, dataSchema, context)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
override def paths: Array[String] = Array(path)
|
||||
|
||||
override def dataSchema: StructType = StructType(
|
||||
StructField("label", DoubleType, nullable = false) ::
|
||||
StructField("features", new VectorUDT(), nullable = false) :: Nil)
|
||||
}
|
||||
|
||||
import org.apache.spark.util.SerializableConfiguration
|
||||
import org.apache.spark.util.collection.BitSet
|
||||
|
||||
private[libsvm] class LibSVMOutputWriter(
|
||||
path: String,
|
||||
|
@ -124,6 +73,7 @@ private[libsvm] class LibSVMOutputWriter(
|
|||
recordWriter.close(context)
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* `libsvm` package implements Spark SQL data source API for loading LIBSVM data as [[DataFrame]].
|
||||
* The loaded [[DataFrame]] has two columns: `label` containing labels stored as doubles and
|
||||
|
@ -155,7 +105,7 @@ private[libsvm] class LibSVMOutputWriter(
|
|||
* @see [[https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ LIBSVM datasets]]
|
||||
*/
|
||||
@Since("1.6.0")
|
||||
class DefaultSource extends HadoopFsRelationProvider with DataSourceRegister {
|
||||
class DefaultSource extends FileFormat with DataSourceRegister {
|
||||
|
||||
@Since("1.6.0")
|
||||
override def shortName(): String = "libsvm"
|
||||
|
@ -167,22 +117,63 @@ class DefaultSource extends HadoopFsRelationProvider with DataSourceRegister {
|
|||
throw new IOException(s"Illegal schema for libsvm data, schema=${dataSchema}")
|
||||
}
|
||||
}
|
||||
|
||||
override def createRelation(
|
||||
override def inferSchema(
|
||||
sqlContext: SQLContext,
|
||||
paths: Array[String],
|
||||
dataSchema: Option[StructType],
|
||||
partitionColumns: Option[StructType],
|
||||
parameters: Map[String, String]): HadoopFsRelation = {
|
||||
val path = if (paths.length == 1) paths(0)
|
||||
else if (paths.isEmpty) throw new IOException("No input path specified for libsvm data")
|
||||
else throw new IOException("Multiple input paths are not supported for libsvm data")
|
||||
if (partitionColumns.isDefined && !partitionColumns.get.isEmpty) {
|
||||
throw new IOException("Partition is not supported for libsvm data")
|
||||
options: Map[String, String],
|
||||
files: Seq[FileStatus]): Option[StructType] = {
|
||||
Some(
|
||||
StructType(
|
||||
StructField("label", DoubleType, nullable = false) ::
|
||||
StructField("features", new VectorUDT(), nullable = false) :: Nil))
|
||||
}
|
||||
|
||||
override def prepareWrite(
|
||||
sqlContext: SQLContext,
|
||||
job: Job,
|
||||
options: Map[String, String],
|
||||
dataSchema: StructType): OutputWriterFactory = {
|
||||
new OutputWriterFactory {
|
||||
override def newInstance(
|
||||
path: String,
|
||||
bucketId: Option[Int],
|
||||
dataSchema: StructType,
|
||||
context: TaskAttemptContext): OutputWriter = {
|
||||
if (bucketId.isDefined) { sys.error("LibSVM doesn't support bucketing") }
|
||||
new LibSVMOutputWriter(path, dataSchema, context)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
override def buildInternalScan(
|
||||
sqlContext: SQLContext,
|
||||
dataSchema: StructType,
|
||||
requiredColumns: Array[String],
|
||||
filters: Array[Filter],
|
||||
bucketSet: Option[BitSet],
|
||||
inputFiles: Array[FileStatus],
|
||||
broadcastedConf: Broadcast[SerializableConfiguration],
|
||||
options: Map[String, String]): RDD[InternalRow] = {
|
||||
// TODO: This does not handle cases where column pruning has been performed.
|
||||
|
||||
verifySchema(dataSchema)
|
||||
val dataFiles = inputFiles.filterNot(_.getPath.getName startsWith "_")
|
||||
|
||||
val path = if (dataFiles.length == 1) dataFiles(0).getPath.toUri.toString
|
||||
else if (dataFiles.isEmpty) throw new IOException("No input path specified for libsvm data")
|
||||
else throw new IOException("Multiple input paths are not supported for libsvm data.")
|
||||
|
||||
val numFeatures = options.getOrElse("numFeatures", "-1").toInt
|
||||
val vectorType = options.getOrElse("vectorType", "sparse")
|
||||
|
||||
val sc = sqlContext.sparkContext
|
||||
val baseRdd = MLUtils.loadLibSVMFile(sc, path, numFeatures)
|
||||
val sparse = vectorType == "sparse"
|
||||
baseRdd.map { pt =>
|
||||
val features = if (sparse) pt.features.toSparse else pt.features.toDense
|
||||
Row(pt.label, features)
|
||||
}.mapPartitions { externalRows =>
|
||||
val converter = RowEncoder(dataSchema)
|
||||
externalRows.map(converter.toRow)
|
||||
}
|
||||
dataSchema.foreach(verifySchema(_))
|
||||
val numFeatures = parameters.getOrElse("numFeatures", "-1").toInt
|
||||
val vectorType = parameters.getOrElse("vectorType", "sparse")
|
||||
new LibSVMRelation(path, numFeatures, vectorType)(sqlContext)
|
||||
}
|
||||
}
|
||||
|
|
|
@ -22,7 +22,7 @@ import java.io.{File, IOException}
|
|||
import com.google.common.base.Charsets
|
||||
import com.google.common.io.Files
|
||||
|
||||
import org.apache.spark.SparkFunSuite
|
||||
import org.apache.spark.{SparkException, SparkFunSuite}
|
||||
import org.apache.spark.mllib.linalg.{DenseVector, SparseVector, Vectors}
|
||||
import org.apache.spark.mllib.util.MLlibTestSparkContext
|
||||
import org.apache.spark.sql.SaveMode
|
||||
|
@ -88,7 +88,8 @@ class LibSVMRelationSuite extends SparkFunSuite with MLlibTestSparkContext {
|
|||
val df = sqlContext.read.format("libsvm").load(path)
|
||||
val tempDir2 = Utils.createTempDir()
|
||||
val writepath = tempDir2.toURI.toString
|
||||
df.write.format("libsvm").mode(SaveMode.Overwrite).save(writepath)
|
||||
// TODO: Remove requirement to coalesce by supporting mutiple reads.
|
||||
df.coalesce(1).write.format("libsvm").mode(SaveMode.Overwrite).save(writepath)
|
||||
|
||||
val df2 = sqlContext.read.format("libsvm").load(writepath)
|
||||
val row1 = df2.first()
|
||||
|
@ -98,9 +99,8 @@ class LibSVMRelationSuite extends SparkFunSuite with MLlibTestSparkContext {
|
|||
|
||||
test("write libsvm data failed due to invalid schema") {
|
||||
val df = sqlContext.read.format("text").load(path)
|
||||
val e = intercept[IOException] {
|
||||
val e = intercept[SparkException] {
|
||||
df.write.format("libsvm").save(path + "_2")
|
||||
}
|
||||
assert(e.getMessage.contains("Illegal schema for libsvm data"))
|
||||
}
|
||||
}
|
||||
|
|
|
@ -60,7 +60,11 @@ object MimaExcludes {
|
|||
ProblemFilters.exclude[MissingMethodProblem]("org.apache.spark.sql.SQLContext.jsonRDD"),
|
||||
ProblemFilters.exclude[MissingMethodProblem]("org.apache.spark.sql.SQLContext.load"),
|
||||
ProblemFilters.exclude[MissingMethodProblem]("org.apache.spark.sql.SQLContext.dialectClassName"),
|
||||
ProblemFilters.exclude[MissingMethodProblem]("org.apache.spark.sql.SQLContext.getSQLDialect")
|
||||
ProblemFilters.exclude[MissingMethodProblem]("org.apache.spark.sql.SQLContext.getSQLDialect"),
|
||||
// SPARK-13664 Replace HadoopFsRelation with FileFormat
|
||||
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.ml.source.libsvm.LibSVMRelation"),
|
||||
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.sources.HadoopFsRelationProvider"),
|
||||
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.sources.HadoopFsRelation$FileStatusCache")
|
||||
) ++ Seq(
|
||||
ProblemFilters.exclude[IncompatibleResultTypeProblem]("org.apache.spark.SparkContext.emptyRDD"),
|
||||
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.broadcast.HttpBroadcastFactory")
|
||||
|
|
|
@ -103,7 +103,7 @@ object DataType {
|
|||
|
||||
/** Given the string representation of a type, return its DataType */
|
||||
private def nameToType(name: String): DataType = {
|
||||
val FIXED_DECIMAL = """decimal\(\s*(\d+)\s*,\s*(\d+)\s*\)""".r
|
||||
val FIXED_DECIMAL = """decimal\(\s*(\d+)\s*,\s*(\-?\d+)\s*\)""".r
|
||||
name match {
|
||||
case "decimal" => DecimalType.USER_DEFAULT
|
||||
case FIXED_DECIMAL(precision, scale) => DecimalType(precision.toInt, scale.toInt)
|
||||
|
|
|
@ -21,18 +21,14 @@ import java.util.Properties
|
|||
|
||||
import scala.collection.JavaConverters._
|
||||
|
||||
import org.apache.hadoop.fs.Path
|
||||
import org.apache.hadoop.util.StringUtils
|
||||
|
||||
import org.apache.spark.{Logging, Partition}
|
||||
import org.apache.spark.annotation.Experimental
|
||||
import org.apache.spark.api.java.JavaRDD
|
||||
import org.apache.spark.deploy.SparkHadoopUtil
|
||||
import org.apache.spark.rdd.RDD
|
||||
import org.apache.spark.sql.execution.LogicalRDD
|
||||
import org.apache.spark.sql.execution.datasources.{LogicalRelation, ResolvedDataSource}
|
||||
import org.apache.spark.sql.execution.datasources.jdbc.{JDBCPartition, JDBCPartitioningInfo, JDBCRelation}
|
||||
import org.apache.spark.sql.execution.datasources.json.JSONRelation
|
||||
import org.apache.spark.sql.execution.datasources.parquet.ParquetRelation
|
||||
import org.apache.spark.sql.execution.datasources.json.{InferSchema, JacksonParser, JSONOptions}
|
||||
import org.apache.spark.sql.execution.streaming.StreamingRelation
|
||||
import org.apache.spark.sql.types.StructType
|
||||
|
||||
|
@ -129,8 +125,6 @@ class DataFrameReader private[sql](sqlContext: SQLContext) extends Logging {
|
|||
val resolved = ResolvedDataSource(
|
||||
sqlContext,
|
||||
userSpecifiedSchema = userSpecifiedSchema,
|
||||
partitionColumns = Array.empty[String],
|
||||
bucketSpec = None,
|
||||
provider = source,
|
||||
options = extraOptions.toMap)
|
||||
DataFrame(sqlContext, LogicalRelation(resolved.relation))
|
||||
|
@ -154,7 +148,17 @@ class DataFrameReader private[sql](sqlContext: SQLContext) extends Logging {
|
|||
*/
|
||||
@scala.annotation.varargs
|
||||
def load(paths: String*): DataFrame = {
|
||||
option("paths", paths.map(StringUtils.escapeString(_, '\\', ',')).mkString(",")).load()
|
||||
if (paths.isEmpty) {
|
||||
sqlContext.emptyDataFrame
|
||||
} else {
|
||||
sqlContext.baseRelationToDataFrame(
|
||||
ResolvedDataSource.apply(
|
||||
sqlContext,
|
||||
paths = paths,
|
||||
userSpecifiedSchema = userSpecifiedSchema,
|
||||
provider = source,
|
||||
options = extraOptions.toMap).relation)
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
|
@ -334,14 +338,20 @@ class DataFrameReader private[sql](sqlContext: SQLContext) extends Logging {
|
|||
* @since 1.4.0
|
||||
*/
|
||||
def json(jsonRDD: RDD[String]): DataFrame = {
|
||||
sqlContext.baseRelationToDataFrame(
|
||||
new JSONRelation(
|
||||
Some(jsonRDD),
|
||||
maybeDataSchema = userSpecifiedSchema,
|
||||
maybePartitionSpec = None,
|
||||
userDefinedPartitionColumns = None,
|
||||
parameters = extraOptions.toMap)(sqlContext)
|
||||
)
|
||||
val parsedOptions: JSONOptions = new JSONOptions(extraOptions.toMap)
|
||||
val schema = userSpecifiedSchema.getOrElse {
|
||||
InferSchema.infer(jsonRDD, sqlContext.conf.columnNameOfCorruptRecord, parsedOptions)
|
||||
}
|
||||
|
||||
new DataFrame(
|
||||
sqlContext,
|
||||
LogicalRDD(
|
||||
schema.toAttributes,
|
||||
JacksonParser.parse(
|
||||
jsonRDD,
|
||||
schema,
|
||||
sqlContext.conf.columnNameOfCorruptRecord,
|
||||
parsedOptions))(sqlContext))
|
||||
}
|
||||
|
||||
/**
|
||||
|
@ -363,20 +373,7 @@ class DataFrameReader private[sql](sqlContext: SQLContext) extends Logging {
|
|||
*/
|
||||
@scala.annotation.varargs
|
||||
def parquet(paths: String*): DataFrame = {
|
||||
if (paths.isEmpty) {
|
||||
sqlContext.emptyDataFrame
|
||||
} else {
|
||||
val globbedPaths = paths.flatMap { path =>
|
||||
val hdfsPath = new Path(path)
|
||||
val fs = hdfsPath.getFileSystem(sqlContext.sparkContext.hadoopConfiguration)
|
||||
val qualified = hdfsPath.makeQualified(fs.getUri, fs.getWorkingDirectory)
|
||||
SparkHadoopUtil.get.globPathIfNecessary(qualified)
|
||||
}.toArray
|
||||
|
||||
sqlContext.baseRelationToDataFrame(
|
||||
new ParquetRelation(
|
||||
globbedPaths.map(_.toString), userSpecifiedSchema, None, extraOptions.toMap)(sqlContext))
|
||||
}
|
||||
format("parquet").load(paths: _*)
|
||||
}
|
||||
|
||||
/**
|
||||
|
|
|
@ -366,13 +366,6 @@ final class DataFrameWriter private[sql](df: DataFrame) {
|
|||
case (true, SaveMode.ErrorIfExists) =>
|
||||
throw new AnalysisException(s"Table $tableIdent already exists.")
|
||||
|
||||
case (true, SaveMode.Append) =>
|
||||
// If it is Append, we just ask insertInto to handle it. We will not use insertInto
|
||||
// to handle saveAsTable with Overwrite because saveAsTable can change the schema of
|
||||
// the table. But, insertInto with Overwrite requires the schema of data be the same
|
||||
// the schema of the table.
|
||||
insertInto(tableIdent)
|
||||
|
||||
case _ =>
|
||||
val cmd =
|
||||
CreateTableUsingAsSelect(
|
||||
|
|
|
@ -25,7 +25,7 @@ import org.apache.spark.sql.catalyst.expressions._
|
|||
import org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext
|
||||
import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Statistics}
|
||||
import org.apache.spark.sql.catalyst.plans.physical.{HashPartitioning, Partitioning, UnknownPartitioning}
|
||||
import org.apache.spark.sql.execution.datasources.parquet.ParquetRelation
|
||||
import org.apache.spark.sql.execution.datasources.parquet.{DefaultSource => ParquetSource}
|
||||
import org.apache.spark.sql.execution.metric.SQLMetrics
|
||||
import org.apache.spark.sql.internal.SQLConf
|
||||
import org.apache.spark.sql.sources.{BaseRelation, HadoopFsRelation}
|
||||
|
@ -226,16 +226,17 @@ private[sql] object PhysicalRDD {
|
|||
rdd: RDD[InternalRow],
|
||||
relation: BaseRelation,
|
||||
metadata: Map[String, String] = Map.empty): PhysicalRDD = {
|
||||
val outputUnsafeRows = if (relation.isInstanceOf[ParquetRelation]) {
|
||||
// The vectorized parquet reader does not produce unsafe rows.
|
||||
!SQLContext.getActive().get.conf.getConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED)
|
||||
} else {
|
||||
// All HadoopFsRelations output UnsafeRows
|
||||
relation.isInstanceOf[HadoopFsRelation]
|
||||
|
||||
val outputUnsafeRows = relation match {
|
||||
case r: HadoopFsRelation if r.fileFormat.isInstanceOf[ParquetSource] =>
|
||||
!SQLContext.getActive().get.conf.getConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED)
|
||||
case _: HadoopFsRelation => true
|
||||
case _ => false
|
||||
}
|
||||
|
||||
val bucketSpec = relation match {
|
||||
case r: HadoopFsRelation => r.getBucketSpec
|
||||
// TODO: this should be closer to bucket planning.
|
||||
case r: HadoopFsRelation if r.sqlContext.conf.bucketingEnabled() => r.bucketSpec
|
||||
case _ => None
|
||||
}
|
||||
|
||||
|
|
|
@ -25,12 +25,14 @@ import org.apache.spark.rdd.{MapPartitionsRDD, RDD, UnionRDD}
|
|||
import org.apache.spark.sql._
|
||||
import org.apache.spark.sql.catalyst.{CatalystTypeConverters, InternalRow}
|
||||
import org.apache.spark.sql.catalyst.CatalystTypeConverters.convertToScala
|
||||
import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute
|
||||
import org.apache.spark.sql.catalyst.expressions
|
||||
import org.apache.spark.sql.catalyst.expressions._
|
||||
import org.apache.spark.sql.catalyst.planning.PhysicalOperation
|
||||
import org.apache.spark.sql.catalyst.plans.logical
|
||||
import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
|
||||
import org.apache.spark.sql.catalyst.plans.physical.HashPartitioning
|
||||
import org.apache.spark.sql.catalyst.rules.Rule
|
||||
import org.apache.spark.sql.execution.PhysicalRDD.{INPUT_PATHS, PUSHED_FILTERS}
|
||||
import org.apache.spark.sql.execution.SparkPlan
|
||||
import org.apache.spark.sql.execution.command.ExecutedCommand
|
||||
|
@ -41,6 +43,45 @@ import org.apache.spark.unsafe.types.UTF8String
|
|||
import org.apache.spark.util.{SerializableConfiguration, Utils}
|
||||
import org.apache.spark.util.collection.BitSet
|
||||
|
||||
/**
|
||||
* Replaces generic operations with specific variants that are designed to work with Spark
|
||||
* SQL Data Sources.
|
||||
*/
|
||||
private[sql] object DataSourceAnalysis extends Rule[LogicalPlan] {
|
||||
override def apply(plan: LogicalPlan): LogicalPlan = plan transform {
|
||||
case i @ logical.InsertIntoTable(
|
||||
l @ LogicalRelation(t: HadoopFsRelation, _, _), part, query, overwrite, false)
|
||||
if query.resolved && t.schema.asNullable == query.schema.asNullable =>
|
||||
|
||||
// Sanity checks
|
||||
if (t.location.paths.size != 1) {
|
||||
throw new AnalysisException(
|
||||
"Can only write data to relations with a single path.")
|
||||
}
|
||||
|
||||
val outputPath = t.location.paths.head
|
||||
val inputPaths = query.collect {
|
||||
case LogicalRelation(r: HadoopFsRelation, _, _) => r.location.paths
|
||||
}.flatten
|
||||
|
||||
val mode = if (overwrite) SaveMode.Overwrite else SaveMode.Append
|
||||
if (overwrite && inputPaths.contains(outputPath)) {
|
||||
throw new AnalysisException(
|
||||
"Cannot overwrite a path that is also being read from.")
|
||||
}
|
||||
|
||||
InsertIntoHadoopFsRelation(
|
||||
outputPath,
|
||||
t.partitionSchema.fields.map(_.name).map(UnresolvedAttribute(_)),
|
||||
t.bucketSpec,
|
||||
t.fileFormat,
|
||||
() => t.refresh(),
|
||||
t.options,
|
||||
query,
|
||||
mode)
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* A Strategy for planning scans over data sources defined using the sources API.
|
||||
*/
|
||||
|
@ -70,10 +111,10 @@ private[sql] object DataSourceStrategy extends Strategy with Logging {
|
|||
|
||||
// Scanning partitioned HadoopFsRelation
|
||||
case PhysicalOperation(projects, filters, l @ LogicalRelation(t: HadoopFsRelation, _, _))
|
||||
if t.partitionSpec.partitionColumns.nonEmpty =>
|
||||
if t.partitionSchema.nonEmpty =>
|
||||
// We divide the filter expressions into 3 parts
|
||||
val partitionColumns = AttributeSet(
|
||||
t.partitionColumns.map(c => l.output.find(_.name == c.name).get))
|
||||
t.partitionSchema.map(c => l.output.find(_.name == c.name).get))
|
||||
|
||||
// Only pruning the partition keys
|
||||
val partitionFilters = filters.filter(_.references.subsetOf(partitionColumns))
|
||||
|
@ -104,15 +145,15 @@ private[sql] object DataSourceStrategy extends Strategy with Logging {
|
|||
|
||||
// Prune the buckets based on the pushed filters that do not contain partitioning key
|
||||
// since the bucketing key is not allowed to use the columns in partitioning key
|
||||
val bucketSet = getBuckets(pushedFilters, t.getBucketSpec)
|
||||
|
||||
val bucketSet = getBuckets(pushedFilters, t.bucketSpec)
|
||||
val scan = buildPartitionedTableScan(
|
||||
l,
|
||||
partitionAndNormalColumnProjs,
|
||||
pushedFilters,
|
||||
bucketSet,
|
||||
t.partitionSpec.partitionColumns,
|
||||
selectedPartitions)
|
||||
selectedPartitions,
|
||||
t.options)
|
||||
|
||||
// Add a Projection to guarantee the original projection:
|
||||
// this is because "partitionAndNormalColumnAttrs" may be different
|
||||
|
@ -127,6 +168,9 @@ private[sql] object DataSourceStrategy extends Strategy with Logging {
|
|||
}
|
||||
).getOrElse(scan) :: Nil
|
||||
|
||||
// TODO: The code for planning bucketed/unbucketed/partitioned/unpartitioned tables contains
|
||||
// a lot of duplication and produces overly complicated RDDs.
|
||||
|
||||
// Scanning non-partitioned HadoopFsRelation
|
||||
case PhysicalOperation(projects, filters, l @ LogicalRelation(t: HadoopFsRelation, _, _)) =>
|
||||
// See buildPartitionedTableScan for the reason that we need to create a shard
|
||||
|
@ -134,14 +178,65 @@ private[sql] object DataSourceStrategy extends Strategy with Logging {
|
|||
val sharedHadoopConf = SparkHadoopUtil.get.conf
|
||||
val confBroadcast =
|
||||
t.sqlContext.sparkContext.broadcast(new SerializableConfiguration(sharedHadoopConf))
|
||||
// Prune the buckets based on the filters
|
||||
val bucketSet = getBuckets(filters, t.getBucketSpec)
|
||||
pruneFilterProject(
|
||||
l,
|
||||
projects,
|
||||
filters,
|
||||
(a, f) =>
|
||||
t.buildInternalScan(a.map(_.name).toArray, f, bucketSet, t.paths, confBroadcast)) :: Nil
|
||||
|
||||
t.bucketSpec match {
|
||||
case Some(spec) if t.sqlContext.conf.bucketingEnabled() =>
|
||||
val scanBuilder: (Seq[Attribute], Array[Filter]) => RDD[InternalRow] = {
|
||||
(requiredColumns: Seq[Attribute], filters: Array[Filter]) => {
|
||||
val bucketed =
|
||||
t.location
|
||||
.allFiles()
|
||||
.filterNot(_.getPath.getName startsWith "_")
|
||||
.groupBy { f =>
|
||||
BucketingUtils
|
||||
.getBucketId(f.getPath.getName)
|
||||
.getOrElse(sys.error(s"Invalid bucket file ${f.getPath}"))
|
||||
}
|
||||
|
||||
val bucketedDataMap = bucketed.mapValues { bucketFiles =>
|
||||
t.fileFormat.buildInternalScan(
|
||||
t.sqlContext,
|
||||
t.dataSchema,
|
||||
requiredColumns.map(_.name).toArray,
|
||||
filters,
|
||||
None,
|
||||
bucketFiles.toArray,
|
||||
confBroadcast,
|
||||
t.options).coalesce(1)
|
||||
}
|
||||
|
||||
val bucketedRDD = new UnionRDD(t.sqlContext.sparkContext,
|
||||
(0 until spec.numBuckets).map { bucketId =>
|
||||
bucketedDataMap.get(bucketId).getOrElse {
|
||||
t.sqlContext.emptyResult: RDD[InternalRow]
|
||||
}
|
||||
})
|
||||
bucketedRDD
|
||||
}
|
||||
}
|
||||
|
||||
pruneFilterProject(
|
||||
l,
|
||||
projects,
|
||||
filters,
|
||||
scanBuilder) :: Nil
|
||||
|
||||
case _ =>
|
||||
pruneFilterProject(
|
||||
l,
|
||||
projects,
|
||||
filters,
|
||||
(a, f) =>
|
||||
t.fileFormat.buildInternalScan(
|
||||
t.sqlContext,
|
||||
t.dataSchema,
|
||||
a.map(_.name).toArray,
|
||||
f,
|
||||
None,
|
||||
t.location.allFiles().toArray,
|
||||
confBroadcast,
|
||||
t.options)) :: Nil
|
||||
}
|
||||
|
||||
case l @ LogicalRelation(baseRelation: TableScan, _, _) =>
|
||||
execution.PhysicalRDD.createFromDataSource(
|
||||
|
@ -151,11 +246,6 @@ private[sql] object DataSourceStrategy extends Strategy with Logging {
|
|||
part, query, overwrite, false) if part.isEmpty =>
|
||||
ExecutedCommand(InsertIntoDataSource(l, query, overwrite)) :: Nil
|
||||
|
||||
case i @ logical.InsertIntoTable(
|
||||
l @ LogicalRelation(t: HadoopFsRelation, _, _), part, query, overwrite, false) =>
|
||||
val mode = if (overwrite) SaveMode.Overwrite else SaveMode.Append
|
||||
ExecutedCommand(InsertIntoHadoopFsRelation(t, query, mode)) :: Nil
|
||||
|
||||
case _ => Nil
|
||||
}
|
||||
|
||||
|
@ -165,7 +255,8 @@ private[sql] object DataSourceStrategy extends Strategy with Logging {
|
|||
filters: Seq[Expression],
|
||||
buckets: Option[BitSet],
|
||||
partitionColumns: StructType,
|
||||
partitions: Array[Partition]): SparkPlan = {
|
||||
partitions: Array[Partition],
|
||||
options: Map[String, String]): SparkPlan = {
|
||||
val relation = logicalRelation.relation.asInstanceOf[HadoopFsRelation]
|
||||
|
||||
// Because we are creating one RDD per partition, we need to have a shared HadoopConf.
|
||||
|
@ -177,36 +268,86 @@ private[sql] object DataSourceStrategy extends Strategy with Logging {
|
|||
|
||||
// Now, we create a scan builder, which will be used by pruneFilterProject. This scan builder
|
||||
// will union all partitions and attach partition values if needed.
|
||||
val scanBuilder = {
|
||||
val scanBuilder: (Seq[Attribute], Array[Filter]) => RDD[InternalRow] = {
|
||||
(requiredColumns: Seq[Attribute], filters: Array[Filter]) => {
|
||||
val requiredDataColumns =
|
||||
requiredColumns.filterNot(c => partitionColumnNames.contains(c.name))
|
||||
|
||||
// Builds RDD[Row]s for each selected partition.
|
||||
val perPartitionRows = partitions.map { case Partition(partitionValues, dir) =>
|
||||
// Don't scan any partition columns to save I/O. Here we are being optimistic and
|
||||
// assuming partition columns data stored in data files are always consistent with those
|
||||
// partition values encoded in partition directory paths.
|
||||
val dataRows = relation.buildInternalScan(
|
||||
requiredDataColumns.map(_.name).toArray, filters, buckets, Array(dir), confBroadcast)
|
||||
relation.bucketSpec match {
|
||||
case Some(spec) if relation.sqlContext.conf.bucketingEnabled() =>
|
||||
val requiredDataColumns =
|
||||
requiredColumns.filterNot(c => partitionColumnNames.contains(c.name))
|
||||
|
||||
// Merges data values with partition values.
|
||||
mergeWithPartitionValues(
|
||||
requiredColumns,
|
||||
requiredDataColumns,
|
||||
partitionColumns,
|
||||
partitionValues,
|
||||
dataRows)
|
||||
}
|
||||
// Builds RDD[Row]s for each selected partition.
|
||||
val perPartitionRows: Seq[(Int, RDD[InternalRow])] = partitions.flatMap {
|
||||
case Partition(partitionValues, dir) =>
|
||||
val files = relation.location.getStatus(dir)
|
||||
val bucketed = files.groupBy { f =>
|
||||
BucketingUtils
|
||||
.getBucketId(f.getPath.getName)
|
||||
.getOrElse(sys.error(s"Invalid bucket file ${f.getPath}"))
|
||||
}
|
||||
|
||||
val unionedRows =
|
||||
if (perPartitionRows.length == 0) {
|
||||
relation.sqlContext.emptyResult
|
||||
} else {
|
||||
bucketed.map { bucketFiles =>
|
||||
// Don't scan any partition columns to save I/O. Here we are being optimistic and
|
||||
// assuming partition columns data stored in data files are always consistent with
|
||||
// those partition values encoded in partition directory paths.
|
||||
val dataRows = relation.fileFormat.buildInternalScan(
|
||||
relation.sqlContext,
|
||||
relation.dataSchema,
|
||||
requiredDataColumns.map(_.name).toArray,
|
||||
filters,
|
||||
buckets,
|
||||
bucketFiles._2,
|
||||
confBroadcast,
|
||||
options)
|
||||
|
||||
// Merges data values with partition values.
|
||||
bucketFiles._1 -> mergeWithPartitionValues(
|
||||
requiredColumns,
|
||||
requiredDataColumns,
|
||||
partitionColumns,
|
||||
partitionValues,
|
||||
dataRows)
|
||||
}
|
||||
}
|
||||
|
||||
val bucketedDataMap: Map[Int, Seq[RDD[InternalRow]]] =
|
||||
perPartitionRows.groupBy(_._1).mapValues(_.map(_._2))
|
||||
|
||||
val bucketed = new UnionRDD(relation.sqlContext.sparkContext,
|
||||
(0 until spec.numBuckets).map { bucketId =>
|
||||
bucketedDataMap.get(bucketId).map(i => i.reduce(_ ++ _).coalesce(1)).getOrElse {
|
||||
relation.sqlContext.emptyResult: RDD[InternalRow]
|
||||
}
|
||||
})
|
||||
bucketed
|
||||
|
||||
case _ =>
|
||||
val requiredDataColumns =
|
||||
requiredColumns.filterNot(c => partitionColumnNames.contains(c.name))
|
||||
|
||||
// Builds RDD[Row]s for each selected partition.
|
||||
val perPartitionRows = partitions.map {
|
||||
case Partition(partitionValues, dir) =>
|
||||
val dataRows = relation.fileFormat.buildInternalScan(
|
||||
relation.sqlContext,
|
||||
relation.dataSchema,
|
||||
requiredDataColumns.map(_.name).toArray,
|
||||
filters,
|
||||
buckets,
|
||||
relation.location.getStatus(dir),
|
||||
confBroadcast,
|
||||
options)
|
||||
|
||||
// Merges data values with partition values.
|
||||
mergeWithPartitionValues(
|
||||
requiredColumns,
|
||||
requiredDataColumns,
|
||||
partitionColumns,
|
||||
partitionValues,
|
||||
dataRows)
|
||||
}
|
||||
new UnionRDD(relation.sqlContext.sparkContext, perPartitionRows)
|
||||
}
|
||||
|
||||
unionedRows
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
|
@ -477,7 +618,7 @@ private[sql] object DataSourceStrategy extends Strategy with Logging {
|
|||
}
|
||||
|
||||
relation.relation match {
|
||||
case r: HadoopFsRelation => pairs += INPUT_PATHS -> r.paths.mkString(", ")
|
||||
case r: HadoopFsRelation => pairs += INPUT_PATHS -> r.location.paths.mkString(", ")
|
||||
case _ =>
|
||||
}
|
||||
|
||||
|
|
|
@ -25,8 +25,8 @@ import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat
|
|||
|
||||
import org.apache.spark._
|
||||
import org.apache.spark.sql._
|
||||
import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute
|
||||
import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project}
|
||||
import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeSet}
|
||||
import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
|
||||
import org.apache.spark.sql.catalyst.InternalRow
|
||||
import org.apache.spark.sql.execution.SQLExecution
|
||||
import org.apache.spark.sql.execution.command.RunnableCommand
|
||||
|
@ -34,7 +34,6 @@ import org.apache.spark.sql.internal.SQLConf
|
|||
import org.apache.spark.sql.sources._
|
||||
import org.apache.spark.util.Utils
|
||||
|
||||
|
||||
/**
|
||||
* A command for writing data to a [[HadoopFsRelation]]. Supports both overwriting and appending.
|
||||
* Writing to dynamic partitions is also supported. Each [[InsertIntoHadoopFsRelation]] issues a
|
||||
|
@ -58,18 +57,29 @@ import org.apache.spark.util.Utils
|
|||
* thrown during job commitment, also aborts the job.
|
||||
*/
|
||||
private[sql] case class InsertIntoHadoopFsRelation(
|
||||
@transient relation: HadoopFsRelation,
|
||||
outputPath: Path,
|
||||
partitionColumns: Seq[Attribute],
|
||||
bucketSpec: Option[BucketSpec],
|
||||
fileFormat: FileFormat,
|
||||
refreshFunction: () => Unit,
|
||||
options: Map[String, String],
|
||||
@transient query: LogicalPlan,
|
||||
mode: SaveMode)
|
||||
extends RunnableCommand {
|
||||
|
||||
override def children: Seq[LogicalPlan] = query :: Nil
|
||||
|
||||
override def run(sqlContext: SQLContext): Seq[Row] = {
|
||||
require(
|
||||
relation.paths.length == 1,
|
||||
s"Cannot write to multiple destinations: ${relation.paths.mkString(",")}")
|
||||
// Most formats don't do well with duplicate columns, so lets not allow that
|
||||
if (query.schema.fieldNames.length != query.schema.fieldNames.distinct.length) {
|
||||
val duplicateColumns = query.schema.fieldNames.groupBy(identity).collect {
|
||||
case (x, ys) if ys.length > 1 => "\"" + x + "\""
|
||||
}.mkString(", ")
|
||||
throw new AnalysisException(s"Duplicate column(s) : $duplicateColumns found, " +
|
||||
s"cannot save to file.")
|
||||
}
|
||||
|
||||
val hadoopConf = sqlContext.sparkContext.hadoopConfiguration
|
||||
val outputPath = new Path(relation.paths.head)
|
||||
val fs = outputPath.getFileSystem(hadoopConf)
|
||||
val qualifiedOutputPath = outputPath.makeQualified(fs.getUri, fs.getWorkingDirectory)
|
||||
|
||||
|
@ -101,45 +111,28 @@ private[sql] case class InsertIntoHadoopFsRelation(
|
|||
job.setOutputValueClass(classOf[InternalRow])
|
||||
FileOutputFormat.setOutputPath(job, qualifiedOutputPath)
|
||||
|
||||
// A partitioned relation schema's can be different from the input logicalPlan, since
|
||||
// partition columns are all moved after data column. We Project to adjust the ordering.
|
||||
// TODO: this belongs in the analyzer.
|
||||
val project = Project(
|
||||
relation.schema.map(field => UnresolvedAttribute.quoted(field.name)), query)
|
||||
val queryExecution = DataFrame(sqlContext, project).queryExecution
|
||||
val partitionSet = AttributeSet(partitionColumns)
|
||||
val dataColumns = query.output.filterNot(partitionSet.contains)
|
||||
|
||||
val queryExecution = DataFrame(sqlContext, query).queryExecution
|
||||
SQLExecution.withNewExecutionId(sqlContext, queryExecution) {
|
||||
val df = sqlContext.internalCreateDataFrame(queryExecution.toRdd, relation.schema)
|
||||
val partitionColumns = relation.partitionColumns.fieldNames
|
||||
val relation =
|
||||
WriteRelation(
|
||||
sqlContext,
|
||||
dataColumns.toStructType,
|
||||
qualifiedOutputPath.toString,
|
||||
fileFormat.prepareWrite(sqlContext, _, options, dataColumns.toStructType),
|
||||
bucketSpec)
|
||||
|
||||
// Some pre-flight checks.
|
||||
require(
|
||||
df.schema == relation.schema,
|
||||
s"""DataFrame must have the same schema as the relation to which is inserted.
|
||||
|DataFrame schema: ${df.schema}
|
||||
|Relation schema: ${relation.schema}
|
||||
""".stripMargin)
|
||||
val partitionColumnsInSpec = relation.partitionColumns.fieldNames
|
||||
require(
|
||||
partitionColumnsInSpec.sameElements(partitionColumns),
|
||||
s"""Partition columns mismatch.
|
||||
|Expected: ${partitionColumnsInSpec.mkString(", ")}
|
||||
|Actual: ${partitionColumns.mkString(", ")}
|
||||
""".stripMargin)
|
||||
|
||||
val writerContainer = if (partitionColumns.isEmpty && relation.maybeBucketSpec.isEmpty) {
|
||||
val writerContainer = if (partitionColumns.isEmpty && bucketSpec.isEmpty) {
|
||||
new DefaultWriterContainer(relation, job, isAppend)
|
||||
} else {
|
||||
val output = df.queryExecution.executedPlan.output
|
||||
val (partitionOutput, dataOutput) =
|
||||
output.partition(a => partitionColumns.contains(a.name))
|
||||
|
||||
new DynamicPartitionWriterContainer(
|
||||
relation,
|
||||
job,
|
||||
partitionOutput,
|
||||
dataOutput,
|
||||
output,
|
||||
partitionColumns = partitionColumns,
|
||||
dataColumns = dataColumns,
|
||||
inputSchema = query.output,
|
||||
PartitioningUtils.DEFAULT_PARTITION_NAME,
|
||||
sqlContext.conf.getConf(SQLConf.PARTITION_MAX_FILES),
|
||||
isAppend)
|
||||
|
@ -150,9 +143,9 @@ private[sql] case class InsertIntoHadoopFsRelation(
|
|||
writerContainer.driverSideSetup()
|
||||
|
||||
try {
|
||||
sqlContext.sparkContext.runJob(df.queryExecution.toRdd, writerContainer.writeRows _)
|
||||
sqlContext.sparkContext.runJob(queryExecution.toRdd, writerContainer.writeRows _)
|
||||
writerContainer.commitJob()
|
||||
relation.refresh()
|
||||
refreshFunction()
|
||||
} catch { case cause: Throwable =>
|
||||
logError("Aborting job.", cause)
|
||||
writerContainer.abortJob()
|
||||
|
|
|
@ -32,7 +32,12 @@ import org.apache.spark.sql.catalyst.expressions.{Cast, Literal}
|
|||
import org.apache.spark.sql.types._
|
||||
|
||||
|
||||
private[sql] case class Partition(values: InternalRow, path: String)
|
||||
object Partition {
|
||||
def apply(values: InternalRow, path: String): Partition =
|
||||
apply(values, new Path(path))
|
||||
}
|
||||
|
||||
private[sql] case class Partition(values: InternalRow, path: Path)
|
||||
|
||||
private[sql] case class PartitionSpec(partitionColumns: StructType, partitions: Seq[Partition])
|
||||
|
||||
|
@ -102,7 +107,8 @@ private[sql] object PartitioningUtils {
|
|||
// It will be recognised as conflicting directory structure:
|
||||
// "hdfs://host:9000/invalidPath"
|
||||
// "hdfs://host:9000/path"
|
||||
val discoveredBasePaths = optDiscoveredBasePaths.flatMap(x => x)
|
||||
// TODO: Selective case sensitivity.
|
||||
val discoveredBasePaths = optDiscoveredBasePaths.flatMap(x => x).map(_.toString.toLowerCase())
|
||||
assert(
|
||||
discoveredBasePaths.distinct.size == 1,
|
||||
"Conflicting directory structures detected. Suspicious paths:\b" +
|
||||
|
@ -127,7 +133,7 @@ private[sql] object PartitioningUtils {
|
|||
// Finally, we create `Partition`s based on paths and resolved partition values.
|
||||
val partitions = resolvedPartitionValues.zip(pathsWithPartitionValues).map {
|
||||
case (PartitionValues(_, literals), (path, _)) =>
|
||||
Partition(InternalRow.fromSeq(literals.map(_.value)), path.toString)
|
||||
Partition(InternalRow.fromSeq(literals.map(_.value)), path)
|
||||
}
|
||||
|
||||
PartitionSpec(StructType(fields), partitions)
|
||||
|
@ -242,7 +248,9 @@ private[sql] object PartitioningUtils {
|
|||
if (pathsWithPartitionValues.isEmpty) {
|
||||
Seq.empty
|
||||
} else {
|
||||
val distinctPartColNames = pathsWithPartitionValues.map(_._2.columnNames).distinct
|
||||
// TODO: Selective case sensitivity.
|
||||
val distinctPartColNames =
|
||||
pathsWithPartitionValues.map(_._2.columnNames.map(_.toLowerCase())).distinct
|
||||
assert(
|
||||
distinctPartColNames.size == 1,
|
||||
listConflictingPartitionColumns(pathsWithPartitionValues))
|
||||
|
|
|
@ -24,19 +24,23 @@ import scala.language.{existentials, implicitConversions}
|
|||
import scala.util.{Failure, Success, Try}
|
||||
|
||||
import org.apache.hadoop.fs.Path
|
||||
import org.apache.hadoop.util.StringUtils
|
||||
|
||||
import org.apache.spark.Logging
|
||||
import org.apache.spark.deploy.SparkHadoopUtil
|
||||
import org.apache.spark.sql.{AnalysisException, DataFrame, SaveMode, SQLContext}
|
||||
import org.apache.spark.sql.execution.streaming.{Sink, Source}
|
||||
import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute
|
||||
import org.apache.spark.sql.execution.streaming.{FileStreamSource, Sink, Source}
|
||||
import org.apache.spark.sql.sources._
|
||||
import org.apache.spark.sql.types.{CalendarIntervalType, StructType}
|
||||
import org.apache.spark.util.Utils
|
||||
|
||||
case class ResolvedDataSource(provider: Class[_], relation: BaseRelation)
|
||||
|
||||
|
||||
/**
|
||||
* Responsible for taking a description of a datasource (either from
|
||||
* [[org.apache.spark.sql.DataFrameReader]], or a metastore) and converting it into a logical
|
||||
* relation that can be used in a query plan.
|
||||
*/
|
||||
object ResolvedDataSource extends Logging {
|
||||
|
||||
/** A map to maintain backward compatibility in case we move data sources around. */
|
||||
|
@ -92,19 +96,61 @@ object ResolvedDataSource extends Logging {
|
|||
}
|
||||
}
|
||||
|
||||
// TODO: Combine with apply?
|
||||
def createSource(
|
||||
sqlContext: SQLContext,
|
||||
userSpecifiedSchema: Option[StructType],
|
||||
providerName: String,
|
||||
options: Map[String, String]): Source = {
|
||||
val provider = lookupDataSource(providerName).newInstance() match {
|
||||
case s: StreamSourceProvider => s
|
||||
case s: StreamSourceProvider =>
|
||||
s.createSource(sqlContext, userSpecifiedSchema, providerName, options)
|
||||
|
||||
case format: FileFormat =>
|
||||
val caseInsensitiveOptions = new CaseInsensitiveMap(options)
|
||||
val path = caseInsensitiveOptions.getOrElse("path", {
|
||||
throw new IllegalArgumentException("'path' is not specified")
|
||||
})
|
||||
val metadataPath = caseInsensitiveOptions.getOrElse("metadataPath", s"$path/_metadata")
|
||||
|
||||
val allPaths = caseInsensitiveOptions.get("path")
|
||||
val globbedPaths = allPaths.toSeq.flatMap { path =>
|
||||
val hdfsPath = new Path(path)
|
||||
val fs = hdfsPath.getFileSystem(sqlContext.sparkContext.hadoopConfiguration)
|
||||
val qualified = hdfsPath.makeQualified(fs.getUri, fs.getWorkingDirectory)
|
||||
SparkHadoopUtil.get.globPathIfNecessary(qualified)
|
||||
}.toArray
|
||||
|
||||
val fileCatalog: FileCatalog = new HDFSFileCatalog(sqlContext, options, globbedPaths)
|
||||
val dataSchema = userSpecifiedSchema.orElse {
|
||||
format.inferSchema(
|
||||
sqlContext,
|
||||
caseInsensitiveOptions,
|
||||
fileCatalog.allFiles())
|
||||
}.getOrElse {
|
||||
throw new AnalysisException("Unable to infer schema. It must be specified manually.")
|
||||
}
|
||||
|
||||
def dataFrameBuilder(files: Array[String]): DataFrame = {
|
||||
new DataFrame(
|
||||
sqlContext,
|
||||
LogicalRelation(
|
||||
apply(
|
||||
sqlContext,
|
||||
paths = files,
|
||||
userSpecifiedSchema = Some(dataSchema),
|
||||
provider = providerName,
|
||||
options = options.filterKeys(_ != "path")).relation))
|
||||
}
|
||||
|
||||
new FileStreamSource(
|
||||
sqlContext, metadataPath, path, Some(dataSchema), providerName, dataFrameBuilder)
|
||||
case _ =>
|
||||
throw new UnsupportedOperationException(
|
||||
s"Data source $providerName does not support streamed reading")
|
||||
}
|
||||
|
||||
provider.createSource(sqlContext, userSpecifiedSchema, providerName, options)
|
||||
provider
|
||||
}
|
||||
|
||||
def createSink(
|
||||
|
@ -125,98 +171,72 @@ object ResolvedDataSource extends Logging {
|
|||
/** Create a [[ResolvedDataSource]] for reading data in. */
|
||||
def apply(
|
||||
sqlContext: SQLContext,
|
||||
userSpecifiedSchema: Option[StructType],
|
||||
partitionColumns: Array[String],
|
||||
bucketSpec: Option[BucketSpec],
|
||||
paths: Seq[String] = Nil,
|
||||
userSpecifiedSchema: Option[StructType] = None,
|
||||
partitionColumns: Array[String] = Array.empty,
|
||||
bucketSpec: Option[BucketSpec] = None,
|
||||
provider: String,
|
||||
options: Map[String, String]): ResolvedDataSource = {
|
||||
val clazz: Class[_] = lookupDataSource(provider)
|
||||
def className: String = clazz.getCanonicalName
|
||||
val relation = userSpecifiedSchema match {
|
||||
case Some(schema: StructType) => clazz.newInstance() match {
|
||||
case dataSource: SchemaRelationProvider =>
|
||||
val caseInsensitiveOptions = new CaseInsensitiveMap(options)
|
||||
if (caseInsensitiveOptions.contains("paths")) {
|
||||
throw new AnalysisException(s"$className does not support paths option.")
|
||||
}
|
||||
dataSource.createRelation(sqlContext, caseInsensitiveOptions, schema)
|
||||
case dataSource: HadoopFsRelationProvider =>
|
||||
val maybePartitionsSchema = if (partitionColumns.isEmpty) {
|
||||
None
|
||||
} else {
|
||||
Some(partitionColumnsSchema(
|
||||
schema, partitionColumns, sqlContext.conf.caseSensitiveAnalysis))
|
||||
}
|
||||
|
||||
val caseInsensitiveOptions = new CaseInsensitiveMap(options)
|
||||
val paths = {
|
||||
if (caseInsensitiveOptions.contains("paths") &&
|
||||
caseInsensitiveOptions.contains("path")) {
|
||||
throw new AnalysisException(s"Both path and paths options are present.")
|
||||
}
|
||||
caseInsensitiveOptions.get("paths")
|
||||
.map(_.split("(?<!\\\\),").map(StringUtils.unEscapeString(_, '\\', ',')))
|
||||
.getOrElse(Array(caseInsensitiveOptions.getOrElse("path", {
|
||||
throw new IllegalArgumentException("'path' is not specified")
|
||||
})))
|
||||
.flatMap{ pathString =>
|
||||
val hdfsPath = new Path(pathString)
|
||||
val fs = hdfsPath.getFileSystem(sqlContext.sparkContext.hadoopConfiguration)
|
||||
val qualified = hdfsPath.makeQualified(fs.getUri, fs.getWorkingDirectory)
|
||||
SparkHadoopUtil.get.globPathIfNecessary(qualified).map(_.toString)
|
||||
}
|
||||
}
|
||||
val caseInsensitiveOptions = new CaseInsensitiveMap(options)
|
||||
val relation = (clazz.newInstance(), userSpecifiedSchema) match {
|
||||
// TODO: Throw when too much is given.
|
||||
case (dataSource: SchemaRelationProvider, Some(schema)) =>
|
||||
dataSource.createRelation(sqlContext, caseInsensitiveOptions, schema)
|
||||
case (dataSource: RelationProvider, None) =>
|
||||
dataSource.createRelation(sqlContext, caseInsensitiveOptions)
|
||||
case (_: SchemaRelationProvider, None) =>
|
||||
throw new AnalysisException(s"A schema needs to be specified when using $className.")
|
||||
case (_: RelationProvider, Some(_)) =>
|
||||
throw new AnalysisException(s"$className does not allow user-specified schemas.")
|
||||
|
||||
val dataSchema =
|
||||
StructType(schema.filterNot(f => partitionColumns.contains(f.name))).asNullable
|
||||
case (format: FileFormat, _) =>
|
||||
val allPaths = caseInsensitiveOptions.get("path") ++ paths
|
||||
val globbedPaths = allPaths.flatMap { path =>
|
||||
val hdfsPath = new Path(path)
|
||||
val fs = hdfsPath.getFileSystem(sqlContext.sparkContext.hadoopConfiguration)
|
||||
val qualified = hdfsPath.makeQualified(fs.getUri, fs.getWorkingDirectory)
|
||||
SparkHadoopUtil.get.globPathIfNecessary(qualified)
|
||||
}.toArray
|
||||
|
||||
dataSource.createRelation(
|
||||
val fileCatalog: FileCatalog = new HDFSFileCatalog(sqlContext, options, globbedPaths)
|
||||
val dataSchema = userSpecifiedSchema.orElse {
|
||||
format.inferSchema(
|
||||
sqlContext,
|
||||
paths,
|
||||
Some(dataSchema),
|
||||
maybePartitionsSchema,
|
||||
bucketSpec,
|
||||
caseInsensitiveOptions)
|
||||
case dataSource: org.apache.spark.sql.sources.RelationProvider =>
|
||||
throw new AnalysisException(s"$className does not allow user-specified schemas.")
|
||||
case _ =>
|
||||
throw new AnalysisException(s"$className is not a RelationProvider.")
|
||||
}
|
||||
caseInsensitiveOptions,
|
||||
fileCatalog.allFiles())
|
||||
}.getOrElse {
|
||||
throw new AnalysisException(
|
||||
s"Unable to infer schema for $format at ${allPaths.take(2).mkString(",")}. " +
|
||||
"It must be specified manually")
|
||||
}
|
||||
|
||||
case None => clazz.newInstance() match {
|
||||
case dataSource: RelationProvider =>
|
||||
val caseInsensitiveOptions = new CaseInsensitiveMap(options)
|
||||
if (caseInsensitiveOptions.contains("paths")) {
|
||||
throw new AnalysisException(s"$className does not support paths option.")
|
||||
}
|
||||
dataSource.createRelation(sqlContext, caseInsensitiveOptions)
|
||||
case dataSource: HadoopFsRelationProvider =>
|
||||
val caseInsensitiveOptions = new CaseInsensitiveMap(options)
|
||||
val paths = {
|
||||
if (caseInsensitiveOptions.contains("paths") &&
|
||||
caseInsensitiveOptions.contains("path")) {
|
||||
throw new AnalysisException(s"Both path and paths options are present.")
|
||||
}
|
||||
caseInsensitiveOptions.get("paths")
|
||||
.map(_.split("(?<!\\\\),").map(StringUtils.unEscapeString(_, '\\', ',')))
|
||||
.getOrElse(Array(caseInsensitiveOptions.getOrElse("path", {
|
||||
throw new IllegalArgumentException("'path' is not specified")
|
||||
})))
|
||||
.flatMap{ pathString =>
|
||||
val hdfsPath = new Path(pathString)
|
||||
val fs = hdfsPath.getFileSystem(sqlContext.sparkContext.hadoopConfiguration)
|
||||
val qualified = hdfsPath.makeQualified(fs.getUri, fs.getWorkingDirectory)
|
||||
SparkHadoopUtil.get.globPathIfNecessary(qualified).map(_.toString)
|
||||
}
|
||||
}
|
||||
dataSource.createRelation(sqlContext, paths, None, None, None, caseInsensitiveOptions)
|
||||
case dataSource: org.apache.spark.sql.sources.SchemaRelationProvider =>
|
||||
throw new AnalysisException(
|
||||
s"A schema needs to be specified when using $className.")
|
||||
case _ =>
|
||||
throw new AnalysisException(
|
||||
s"$className is neither a RelationProvider nor a FSBasedRelationProvider.")
|
||||
}
|
||||
// If they gave a schema, then we try and figure out the types of the partition columns
|
||||
// from that schema.
|
||||
val partitionSchema = userSpecifiedSchema.map { schema =>
|
||||
StructType(
|
||||
partitionColumns.map { c =>
|
||||
// TODO: Case sensitivity.
|
||||
schema
|
||||
.find(_.name.toLowerCase() == c.toLowerCase())
|
||||
.getOrElse(throw new AnalysisException(s"Invalid partition column '$c'"))
|
||||
})
|
||||
}.getOrElse(fileCatalog.partitionSpec(None).partitionColumns)
|
||||
|
||||
HadoopFsRelation(
|
||||
sqlContext,
|
||||
fileCatalog,
|
||||
partitionSchema = partitionSchema,
|
||||
dataSchema = dataSchema.asNullable,
|
||||
bucketSpec = bucketSpec,
|
||||
format,
|
||||
options)
|
||||
|
||||
case _ =>
|
||||
throw new AnalysisException(
|
||||
s"$className is not a valid Spark SQL Data Source.")
|
||||
}
|
||||
new ResolvedDataSource(clazz, relation)
|
||||
}
|
||||
|
@ -254,10 +274,10 @@ object ResolvedDataSource extends Logging {
|
|||
throw new AnalysisException("Cannot save interval data type into external storage.")
|
||||
}
|
||||
val clazz: Class[_] = lookupDataSource(provider)
|
||||
val relation = clazz.newInstance() match {
|
||||
clazz.newInstance() match {
|
||||
case dataSource: CreatableRelationProvider =>
|
||||
dataSource.createRelation(sqlContext, mode, options, data)
|
||||
case dataSource: HadoopFsRelationProvider =>
|
||||
case format: FileFormat =>
|
||||
// Don't glob path for the write path. The contracts here are:
|
||||
// 1. Only one output path can be specified on the write path;
|
||||
// 2. Output path must be a legal HDFS style file system path;
|
||||
|
@ -278,26 +298,63 @@ object ResolvedDataSource extends Logging {
|
|||
val equality = columnNameEquality(caseSensitive)
|
||||
val dataSchema = StructType(
|
||||
data.schema.filterNot(f => partitionColumns.exists(equality(_, f.name))))
|
||||
val r = dataSource.createRelation(
|
||||
sqlContext,
|
||||
Array(outputPath.toString),
|
||||
Some(dataSchema.asNullable),
|
||||
Some(partitionColumnsSchema(data.schema, partitionColumns, caseSensitive)),
|
||||
bucketSpec,
|
||||
caseInsensitiveOptions)
|
||||
|
||||
// If we are appending to a table that already exists, make sure the partitioning matches
|
||||
// up. If we fail to load the table for whatever reason, ignore the check.
|
||||
if (mode == SaveMode.Append) {
|
||||
val existingPartitionColumnSet = try {
|
||||
val resolved = apply(
|
||||
sqlContext,
|
||||
userSpecifiedSchema = Some(data.schema.asNullable),
|
||||
provider = provider,
|
||||
options = options)
|
||||
|
||||
Some(resolved.relation
|
||||
.asInstanceOf[HadoopFsRelation]
|
||||
.location
|
||||
.partitionSpec(None)
|
||||
.partitionColumns
|
||||
.fieldNames
|
||||
.toSet)
|
||||
} catch {
|
||||
case e: Exception =>
|
||||
None
|
||||
}
|
||||
|
||||
existingPartitionColumnSet.foreach { ex =>
|
||||
if (ex.map(_.toLowerCase) != partitionColumns.map(_.toLowerCase()).toSet) {
|
||||
throw new AnalysisException(
|
||||
s"Requested partitioning does not equal existing partitioning: " +
|
||||
s"$ex != ${partitionColumns.toSet}.")
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// For partitioned relation r, r.schema's column ordering can be different from the column
|
||||
// ordering of data.logicalPlan (partition columns are all moved after data column). This
|
||||
// will be adjusted within InsertIntoHadoopFsRelation.
|
||||
sqlContext.executePlan(
|
||||
val plan =
|
||||
InsertIntoHadoopFsRelation(
|
||||
r,
|
||||
outputPath,
|
||||
partitionColumns.map(UnresolvedAttribute.quoted),
|
||||
bucketSpec,
|
||||
format,
|
||||
() => Unit, // No existing table needs to be refreshed.
|
||||
options,
|
||||
data.logicalPlan,
|
||||
mode)).toRdd
|
||||
r
|
||||
mode)
|
||||
sqlContext.executePlan(plan).toRdd
|
||||
|
||||
case _ =>
|
||||
sys.error(s"${clazz.getCanonicalName} does not allow create table as select.")
|
||||
}
|
||||
ResolvedDataSource(clazz, relation)
|
||||
|
||||
apply(
|
||||
sqlContext,
|
||||
userSpecifiedSchema = Some(data.schema.asNullable),
|
||||
partitionColumns = partitionColumns,
|
||||
bucketSpec = bucketSpec,
|
||||
provider = provider,
|
||||
options = options)
|
||||
}
|
||||
}
|
||||
|
|
|
@ -26,6 +26,7 @@ import org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
|
|||
|
||||
import org.apache.spark._
|
||||
import org.apache.spark.mapred.SparkHadoopMapRedUtil
|
||||
import org.apache.spark.sql.SQLContext
|
||||
import org.apache.spark.sql.catalyst.expressions._
|
||||
import org.apache.spark.sql.catalyst.plans.physical.HashPartitioning
|
||||
import org.apache.spark.sql.catalyst.InternalRow
|
||||
|
@ -35,9 +36,16 @@ import org.apache.spark.sql.sources.{HadoopFsRelation, OutputWriter, OutputWrite
|
|||
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
|
||||
import org.apache.spark.util.SerializableConfiguration
|
||||
|
||||
/** A container for all the details required when writing to a table. */
|
||||
case class WriteRelation(
|
||||
sqlContext: SQLContext,
|
||||
dataSchema: StructType,
|
||||
path: String,
|
||||
prepareJobForWrite: Job => OutputWriterFactory,
|
||||
bucketSpec: Option[BucketSpec])
|
||||
|
||||
private[sql] abstract class BaseWriterContainer(
|
||||
@transient val relation: HadoopFsRelation,
|
||||
@transient val relation: WriteRelation,
|
||||
@transient private val job: Job,
|
||||
isAppend: Boolean)
|
||||
extends Logging with Serializable {
|
||||
|
@ -67,12 +75,7 @@ private[sql] abstract class BaseWriterContainer(
|
|||
@transient private var taskAttemptId: TaskAttemptID = _
|
||||
@transient protected var taskAttemptContext: TaskAttemptContext = _
|
||||
|
||||
protected val outputPath: String = {
|
||||
assert(
|
||||
relation.paths.length == 1,
|
||||
s"Cannot write to multiple destinations: ${relation.paths.mkString(",")}")
|
||||
relation.paths.head
|
||||
}
|
||||
protected val outputPath: String = relation.path
|
||||
|
||||
protected var outputWriterFactory: OutputWriterFactory = _
|
||||
|
||||
|
@ -237,7 +240,7 @@ private[sql] abstract class BaseWriterContainer(
|
|||
* A writer that writes all of the rows in a partition to a single file.
|
||||
*/
|
||||
private[sql] class DefaultWriterContainer(
|
||||
relation: HadoopFsRelation,
|
||||
relation: WriteRelation,
|
||||
job: Job,
|
||||
isAppend: Boolean)
|
||||
extends BaseWriterContainer(relation, job, isAppend) {
|
||||
|
@ -299,7 +302,7 @@ private[sql] class DefaultWriterContainer(
|
|||
* writer externally sorts the remaining rows and then writes out them out one file at a time.
|
||||
*/
|
||||
private[sql] class DynamicPartitionWriterContainer(
|
||||
relation: HadoopFsRelation,
|
||||
relation: WriteRelation,
|
||||
job: Job,
|
||||
partitionColumns: Seq[Attribute],
|
||||
dataColumns: Seq[Attribute],
|
||||
|
@ -309,7 +312,7 @@ private[sql] class DynamicPartitionWriterContainer(
|
|||
isAppend: Boolean)
|
||||
extends BaseWriterContainer(relation, job, isAppend) {
|
||||
|
||||
private val bucketSpec = relation.maybeBucketSpec
|
||||
private val bucketSpec = relation.bucketSpec
|
||||
|
||||
private val bucketColumns: Seq[Attribute] = bucketSpec.toSeq.flatMap {
|
||||
spec => spec.bucketColumnNames.map(c => inputSchema.find(_.name == c).get)
|
||||
|
@ -374,7 +377,6 @@ private[sql] class DynamicPartitionWriterContainer(
|
|||
|
||||
// We should first sort by partition columns, then bucket id, and finally sorting columns.
|
||||
val sortingExpressions: Seq[Expression] = partitionColumns ++ bucketIdExpression ++ sortColumns
|
||||
|
||||
val getSortingKey = UnsafeProjection.create(sortingExpressions, inputSchema)
|
||||
|
||||
val sortingKeySchema = StructType(sortingExpressions.map {
|
||||
|
|
|
@ -17,12 +17,6 @@
|
|||
|
||||
package org.apache.spark.sql.execution.datasources
|
||||
|
||||
import org.apache.hadoop.mapreduce.TaskAttemptContext
|
||||
|
||||
import org.apache.spark.sql.SQLContext
|
||||
import org.apache.spark.sql.sources.{HadoopFsRelation, HadoopFsRelationProvider, OutputWriter, OutputWriterFactory}
|
||||
import org.apache.spark.sql.types.StructType
|
||||
|
||||
/**
|
||||
* A container for bucketing information.
|
||||
* Bucketing is a technology for decomposing data sets into more manageable parts, and the number
|
||||
|
@ -37,24 +31,6 @@ private[sql] case class BucketSpec(
|
|||
bucketColumnNames: Seq[String],
|
||||
sortColumnNames: Seq[String])
|
||||
|
||||
private[sql] trait BucketedHadoopFsRelationProvider extends HadoopFsRelationProvider {
|
||||
final override def createRelation(
|
||||
sqlContext: SQLContext,
|
||||
paths: Array[String],
|
||||
dataSchema: Option[StructType],
|
||||
partitionColumns: Option[StructType],
|
||||
parameters: Map[String, String]): HadoopFsRelation =
|
||||
throw new UnsupportedOperationException("use the overload version with bucketSpec parameter")
|
||||
}
|
||||
|
||||
private[sql] abstract class BucketedOutputWriterFactory extends OutputWriterFactory {
|
||||
final override def newInstance(
|
||||
path: String,
|
||||
dataSchema: StructType,
|
||||
context: TaskAttemptContext): OutputWriter =
|
||||
throw new UnsupportedOperationException("use the overload version with bucketSpec parameter")
|
||||
}
|
||||
|
||||
private[sql] object BucketingUtils {
|
||||
// The file name of bucketed data should have 3 parts:
|
||||
// 1. some other information in the head of file name
|
||||
|
|
|
@ -17,151 +17,21 @@
|
|||
|
||||
package org.apache.spark.sql.execution.datasources.csv
|
||||
|
||||
import java.nio.charset.Charset
|
||||
|
||||
import scala.util.control.NonFatal
|
||||
|
||||
import com.google.common.base.Objects
|
||||
import org.apache.hadoop.fs.{FileStatus, Path}
|
||||
import org.apache.hadoop.io.{LongWritable, NullWritable, Text}
|
||||
import org.apache.hadoop.mapred.TextInputFormat
|
||||
import org.apache.hadoop.mapreduce.{Job, TaskAttemptContext}
|
||||
import org.apache.hadoop.io.{NullWritable, Text}
|
||||
import org.apache.hadoop.mapreduce.RecordWriter
|
||||
import org.apache.hadoop.mapreduce.TaskAttemptContext
|
||||
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
|
||||
|
||||
import org.apache.spark.Logging
|
||||
import org.apache.spark.rdd.RDD
|
||||
import org.apache.spark.sql._
|
||||
import org.apache.spark.sql.catalyst.InternalRow
|
||||
import org.apache.spark.sql.execution.datasources.CompressionCodecs
|
||||
import org.apache.spark.sql.sources._
|
||||
import org.apache.spark.sql.types._
|
||||
|
||||
private[sql] class CSVRelation(
|
||||
private val inputRDD: Option[RDD[String]],
|
||||
override val paths: Array[String] = Array.empty[String],
|
||||
private val maybeDataSchema: Option[StructType],
|
||||
override val userDefinedPartitionColumns: Option[StructType],
|
||||
private val parameters: Map[String, String])
|
||||
(@transient val sqlContext: SQLContext) extends HadoopFsRelation {
|
||||
|
||||
override lazy val dataSchema: StructType = maybeDataSchema match {
|
||||
case Some(structType) => structType
|
||||
case None => inferSchema(paths)
|
||||
}
|
||||
|
||||
private val options = new CSVOptions(parameters)
|
||||
|
||||
@transient
|
||||
private var cachedRDD: Option[RDD[String]] = None
|
||||
|
||||
private def readText(location: String): RDD[String] = {
|
||||
if (Charset.forName(options.charset) == Charset.forName("UTF-8")) {
|
||||
sqlContext.sparkContext.textFile(location)
|
||||
} else {
|
||||
val charset = options.charset
|
||||
sqlContext.sparkContext.hadoopFile[LongWritable, Text, TextInputFormat](location)
|
||||
.mapPartitions { _.map { pair =>
|
||||
new String(pair._2.getBytes, 0, pair._2.getLength, charset)
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
private def baseRdd(inputPaths: Array[String]): RDD[String] = {
|
||||
inputRDD.getOrElse {
|
||||
cachedRDD.getOrElse {
|
||||
val rdd = readText(inputPaths.mkString(","))
|
||||
cachedRDD = Some(rdd)
|
||||
rdd
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
private def tokenRdd(header: Array[String], inputPaths: Array[String]): RDD[Array[String]] = {
|
||||
val rdd = baseRdd(inputPaths)
|
||||
// Make sure firstLine is materialized before sending to executors
|
||||
val firstLine = if (options.headerFlag) findFirstLine(rdd) else null
|
||||
CSVRelation.univocityTokenizer(rdd, header, firstLine, options)
|
||||
}
|
||||
|
||||
/**
|
||||
* This supports to eliminate unneeded columns before producing an RDD
|
||||
* containing all of its tuples as Row objects. This reads all the tokens of each line
|
||||
* and then drop unneeded tokens without casting and type-checking by mapping
|
||||
* both the indices produced by `requiredColumns` and the ones of tokens.
|
||||
* TODO: Switch to using buildInternalScan
|
||||
*/
|
||||
override def buildScan(requiredColumns: Array[String], inputs: Array[FileStatus]): RDD[Row] = {
|
||||
val pathsString = inputs.map(_.getPath.toUri.toString)
|
||||
val header = schema.fields.map(_.name)
|
||||
val tokenizedRdd = tokenRdd(header, pathsString)
|
||||
CSVRelation.parseCsv(tokenizedRdd, schema, requiredColumns, inputs, sqlContext, options)
|
||||
}
|
||||
|
||||
override def prepareJobForWrite(job: Job): OutputWriterFactory = {
|
||||
val conf = job.getConfiguration
|
||||
options.compressionCodec.foreach { codec =>
|
||||
CompressionCodecs.setCodecConfiguration(conf, codec)
|
||||
}
|
||||
|
||||
new CSVOutputWriterFactory(options)
|
||||
}
|
||||
|
||||
override def hashCode(): Int = Objects.hashCode(paths.toSet, dataSchema, schema, partitionColumns)
|
||||
|
||||
override def equals(other: Any): Boolean = other match {
|
||||
case that: CSVRelation => {
|
||||
val equalPath = paths.toSet == that.paths.toSet
|
||||
val equalDataSchema = dataSchema == that.dataSchema
|
||||
val equalSchema = schema == that.schema
|
||||
val equalPartitionColums = partitionColumns == that.partitionColumns
|
||||
|
||||
equalPath && equalDataSchema && equalSchema && equalPartitionColums
|
||||
}
|
||||
case _ => false
|
||||
}
|
||||
|
||||
private def inferSchema(paths: Array[String]): StructType = {
|
||||
val rdd = baseRdd(paths)
|
||||
val firstLine = findFirstLine(rdd)
|
||||
val firstRow = new LineCsvReader(options).parseLine(firstLine)
|
||||
|
||||
val header = if (options.headerFlag) {
|
||||
firstRow
|
||||
} else {
|
||||
firstRow.zipWithIndex.map { case (value, index) => s"C$index" }
|
||||
}
|
||||
|
||||
val parsedRdd = tokenRdd(header, paths)
|
||||
if (options.inferSchemaFlag) {
|
||||
CSVInferSchema.infer(parsedRdd, header, options.nullValue)
|
||||
} else {
|
||||
// By default fields are assumed to be StringType
|
||||
val schemaFields = header.map { fieldName =>
|
||||
StructField(fieldName.toString, StringType, nullable = true)
|
||||
}
|
||||
StructType(schemaFields)
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Returns the first line of the first non-empty file in path
|
||||
*/
|
||||
private def findFirstLine(rdd: RDD[String]): String = {
|
||||
if (options.isCommentSet) {
|
||||
val comment = options.comment.toString
|
||||
rdd.filter { line =>
|
||||
line.trim.nonEmpty && !line.startsWith(comment)
|
||||
}.first()
|
||||
} else {
|
||||
rdd.filter { line =>
|
||||
line.trim.nonEmpty
|
||||
}.first()
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
object CSVRelation extends Logging {
|
||||
|
||||
def univocityTokenizer(
|
||||
|
@ -246,8 +116,10 @@ object CSVRelation extends Logging {
|
|||
private[sql] class CSVOutputWriterFactory(params: CSVOptions) extends OutputWriterFactory {
|
||||
override def newInstance(
|
||||
path: String,
|
||||
bucketId: Option[Int],
|
||||
dataSchema: StructType,
|
||||
context: TaskAttemptContext): OutputWriter = {
|
||||
if (bucketId.isDefined) sys.error("csv doesn't support bucketing")
|
||||
new CsvOutputWriter(path, dataSchema, context, params)
|
||||
}
|
||||
}
|
||||
|
|
|
@ -17,32 +17,157 @@
|
|||
|
||||
package org.apache.spark.sql.execution.datasources.csv
|
||||
|
||||
import java.nio.charset.Charset
|
||||
|
||||
import org.apache.hadoop.fs.FileStatus
|
||||
import org.apache.hadoop.io.{LongWritable, Text}
|
||||
import org.apache.hadoop.mapred.TextInputFormat
|
||||
import org.apache.hadoop.mapreduce.Job
|
||||
|
||||
import org.apache.spark.broadcast.Broadcast
|
||||
import org.apache.spark.rdd.RDD
|
||||
import org.apache.spark.sql.SQLContext
|
||||
import org.apache.spark.sql.catalyst.InternalRow
|
||||
import org.apache.spark.sql.catalyst.encoders.RowEncoder
|
||||
import org.apache.spark.sql.execution.datasources.CompressionCodecs
|
||||
import org.apache.spark.sql.sources._
|
||||
import org.apache.spark.sql.types.StructType
|
||||
import org.apache.spark.sql.types.{StringType, StructField, StructType}
|
||||
import org.apache.spark.util.SerializableConfiguration
|
||||
import org.apache.spark.util.collection.BitSet
|
||||
|
||||
/**
|
||||
* Provides access to CSV data from pure SQL statements.
|
||||
*/
|
||||
class DefaultSource extends HadoopFsRelationProvider with DataSourceRegister {
|
||||
class DefaultSource extends FileFormat with DataSourceRegister {
|
||||
|
||||
override def shortName(): String = "csv"
|
||||
|
||||
/**
|
||||
* Creates a new relation for data store in CSV given parameters and user supported schema.
|
||||
*/
|
||||
override def createRelation(
|
||||
sqlContext: SQLContext,
|
||||
paths: Array[String],
|
||||
dataSchema: Option[StructType],
|
||||
partitionColumns: Option[StructType],
|
||||
parameters: Map[String, String]): HadoopFsRelation = {
|
||||
override def toString: String = "CSV"
|
||||
|
||||
new CSVRelation(
|
||||
None,
|
||||
paths,
|
||||
dataSchema,
|
||||
partitionColumns,
|
||||
parameters)(sqlContext)
|
||||
override def equals(other: Any): Boolean = other.isInstanceOf[DefaultSource]
|
||||
|
||||
override def inferSchema(
|
||||
sqlContext: SQLContext,
|
||||
options: Map[String, String],
|
||||
files: Seq[FileStatus]): Option[StructType] = {
|
||||
val csvOptions = new CSVOptions(options)
|
||||
|
||||
// TODO: Move filtering.
|
||||
val paths = files.filterNot(_.getPath.getName startsWith "_").map(_.getPath.toString)
|
||||
val rdd = baseRdd(sqlContext, csvOptions, paths)
|
||||
val firstLine = findFirstLine(csvOptions, rdd)
|
||||
val firstRow = new LineCsvReader(csvOptions).parseLine(firstLine)
|
||||
|
||||
val header = if (csvOptions.headerFlag) {
|
||||
firstRow
|
||||
} else {
|
||||
firstRow.zipWithIndex.map { case (value, index) => s"C$index" }
|
||||
}
|
||||
|
||||
val parsedRdd = tokenRdd(sqlContext, csvOptions, header, paths)
|
||||
val schema = if (csvOptions.inferSchemaFlag) {
|
||||
CSVInferSchema.infer(parsedRdd, header, csvOptions.nullValue)
|
||||
} else {
|
||||
// By default fields are assumed to be StringType
|
||||
val schemaFields = header.map { fieldName =>
|
||||
StructField(fieldName.toString, StringType, nullable = true)
|
||||
}
|
||||
StructType(schemaFields)
|
||||
}
|
||||
Some(schema)
|
||||
}
|
||||
|
||||
override def prepareWrite(
|
||||
sqlContext: SQLContext,
|
||||
job: Job,
|
||||
options: Map[String, String],
|
||||
dataSchema: StructType): OutputWriterFactory = {
|
||||
val conf = job.getConfiguration
|
||||
val csvOptions = new CSVOptions(options)
|
||||
csvOptions.compressionCodec.foreach { codec =>
|
||||
CompressionCodecs.setCodecConfiguration(conf, codec)
|
||||
}
|
||||
|
||||
new CSVOutputWriterFactory(csvOptions)
|
||||
}
|
||||
|
||||
/**
|
||||
* This supports to eliminate unneeded columns before producing an RDD
|
||||
* containing all of its tuples as Row objects. This reads all the tokens of each line
|
||||
* and then drop unneeded tokens without casting and type-checking by mapping
|
||||
* both the indices produced by `requiredColumns` and the ones of tokens.
|
||||
*/
|
||||
override def buildInternalScan(
|
||||
sqlContext: SQLContext,
|
||||
dataSchema: StructType,
|
||||
requiredColumns: Array[String],
|
||||
filters: Array[Filter],
|
||||
bucketSet: Option[BitSet],
|
||||
inputFiles: Array[FileStatus],
|
||||
broadcastedConf: Broadcast[SerializableConfiguration],
|
||||
options: Map[String, String]): RDD[InternalRow] = {
|
||||
// TODO: Filter before calling buildInternalScan.
|
||||
val csvFiles = inputFiles.filterNot(_.getPath.getName startsWith "_")
|
||||
|
||||
val csvOptions = new CSVOptions(options)
|
||||
val pathsString = csvFiles.map(_.getPath.toUri.toString)
|
||||
val header = dataSchema.fields.map(_.name)
|
||||
val tokenizedRdd = tokenRdd(sqlContext, csvOptions, header, pathsString)
|
||||
val external = CSVRelation.parseCsv(
|
||||
tokenizedRdd, dataSchema, requiredColumns, csvFiles, sqlContext, csvOptions)
|
||||
|
||||
// TODO: Generate InternalRow in parseCsv
|
||||
val outputSchema = StructType(requiredColumns.map(c => dataSchema.find(_.name == c).get))
|
||||
val encoder = RowEncoder(outputSchema)
|
||||
external.map(encoder.toRow)
|
||||
}
|
||||
|
||||
|
||||
private def baseRdd(
|
||||
sqlContext: SQLContext,
|
||||
options: CSVOptions,
|
||||
inputPaths: Seq[String]): RDD[String] = {
|
||||
readText(sqlContext, options, inputPaths.mkString(","))
|
||||
}
|
||||
|
||||
private def tokenRdd(
|
||||
sqlContext: SQLContext,
|
||||
options: CSVOptions,
|
||||
header: Array[String],
|
||||
inputPaths: Seq[String]): RDD[Array[String]] = {
|
||||
val rdd = baseRdd(sqlContext, options, inputPaths)
|
||||
// Make sure firstLine is materialized before sending to executors
|
||||
val firstLine = if (options.headerFlag) findFirstLine(options, rdd) else null
|
||||
CSVRelation.univocityTokenizer(rdd, header, firstLine, options)
|
||||
}
|
||||
|
||||
/**
|
||||
* Returns the first line of the first non-empty file in path
|
||||
*/
|
||||
private def findFirstLine(options: CSVOptions, rdd: RDD[String]): String = {
|
||||
if (options.isCommentSet) {
|
||||
val comment = options.comment.toString
|
||||
rdd.filter { line =>
|
||||
line.trim.nonEmpty && !line.startsWith(comment)
|
||||
}.first()
|
||||
} else {
|
||||
rdd.filter { line =>
|
||||
line.trim.nonEmpty
|
||||
}.first()
|
||||
}
|
||||
}
|
||||
|
||||
private def readText(
|
||||
sqlContext: SQLContext,
|
||||
options: CSVOptions,
|
||||
location: String): RDD[String] = {
|
||||
if (Charset.forName(options.charset) == Charset.forName("UTF-8")) {
|
||||
sqlContext.sparkContext.textFile(location)
|
||||
} else {
|
||||
val charset = options.charset
|
||||
sqlContext.sparkContext
|
||||
.hadoopFile[LongWritable, Text, TextInputFormat](location)
|
||||
.mapPartitions(_.map(pair => new String(pair._2.getBytes, 0, pair._2.getLength, charset)))
|
||||
}
|
||||
}
|
||||
}
|
||||
|
|
|
@ -92,7 +92,10 @@ case class CreateTempTableUsing(
|
|||
|
||||
def run(sqlContext: SQLContext): Seq[Row] = {
|
||||
val resolved = ResolvedDataSource(
|
||||
sqlContext, userSpecifiedSchema, Array.empty[String], bucketSpec = None, provider, options)
|
||||
sqlContext,
|
||||
userSpecifiedSchema = userSpecifiedSchema,
|
||||
provider = provider,
|
||||
options = options)
|
||||
sqlContext.catalog.registerTable(
|
||||
tableIdent,
|
||||
DataFrame(sqlContext, LogicalRelation(resolved.relation)).logicalPlan)
|
||||
|
|
|
@ -26,7 +26,7 @@ import org.apache.spark.sql.types._
|
|||
import org.apache.spark.util.Utils
|
||||
|
||||
|
||||
private[json] object InferSchema {
|
||||
private[sql] object InferSchema {
|
||||
|
||||
/**
|
||||
* Infer the type of a collection of json records in three stages:
|
||||
|
|
|
@ -38,58 +38,84 @@ import org.apache.spark.sql.execution.datasources._
|
|||
import org.apache.spark.sql.sources._
|
||||
import org.apache.spark.sql.types.StructType
|
||||
import org.apache.spark.util.SerializableConfiguration
|
||||
import org.apache.spark.util.collection.BitSet
|
||||
|
||||
|
||||
class DefaultSource extends BucketedHadoopFsRelationProvider with DataSourceRegister {
|
||||
class DefaultSource extends FileFormat with DataSourceRegister {
|
||||
|
||||
override def shortName(): String = "json"
|
||||
|
||||
override def createRelation(
|
||||
override def inferSchema(
|
||||
sqlContext: SQLContext,
|
||||
paths: Array[String],
|
||||
dataSchema: Option[StructType],
|
||||
partitionColumns: Option[StructType],
|
||||
bucketSpec: Option[BucketSpec],
|
||||
parameters: Map[String, String]): HadoopFsRelation = {
|
||||
options: Map[String, String],
|
||||
files: Seq[FileStatus]): Option[StructType] = {
|
||||
if (files.isEmpty) {
|
||||
None
|
||||
} else {
|
||||
val parsedOptions: JSONOptions = new JSONOptions(options)
|
||||
val jsonFiles = files.filterNot { status =>
|
||||
val name = status.getPath.getName
|
||||
name.startsWith("_") || name.startsWith(".")
|
||||
}.toArray
|
||||
|
||||
new JSONRelation(
|
||||
inputRDD = None,
|
||||
maybeDataSchema = dataSchema,
|
||||
maybePartitionSpec = None,
|
||||
userDefinedPartitionColumns = partitionColumns,
|
||||
maybeBucketSpec = bucketSpec,
|
||||
paths = paths,
|
||||
parameters = parameters)(sqlContext)
|
||||
}
|
||||
}
|
||||
val jsonSchema = InferSchema.infer(
|
||||
createBaseRdd(sqlContext, jsonFiles),
|
||||
sqlContext.conf.columnNameOfCorruptRecord,
|
||||
parsedOptions)
|
||||
checkConstraints(jsonSchema)
|
||||
|
||||
private[sql] class JSONRelation(
|
||||
val inputRDD: Option[RDD[String]],
|
||||
val maybeDataSchema: Option[StructType],
|
||||
val maybePartitionSpec: Option[PartitionSpec],
|
||||
override val userDefinedPartitionColumns: Option[StructType],
|
||||
override val maybeBucketSpec: Option[BucketSpec] = None,
|
||||
override val paths: Array[String] = Array.empty[String],
|
||||
parameters: Map[String, String] = Map.empty[String, String])
|
||||
(@transient val sqlContext: SQLContext)
|
||||
extends HadoopFsRelation(maybePartitionSpec, parameters) {
|
||||
|
||||
val options: JSONOptions = new JSONOptions(parameters)
|
||||
|
||||
/** Constraints to be imposed on schema to be stored. */
|
||||
private def checkConstraints(schema: StructType): Unit = {
|
||||
if (schema.fieldNames.length != schema.fieldNames.distinct.length) {
|
||||
val duplicateColumns = schema.fieldNames.groupBy(identity).collect {
|
||||
case (x, ys) if ys.length > 1 => "\"" + x + "\""
|
||||
}.mkString(", ")
|
||||
throw new AnalysisException(s"Duplicate column(s) : $duplicateColumns found, " +
|
||||
s"cannot save to JSON format")
|
||||
Some(jsonSchema)
|
||||
}
|
||||
}
|
||||
|
||||
override val needConversion: Boolean = false
|
||||
override def prepareWrite(
|
||||
sqlContext: SQLContext,
|
||||
job: Job,
|
||||
options: Map[String, String],
|
||||
dataSchema: StructType): OutputWriterFactory = {
|
||||
val conf = job.getConfiguration
|
||||
val parsedOptions: JSONOptions = new JSONOptions(options)
|
||||
parsedOptions.compressionCodec.foreach { codec =>
|
||||
CompressionCodecs.setCodecConfiguration(conf, codec)
|
||||
}
|
||||
|
||||
private def createBaseRdd(inputPaths: Array[FileStatus]): RDD[String] = {
|
||||
new OutputWriterFactory {
|
||||
override def newInstance(
|
||||
path: String,
|
||||
bucketId: Option[Int],
|
||||
dataSchema: StructType,
|
||||
context: TaskAttemptContext): OutputWriter = {
|
||||
new JsonOutputWriter(path, bucketId, dataSchema, context)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
override def buildInternalScan(
|
||||
sqlContext: SQLContext,
|
||||
dataSchema: StructType,
|
||||
requiredColumns: Array[String],
|
||||
filters: Array[Filter],
|
||||
bucketSet: Option[BitSet],
|
||||
inputFiles: Array[FileStatus],
|
||||
broadcastedConf: Broadcast[SerializableConfiguration],
|
||||
options: Map[String, String]): RDD[InternalRow] = {
|
||||
// TODO: Filter files for all formats before calling buildInternalScan.
|
||||
val jsonFiles = inputFiles.filterNot(_.getPath.getName startsWith "_")
|
||||
|
||||
val parsedOptions: JSONOptions = new JSONOptions(options)
|
||||
val requiredDataSchema = StructType(requiredColumns.map(dataSchema(_)))
|
||||
val rows = JacksonParser.parse(
|
||||
createBaseRdd(sqlContext, jsonFiles),
|
||||
requiredDataSchema,
|
||||
sqlContext.conf.columnNameOfCorruptRecord,
|
||||
parsedOptions)
|
||||
|
||||
rows.mapPartitions { iterator =>
|
||||
val unsafeProjection = UnsafeProjection.create(requiredDataSchema)
|
||||
iterator.map(unsafeProjection)
|
||||
}
|
||||
}
|
||||
|
||||
private def createBaseRdd(sqlContext: SQLContext, inputPaths: Array[FileStatus]): RDD[String] = {
|
||||
val job = Job.getInstance(sqlContext.sparkContext.hadoopConfiguration)
|
||||
val conf = job.getConfiguration
|
||||
|
||||
|
@ -106,77 +132,19 @@ private[sql] class JSONRelation(
|
|||
classOf[Text]).map(_._2.toString) // get the text line
|
||||
}
|
||||
|
||||
override lazy val dataSchema: StructType = {
|
||||
val jsonSchema = maybeDataSchema.getOrElse {
|
||||
val files = cachedLeafStatuses().filterNot { status =>
|
||||
val name = status.getPath.getName
|
||||
name.startsWith("_") || name.startsWith(".")
|
||||
}.toArray
|
||||
InferSchema.infer(
|
||||
inputRDD.getOrElse(createBaseRdd(files)),
|
||||
sqlContext.conf.columnNameOfCorruptRecord,
|
||||
options)
|
||||
}
|
||||
checkConstraints(jsonSchema)
|
||||
|
||||
jsonSchema
|
||||
}
|
||||
|
||||
override private[sql] def buildInternalScan(
|
||||
requiredColumns: Array[String],
|
||||
filters: Array[Filter],
|
||||
inputPaths: Array[FileStatus],
|
||||
broadcastedConf: Broadcast[SerializableConfiguration]): RDD[InternalRow] = {
|
||||
val requiredDataSchema = StructType(requiredColumns.map(dataSchema(_)))
|
||||
val rows = JacksonParser.parse(
|
||||
inputRDD.getOrElse(createBaseRdd(inputPaths)),
|
||||
requiredDataSchema,
|
||||
sqlContext.conf.columnNameOfCorruptRecord,
|
||||
options)
|
||||
|
||||
rows.mapPartitions { iterator =>
|
||||
val unsafeProjection = UnsafeProjection.create(requiredDataSchema)
|
||||
iterator.map(unsafeProjection)
|
||||
/** Constraints to be imposed on schema to be stored. */
|
||||
private def checkConstraints(schema: StructType): Unit = {
|
||||
if (schema.fieldNames.length != schema.fieldNames.distinct.length) {
|
||||
val duplicateColumns = schema.fieldNames.groupBy(identity).collect {
|
||||
case (x, ys) if ys.length > 1 => "\"" + x + "\""
|
||||
}.mkString(", ")
|
||||
throw new AnalysisException(s"Duplicate column(s) : $duplicateColumns found, " +
|
||||
s"cannot save to JSON format")
|
||||
}
|
||||
}
|
||||
|
||||
override def equals(other: Any): Boolean = other match {
|
||||
case that: JSONRelation =>
|
||||
((inputRDD, that.inputRDD) match {
|
||||
case (Some(thizRdd), Some(thatRdd)) => thizRdd eq thatRdd
|
||||
case (None, None) => true
|
||||
case _ => false
|
||||
}) && paths.toSet == that.paths.toSet &&
|
||||
dataSchema == that.dataSchema &&
|
||||
schema == that.schema
|
||||
case _ => false
|
||||
}
|
||||
|
||||
override def hashCode(): Int = {
|
||||
Objects.hashCode(
|
||||
inputRDD,
|
||||
paths.toSet,
|
||||
dataSchema,
|
||||
schema,
|
||||
partitionColumns)
|
||||
}
|
||||
|
||||
override def prepareJobForWrite(job: Job): BucketedOutputWriterFactory = {
|
||||
val conf = job.getConfiguration
|
||||
options.compressionCodec.foreach { codec =>
|
||||
CompressionCodecs.setCodecConfiguration(conf, codec)
|
||||
}
|
||||
|
||||
new BucketedOutputWriterFactory {
|
||||
override def newInstance(
|
||||
path: String,
|
||||
bucketId: Option[Int],
|
||||
dataSchema: StructType,
|
||||
context: TaskAttemptContext): OutputWriter = {
|
||||
new JsonOutputWriter(path, bucketId, dataSchema, context)
|
||||
}
|
||||
}
|
||||
}
|
||||
override def toString: String = "JSON"
|
||||
override def equals(other: Any): Boolean = other.isInstanceOf[DefaultSource]
|
||||
}
|
||||
|
||||
private[json] class JsonOutputWriter(
|
||||
|
|
|
@ -25,7 +25,6 @@ import scala.collection.JavaConverters._
|
|||
import scala.collection.mutable
|
||||
import scala.util.{Failure, Try}
|
||||
|
||||
import com.google.common.base.Objects
|
||||
import org.apache.hadoop.conf.Configuration
|
||||
import org.apache.hadoop.fs.{FileStatus, Path}
|
||||
import org.apache.hadoop.io.Writable
|
||||
|
@ -51,193 +50,23 @@ import org.apache.spark.sql.internal.SQLConf
|
|||
import org.apache.spark.sql.sources._
|
||||
import org.apache.spark.sql.types.{DataType, StructType}
|
||||
import org.apache.spark.util.{SerializableConfiguration, Utils}
|
||||
import org.apache.spark.util.collection.BitSet
|
||||
|
||||
private[sql] class DefaultSource extends BucketedHadoopFsRelationProvider with DataSourceRegister {
|
||||
|
||||
private[sql] class DefaultSource extends FileFormat with DataSourceRegister with Logging {
|
||||
|
||||
override def shortName(): String = "parquet"
|
||||
|
||||
override def createRelation(
|
||||
override def toString: String = "ParquetFormat"
|
||||
|
||||
override def equals(other: Any): Boolean = other.isInstanceOf[DefaultSource]
|
||||
|
||||
override def prepareWrite(
|
||||
sqlContext: SQLContext,
|
||||
paths: Array[String],
|
||||
schema: Option[StructType],
|
||||
partitionColumns: Option[StructType],
|
||||
bucketSpec: Option[BucketSpec],
|
||||
parameters: Map[String, String]): HadoopFsRelation = {
|
||||
new ParquetRelation(paths, schema, None, partitionColumns, bucketSpec, parameters)(sqlContext)
|
||||
}
|
||||
}
|
||||
job: Job,
|
||||
options: Map[String, String],
|
||||
dataSchema: StructType): OutputWriterFactory = {
|
||||
|
||||
// NOTE: This class is instantiated and used on executor side only, no need to be serializable.
|
||||
private[sql] class ParquetOutputWriter(
|
||||
path: String,
|
||||
bucketId: Option[Int],
|
||||
context: TaskAttemptContext)
|
||||
extends OutputWriter {
|
||||
|
||||
private val recordWriter: RecordWriter[Void, InternalRow] = {
|
||||
val outputFormat = {
|
||||
new ParquetOutputFormat[InternalRow]() {
|
||||
// Here we override `getDefaultWorkFile` for two reasons:
|
||||
//
|
||||
// 1. To allow appending. We need to generate unique output file names to avoid
|
||||
// overwriting existing files (either exist before the write job, or are just written
|
||||
// by other tasks within the same write job).
|
||||
//
|
||||
// 2. To allow dynamic partitioning. Default `getDefaultWorkFile` uses
|
||||
// `FileOutputCommitter.getWorkPath()`, which points to the base directory of all
|
||||
// partitions in the case of dynamic partitioning.
|
||||
override def getDefaultWorkFile(context: TaskAttemptContext, extension: String): Path = {
|
||||
val configuration = context.getConfiguration
|
||||
val uniqueWriteJobId = configuration.get("spark.sql.sources.writeJobUUID")
|
||||
val taskAttemptId = context.getTaskAttemptID
|
||||
val split = taskAttemptId.getTaskID.getId
|
||||
val bucketString = bucketId.map(BucketingUtils.bucketIdToString).getOrElse("")
|
||||
new Path(path, f"part-r-$split%05d-$uniqueWriteJobId$bucketString$extension")
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
outputFormat.getRecordWriter(context)
|
||||
}
|
||||
|
||||
override def write(row: Row): Unit = throw new UnsupportedOperationException("call writeInternal")
|
||||
|
||||
override protected[sql] def writeInternal(row: InternalRow): Unit = recordWriter.write(null, row)
|
||||
|
||||
override def close(): Unit = recordWriter.close(context)
|
||||
}
|
||||
|
||||
private[sql] class ParquetRelation(
|
||||
override val paths: Array[String],
|
||||
private val maybeDataSchema: Option[StructType],
|
||||
// This is for metastore conversion.
|
||||
private val maybePartitionSpec: Option[PartitionSpec],
|
||||
override val userDefinedPartitionColumns: Option[StructType],
|
||||
override val maybeBucketSpec: Option[BucketSpec],
|
||||
parameters: Map[String, String])(
|
||||
val sqlContext: SQLContext)
|
||||
extends HadoopFsRelation(maybePartitionSpec, parameters)
|
||||
with Logging {
|
||||
|
||||
private[sql] def this(
|
||||
paths: Array[String],
|
||||
maybeDataSchema: Option[StructType],
|
||||
maybePartitionSpec: Option[PartitionSpec],
|
||||
parameters: Map[String, String])(
|
||||
sqlContext: SQLContext) = {
|
||||
this(
|
||||
paths,
|
||||
maybeDataSchema,
|
||||
maybePartitionSpec,
|
||||
maybePartitionSpec.map(_.partitionColumns),
|
||||
None,
|
||||
parameters)(sqlContext)
|
||||
}
|
||||
|
||||
// Should we merge schemas from all Parquet part-files?
|
||||
private val shouldMergeSchemas =
|
||||
parameters
|
||||
.get(ParquetRelation.MERGE_SCHEMA)
|
||||
.map(_.toBoolean)
|
||||
.getOrElse(sqlContext.conf.getConf(SQLConf.PARQUET_SCHEMA_MERGING_ENABLED))
|
||||
|
||||
private val mergeRespectSummaries =
|
||||
sqlContext.conf.getConf(SQLConf.PARQUET_SCHEMA_RESPECT_SUMMARIES)
|
||||
|
||||
private val maybeMetastoreSchema = parameters
|
||||
.get(ParquetRelation.METASTORE_SCHEMA)
|
||||
.map(DataType.fromJson(_).asInstanceOf[StructType])
|
||||
|
||||
private val compressionCodec: Option[String] = parameters
|
||||
.get("compression")
|
||||
.map { codecName =>
|
||||
// Validate if given compression codec is supported or not.
|
||||
val shortParquetCompressionCodecNames = ParquetRelation.shortParquetCompressionCodecNames
|
||||
if (!shortParquetCompressionCodecNames.contains(codecName.toLowerCase)) {
|
||||
val availableCodecs = shortParquetCompressionCodecNames.keys.map(_.toLowerCase)
|
||||
throw new IllegalArgumentException(s"Codec [$codecName] " +
|
||||
s"is not available. Available codecs are ${availableCodecs.mkString(", ")}.")
|
||||
}
|
||||
codecName.toLowerCase
|
||||
}
|
||||
|
||||
private lazy val metadataCache: MetadataCache = {
|
||||
val meta = new MetadataCache
|
||||
meta.refresh()
|
||||
meta
|
||||
}
|
||||
|
||||
override def toString: String = {
|
||||
parameters.get(ParquetRelation.METASTORE_TABLE_NAME).map { tableName =>
|
||||
s"${getClass.getSimpleName}: $tableName"
|
||||
}.getOrElse(super.toString)
|
||||
}
|
||||
|
||||
override def equals(other: Any): Boolean = other match {
|
||||
case that: ParquetRelation =>
|
||||
val schemaEquality = if (shouldMergeSchemas) {
|
||||
this.shouldMergeSchemas == that.shouldMergeSchemas
|
||||
} else {
|
||||
this.dataSchema == that.dataSchema &&
|
||||
this.schema == that.schema
|
||||
}
|
||||
|
||||
this.paths.toSet == that.paths.toSet &&
|
||||
schemaEquality &&
|
||||
this.maybeDataSchema == that.maybeDataSchema &&
|
||||
this.partitionColumns == that.partitionColumns
|
||||
|
||||
case _ => false
|
||||
}
|
||||
|
||||
override def hashCode(): Int = {
|
||||
if (shouldMergeSchemas) {
|
||||
Objects.hashCode(
|
||||
Boolean.box(shouldMergeSchemas),
|
||||
paths.toSet,
|
||||
maybeDataSchema,
|
||||
partitionColumns)
|
||||
} else {
|
||||
Objects.hashCode(
|
||||
Boolean.box(shouldMergeSchemas),
|
||||
paths.toSet,
|
||||
dataSchema,
|
||||
schema,
|
||||
maybeDataSchema,
|
||||
partitionColumns)
|
||||
}
|
||||
}
|
||||
|
||||
/** Constraints on schema of dataframe to be stored. */
|
||||
private def checkConstraints(schema: StructType): Unit = {
|
||||
if (schema.fieldNames.length != schema.fieldNames.distinct.length) {
|
||||
val duplicateColumns = schema.fieldNames.groupBy(identity).collect {
|
||||
case (x, ys) if ys.length > 1 => "\"" + x + "\""
|
||||
}.mkString(", ")
|
||||
throw new AnalysisException(s"Duplicate column(s) : $duplicateColumns found, " +
|
||||
s"cannot save to parquet format")
|
||||
}
|
||||
}
|
||||
|
||||
override def dataSchema: StructType = {
|
||||
val schema = maybeDataSchema.getOrElse(metadataCache.dataSchema)
|
||||
// check if schema satisfies the constraints
|
||||
// before moving forward
|
||||
checkConstraints(schema)
|
||||
schema
|
||||
}
|
||||
|
||||
override private[sql] def refresh(): Unit = {
|
||||
super.refresh()
|
||||
metadataCache.refresh()
|
||||
}
|
||||
|
||||
// Parquet data source always uses Catalyst internal representations.
|
||||
override val needConversion: Boolean = false
|
||||
|
||||
override def sizeInBytes: Long = metadataCache.dataStatuses.map(_.getLen).sum
|
||||
|
||||
override def prepareJobForWrite(job: Job): BucketedOutputWriterFactory = {
|
||||
val conf = ContextUtil.getConfiguration(job)
|
||||
|
||||
// SPARK-9849 DirectParquetOutputCommitter qualified name should be backward compatible
|
||||
|
@ -255,11 +84,24 @@ private[sql] class ParquetRelation(
|
|||
|
||||
if (conf.get(SQLConf.PARQUET_OUTPUT_COMMITTER_CLASS.key) == null) {
|
||||
logInfo("Using default output committer for Parquet: " +
|
||||
classOf[ParquetOutputCommitter].getCanonicalName)
|
||||
classOf[ParquetOutputCommitter].getCanonicalName)
|
||||
} else {
|
||||
logInfo("Using user defined output committer for Parquet: " + committerClass.getCanonicalName)
|
||||
}
|
||||
|
||||
val compressionCodec: Option[String] = options
|
||||
.get("compression")
|
||||
.map { codecName =>
|
||||
// Validate if given compression codec is supported or not.
|
||||
val shortParquetCompressionCodecNames = ParquetRelation.shortParquetCompressionCodecNames
|
||||
if (!shortParquetCompressionCodecNames.contains(codecName.toLowerCase)) {
|
||||
val availableCodecs = shortParquetCompressionCodecNames.keys.map(_.toLowerCase)
|
||||
throw new IllegalArgumentException(s"Codec [$codecName] " +
|
||||
s"is not available. Available codecs are ${availableCodecs.mkString(", ")}.")
|
||||
}
|
||||
codecName.toLowerCase
|
||||
}
|
||||
|
||||
conf.setClass(
|
||||
SQLConf.OUTPUT_COMMITTER_CLASS.key,
|
||||
committerClass,
|
||||
|
@ -303,7 +145,7 @@ private[sql] class ParquetRelation(
|
|||
.getOrElse(sqlContext.conf.parquetCompressionCodec.toLowerCase),
|
||||
CompressionCodecName.UNCOMPRESSED).name())
|
||||
|
||||
new BucketedOutputWriterFactory {
|
||||
new OutputWriterFactory {
|
||||
override def newInstance(
|
||||
path: String,
|
||||
bucketId: Option[Int],
|
||||
|
@ -314,11 +156,127 @@ private[sql] class ParquetRelation(
|
|||
}
|
||||
}
|
||||
|
||||
def inferSchema(
|
||||
sqlContext: SQLContext,
|
||||
parameters: Map[String, String],
|
||||
files: Seq[FileStatus]): Option[StructType] = {
|
||||
// Should we merge schemas from all Parquet part-files?
|
||||
val shouldMergeSchemas =
|
||||
parameters
|
||||
.get(ParquetRelation.MERGE_SCHEMA)
|
||||
.map(_.toBoolean)
|
||||
.getOrElse(sqlContext.conf.getConf(SQLConf.PARQUET_SCHEMA_MERGING_ENABLED))
|
||||
|
||||
val mergeRespectSummaries =
|
||||
sqlContext.conf.getConf(SQLConf.PARQUET_SCHEMA_RESPECT_SUMMARIES)
|
||||
|
||||
val filesByType = splitFiles(files)
|
||||
|
||||
// Sees which file(s) we need to touch in order to figure out the schema.
|
||||
//
|
||||
// Always tries the summary files first if users don't require a merged schema. In this case,
|
||||
// "_common_metadata" is more preferable than "_metadata" because it doesn't contain row
|
||||
// groups information, and could be much smaller for large Parquet files with lots of row
|
||||
// groups. If no summary file is available, falls back to some random part-file.
|
||||
//
|
||||
// NOTE: Metadata stored in the summary files are merged from all part-files. However, for
|
||||
// user defined key-value metadata (in which we store Spark SQL schema), Parquet doesn't know
|
||||
// how to merge them correctly if some key is associated with different values in different
|
||||
// part-files. When this happens, Parquet simply gives up generating the summary file. This
|
||||
// implies that if a summary file presents, then:
|
||||
//
|
||||
// 1. Either all part-files have exactly the same Spark SQL schema, or
|
||||
// 2. Some part-files don't contain Spark SQL schema in the key-value metadata at all (thus
|
||||
// their schemas may differ from each other).
|
||||
//
|
||||
// Here we tend to be pessimistic and take the second case into account. Basically this means
|
||||
// we can't trust the summary files if users require a merged schema, and must touch all part-
|
||||
// files to do the merge.
|
||||
val filesToTouch =
|
||||
if (shouldMergeSchemas) {
|
||||
// Also includes summary files, 'cause there might be empty partition directories.
|
||||
|
||||
// If mergeRespectSummaries config is true, we assume that all part-files are the same for
|
||||
// their schema with summary files, so we ignore them when merging schema.
|
||||
// If the config is disabled, which is the default setting, we merge all part-files.
|
||||
// In this mode, we only need to merge schemas contained in all those summary files.
|
||||
// You should enable this configuration only if you are very sure that for the parquet
|
||||
// part-files to read there are corresponding summary files containing correct schema.
|
||||
|
||||
// As filed in SPARK-11500, the order of files to touch is a matter, which might affect
|
||||
// the ordering of the output columns. There are several things to mention here.
|
||||
//
|
||||
// 1. If mergeRespectSummaries config is false, then it merges schemas by reducing from
|
||||
// the first part-file so that the columns of the lexicographically first file show
|
||||
// first.
|
||||
//
|
||||
// 2. If mergeRespectSummaries config is true, then there should be, at least,
|
||||
// "_metadata"s for all given files, so that we can ensure the columns of
|
||||
// the lexicographically first file show first.
|
||||
//
|
||||
// 3. If shouldMergeSchemas is false, but when multiple files are given, there is
|
||||
// no guarantee of the output order, since there might not be a summary file for the
|
||||
// lexicographically first file, which ends up putting ahead the columns of
|
||||
// the other files. However, this should be okay since not enabling
|
||||
// shouldMergeSchemas means (assumes) all the files have the same schemas.
|
||||
|
||||
val needMerged: Seq[FileStatus] =
|
||||
if (mergeRespectSummaries) {
|
||||
Seq()
|
||||
} else {
|
||||
filesByType.data
|
||||
}
|
||||
needMerged ++ filesByType.metadata ++ filesByType.commonMetadata
|
||||
} else {
|
||||
// Tries any "_common_metadata" first. Parquet files written by old versions or Parquet
|
||||
// don't have this.
|
||||
filesByType.commonMetadata.headOption
|
||||
// Falls back to "_metadata"
|
||||
.orElse(filesByType.metadata.headOption)
|
||||
// Summary file(s) not found, the Parquet file is either corrupted, or different part-
|
||||
// files contain conflicting user defined metadata (two or more values are associated
|
||||
// with a same key in different files). In either case, we fall back to any of the
|
||||
// first part-file, and just assume all schemas are consistent.
|
||||
.orElse(filesByType.data.headOption)
|
||||
.toSeq
|
||||
}
|
||||
ParquetRelation.mergeSchemasInParallel(filesToTouch, sqlContext)
|
||||
}
|
||||
|
||||
case class FileTypes(
|
||||
data: Seq[FileStatus],
|
||||
metadata: Seq[FileStatus],
|
||||
commonMetadata: Seq[FileStatus])
|
||||
|
||||
private def splitFiles(allFiles: Seq[FileStatus]): FileTypes = {
|
||||
// Lists `FileStatus`es of all leaf nodes (files) under all base directories.
|
||||
val leaves = allFiles.filter { f =>
|
||||
isSummaryFile(f.getPath) ||
|
||||
!(f.getPath.getName.startsWith("_") || f.getPath.getName.startsWith("."))
|
||||
}.toArray.sortBy(_.getPath.toString)
|
||||
|
||||
FileTypes(
|
||||
data = leaves.filterNot(f => isSummaryFile(f.getPath)),
|
||||
metadata =
|
||||
leaves.filter(_.getPath.getName == ParquetFileWriter.PARQUET_METADATA_FILE),
|
||||
commonMetadata =
|
||||
leaves.filter(_.getPath.getName == ParquetFileWriter.PARQUET_COMMON_METADATA_FILE))
|
||||
}
|
||||
|
||||
private def isSummaryFile(file: Path): Boolean = {
|
||||
file.getName == ParquetFileWriter.PARQUET_COMMON_METADATA_FILE ||
|
||||
file.getName == ParquetFileWriter.PARQUET_METADATA_FILE
|
||||
}
|
||||
|
||||
override def buildInternalScan(
|
||||
sqlContext: SQLContext,
|
||||
dataSchema: StructType,
|
||||
requiredColumns: Array[String],
|
||||
filters: Array[Filter],
|
||||
inputFiles: Array[FileStatus],
|
||||
broadcastedConf: Broadcast[SerializableConfiguration]): RDD[InternalRow] = {
|
||||
bucketSet: Option[BitSet],
|
||||
allFiles: Array[FileStatus],
|
||||
broadcastedConf: Broadcast[SerializableConfiguration],
|
||||
options: Map[String, String]): RDD[InternalRow] = {
|
||||
val useMetadataCache = sqlContext.getConf(SQLConf.PARQUET_CACHE_METADATA)
|
||||
val parquetFilterPushDown = sqlContext.conf.parquetFilterPushDown
|
||||
val assumeBinaryIsString = sqlContext.conf.isParquetBinaryAsString
|
||||
|
@ -341,6 +299,8 @@ private[sql] class ParquetRelation(
|
|||
assumeBinaryIsString,
|
||||
assumeInt96IsTimestamp) _
|
||||
|
||||
val inputFiles = splitFiles(allFiles).data.toArray
|
||||
|
||||
// Create the function to set input paths at the driver side.
|
||||
val setInputPaths =
|
||||
ParquetRelation.initializeDriverSideJobFunc(inputFiles, parquetBlockSize) _
|
||||
|
@ -392,153 +352,46 @@ private[sql] class ParquetRelation(
|
|||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
private class MetadataCache {
|
||||
// `FileStatus` objects of all "_metadata" files.
|
||||
private var metadataStatuses: Array[FileStatus] = _
|
||||
// NOTE: This class is instantiated and used on executor side only, no need to be serializable.
|
||||
private[sql] class ParquetOutputWriter(
|
||||
path: String,
|
||||
bucketId: Option[Int],
|
||||
context: TaskAttemptContext)
|
||||
extends OutputWriter {
|
||||
|
||||
// `FileStatus` objects of all "_common_metadata" files.
|
||||
private var commonMetadataStatuses: Array[FileStatus] = _
|
||||
|
||||
// `FileStatus` objects of all data files (Parquet part-files).
|
||||
var dataStatuses: Array[FileStatus] = _
|
||||
|
||||
// Schema of the actual Parquet files, without partition columns discovered from partition
|
||||
// directory paths.
|
||||
var dataSchema: StructType = null
|
||||
|
||||
// Schema of the whole table, including partition columns.
|
||||
var schema: StructType = _
|
||||
|
||||
// Cached leaves
|
||||
var cachedLeaves: mutable.LinkedHashSet[FileStatus] = null
|
||||
|
||||
/**
|
||||
* Refreshes `FileStatus`es, footers, partition spec, and table schema.
|
||||
*/
|
||||
def refresh(): Unit = {
|
||||
val currentLeafStatuses = cachedLeafStatuses()
|
||||
|
||||
// Check if cachedLeafStatuses is changed or not
|
||||
val leafStatusesChanged = (cachedLeaves == null) ||
|
||||
!cachedLeaves.equals(currentLeafStatuses)
|
||||
|
||||
if (leafStatusesChanged) {
|
||||
cachedLeaves = currentLeafStatuses
|
||||
|
||||
// Lists `FileStatus`es of all leaf nodes (files) under all base directories.
|
||||
val leaves = currentLeafStatuses.filter { f =>
|
||||
isSummaryFile(f.getPath) ||
|
||||
!(f.getPath.getName.startsWith("_") || f.getPath.getName.startsWith("."))
|
||||
}.toArray.sortBy(_.getPath.toString)
|
||||
|
||||
dataStatuses = leaves.filterNot(f => isSummaryFile(f.getPath))
|
||||
metadataStatuses =
|
||||
leaves.filter(_.getPath.getName == ParquetFileWriter.PARQUET_METADATA_FILE)
|
||||
commonMetadataStatuses =
|
||||
leaves.filter(_.getPath.getName == ParquetFileWriter.PARQUET_COMMON_METADATA_FILE)
|
||||
|
||||
dataSchema = {
|
||||
val dataSchema0 = maybeDataSchema
|
||||
.orElse(readSchema())
|
||||
.orElse(maybeMetastoreSchema)
|
||||
.getOrElse(throw new AnalysisException(
|
||||
s"Failed to discover schema of Parquet file(s) in the following location(s):\n" +
|
||||
paths.mkString("\n\t")))
|
||||
|
||||
// If this Parquet relation is converted from a Hive Metastore table, must reconcile case
|
||||
// case insensitivity issue and possible schema mismatch (probably caused by schema
|
||||
// evolution).
|
||||
maybeMetastoreSchema
|
||||
.map(ParquetRelation.mergeMetastoreParquetSchema(_, dataSchema0))
|
||||
.getOrElse(dataSchema0)
|
||||
private val recordWriter: RecordWriter[Void, InternalRow] = {
|
||||
val outputFormat = {
|
||||
new ParquetOutputFormat[InternalRow]() {
|
||||
// Here we override `getDefaultWorkFile` for two reasons:
|
||||
//
|
||||
// 1. To allow appending. We need to generate unique output file names to avoid
|
||||
// overwriting existing files (either exist before the write job, or are just written
|
||||
// by other tasks within the same write job).
|
||||
//
|
||||
// 2. To allow dynamic partitioning. Default `getDefaultWorkFile` uses
|
||||
// `FileOutputCommitter.getWorkPath()`, which points to the base directory of all
|
||||
// partitions in the case of dynamic partitioning.
|
||||
override def getDefaultWorkFile(context: TaskAttemptContext, extension: String): Path = {
|
||||
val configuration = context.getConfiguration
|
||||
val uniqueWriteJobId = configuration.get("spark.sql.sources.writeJobUUID")
|
||||
val taskAttemptId = context.getTaskAttemptID
|
||||
val split = taskAttemptId.getTaskID.getId
|
||||
val bucketString = bucketId.map(BucketingUtils.bucketIdToString).getOrElse("")
|
||||
new Path(path, f"part-r-$split%05d-$uniqueWriteJobId$bucketString$extension")
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
private def isSummaryFile(file: Path): Boolean = {
|
||||
file.getName == ParquetFileWriter.PARQUET_COMMON_METADATA_FILE ||
|
||||
file.getName == ParquetFileWriter.PARQUET_METADATA_FILE
|
||||
}
|
||||
|
||||
private def readSchema(): Option[StructType] = {
|
||||
// Sees which file(s) we need to touch in order to figure out the schema.
|
||||
//
|
||||
// Always tries the summary files first if users don't require a merged schema. In this case,
|
||||
// "_common_metadata" is more preferable than "_metadata" because it doesn't contain row
|
||||
// groups information, and could be much smaller for large Parquet files with lots of row
|
||||
// groups. If no summary file is available, falls back to some random part-file.
|
||||
//
|
||||
// NOTE: Metadata stored in the summary files are merged from all part-files. However, for
|
||||
// user defined key-value metadata (in which we store Spark SQL schema), Parquet doesn't know
|
||||
// how to merge them correctly if some key is associated with different values in different
|
||||
// part-files. When this happens, Parquet simply gives up generating the summary file. This
|
||||
// implies that if a summary file presents, then:
|
||||
//
|
||||
// 1. Either all part-files have exactly the same Spark SQL schema, or
|
||||
// 2. Some part-files don't contain Spark SQL schema in the key-value metadata at all (thus
|
||||
// their schemas may differ from each other).
|
||||
//
|
||||
// Here we tend to be pessimistic and take the second case into account. Basically this means
|
||||
// we can't trust the summary files if users require a merged schema, and must touch all part-
|
||||
// files to do the merge.
|
||||
val filesToTouch =
|
||||
if (shouldMergeSchemas) {
|
||||
// Also includes summary files, 'cause there might be empty partition directories.
|
||||
|
||||
// If mergeRespectSummaries config is true, we assume that all part-files are the same for
|
||||
// their schema with summary files, so we ignore them when merging schema.
|
||||
// If the config is disabled, which is the default setting, we merge all part-files.
|
||||
// In this mode, we only need to merge schemas contained in all those summary files.
|
||||
// You should enable this configuration only if you are very sure that for the parquet
|
||||
// part-files to read there are corresponding summary files containing correct schema.
|
||||
|
||||
// As filed in SPARK-11500, the order of files to touch is a matter, which might affect
|
||||
// the ordering of the output columns. There are several things to mention here.
|
||||
//
|
||||
// 1. If mergeRespectSummaries config is false, then it merges schemas by reducing from
|
||||
// the first part-file so that the columns of the lexicographically first file show
|
||||
// first.
|
||||
//
|
||||
// 2. If mergeRespectSummaries config is true, then there should be, at least,
|
||||
// "_metadata"s for all given files, so that we can ensure the columns of
|
||||
// the lexicographically first file show first.
|
||||
//
|
||||
// 3. If shouldMergeSchemas is false, but when multiple files are given, there is
|
||||
// no guarantee of the output order, since there might not be a summary file for the
|
||||
// lexicographically first file, which ends up putting ahead the columns of
|
||||
// the other files. However, this should be okay since not enabling
|
||||
// shouldMergeSchemas means (assumes) all the files have the same schemas.
|
||||
|
||||
val needMerged: Seq[FileStatus] =
|
||||
if (mergeRespectSummaries) {
|
||||
Seq()
|
||||
} else {
|
||||
dataStatuses
|
||||
}
|
||||
needMerged ++ metadataStatuses ++ commonMetadataStatuses
|
||||
} else {
|
||||
// Tries any "_common_metadata" first. Parquet files written by old versions or Parquet
|
||||
// don't have this.
|
||||
commonMetadataStatuses.headOption
|
||||
// Falls back to "_metadata"
|
||||
.orElse(metadataStatuses.headOption)
|
||||
// Summary file(s) not found, the Parquet file is either corrupted, or different part-
|
||||
// files contain conflicting user defined metadata (two or more values are associated
|
||||
// with a same key in different files). In either case, we fall back to any of the
|
||||
// first part-file, and just assume all schemas are consistent.
|
||||
.orElse(dataStatuses.headOption)
|
||||
.toSeq
|
||||
}
|
||||
|
||||
assert(
|
||||
filesToTouch.nonEmpty || maybeDataSchema.isDefined || maybeMetastoreSchema.isDefined,
|
||||
"No predefined schema found, " +
|
||||
s"and no Parquet data files or summary files found under ${paths.mkString(", ")}.")
|
||||
|
||||
ParquetRelation.mergeSchemasInParallel(filesToTouch, sqlContext)
|
||||
}
|
||||
outputFormat.getRecordWriter(context)
|
||||
}
|
||||
|
||||
override def write(row: Row): Unit = throw new UnsupportedOperationException("call writeInternal")
|
||||
|
||||
override protected[sql] def writeInternal(row: InternalRow): Unit = recordWriter.write(null, row)
|
||||
|
||||
override def close(): Unit = recordWriter.close(context)
|
||||
}
|
||||
|
||||
private[sql] object ParquetRelation extends Logging {
|
||||
|
@ -699,7 +552,7 @@ private[sql] object ParquetRelation extends Logging {
|
|||
* distinguish binary and string). This method generates a correct schema by merging Metastore
|
||||
* schema data types and Parquet schema field names.
|
||||
*/
|
||||
private[parquet] def mergeMetastoreParquetSchema(
|
||||
private[sql] def mergeMetastoreParquetSchema(
|
||||
metastoreSchema: StructType,
|
||||
parquetSchema: StructType): StructType = {
|
||||
def schemaConflictMessage: String =
|
||||
|
|
|
@ -34,6 +34,7 @@ private[sql] class ResolveDataSource(sqlContext: SQLContext) extends Rule[Logica
|
|||
try {
|
||||
val resolved = ResolvedDataSource(
|
||||
sqlContext,
|
||||
paths = Seq.empty,
|
||||
userSpecifiedSchema = None,
|
||||
partitionColumns = Array(),
|
||||
bucketSpec = None,
|
||||
|
@ -130,7 +131,7 @@ private[sql] case class PreWriteCheck(catalog: Catalog) extends (LogicalPlan =>
|
|||
LogicalRelation(r: HadoopFsRelation, _, _), part, query, overwrite, _) =>
|
||||
// We need to make sure the partition columns specified by users do match partition
|
||||
// columns of the relation.
|
||||
val existingPartitionColumns = r.partitionColumns.fieldNames.toSet
|
||||
val existingPartitionColumns = r.partitionSchema.fieldNames.toSet
|
||||
val specifiedPartitionColumns = part.keySet
|
||||
if (existingPartitionColumns != specifiedPartitionColumns) {
|
||||
failAnalysis(s"Specified partition columns " +
|
||||
|
|
|
@ -31,25 +31,16 @@ import org.apache.spark.sql.{AnalysisException, Row, SQLContext}
|
|||
import org.apache.spark.sql.catalyst.InternalRow
|
||||
import org.apache.spark.sql.catalyst.expressions.UnsafeRow
|
||||
import org.apache.spark.sql.catalyst.expressions.codegen.{BufferHolder, UnsafeRowWriter}
|
||||
import org.apache.spark.sql.execution.datasources.{CompressionCodecs, PartitionSpec}
|
||||
import org.apache.spark.sql.execution.datasources.CompressionCodecs
|
||||
import org.apache.spark.sql.sources._
|
||||
import org.apache.spark.sql.types.{StringType, StructType}
|
||||
import org.apache.spark.util.SerializableConfiguration
|
||||
import org.apache.spark.util.collection.BitSet
|
||||
|
||||
/**
|
||||
* A data source for reading text files.
|
||||
*/
|
||||
class DefaultSource extends HadoopFsRelationProvider with DataSourceRegister {
|
||||
|
||||
override def createRelation(
|
||||
sqlContext: SQLContext,
|
||||
paths: Array[String],
|
||||
dataSchema: Option[StructType],
|
||||
partitionColumns: Option[StructType],
|
||||
parameters: Map[String, String]): HadoopFsRelation = {
|
||||
dataSchema.foreach(verifySchema)
|
||||
new TextRelation(None, dataSchema, partitionColumns, paths, parameters)(sqlContext)
|
||||
}
|
||||
class DefaultSource extends FileFormat with DataSourceRegister {
|
||||
|
||||
override def shortName(): String = "text"
|
||||
|
||||
|
@ -64,58 +55,21 @@ class DefaultSource extends HadoopFsRelationProvider with DataSourceRegister {
|
|||
s"Text data source supports only a string column, but you have ${tpe.simpleString}.")
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
private[sql] class TextRelation(
|
||||
val maybePartitionSpec: Option[PartitionSpec],
|
||||
val textSchema: Option[StructType],
|
||||
override val userDefinedPartitionColumns: Option[StructType],
|
||||
override val paths: Array[String] = Array.empty[String],
|
||||
parameters: Map[String, String] = Map.empty[String, String])
|
||||
(@transient val sqlContext: SQLContext)
|
||||
extends HadoopFsRelation(maybePartitionSpec, parameters) {
|
||||
override def inferSchema(
|
||||
sqlContext: SQLContext,
|
||||
options: Map[String, String],
|
||||
files: Seq[FileStatus]): Option[StructType] = Some(new StructType().add("value", StringType))
|
||||
|
||||
/** Data schema is always a single column, named "value" if original Data source has no schema. */
|
||||
override def dataSchema: StructType =
|
||||
textSchema.getOrElse(new StructType().add("value", StringType))
|
||||
/** This is an internal data source that outputs internal row format. */
|
||||
override val needConversion: Boolean = false
|
||||
override def prepareWrite(
|
||||
sqlContext: SQLContext,
|
||||
job: Job,
|
||||
options: Map[String, String],
|
||||
dataSchema: StructType): OutputWriterFactory = {
|
||||
verifySchema(dataSchema)
|
||||
|
||||
|
||||
override private[sql] def buildInternalScan(
|
||||
requiredColumns: Array[String],
|
||||
filters: Array[Filter],
|
||||
inputPaths: Array[FileStatus],
|
||||
broadcastedConf: Broadcast[SerializableConfiguration]): RDD[InternalRow] = {
|
||||
val job = Job.getInstance(sqlContext.sparkContext.hadoopConfiguration)
|
||||
val conf = job.getConfiguration
|
||||
val paths = inputPaths.map(_.getPath).sortBy(_.toUri)
|
||||
|
||||
if (paths.nonEmpty) {
|
||||
FileInputFormat.setInputPaths(job, paths: _*)
|
||||
}
|
||||
|
||||
sqlContext.sparkContext.hadoopRDD(
|
||||
conf.asInstanceOf[JobConf], classOf[TextInputFormat], classOf[LongWritable], classOf[Text])
|
||||
.mapPartitions { iter =>
|
||||
val unsafeRow = new UnsafeRow(1)
|
||||
val bufferHolder = new BufferHolder(unsafeRow)
|
||||
val unsafeRowWriter = new UnsafeRowWriter(bufferHolder, 1)
|
||||
|
||||
iter.map { case (_, line) =>
|
||||
// Writes to an UnsafeRow directly
|
||||
bufferHolder.reset()
|
||||
unsafeRowWriter.write(0, line.getBytes, 0, line.getLength)
|
||||
unsafeRow.setTotalSize(bufferHolder.totalSize())
|
||||
unsafeRow
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/** Write path. */
|
||||
override def prepareJobForWrite(job: Job): OutputWriterFactory = {
|
||||
val conf = job.getConfiguration
|
||||
val compressionCodec = parameters.get("compression").map(CompressionCodecs.getCodecClassName)
|
||||
val compressionCodec = options.get("compression").map(CompressionCodecs.getCodecClassName)
|
||||
compressionCodec.foreach { codec =>
|
||||
CompressionCodecs.setCodecConfiguration(conf, codec)
|
||||
}
|
||||
|
@ -123,21 +77,54 @@ private[sql] class TextRelation(
|
|||
new OutputWriterFactory {
|
||||
override def newInstance(
|
||||
path: String,
|
||||
bucketId: Option[Int],
|
||||
dataSchema: StructType,
|
||||
context: TaskAttemptContext): OutputWriter = {
|
||||
if (bucketId.isDefined) {
|
||||
throw new AnalysisException("Text doesn't support bucketing")
|
||||
}
|
||||
new TextOutputWriter(path, dataSchema, context)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
override def equals(other: Any): Boolean = other match {
|
||||
case that: TextRelation =>
|
||||
paths.toSet == that.paths.toSet && partitionColumns == that.partitionColumns
|
||||
case _ => false
|
||||
}
|
||||
override def buildInternalScan(
|
||||
sqlContext: SQLContext,
|
||||
dataSchema: StructType,
|
||||
requiredColumns: Array[String],
|
||||
filters: Array[Filter],
|
||||
bucketSet: Option[BitSet],
|
||||
inputFiles: Array[FileStatus],
|
||||
broadcastedConf: Broadcast[SerializableConfiguration],
|
||||
options: Map[String, String]): RDD[InternalRow] = {
|
||||
verifySchema(dataSchema)
|
||||
|
||||
override def hashCode(): Int = {
|
||||
Objects.hashCode(paths.toSet, partitionColumns)
|
||||
val job = Job.getInstance(sqlContext.sparkContext.hadoopConfiguration)
|
||||
val conf = job.getConfiguration
|
||||
val paths = inputFiles
|
||||
.filterNot(_.getPath.getName startsWith "_")
|
||||
.map(_.getPath)
|
||||
.sortBy(_.toUri)
|
||||
|
||||
if (paths.nonEmpty) {
|
||||
FileInputFormat.setInputPaths(job, paths: _*)
|
||||
}
|
||||
|
||||
sqlContext.sparkContext.hadoopRDD(
|
||||
conf.asInstanceOf[JobConf], classOf[TextInputFormat], classOf[LongWritable], classOf[Text])
|
||||
.mapPartitions { iter =>
|
||||
val unsafeRow = new UnsafeRow(1)
|
||||
val bufferHolder = new BufferHolder(unsafeRow)
|
||||
val unsafeRowWriter = new UnsafeRowWriter(bufferHolder, 1)
|
||||
|
||||
iter.map { case (_, line) =>
|
||||
// Writes to an UnsafeRow directly
|
||||
bufferHolder.reset()
|
||||
unsafeRowWriter.write(0, line.getBytes, 0, line.getLength)
|
||||
unsafeRow.setTotalSize(bufferHolder.totalSize())
|
||||
unsafeRow
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
|
@ -170,3 +157,4 @@ class TextOutputWriter(path: String, dataSchema: StructType, context: TaskAttemp
|
|||
recordWriter.close(context)
|
||||
}
|
||||
}
|
||||
|
||||
|
|
|
@ -23,7 +23,7 @@ import org.apache.spark.sql.catalyst.optimizer.Optimizer
|
|||
import org.apache.spark.sql.catalyst.parser.ParserInterface
|
||||
import org.apache.spark.sql.catalyst.rules.RuleExecutor
|
||||
import org.apache.spark.sql.execution._
|
||||
import org.apache.spark.sql.execution.datasources.{PreInsertCastAndRename, ResolveDataSource}
|
||||
import org.apache.spark.sql.execution.datasources.{DataSourceAnalysis, PreInsertCastAndRename, ResolveDataSource}
|
||||
import org.apache.spark.sql.execution.exchange.EnsureRequirements
|
||||
import org.apache.spark.sql.util.ExecutionListenerManager
|
||||
|
||||
|
@ -63,8 +63,9 @@ private[sql] class SessionState(ctx: SQLContext) {
|
|||
new Analyzer(catalog, functionRegistry, conf) {
|
||||
override val extendedResolutionRules =
|
||||
python.ExtractPythonUDFs ::
|
||||
PreInsertCastAndRename ::
|
||||
(if (conf.runSQLOnFile) new ResolveDataSource(ctx) :: Nil else Nil)
|
||||
PreInsertCastAndRename ::
|
||||
DataSourceAnalysis ::
|
||||
(if (conf.runSQLOnFile) new ResolveDataSource(ctx) :: Nil else Nil)
|
||||
|
||||
override val extendedCheckRules = Seq(datasources.PreWriteCheck(catalog))
|
||||
}
|
||||
|
|
|
@ -28,12 +28,11 @@ import org.apache.hadoop.mapreduce.{Job, TaskAttemptContext}
|
|||
import org.apache.spark.{Logging, SparkContext}
|
||||
import org.apache.spark.annotation.{DeveloperApi, Experimental}
|
||||
import org.apache.spark.broadcast.Broadcast
|
||||
import org.apache.spark.rdd.{RDD, UnionRDD}
|
||||
import org.apache.spark.rdd.RDD
|
||||
import org.apache.spark.sql._
|
||||
import org.apache.spark.sql.catalyst.{CatalystTypeConverters, InternalRow}
|
||||
import org.apache.spark.sql.catalyst.expressions._
|
||||
import org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection
|
||||
import org.apache.spark.sql.execution.{FileRelation, RDDConversions}
|
||||
import org.apache.spark.sql.execution.FileRelation
|
||||
import org.apache.spark.sql.execution.datasources._
|
||||
import org.apache.spark.sql.execution.streaming.{FileStreamSource, Sink, Source}
|
||||
import org.apache.spark.sql.types.{StringType, StructType}
|
||||
|
@ -146,84 +145,6 @@ trait StreamSinkProvider {
|
|||
partitionColumns: Seq[String]): Sink
|
||||
}
|
||||
|
||||
/**
|
||||
* ::Experimental::
|
||||
* Implemented by objects that produce relations for a specific kind of data source
|
||||
* with a given schema and partitioned columns. When Spark SQL is given a DDL operation with a
|
||||
* USING clause specified (to specify the implemented [[HadoopFsRelationProvider]]), a user defined
|
||||
* schema, and an optional list of partition columns, this interface is used to pass in the
|
||||
* parameters specified by a user.
|
||||
*
|
||||
* Users may specify the fully qualified class name of a given data source. When that class is
|
||||
* not found Spark SQL will append the class name `DefaultSource` to the path, allowing for
|
||||
* less verbose invocation. For example, 'org.apache.spark.sql.json' would resolve to the
|
||||
* data source 'org.apache.spark.sql.json.DefaultSource'
|
||||
*
|
||||
* A new instance of this class will be instantiated each time a DDL call is made.
|
||||
*
|
||||
* The difference between a [[RelationProvider]] and a [[HadoopFsRelationProvider]] is
|
||||
* that users need to provide a schema and a (possibly empty) list of partition columns when
|
||||
* using a [[HadoopFsRelationProvider]]. A relation provider can inherits both [[RelationProvider]],
|
||||
* and [[HadoopFsRelationProvider]] if it can support schema inference, user-specified
|
||||
* schemas, and accessing partitioned relations.
|
||||
*
|
||||
* @since 1.4.0
|
||||
*/
|
||||
@Experimental
|
||||
trait HadoopFsRelationProvider extends StreamSourceProvider {
|
||||
/**
|
||||
* Returns a new base relation with the given parameters, a user defined schema, and a list of
|
||||
* partition columns. Note: the parameters' keywords are case insensitive and this insensitivity
|
||||
* is enforced by the Map that is passed to the function.
|
||||
*
|
||||
* @param dataSchema Schema of data columns (i.e., columns that are not partition columns).
|
||||
*/
|
||||
def createRelation(
|
||||
sqlContext: SQLContext,
|
||||
paths: Array[String],
|
||||
dataSchema: Option[StructType],
|
||||
partitionColumns: Option[StructType],
|
||||
parameters: Map[String, String]): HadoopFsRelation
|
||||
|
||||
private[sql] def createRelation(
|
||||
sqlContext: SQLContext,
|
||||
paths: Array[String],
|
||||
dataSchema: Option[StructType],
|
||||
partitionColumns: Option[StructType],
|
||||
bucketSpec: Option[BucketSpec],
|
||||
parameters: Map[String, String]): HadoopFsRelation = {
|
||||
if (bucketSpec.isDefined) {
|
||||
throw new AnalysisException("Currently we don't support bucketing for this data source.")
|
||||
}
|
||||
createRelation(sqlContext, paths, dataSchema, partitionColumns, parameters)
|
||||
}
|
||||
|
||||
override def createSource(
|
||||
sqlContext: SQLContext,
|
||||
schema: Option[StructType],
|
||||
providerName: String,
|
||||
parameters: Map[String, String]): Source = {
|
||||
val caseInsensitiveOptions = new CaseInsensitiveMap(parameters)
|
||||
val path = caseInsensitiveOptions.getOrElse("path", {
|
||||
throw new IllegalArgumentException("'path' is not specified")
|
||||
})
|
||||
val metadataPath = caseInsensitiveOptions.getOrElse("metadataPath", s"$path/_metadata")
|
||||
|
||||
def dataFrameBuilder(files: Array[String]): DataFrame = {
|
||||
val relation = createRelation(
|
||||
sqlContext,
|
||||
files,
|
||||
schema,
|
||||
partitionColumns = None,
|
||||
bucketSpec = None,
|
||||
parameters)
|
||||
DataFrame(sqlContext, LogicalRelation(relation))
|
||||
}
|
||||
|
||||
new FileStreamSource(sqlContext, metadataPath, path, schema, providerName, dataFrameBuilder)
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* @since 1.3.0
|
||||
*/
|
||||
|
@ -409,20 +330,13 @@ abstract class OutputWriterFactory extends Serializable {
|
|||
* @param dataSchema Schema of the rows to be written. Partition columns are not included in the
|
||||
* schema if the relation being written is partitioned.
|
||||
* @param context The Hadoop MapReduce task context.
|
||||
*
|
||||
* @since 1.4.0
|
||||
*/
|
||||
def newInstance(
|
||||
path: String,
|
||||
dataSchema: StructType,
|
||||
context: TaskAttemptContext): OutputWriter
|
||||
|
||||
private[sql] def newInstance(
|
||||
path: String,
|
||||
bucketId: Option[Int],
|
||||
bucketId: Option[Int], // TODO: This doesn't belong here...
|
||||
dataSchema: StructType,
|
||||
context: TaskAttemptContext): OutputWriter =
|
||||
newInstance(path, dataSchema, context)
|
||||
context: TaskAttemptContext): OutputWriter
|
||||
}
|
||||
|
||||
/**
|
||||
|
@ -465,214 +379,165 @@ abstract class OutputWriter {
|
|||
}
|
||||
|
||||
/**
|
||||
* ::Experimental::
|
||||
* A [[BaseRelation]] that provides much of the common code required for relations that store their
|
||||
* data to an HDFS compatible filesystem.
|
||||
* Acts as a container for all of the metadata required to read from a datasource. All discovery,
|
||||
* resolution and merging logic for schemas and partitions has been removed.
|
||||
*
|
||||
* For the read path, similar to [[PrunedFilteredScan]], it can eliminate unneeded columns and
|
||||
* filter using selected predicates before producing an RDD containing all matching tuples as
|
||||
* [[Row]] objects. In addition, when reading from Hive style partitioned tables stored in file
|
||||
* systems, it's able to discover partitioning information from the paths of input directories, and
|
||||
* perform partition pruning before start reading the data. Subclasses of [[HadoopFsRelation()]]
|
||||
* must override one of the four `buildScan` methods to implement the read path.
|
||||
*
|
||||
* For the write path, it provides the ability to write to both non-partitioned and partitioned
|
||||
* tables. Directory layout of the partitioned tables is compatible with Hive.
|
||||
*
|
||||
* @constructor This constructor is for internal uses only. The [[PartitionSpec]] argument is for
|
||||
* implementing metastore table conversion.
|
||||
*
|
||||
* @param maybePartitionSpec An [[HadoopFsRelation]] can be created with an optional
|
||||
* [[PartitionSpec]], so that partition discovery can be skipped.
|
||||
*
|
||||
* @since 1.4.0
|
||||
* @param location A [[FileCatalog]] that can enumerate the locations of all the files that comprise
|
||||
* this relation.
|
||||
* @param partitionSchema The schmea of the columns (if any) that are used to partition the relation
|
||||
* @param dataSchema The schema of any remaining columns. Note that if any partition columns are
|
||||
* present in the actual data files as well, they are removed.
|
||||
* @param bucketSpec Describes the bucketing (hash-partitioning of the files by some column values).
|
||||
* @param fileFormat A file format that can be used to read and write the data in files.
|
||||
* @param options Configuration used when reading / writing data.
|
||||
*/
|
||||
@Experimental
|
||||
abstract class HadoopFsRelation private[sql](
|
||||
maybePartitionSpec: Option[PartitionSpec],
|
||||
parameters: Map[String, String])
|
||||
extends BaseRelation with FileRelation with Logging {
|
||||
case class HadoopFsRelation(
|
||||
sqlContext: SQLContext,
|
||||
location: FileCatalog,
|
||||
partitionSchema: StructType,
|
||||
dataSchema: StructType,
|
||||
bucketSpec: Option[BucketSpec],
|
||||
fileFormat: FileFormat,
|
||||
options: Map[String, String]) extends BaseRelation with FileRelation {
|
||||
|
||||
override def toString: String = getClass.getSimpleName
|
||||
val schema: StructType = {
|
||||
val dataSchemaColumnNames = dataSchema.map(_.name.toLowerCase).toSet
|
||||
StructType(dataSchema ++ partitionSchema.filterNot { column =>
|
||||
dataSchemaColumnNames.contains(column.name.toLowerCase)
|
||||
})
|
||||
}
|
||||
|
||||
def this() = this(None, Map.empty[String, String])
|
||||
def partitionSchemaOption: Option[StructType] =
|
||||
if (partitionSchema.isEmpty) None else Some(partitionSchema)
|
||||
def partitionSpec: PartitionSpec = location.partitionSpec(partitionSchemaOption)
|
||||
|
||||
def this(parameters: Map[String, String]) = this(None, parameters)
|
||||
def refresh(): Unit = location.refresh()
|
||||
|
||||
private[sql] def this(maybePartitionSpec: Option[PartitionSpec]) =
|
||||
this(maybePartitionSpec, Map.empty[String, String])
|
||||
override def toString: String =
|
||||
s"$fileFormat part: ${partitionSchema.simpleString}, data: ${dataSchema.simpleString}"
|
||||
|
||||
/** Returns the list of files that will be read when scanning this relation. */
|
||||
override def inputFiles: Array[String] =
|
||||
location.allFiles().map(_.getPath.toUri.toString).toArray
|
||||
}
|
||||
|
||||
/**
|
||||
* Used to read a write data in files to [[InternalRow]] format.
|
||||
*/
|
||||
trait FileFormat {
|
||||
/**
|
||||
* When possible, this method should return the schema of the given `files`. When the format
|
||||
* does not support inference, or no valid files are given should return None. In these cases
|
||||
* Spark will require that user specify the schema manually.
|
||||
*/
|
||||
def inferSchema(
|
||||
sqlContext: SQLContext,
|
||||
options: Map[String, String],
|
||||
files: Seq[FileStatus]): Option[StructType]
|
||||
|
||||
/**
|
||||
* Prepares a write job and returns an [[OutputWriterFactory]]. Client side job preparation can
|
||||
* be put here. For example, user defined output committer can be configured here
|
||||
* by setting the output committer class in the conf of spark.sql.sources.outputCommitterClass.
|
||||
*/
|
||||
def prepareWrite(
|
||||
sqlContext: SQLContext,
|
||||
job: Job,
|
||||
options: Map[String, String],
|
||||
dataSchema: StructType): OutputWriterFactory
|
||||
|
||||
def buildInternalScan(
|
||||
sqlContext: SQLContext,
|
||||
dataSchema: StructType,
|
||||
requiredColumns: Array[String],
|
||||
filters: Array[Filter],
|
||||
bucketSet: Option[BitSet],
|
||||
inputFiles: Array[FileStatus],
|
||||
broadcastedConf: Broadcast[SerializableConfiguration],
|
||||
options: Map[String, String]): RDD[InternalRow]
|
||||
}
|
||||
|
||||
/**
|
||||
* An interface for objects capable of enumerating the files that comprise a relation as well
|
||||
* as the partitioning characteristics of those files.
|
||||
*/
|
||||
trait FileCatalog {
|
||||
def paths: Seq[Path]
|
||||
|
||||
def partitionSpec(schema: Option[StructType]): PartitionSpec
|
||||
|
||||
def allFiles(): Seq[FileStatus]
|
||||
|
||||
def getStatus(path: Path): Array[FileStatus]
|
||||
|
||||
def refresh(): Unit
|
||||
}
|
||||
|
||||
/**
|
||||
* A file catalog that caches metadata gathered by scanning all the files present in `paths`
|
||||
* recursively.
|
||||
*/
|
||||
class HDFSFileCatalog(
|
||||
val sqlContext: SQLContext,
|
||||
val parameters: Map[String, String],
|
||||
val paths: Seq[Path])
|
||||
extends FileCatalog with Logging {
|
||||
|
||||
private val hadoopConf = new Configuration(sqlContext.sparkContext.hadoopConfiguration)
|
||||
|
||||
private var _partitionSpec: PartitionSpec = _
|
||||
var leafFiles = mutable.LinkedHashMap.empty[Path, FileStatus]
|
||||
var leafDirToChildrenFiles = mutable.Map.empty[Path, Array[FileStatus]]
|
||||
var cachedPartitionSpec: PartitionSpec = _
|
||||
|
||||
private[this] var malformedBucketFile = false
|
||||
def partitionSpec(schema: Option[StructType]): PartitionSpec = {
|
||||
if (cachedPartitionSpec == null) {
|
||||
cachedPartitionSpec = inferPartitioning(schema)
|
||||
}
|
||||
|
||||
private[sql] def maybeBucketSpec: Option[BucketSpec] = None
|
||||
cachedPartitionSpec
|
||||
}
|
||||
|
||||
final private[sql] def getBucketSpec: Option[BucketSpec] =
|
||||
maybeBucketSpec.filter(_ => sqlContext.conf.bucketingEnabled() && !malformedBucketFile)
|
||||
refresh()
|
||||
|
||||
private class FileStatusCache {
|
||||
var leafFiles = mutable.LinkedHashMap.empty[Path, FileStatus]
|
||||
def allFiles(): Seq[FileStatus] = leafFiles.values.toSeq
|
||||
|
||||
var leafDirToChildrenFiles = mutable.Map.empty[Path, Array[FileStatus]]
|
||||
def getStatus(path: Path): Array[FileStatus] = leafDirToChildrenFiles(path)
|
||||
|
||||
private def listLeafFiles(paths: Array[String]): mutable.LinkedHashSet[FileStatus] = {
|
||||
if (paths.length >= sqlContext.conf.parallelPartitionDiscoveryThreshold) {
|
||||
HadoopFsRelation.listLeafFilesInParallel(paths, hadoopConf, sqlContext.sparkContext)
|
||||
} else {
|
||||
val statuses = paths.flatMap { path =>
|
||||
val hdfsPath = new Path(path)
|
||||
val fs = hdfsPath.getFileSystem(hadoopConf)
|
||||
val qualified = hdfsPath.makeQualified(fs.getUri, fs.getWorkingDirectory)
|
||||
logInfo(s"Listing $qualified on driver")
|
||||
// Dummy jobconf to get to the pathFilter defined in configuration
|
||||
val jobConf = new JobConf(hadoopConf, this.getClass())
|
||||
val pathFilter = FileInputFormat.getInputPathFilter(jobConf)
|
||||
if (pathFilter != null) {
|
||||
Try(fs.listStatus(qualified, pathFilter)).getOrElse(Array.empty)
|
||||
} else {
|
||||
Try(fs.listStatus(qualified)).getOrElse(Array.empty)
|
||||
}
|
||||
}.filterNot { status =>
|
||||
val name = status.getPath.getName
|
||||
name.toLowerCase == "_temporary" || name.startsWith(".")
|
||||
}
|
||||
|
||||
val (dirs, files) = statuses.partition(_.isDirectory)
|
||||
|
||||
// It uses [[LinkedHashSet]] since the order of files can affect the results. (SPARK-11500)
|
||||
if (dirs.isEmpty) {
|
||||
mutable.LinkedHashSet(files: _*)
|
||||
private def listLeafFiles(paths: Seq[Path]): mutable.LinkedHashSet[FileStatus] = {
|
||||
if (paths.length >= sqlContext.conf.parallelPartitionDiscoveryThreshold) {
|
||||
HadoopFsRelation.listLeafFilesInParallel(paths, hadoopConf, sqlContext.sparkContext)
|
||||
} else {
|
||||
val statuses = paths.flatMap { path =>
|
||||
val fs = path.getFileSystem(hadoopConf)
|
||||
logInfo(s"Listing $path on driver")
|
||||
// Dummy jobconf to get to the pathFilter defined in configuration
|
||||
val jobConf = new JobConf(hadoopConf, this.getClass())
|
||||
val pathFilter = FileInputFormat.getInputPathFilter(jobConf)
|
||||
if (pathFilter != null) {
|
||||
Try(fs.listStatus(path, pathFilter)).getOrElse(Array.empty)
|
||||
} else {
|
||||
mutable.LinkedHashSet(files: _*) ++ listLeafFiles(dirs.map(_.getPath.toString))
|
||||
Try(fs.listStatus(path)).getOrElse(Array.empty)
|
||||
}
|
||||
}.filterNot { status =>
|
||||
val name = status.getPath.getName
|
||||
name.toLowerCase == "_temporary" || name.startsWith(".")
|
||||
}
|
||||
|
||||
val (dirs, files) = statuses.partition(_.isDirectory)
|
||||
|
||||
// It uses [[LinkedHashSet]] since the order of files can affect the results. (SPARK-11500)
|
||||
if (dirs.isEmpty) {
|
||||
mutable.LinkedHashSet(files: _*)
|
||||
} else {
|
||||
mutable.LinkedHashSet(files: _*) ++ listLeafFiles(dirs.map(_.getPath))
|
||||
}
|
||||
}
|
||||
|
||||
def refresh(): Unit = {
|
||||
val files = listLeafFiles(paths)
|
||||
|
||||
leafFiles.clear()
|
||||
leafDirToChildrenFiles.clear()
|
||||
|
||||
leafFiles ++= files.map(f => f.getPath -> f)
|
||||
leafDirToChildrenFiles ++= files.toArray.groupBy(_.getPath.getParent)
|
||||
}
|
||||
}
|
||||
|
||||
private lazy val fileStatusCache = {
|
||||
val cache = new FileStatusCache
|
||||
cache.refresh()
|
||||
cache
|
||||
}
|
||||
|
||||
protected def cachedLeafStatuses(): mutable.LinkedHashSet[FileStatus] = {
|
||||
mutable.LinkedHashSet(fileStatusCache.leafFiles.values.toArray: _*)
|
||||
}
|
||||
|
||||
final private[sql] def partitionSpec: PartitionSpec = {
|
||||
if (_partitionSpec == null) {
|
||||
_partitionSpec = maybePartitionSpec
|
||||
.flatMap {
|
||||
case spec if spec.partitions.nonEmpty =>
|
||||
Some(spec.copy(partitionColumns = spec.partitionColumns.asNullable))
|
||||
case _ =>
|
||||
None
|
||||
}
|
||||
.orElse {
|
||||
// We only know the partition columns and their data types. We need to discover
|
||||
// partition values.
|
||||
userDefinedPartitionColumns.map { partitionSchema =>
|
||||
val spec = discoverPartitions()
|
||||
val partitionColumnTypes = spec.partitionColumns.map(_.dataType)
|
||||
val castedPartitions = spec.partitions.map { case p @ Partition(values, path) =>
|
||||
val literals = partitionColumnTypes.zipWithIndex.map { case (dt, i) =>
|
||||
Literal.create(values.get(i, dt), dt)
|
||||
}
|
||||
val castedValues = partitionSchema.zip(literals).map { case (field, literal) =>
|
||||
Cast(literal, field.dataType).eval()
|
||||
}
|
||||
p.copy(values = InternalRow.fromSeq(castedValues))
|
||||
}
|
||||
PartitionSpec(partitionSchema, castedPartitions)
|
||||
}
|
||||
}
|
||||
.getOrElse {
|
||||
if (sqlContext.conf.partitionDiscoveryEnabled()) {
|
||||
discoverPartitions()
|
||||
} else {
|
||||
PartitionSpec(StructType(Nil), Array.empty[Partition])
|
||||
}
|
||||
}
|
||||
}
|
||||
_partitionSpec
|
||||
}
|
||||
|
||||
/**
|
||||
* Paths of this relation. For partitioned relations, it should be root directories
|
||||
* of all partition directories.
|
||||
*
|
||||
* @since 1.4.0
|
||||
*/
|
||||
def paths: Array[String]
|
||||
|
||||
/**
|
||||
* Contains a set of paths that are considered as the base dirs of the input datasets.
|
||||
* The partitioning discovery logic will make sure it will stop when it reaches any
|
||||
* base path. By default, the paths of the dataset provided by users will be base paths.
|
||||
* For example, if a user uses `sqlContext.read.parquet("/path/something=true/")`, the base path
|
||||
* will be `/path/something=true/`, and the returned DataFrame will not contain a column of
|
||||
* `something`. If users want to override the basePath. They can set `basePath` in the options
|
||||
* to pass the new base path to the data source.
|
||||
* For the above example, if the user-provided base path is `/path/`, the returned
|
||||
* DataFrame will have the column of `something`.
|
||||
*/
|
||||
private def basePaths: Set[Path] = {
|
||||
val userDefinedBasePath = parameters.get("basePath").map(basePath => Set(new Path(basePath)))
|
||||
userDefinedBasePath.getOrElse {
|
||||
// If the user does not provide basePath, we will just use paths.
|
||||
val pathSet = paths.toSet
|
||||
pathSet.map(p => new Path(p))
|
||||
}.map { hdfsPath =>
|
||||
// Make the path qualified (consistent with listLeafFiles and listLeafFilesInParallel).
|
||||
val fs = hdfsPath.getFileSystem(hadoopConf)
|
||||
hdfsPath.makeQualified(fs.getUri, fs.getWorkingDirectory)
|
||||
}
|
||||
}
|
||||
|
||||
override def inputFiles: Array[String] = cachedLeafStatuses().map(_.getPath.toString).toArray
|
||||
|
||||
override def sizeInBytes: Long = cachedLeafStatuses().map(_.getLen).sum
|
||||
|
||||
/**
|
||||
* Partition columns. Can be either defined by [[userDefinedPartitionColumns]] or automatically
|
||||
* discovered. Note that they should always be nullable.
|
||||
*
|
||||
* @since 1.4.0
|
||||
*/
|
||||
final def partitionColumns: StructType =
|
||||
userDefinedPartitionColumns.getOrElse(partitionSpec.partitionColumns)
|
||||
|
||||
/**
|
||||
* Optional user defined partition columns.
|
||||
*
|
||||
* @since 1.4.0
|
||||
*/
|
||||
def userDefinedPartitionColumns: Option[StructType] = None
|
||||
|
||||
private[sql] def refresh(): Unit = {
|
||||
fileStatusCache.refresh()
|
||||
if (sqlContext.conf.partitionDiscoveryEnabled()) {
|
||||
_partitionSpec = discoverPartitions()
|
||||
}
|
||||
}
|
||||
|
||||
private def discoverPartitions(): PartitionSpec = {
|
||||
def inferPartitioning(schema: Option[StructType]): PartitionSpec = {
|
||||
// We use leaf dirs containing data files to discover the schema.
|
||||
val leafDirs = fileStatusCache.leafDirToChildrenFiles.keys.toSeq
|
||||
userDefinedPartitionColumns match {
|
||||
val leafDirs = leafDirToChildrenFiles.keys.toSeq
|
||||
schema match {
|
||||
case Some(userProvidedSchema) if userProvidedSchema.nonEmpty =>
|
||||
val spec = PartitioningUtils.parsePartitions(
|
||||
leafDirs,
|
||||
|
@ -693,9 +558,7 @@ abstract class HadoopFsRelation private[sql](
|
|||
PartitionSpec(userProvidedSchema, spec.partitions.map { part =>
|
||||
part.copy(values = castPartitionValuesToUserSchema(part.values))
|
||||
})
|
||||
|
||||
case _ =>
|
||||
// user did not provide a partitioning schema
|
||||
case None =>
|
||||
PartitioningUtils.parsePartitions(
|
||||
leafDirs,
|
||||
PartitioningUtils.DEFAULT_PARTITION_NAME,
|
||||
|
@ -705,271 +568,51 @@ abstract class HadoopFsRelation private[sql](
|
|||
}
|
||||
|
||||
/**
|
||||
* Schema of this relation. It consists of columns appearing in [[dataSchema]] and all partition
|
||||
* columns not appearing in [[dataSchema]].
|
||||
*
|
||||
* @since 1.4.0
|
||||
* Contains a set of paths that are considered as the base dirs of the input datasets.
|
||||
* The partitioning discovery logic will make sure it will stop when it reaches any
|
||||
* base path. By default, the paths of the dataset provided by users will be base paths.
|
||||
* For example, if a user uses `sqlContext.read.parquet("/path/something=true/")`, the base path
|
||||
* will be `/path/something=true/`, and the returned DataFrame will not contain a column of
|
||||
* `something`. If users want to override the basePath. They can set `basePath` in the options
|
||||
* to pass the new base path to the data source.
|
||||
* For the above example, if the user-provided base path is `/path/`, the returned
|
||||
* DataFrame will have the column of `something`.
|
||||
*/
|
||||
override lazy val schema: StructType = {
|
||||
val dataSchemaColumnNames = dataSchema.map(_.name.toLowerCase).toSet
|
||||
StructType(dataSchema ++ partitionColumns.filterNot { column =>
|
||||
dataSchemaColumnNames.contains(column.name.toLowerCase)
|
||||
})
|
||||
}
|
||||
|
||||
/**
|
||||
* Groups the input files by bucket id, if bucketing is enabled and this data source is bucketed.
|
||||
* Returns None if there exists any malformed bucket files.
|
||||
*/
|
||||
private def groupBucketFiles(
|
||||
files: Array[FileStatus]): Option[scala.collection.Map[Int, Array[FileStatus]]] = {
|
||||
malformedBucketFile = false
|
||||
if (getBucketSpec.isDefined) {
|
||||
val groupedBucketFiles = mutable.HashMap.empty[Int, mutable.ArrayBuffer[FileStatus]]
|
||||
var i = 0
|
||||
while (!malformedBucketFile && i < files.length) {
|
||||
val bucketId = BucketingUtils.getBucketId(files(i).getPath.getName)
|
||||
if (bucketId.isEmpty) {
|
||||
logError(s"File ${files(i).getPath} is expected to be a bucket file, but there is no " +
|
||||
"bucket id information in file name. Fall back to non-bucketing mode.")
|
||||
malformedBucketFile = true
|
||||
} else {
|
||||
val bucketFiles =
|
||||
groupedBucketFiles.getOrElseUpdate(bucketId.get, mutable.ArrayBuffer.empty)
|
||||
bucketFiles += files(i)
|
||||
}
|
||||
i += 1
|
||||
}
|
||||
if (malformedBucketFile) None else Some(groupedBucketFiles.mapValues(_.toArray))
|
||||
} else {
|
||||
None
|
||||
private def basePaths: Set[Path] = {
|
||||
val userDefinedBasePath = parameters.get("basePath").map(basePath => Set(new Path(basePath)))
|
||||
userDefinedBasePath.getOrElse {
|
||||
// If the user does not provide basePath, we will just use paths.
|
||||
paths.toSet
|
||||
}.map { hdfsPath =>
|
||||
// Make the path qualified (consistent with listLeafFiles and listLeafFilesInParallel).
|
||||
val fs = hdfsPath.getFileSystem(hadoopConf)
|
||||
hdfsPath.makeQualified(fs.getUri, fs.getWorkingDirectory)
|
||||
}
|
||||
}
|
||||
|
||||
final private[sql] def buildInternalScan(
|
||||
requiredColumns: Array[String],
|
||||
filters: Array[Filter],
|
||||
bucketSet: Option[BitSet],
|
||||
inputPaths: Array[String],
|
||||
broadcastedConf: Broadcast[SerializableConfiguration]): RDD[InternalRow] = {
|
||||
val inputStatuses = inputPaths.flatMap { input =>
|
||||
val path = new Path(input)
|
||||
def refresh(): Unit = {
|
||||
val files = listLeafFiles(paths)
|
||||
|
||||
// First assumes `input` is a directory path, and tries to get all files contained in it.
|
||||
fileStatusCache.leafDirToChildrenFiles.getOrElse(
|
||||
path,
|
||||
// Otherwise, `input` might be a file path
|
||||
fileStatusCache.leafFiles.get(path).toArray
|
||||
).filter { status =>
|
||||
val name = status.getPath.getName
|
||||
!name.startsWith("_") && !name.startsWith(".")
|
||||
}
|
||||
}
|
||||
leafFiles.clear()
|
||||
leafDirToChildrenFiles.clear()
|
||||
|
||||
groupBucketFiles(inputStatuses).map { groupedBucketFiles =>
|
||||
// For each bucket id, firstly we get all files belong to this bucket, by detecting bucket
|
||||
// id from file name. Then read these files into a RDD(use one-partition empty RDD for empty
|
||||
// bucket), and coalesce it to one partition. Finally union all bucket RDDs to one result.
|
||||
val perBucketRows = (0 until maybeBucketSpec.get.numBuckets).map { bucketId =>
|
||||
// If the current bucketId is not set in the bucket bitSet, skip scanning it.
|
||||
if (bucketSet.nonEmpty && !bucketSet.get.get(bucketId)){
|
||||
sqlContext.emptyResult
|
||||
} else {
|
||||
// When all the buckets need a scan (i.e., bucketSet is equal to None)
|
||||
// or when the current bucket need a scan (i.e., the bit of bucketId is set to true)
|
||||
groupedBucketFiles.get(bucketId).map { inputStatuses =>
|
||||
buildInternalScan(requiredColumns, filters, inputStatuses, broadcastedConf).coalesce(1)
|
||||
}.getOrElse(sqlContext.emptyResult)
|
||||
}
|
||||
}
|
||||
leafFiles ++= files.map(f => f.getPath -> f)
|
||||
leafDirToChildrenFiles ++= files.toArray.groupBy(_.getPath.getParent)
|
||||
|
||||
new UnionRDD(sqlContext.sparkContext, perBucketRows)
|
||||
}.getOrElse {
|
||||
buildInternalScan(requiredColumns, filters, inputStatuses, broadcastedConf)
|
||||
}
|
||||
cachedPartitionSpec = null
|
||||
}
|
||||
|
||||
/**
|
||||
* Specifies schema of actual data files. For partitioned relations, if one or more partitioned
|
||||
* columns are contained in the data files, they should also appear in `dataSchema`.
|
||||
*
|
||||
* @since 1.4.0
|
||||
*/
|
||||
def dataSchema: StructType
|
||||
|
||||
/**
|
||||
* For a non-partitioned relation, this method builds an `RDD[Row]` containing all rows within
|
||||
* this relation. For partitioned relations, this method is called for each selected partition,
|
||||
* and builds an `RDD[Row]` containing all rows within that single partition.
|
||||
*
|
||||
* @param inputFiles For a non-partitioned relation, it contains paths of all data files in the
|
||||
* relation. For a partitioned relation, it contains paths of all data files in a single
|
||||
* selected partition.
|
||||
*
|
||||
* @since 1.4.0
|
||||
*/
|
||||
def buildScan(inputFiles: Array[FileStatus]): RDD[Row] = {
|
||||
throw new UnsupportedOperationException(
|
||||
"At least one buildScan() method should be overridden to read the relation.")
|
||||
override def equals(other: Any): Boolean = other match {
|
||||
case hdfs: HDFSFileCatalog => paths.toSet == hdfs.paths.toSet
|
||||
case _ => false
|
||||
}
|
||||
|
||||
/**
|
||||
* For a non-partitioned relation, this method builds an `RDD[Row]` containing all rows within
|
||||
* this relation. For partitioned relations, this method is called for each selected partition,
|
||||
* and builds an `RDD[Row]` containing all rows within that single partition.
|
||||
*
|
||||
* @param requiredColumns Required columns.
|
||||
* @param inputFiles For a non-partitioned relation, it contains paths of all data files in the
|
||||
* relation. For a partitioned relation, it contains paths of all data files in a single
|
||||
* selected partition.
|
||||
*
|
||||
* @since 1.4.0
|
||||
*/
|
||||
// TODO Tries to eliminate the extra Catalyst-to-Scala conversion when `needConversion` is true
|
||||
//
|
||||
// PR #7626 separated `Row` and `InternalRow` completely. One of the consequences is that we can
|
||||
// no longer treat an `InternalRow` containing Catalyst values as a `Row`. Thus we have to
|
||||
// introduce another row value conversion for data sources whose `needConversion` is true.
|
||||
def buildScan(requiredColumns: Array[String], inputFiles: Array[FileStatus]): RDD[Row] = {
|
||||
// Yeah, to workaround serialization...
|
||||
val dataSchema = this.dataSchema
|
||||
val needConversion = this.needConversion
|
||||
|
||||
val requiredOutput = requiredColumns.map { col =>
|
||||
val field = dataSchema(col)
|
||||
BoundReference(dataSchema.fieldIndex(col), field.dataType, field.nullable)
|
||||
}.toSeq
|
||||
|
||||
val rdd: RDD[Row] = buildScan(inputFiles)
|
||||
val converted: RDD[InternalRow] =
|
||||
if (needConversion) {
|
||||
RDDConversions.rowToRowRdd(rdd, dataSchema.fields.map(_.dataType))
|
||||
} else {
|
||||
rdd.asInstanceOf[RDD[InternalRow]]
|
||||
}
|
||||
|
||||
converted.mapPartitions { rows =>
|
||||
val buildProjection =
|
||||
GenerateMutableProjection.generate(requiredOutput, dataSchema.toAttributes)
|
||||
|
||||
val projectedRows = {
|
||||
val mutableProjection = buildProjection()
|
||||
rows.map(r => mutableProjection(r))
|
||||
}
|
||||
|
||||
if (needConversion) {
|
||||
val requiredSchema = StructType(requiredColumns.map(dataSchema(_)))
|
||||
val toScala = CatalystTypeConverters.createToScalaConverter(requiredSchema)
|
||||
projectedRows.map(toScala(_).asInstanceOf[Row])
|
||||
} else {
|
||||
projectedRows
|
||||
}
|
||||
}.asInstanceOf[RDD[Row]]
|
||||
}
|
||||
|
||||
/**
|
||||
* For a non-partitioned relation, this method builds an `RDD[Row]` containing all rows within
|
||||
* this relation. For partitioned relations, this method is called for each selected partition,
|
||||
* and builds an `RDD[Row]` containing all rows within that single partition.
|
||||
*
|
||||
* @param requiredColumns Required columns.
|
||||
* @param filters Candidate filters to be pushed down. The actual filter should be the conjunction
|
||||
* of all `filters`. The pushed down filters are currently purely an optimization as they
|
||||
* will all be evaluated again. This means it is safe to use them with methods that produce
|
||||
* false positives such as filtering partitions based on a bloom filter.
|
||||
* @param inputFiles For a non-partitioned relation, it contains paths of all data files in the
|
||||
* relation. For a partitioned relation, it contains paths of all data files in a single
|
||||
* selected partition.
|
||||
*
|
||||
* @since 1.4.0
|
||||
*/
|
||||
def buildScan(
|
||||
requiredColumns: Array[String],
|
||||
filters: Array[Filter],
|
||||
inputFiles: Array[FileStatus]): RDD[Row] = {
|
||||
buildScan(requiredColumns, inputFiles)
|
||||
}
|
||||
|
||||
/**
|
||||
* For a non-partitioned relation, this method builds an `RDD[Row]` containing all rows within
|
||||
* this relation. For partitioned relations, this method is called for each selected partition,
|
||||
* and builds an `RDD[Row]` containing all rows within that single partition.
|
||||
*
|
||||
* Note: This interface is subject to change in future.
|
||||
*
|
||||
* @param requiredColumns Required columns.
|
||||
* @param filters Candidate filters to be pushed down. The actual filter should be the conjunction
|
||||
* of all `filters`. The pushed down filters are currently purely an optimization as they
|
||||
* will all be evaluated again. This means it is safe to use them with methods that produce
|
||||
* false positives such as filtering partitions based on a bloom filter.
|
||||
* @param inputFiles For a non-partitioned relation, it contains paths of all data files in the
|
||||
* relation. For a partitioned relation, it contains paths of all data files in a single
|
||||
* selected partition.
|
||||
* @param broadcastedConf A shared broadcast Hadoop Configuration, which can be used to reduce the
|
||||
* overhead of broadcasting the Configuration for every Hadoop RDD.
|
||||
*
|
||||
* @since 1.4.0
|
||||
*/
|
||||
private[sql] def buildScan(
|
||||
requiredColumns: Array[String],
|
||||
filters: Array[Filter],
|
||||
inputFiles: Array[FileStatus],
|
||||
broadcastedConf: Broadcast[SerializableConfiguration]): RDD[Row] = {
|
||||
buildScan(requiredColumns, filters, inputFiles)
|
||||
}
|
||||
|
||||
/**
|
||||
* For a non-partitioned relation, this method builds an `RDD[InternalRow]` containing all rows
|
||||
* within this relation. For partitioned relations, this method is called for each selected
|
||||
* partition, and builds an `RDD[InternalRow]` containing all rows within that single partition.
|
||||
*
|
||||
* Note:
|
||||
*
|
||||
* 1. Rows contained in the returned `RDD[InternalRow]` are assumed to be `UnsafeRow`s.
|
||||
* 2. This interface is subject to change in future.
|
||||
*
|
||||
* @param requiredColumns Required columns.
|
||||
* @param filters Candidate filters to be pushed down. The actual filter should be the conjunction
|
||||
* of all `filters`. The pushed down filters are currently purely an optimization as they
|
||||
* will all be evaluated again. This means it is safe to use them with methods that produce
|
||||
* false positives such as filtering partitions based on a bloom filter.
|
||||
* @param inputFiles For a non-partitioned relation, it contains paths of all data files in the
|
||||
* relation. For a partitioned relation, it contains paths of all data files in a single
|
||||
* selected partition.
|
||||
* @param broadcastedConf A shared broadcast Hadoop Configuration, which can be used to reduce the
|
||||
* overhead of broadcasting the Configuration for every Hadoop RDD.
|
||||
*/
|
||||
private[sql] def buildInternalScan(
|
||||
requiredColumns: Array[String],
|
||||
filters: Array[Filter],
|
||||
inputFiles: Array[FileStatus],
|
||||
broadcastedConf: Broadcast[SerializableConfiguration]): RDD[InternalRow] = {
|
||||
val requiredSchema = StructType(requiredColumns.map(dataSchema.apply))
|
||||
val internalRows = {
|
||||
val externalRows = buildScan(requiredColumns, filters, inputFiles, broadcastedConf)
|
||||
execution.RDDConversions.rowToRowRdd(externalRows, requiredSchema.map(_.dataType))
|
||||
}
|
||||
|
||||
internalRows.mapPartitions { iterator =>
|
||||
val unsafeProjection = UnsafeProjection.create(requiredSchema)
|
||||
iterator.map(unsafeProjection)
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Prepares a write job and returns an [[OutputWriterFactory]]. Client side job preparation can
|
||||
* be put here. For example, user defined output committer can be configured here
|
||||
* by setting the output committer class in the conf of spark.sql.sources.outputCommitterClass.
|
||||
*
|
||||
* Note that the only side effect expected here is mutating `job` via its setters. Especially,
|
||||
* Spark SQL caches [[BaseRelation]] instances for performance, mutating relation internal states
|
||||
* may cause unexpected behaviors.
|
||||
*
|
||||
* @since 1.4.0
|
||||
*/
|
||||
def prepareJobForWrite(job: Job): OutputWriterFactory
|
||||
override def hashCode(): Int = paths.toSet.hashCode()
|
||||
}
|
||||
|
||||
/**
|
||||
* Helper methods for gathering metadata from HDFS.
|
||||
*/
|
||||
private[sql] object HadoopFsRelation extends Logging {
|
||||
// We don't filter files/directories whose name start with "_" except "_temporary" here, as
|
||||
// specific data sources may take advantages over them (e.g. Parquet _metadata and
|
||||
|
@ -1009,17 +652,17 @@ private[sql] object HadoopFsRelation extends Logging {
|
|||
accessTime: Long)
|
||||
|
||||
def listLeafFilesInParallel(
|
||||
paths: Array[String],
|
||||
paths: Seq[Path],
|
||||
hadoopConf: Configuration,
|
||||
sparkContext: SparkContext): mutable.LinkedHashSet[FileStatus] = {
|
||||
logInfo(s"Listing leaf files and directories in parallel under: ${paths.mkString(", ")}")
|
||||
|
||||
val serializableConfiguration = new SerializableConfiguration(hadoopConf)
|
||||
val fakeStatuses = sparkContext.parallelize(paths).flatMap { path =>
|
||||
val hdfsPath = new Path(path)
|
||||
val fs = hdfsPath.getFileSystem(serializableConfiguration.value)
|
||||
val qualified = hdfsPath.makeQualified(fs.getUri, fs.getWorkingDirectory)
|
||||
Try(listLeafFiles(fs, fs.getFileStatus(qualified))).getOrElse(Array.empty)
|
||||
val serializedPaths = paths.map(_.toString)
|
||||
|
||||
val fakeStatuses = sparkContext.parallelize(serializedPaths).map(new Path(_)).flatMap { path =>
|
||||
val fs = path.getFileSystem(serializableConfiguration.value)
|
||||
Try(listLeafFiles(fs, fs.getFileStatus(path))).getOrElse(Array.empty)
|
||||
}.map { status =>
|
||||
FakeFileStatus(
|
||||
status.getPath.toString,
|
||||
|
|
|
@ -889,7 +889,6 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
|
|||
.write.format("parquet").save("temp")
|
||||
}
|
||||
assert(e.getMessage.contains("Duplicate column(s)"))
|
||||
assert(e.getMessage.contains("parquet"))
|
||||
assert(e.getMessage.contains("column1"))
|
||||
assert(!e.getMessage.contains("column2"))
|
||||
|
||||
|
@ -900,7 +899,6 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
|
|||
.write.format("json").save("temp")
|
||||
}
|
||||
assert(f.getMessage.contains("Duplicate column(s)"))
|
||||
assert(f.getMessage.contains("JSON"))
|
||||
assert(f.getMessage.contains("column1"))
|
||||
assert(f.getMessage.contains("column3"))
|
||||
assert(!f.getMessage.contains("column2"))
|
||||
|
|
|
@ -1741,7 +1741,7 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext {
|
|||
val e3 = intercept[AnalysisException] {
|
||||
sql("select * from json.invalid_file")
|
||||
}
|
||||
assert(e3.message.contains("No input paths specified"))
|
||||
assert(e3.message.contains("Unable to infer schema"))
|
||||
}
|
||||
|
||||
test("SortMergeJoin returns wrong results when using UnsafeRows") {
|
||||
|
|
|
@ -582,35 +582,6 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData {
|
|||
jsonDF.registerTempTable("jsonTable")
|
||||
}
|
||||
|
||||
test("jsonFile should be based on JSONRelation") {
|
||||
val dir = Utils.createTempDir()
|
||||
dir.delete()
|
||||
val path = dir.getCanonicalFile.toURI.toString
|
||||
sparkContext.parallelize(1 to 100)
|
||||
.map(i => s"""{"a": 1, "b": "str$i"}""").saveAsTextFile(path)
|
||||
val jsonDF = sqlContext.read.option("samplingRatio", "0.49").json(path)
|
||||
|
||||
val analyzed = jsonDF.queryExecution.analyzed
|
||||
assert(
|
||||
analyzed.isInstanceOf[LogicalRelation],
|
||||
"The DataFrame returned by jsonFile should be based on LogicalRelation.")
|
||||
val relation = analyzed.asInstanceOf[LogicalRelation].relation
|
||||
assert(
|
||||
relation.isInstanceOf[JSONRelation],
|
||||
"The DataFrame returned by jsonFile should be based on JSONRelation.")
|
||||
assert(relation.asInstanceOf[JSONRelation].paths === Array(path))
|
||||
assert(relation.asInstanceOf[JSONRelation].options.samplingRatio === (0.49 +- 0.001))
|
||||
|
||||
val schema = StructType(StructField("a", LongType, true) :: Nil)
|
||||
val logicalRelation =
|
||||
sqlContext.read.schema(schema).json(path)
|
||||
.queryExecution.analyzed.asInstanceOf[LogicalRelation]
|
||||
val relationWithSchema = logicalRelation.relation.asInstanceOf[JSONRelation]
|
||||
assert(relationWithSchema.paths === Array(path))
|
||||
assert(relationWithSchema.schema === schema)
|
||||
assert(relationWithSchema.options.samplingRatio > 0.99)
|
||||
}
|
||||
|
||||
test("Loading a JSON dataset from a text file") {
|
||||
val dir = Utils.createTempDir()
|
||||
dir.delete()
|
||||
|
@ -1202,48 +1173,6 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData {
|
|||
}
|
||||
|
||||
test("JSONRelation equality test") {
|
||||
val relation0 = new JSONRelation(
|
||||
Some(empty),
|
||||
Some(StructType(StructField("a", IntegerType, true) :: Nil)),
|
||||
None,
|
||||
None)(sqlContext)
|
||||
val logicalRelation0 = LogicalRelation(relation0)
|
||||
val relation1 = new JSONRelation(
|
||||
Some(singleRow),
|
||||
Some(StructType(StructField("a", IntegerType, true) :: Nil)),
|
||||
None,
|
||||
None)(sqlContext)
|
||||
val logicalRelation1 = LogicalRelation(relation1)
|
||||
val relation2 = new JSONRelation(
|
||||
Some(singleRow),
|
||||
Some(StructType(StructField("a", IntegerType, true) :: Nil)),
|
||||
None,
|
||||
None,
|
||||
parameters = Map("samplingRatio" -> "0.5"))(sqlContext)
|
||||
val logicalRelation2 = LogicalRelation(relation2)
|
||||
val relation3 = new JSONRelation(
|
||||
Some(singleRow),
|
||||
Some(StructType(StructField("b", IntegerType, true) :: Nil)),
|
||||
None,
|
||||
None)(sqlContext)
|
||||
val logicalRelation3 = LogicalRelation(relation3)
|
||||
|
||||
assert(relation0 !== relation1)
|
||||
assert(!logicalRelation0.sameResult(logicalRelation1),
|
||||
s"$logicalRelation0 and $logicalRelation1 should be considered not having the same result.")
|
||||
|
||||
assert(relation1 === relation2)
|
||||
assert(logicalRelation1.sameResult(logicalRelation2),
|
||||
s"$logicalRelation1 and $logicalRelation2 should be considered having the same result.")
|
||||
|
||||
assert(relation1 !== relation3)
|
||||
assert(!logicalRelation1.sameResult(logicalRelation3),
|
||||
s"$logicalRelation1 and $logicalRelation3 should be considered not having the same result.")
|
||||
|
||||
assert(relation2 !== relation3)
|
||||
assert(!logicalRelation2.sameResult(logicalRelation3),
|
||||
s"$logicalRelation2 and $logicalRelation3 should be considered not having the same result.")
|
||||
|
||||
withTempPath(dir => {
|
||||
val path = dir.getCanonicalFile.toURI.toString
|
||||
sparkContext.parallelize(1 to 100)
|
||||
|
|
|
@ -28,6 +28,7 @@ import org.apache.spark.sql.catalyst.planning.PhysicalOperation
|
|||
import org.apache.spark.sql.execution.datasources.{DataSourceStrategy, LogicalRelation}
|
||||
import org.apache.spark.sql.functions._
|
||||
import org.apache.spark.sql.internal.SQLConf
|
||||
import org.apache.spark.sql.sources.HadoopFsRelation
|
||||
import org.apache.spark.sql.test.SharedSQLContext
|
||||
import org.apache.spark.sql.types._
|
||||
|
||||
|
@ -59,9 +60,9 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex
|
|||
.select(output.map(e => Column(e)): _*)
|
||||
.where(Column(predicate))
|
||||
|
||||
var maybeRelation: Option[ParquetRelation] = None
|
||||
var maybeRelation: Option[HadoopFsRelation] = None
|
||||
val maybeAnalyzedPredicate = query.queryExecution.optimizedPlan.collect {
|
||||
case PhysicalOperation(_, filters, LogicalRelation(relation: ParquetRelation, _, _)) =>
|
||||
case PhysicalOperation(_, filters, LogicalRelation(relation: HadoopFsRelation, _, _)) =>
|
||||
maybeRelation = Some(relation)
|
||||
filters
|
||||
}.flatten.reduceLeftOption(_ && _)
|
||||
|
|
|
@ -437,8 +437,8 @@ class ParquetIOSuite extends QueryTest with ParquetTest with SharedSQLContext {
|
|||
readParquetFile(path.toString) { df =>
|
||||
assertResult(df.schema) {
|
||||
StructType(
|
||||
StructField("a", BooleanType, nullable = false) ::
|
||||
StructField("b", IntegerType, nullable = false) ::
|
||||
StructField("a", BooleanType, nullable = true) ::
|
||||
StructField("b", IntegerType, nullable = true) ::
|
||||
Nil)
|
||||
}
|
||||
}
|
||||
|
|
|
@ -31,6 +31,7 @@ import org.apache.spark.sql.catalyst.InternalRow
|
|||
import org.apache.spark.sql.catalyst.expressions.Literal
|
||||
import org.apache.spark.sql.execution.datasources.{LogicalRelation, Partition, PartitioningUtils, PartitionSpec}
|
||||
import org.apache.spark.sql.internal.SQLConf
|
||||
import org.apache.spark.sql.sources.HadoopFsRelation
|
||||
import org.apache.spark.sql.test.SharedSQLContext
|
||||
import org.apache.spark.sql.types._
|
||||
import org.apache.spark.unsafe.types.UTF8String
|
||||
|
@ -564,7 +565,7 @@ class ParquetPartitionDiscoverySuite extends QueryTest with ParquetTest with Sha
|
|||
(1 to 10).map(i => (i, i.toString)).toDF("a", "b").write.parquet(dir.getCanonicalPath)
|
||||
val queryExecution = sqlContext.read.parquet(dir.getCanonicalPath).queryExecution
|
||||
queryExecution.analyzed.collectFirst {
|
||||
case LogicalRelation(relation: ParquetRelation, _, _) =>
|
||||
case LogicalRelation(relation: HadoopFsRelation, _, _) =>
|
||||
assert(relation.partitionSpec === PartitionSpec.emptySpec)
|
||||
}.getOrElse {
|
||||
fail(s"Expecting a ParquetRelation2, but got:\n$queryExecution")
|
||||
|
|
|
@ -174,7 +174,7 @@ class InsertSuite extends DataSourceTest with SharedSQLContext {
|
|||
""".stripMargin)
|
||||
}.getMessage
|
||||
assert(
|
||||
message.contains("Cannot insert overwrite into table that is also being read from."),
|
||||
message.contains("Cannot overwrite a path that is also being read from."),
|
||||
"INSERT OVERWRITE to a table while querying it should not be allowed.")
|
||||
}
|
||||
|
||||
|
|
|
@ -21,7 +21,7 @@ import java.io.{ByteArrayInputStream, File, FileNotFoundException, InputStream}
|
|||
|
||||
import com.google.common.base.Charsets.UTF_8
|
||||
|
||||
import org.apache.spark.sql.StreamTest
|
||||
import org.apache.spark.sql.{AnalysisException, StreamTest}
|
||||
import org.apache.spark.sql.catalyst.util._
|
||||
import org.apache.spark.sql.execution.streaming._
|
||||
import org.apache.spark.sql.execution.streaming.FileStreamSource._
|
||||
|
@ -112,7 +112,7 @@ class FileStreamSourceSuite extends FileStreamSourceTest with SharedSQLContext {
|
|||
}
|
||||
|
||||
test("FileStreamSource schema: path doesn't exist") {
|
||||
intercept[FileNotFoundException] {
|
||||
intercept[AnalysisException] {
|
||||
createFileStreamSourceAndGetSchema(format = None, path = Some("/a/b/c"), schema = None)
|
||||
}
|
||||
}
|
||||
|
@ -146,11 +146,11 @@ class FileStreamSourceSuite extends FileStreamSourceTest with SharedSQLContext {
|
|||
|
||||
test("FileStreamSource schema: parquet, no existing files, no schema") {
|
||||
withTempDir { src =>
|
||||
val e = intercept[IllegalArgumentException] {
|
||||
val e = intercept[AnalysisException] {
|
||||
createFileStreamSourceAndGetSchema(
|
||||
format = Some("parquet"), path = Some(new File(src, "1").getCanonicalPath), schema = None)
|
||||
}
|
||||
assert("No schema specified" === e.getMessage)
|
||||
assert("Unable to infer schema. It must be specified manually.;" === e.getMessage)
|
||||
}
|
||||
}
|
||||
|
||||
|
@ -177,11 +177,11 @@ class FileStreamSourceSuite extends FileStreamSourceTest with SharedSQLContext {
|
|||
|
||||
test("FileStreamSource schema: json, no existing files, no schema") {
|
||||
withTempDir { src =>
|
||||
val e = intercept[IllegalArgumentException] {
|
||||
val e = intercept[AnalysisException] {
|
||||
createFileStreamSourceAndGetSchema(
|
||||
format = Some("json"), path = Some(src.getCanonicalPath), schema = None)
|
||||
}
|
||||
assert("No schema specified" === e.getMessage)
|
||||
assert("Unable to infer schema. It must be specified manually.;" === e.getMessage)
|
||||
}
|
||||
}
|
||||
|
||||
|
@ -310,10 +310,10 @@ class FileStreamSourceSuite extends FileStreamSourceTest with SharedSQLContext {
|
|||
createFileStreamSource("text", src.getCanonicalPath)
|
||||
|
||||
// Both "json" and "parquet" require a schema if no existing file to infer
|
||||
intercept[IllegalArgumentException] {
|
||||
intercept[AnalysisException] {
|
||||
createFileStreamSource("json", src.getCanonicalPath)
|
||||
}
|
||||
intercept[IllegalArgumentException] {
|
||||
intercept[AnalysisException] {
|
||||
createFileStreamSource("parquet", src.getCanonicalPath)
|
||||
}
|
||||
|
||||
|
|
|
@ -28,6 +28,7 @@ import org.scalatest.BeforeAndAfterAll
|
|||
|
||||
import org.apache.spark.SparkFunSuite
|
||||
import org.apache.spark.sql._
|
||||
import org.apache.spark.sql.catalyst.analysis.NoSuchTableException
|
||||
import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
|
||||
import org.apache.spark.sql.catalyst.util._
|
||||
import org.apache.spark.util.Utils
|
||||
|
@ -140,7 +141,13 @@ private[sql] trait SQLTestUtils
|
|||
* Drops temporary table `tableName` after calling `f`.
|
||||
*/
|
||||
protected def withTempTable(tableNames: String*)(f: => Unit): Unit = {
|
||||
try f finally tableNames.foreach(sqlContext.dropTempTable)
|
||||
try f finally {
|
||||
// If the test failed part way, we don't want to mask the failure by failing to remove
|
||||
// temp tables that never got created.
|
||||
try tableNames.foreach(sqlContext.dropTempTable) catch {
|
||||
case _: NoSuchTableException =>
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
|
|
|
@ -22,7 +22,7 @@ import scala.collection.mutable
|
|||
|
||||
import com.google.common.base.Objects
|
||||
import com.google.common.cache.{CacheBuilder, CacheLoader, LoadingCache}
|
||||
import org.apache.hadoop.fs.Path
|
||||
import org.apache.hadoop.fs.{FileStatus, Path}
|
||||
import org.apache.hadoop.hive.common.StatsSetupConst
|
||||
import org.apache.hadoop.hive.conf.HiveConf
|
||||
import org.apache.hadoop.hive.metastore.{TableType => HiveTableType, Warehouse}
|
||||
|
@ -42,7 +42,7 @@ import org.apache.spark.sql.catalyst.plans.logical._
|
|||
import org.apache.spark.sql.catalyst.rules._
|
||||
import org.apache.spark.sql.execution.{datasources, FileRelation}
|
||||
import org.apache.spark.sql.execution.datasources.{Partition => ParquetPartition, _}
|
||||
import org.apache.spark.sql.execution.datasources.parquet.ParquetRelation
|
||||
import org.apache.spark.sql.execution.datasources.parquet.{DefaultSource, ParquetRelation}
|
||||
import org.apache.spark.sql.hive.client._
|
||||
import org.apache.spark.sql.hive.execution.HiveNativeCommand
|
||||
import org.apache.spark.sql.sources._
|
||||
|
@ -175,18 +175,15 @@ private[hive] class HiveMetastoreCatalog(val client: HiveClient, hive: HiveConte
|
|||
BucketSpec(n.toInt, getColumnNames("bucket"), getColumnNames("sort"))
|
||||
}
|
||||
|
||||
// It does not appear that the ql client for the metastore has a way to enumerate all the
|
||||
// SerDe properties directly...
|
||||
val options = table.storage.serdeProperties
|
||||
|
||||
val resolvedRelation =
|
||||
ResolvedDataSource(
|
||||
hive,
|
||||
userSpecifiedSchema,
|
||||
partitionColumns.toArray,
|
||||
bucketSpec,
|
||||
table.properties("spark.sql.sources.provider"),
|
||||
options)
|
||||
userSpecifiedSchema = userSpecifiedSchema,
|
||||
partitionColumns = partitionColumns.toArray,
|
||||
bucketSpec = bucketSpec,
|
||||
provider = table.properties("spark.sql.sources.provider"),
|
||||
options = options)
|
||||
|
||||
LogicalRelation(
|
||||
resolvedRelation.relation,
|
||||
|
@ -285,8 +282,14 @@ private[hive] class HiveMetastoreCatalog(val client: HiveClient, hive: HiveConte
|
|||
}
|
||||
|
||||
val maybeSerDe = HiveSerDe.sourceToSerDe(provider, hive.hiveconf)
|
||||
val dataSource = ResolvedDataSource(
|
||||
hive, userSpecifiedSchema, partitionColumns, bucketSpec, provider, options)
|
||||
val dataSource =
|
||||
ResolvedDataSource(
|
||||
hive,
|
||||
userSpecifiedSchema = userSpecifiedSchema,
|
||||
partitionColumns = partitionColumns,
|
||||
bucketSpec = bucketSpec,
|
||||
provider = provider,
|
||||
options = options)
|
||||
|
||||
def newSparkSQLSpecificMetastoreTable(): CatalogTable = {
|
||||
CatalogTable(
|
||||
|
@ -308,14 +311,14 @@ private[hive] class HiveMetastoreCatalog(val client: HiveClient, hive: HiveConte
|
|||
relation: HadoopFsRelation,
|
||||
serde: HiveSerDe): CatalogTable = {
|
||||
assert(partitionColumns.isEmpty)
|
||||
assert(relation.partitionColumns.isEmpty)
|
||||
assert(relation.partitionSchema.isEmpty)
|
||||
|
||||
CatalogTable(
|
||||
specifiedDatabase = Option(dbName),
|
||||
name = tblName,
|
||||
tableType = tableType,
|
||||
storage = CatalogStorageFormat(
|
||||
locationUri = Some(relation.paths.head),
|
||||
locationUri = Some(relation.location.paths.map(_.toUri.toString).head),
|
||||
inputFormat = serde.inputFormat,
|
||||
outputFormat = serde.outputFormat,
|
||||
serde = serde.serde,
|
||||
|
@ -339,25 +342,26 @@ private[hive] class HiveMetastoreCatalog(val client: HiveClient, hive: HiveConte
|
|||
(None, message)
|
||||
|
||||
case (Some(serde), relation: HadoopFsRelation)
|
||||
if relation.paths.length == 1 && relation.partitionColumns.isEmpty =>
|
||||
if relation.location.paths.length == 1 && relation.partitionSchema.isEmpty =>
|
||||
val hiveTable = newHiveCompatibleMetastoreTable(relation, serde)
|
||||
val message =
|
||||
s"Persisting data source relation $qualifiedTableName with a single input path " +
|
||||
s"into Hive metastore in Hive compatible format. Input path: ${relation.paths.head}."
|
||||
s"into Hive metastore in Hive compatible format. Input path: " +
|
||||
s"${relation.location.paths.head}."
|
||||
(Some(hiveTable), message)
|
||||
|
||||
case (Some(serde), relation: HadoopFsRelation) if relation.partitionColumns.nonEmpty =>
|
||||
case (Some(serde), relation: HadoopFsRelation) if relation.partitionSchema.nonEmpty =>
|
||||
val message =
|
||||
s"Persisting partitioned data source relation $qualifiedTableName into " +
|
||||
"Hive metastore in Spark SQL specific format, which is NOT compatible with Hive. " +
|
||||
"Input path(s): " + relation.paths.mkString("\n", "\n", "")
|
||||
"Input path(s): " + relation.location.paths.mkString("\n", "\n", "")
|
||||
(None, message)
|
||||
|
||||
case (Some(serde), relation: HadoopFsRelation) =>
|
||||
val message =
|
||||
s"Persisting data source relation $qualifiedTableName with multiple input paths into " +
|
||||
"Hive metastore in Spark SQL specific format, which is NOT compatible with Hive. " +
|
||||
s"Input paths: " + relation.paths.mkString("\n", "\n", "")
|
||||
s"Input paths: " + relation.location.paths.mkString("\n", "\n", "")
|
||||
(None, message)
|
||||
|
||||
case (Some(serde), _) =>
|
||||
|
@ -441,11 +445,7 @@ private[hive] class HiveMetastoreCatalog(val client: HiveClient, hive: HiveConte
|
|||
val metastoreSchema = StructType.fromAttributes(metastoreRelation.output)
|
||||
val mergeSchema = hive.convertMetastoreParquetWithSchemaMerging
|
||||
|
||||
// NOTE: Instead of passing Metastore schema directly to `ParquetRelation`, we have to
|
||||
// serialize the Metastore schema to JSON and pass it as a data source option because of the
|
||||
// evil case insensitivity issue, which is reconciled within `ParquetRelation`.
|
||||
val parquetOptions = Map(
|
||||
ParquetRelation.METASTORE_SCHEMA -> metastoreSchema.json,
|
||||
ParquetRelation.MERGE_SCHEMA -> mergeSchema.toString,
|
||||
ParquetRelation.METASTORE_TABLE_NAME -> TableIdentifier(
|
||||
metastoreRelation.tableName,
|
||||
|
@ -462,11 +462,11 @@ private[hive] class HiveMetastoreCatalog(val client: HiveClient, hive: HiveConte
|
|||
partitionSpecInMetastore: Option[PartitionSpec]): Option[LogicalRelation] = {
|
||||
cachedDataSourceTables.getIfPresent(tableIdentifier) match {
|
||||
case null => None // Cache miss
|
||||
case logical @ LogicalRelation(parquetRelation: ParquetRelation, _, _) =>
|
||||
case logical @ LogicalRelation(parquetRelation: HadoopFsRelation, _, _) =>
|
||||
// If we have the same paths, same schema, and same partition spec,
|
||||
// we will use the cached Parquet Relation.
|
||||
val useCached =
|
||||
parquetRelation.paths.toSet == pathsInMetastore.toSet &&
|
||||
parquetRelation.location.paths.map(_.toString).toSet == pathsInMetastore.toSet &&
|
||||
logical.schema.sameType(metastoreSchema) &&
|
||||
parquetRelation.partitionSpec == partitionSpecInMetastore.getOrElse {
|
||||
PartitionSpec(StructType(Nil), Array.empty[datasources.Partition])
|
||||
|
@ -502,13 +502,33 @@ private[hive] class HiveMetastoreCatalog(val client: HiveClient, hive: HiveConte
|
|||
ParquetPartition(values, location)
|
||||
}
|
||||
val partitionSpec = PartitionSpec(partitionSchema, partitions)
|
||||
val paths = partitions.map(_.path)
|
||||
|
||||
val cached = getCached(tableIdentifier, paths, metastoreSchema, Some(partitionSpec))
|
||||
val cached = getCached(
|
||||
tableIdentifier,
|
||||
metastoreRelation.table.storage.locationUri.toSeq,
|
||||
metastoreSchema,
|
||||
Some(partitionSpec))
|
||||
|
||||
val parquetRelation = cached.getOrElse {
|
||||
val created = LogicalRelation(
|
||||
new ParquetRelation(
|
||||
paths.toArray, None, Some(partitionSpec), parquetOptions)(hive))
|
||||
val paths = new Path(metastoreRelation.table.storage.locationUri.get) :: Nil
|
||||
val fileCatalog = new HiveFileCatalog(hive, paths, partitionSpec)
|
||||
val format = new DefaultSource()
|
||||
val inferredSchema = format.inferSchema(hive, parquetOptions, fileCatalog.allFiles())
|
||||
|
||||
val mergedSchema = inferredSchema.map { inferred =>
|
||||
ParquetRelation.mergeMetastoreParquetSchema(metastoreSchema, inferred)
|
||||
}.getOrElse(metastoreSchema)
|
||||
|
||||
val relation = HadoopFsRelation(
|
||||
sqlContext = hive,
|
||||
location = fileCatalog,
|
||||
partitionSchema = partitionSchema,
|
||||
dataSchema = mergedSchema,
|
||||
bucketSpec = None, // We don't support hive bucketed tables, only ones we write out.
|
||||
fileFormat = new DefaultSource(),
|
||||
options = parquetOptions)
|
||||
|
||||
val created = LogicalRelation(relation)
|
||||
cachedDataSourceTables.put(tableIdentifier, created)
|
||||
created
|
||||
}
|
||||
|
@ -519,15 +539,21 @@ private[hive] class HiveMetastoreCatalog(val client: HiveClient, hive: HiveConte
|
|||
|
||||
val cached = getCached(tableIdentifier, paths, metastoreSchema, None)
|
||||
val parquetRelation = cached.getOrElse {
|
||||
val created = LogicalRelation(
|
||||
new ParquetRelation(paths.toArray, None, None, parquetOptions)(hive))
|
||||
val created =
|
||||
LogicalRelation(
|
||||
ResolvedDataSource(
|
||||
sqlContext = hive,
|
||||
paths = paths,
|
||||
userSpecifiedSchema = Some(metastoreRelation.schema),
|
||||
options = parquetOptions,
|
||||
provider = "parquet").relation)
|
||||
|
||||
cachedDataSourceTables.put(tableIdentifier, created)
|
||||
created
|
||||
}
|
||||
|
||||
parquetRelation
|
||||
}
|
||||
|
||||
result.copy(expectedOutputAttributes = Some(metastoreRelation.output))
|
||||
}
|
||||
|
||||
|
@ -719,6 +745,25 @@ private[hive] class HiveMetastoreCatalog(val client: HiveClient, hive: HiveConte
|
|||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* An override of the standard HDFS listing based catalog, that overrides the partition spec with
|
||||
* the information from the metastore.
|
||||
*/
|
||||
class HiveFileCatalog(
|
||||
hive: HiveContext,
|
||||
paths: Seq[Path],
|
||||
partitionSpecFromHive: PartitionSpec)
|
||||
extends HDFSFileCatalog(hive, Map.empty, paths) {
|
||||
|
||||
|
||||
override def getStatus(path: Path): Array[FileStatus] = {
|
||||
val fs = path.getFileSystem(hive.sparkContext.hadoopConfiguration)
|
||||
fs.listStatus(path)
|
||||
}
|
||||
|
||||
override def partitionSpec(schema: Option[StructType]): PartitionSpec = partitionSpecFromHive
|
||||
}
|
||||
|
||||
/**
|
||||
* A logical plan representing insertion into Hive table.
|
||||
* This plan ignores nullability of ArrayType, MapType, StructType unlike InsertIntoTable
|
||||
|
|
|
@ -58,6 +58,7 @@ private[hive] class HiveSessionState(ctx: HiveContext) extends SessionState(ctx)
|
|||
catalog.PreInsertionCasts ::
|
||||
python.ExtractPythonUDFs ::
|
||||
PreInsertCastAndRename ::
|
||||
DataSourceAnalysis ::
|
||||
(if (conf.runSQLOnFile) new ResolveDataSource(ctx) :: Nil else Nil)
|
||||
|
||||
override val extendedCheckRules = Seq(PreWriteCheck(catalog))
|
||||
|
|
|
@ -147,6 +147,14 @@ case class CreateMetastoreDataSource(
|
|||
options
|
||||
}
|
||||
|
||||
// Create the relation to validate the arguments before writing the metadata to the metastore.
|
||||
ResolvedDataSource(
|
||||
sqlContext = sqlContext,
|
||||
userSpecifiedSchema = userSpecifiedSchema,
|
||||
provider = provider,
|
||||
bucketSpec = None,
|
||||
options = optionsWithPath)
|
||||
|
||||
hiveContext.catalog.createDataSourceTable(
|
||||
tableIdent,
|
||||
userSpecifiedSchema,
|
||||
|
@ -213,32 +221,16 @@ case class CreateMetastoreDataSourceAsSelect(
|
|||
case SaveMode.Append =>
|
||||
// Check if the specified data source match the data source of the existing table.
|
||||
val resolved = ResolvedDataSource(
|
||||
sqlContext,
|
||||
Some(query.schema.asNullable),
|
||||
partitionColumns,
|
||||
bucketSpec,
|
||||
provider,
|
||||
optionsWithPath)
|
||||
val createdRelation = LogicalRelation(resolved.relation)
|
||||
sqlContext = sqlContext,
|
||||
userSpecifiedSchema = Some(query.schema.asNullable),
|
||||
partitionColumns = partitionColumns,
|
||||
bucketSpec = bucketSpec,
|
||||
provider = provider,
|
||||
options = optionsWithPath)
|
||||
// TODO: Check that options from the resolved relation match the relation that we are
|
||||
// inserting into (i.e. using the same compression).
|
||||
EliminateSubqueryAliases(sqlContext.catalog.lookupRelation(tableIdent)) match {
|
||||
case l @ LogicalRelation(_: InsertableRelation | _: HadoopFsRelation, _, _) =>
|
||||
if (l.relation != createdRelation.relation) {
|
||||
val errorDescription =
|
||||
s"Cannot append to table $tableName because the resolved relation does not " +
|
||||
s"match the existing relation of $tableName. " +
|
||||
s"You can use insertInto($tableName, false) to append this DataFrame to the " +
|
||||
s"table $tableName and using its data source and options."
|
||||
val errorMessage =
|
||||
s"""
|
||||
|$errorDescription
|
||||
|== Relations ==
|
||||
|${sideBySide(
|
||||
s"== Expected Relation ==" :: l.toString :: Nil,
|
||||
s"== Actual Relation ==" :: createdRelation.toString :: Nil
|
||||
).mkString("\n")}
|
||||
""".stripMargin
|
||||
throw new AnalysisException(errorMessage)
|
||||
}
|
||||
existingSchema = Some(l.schema)
|
||||
case o =>
|
||||
throw new AnalysisException(s"Saving data in ${o.toString} is not supported.")
|
||||
|
|
|
@ -45,7 +45,6 @@ private[orc] object OrcFileOperator extends Logging {
|
|||
* directly from HDFS via Spark SQL, because we have to discover the schema from raw ORC
|
||||
* files. So this method always tries to find a ORC file whose schema is non-empty, and
|
||||
* create the result reader from that file. If no such file is found, it returns `None`.
|
||||
*
|
||||
* @todo Needs to consider all files when schema evolution is taken into account.
|
||||
*/
|
||||
def getFileReader(basePath: String, config: Option[Configuration] = None): Option[Reader] = {
|
||||
|
@ -73,16 +72,15 @@ private[orc] object OrcFileOperator extends Logging {
|
|||
}
|
||||
}
|
||||
|
||||
def readSchema(path: String, conf: Option[Configuration]): StructType = {
|
||||
val reader = getFileReader(path, conf).getOrElse {
|
||||
throw new AnalysisException(
|
||||
s"Failed to discover schema from ORC files stored in $path. " +
|
||||
"Probably there are either no ORC files or only empty ORC files.")
|
||||
def readSchema(paths: Seq[String], conf: Option[Configuration]): Option[StructType] = {
|
||||
// Take the first file where we can open a valid reader if we can find one. Otherwise just
|
||||
// return None to indicate we can't infer the schema.
|
||||
paths.flatMap(getFileReader(_, conf)).headOption.map { reader =>
|
||||
val readerInspector = reader.getObjectInspector.asInstanceOf[StructObjectInspector]
|
||||
val schema = readerInspector.getTypeName
|
||||
logDebug(s"Reading schema from file $paths, got Hive schema string: $schema")
|
||||
HiveMetastoreTypes.toDataType(schema).asInstanceOf[StructType]
|
||||
}
|
||||
val readerInspector = reader.getObjectInspector.asInstanceOf[StructObjectInspector]
|
||||
val schema = readerInspector.getTypeName
|
||||
logDebug(s"Reading schema from file $path, got Hive schema string: $schema")
|
||||
HiveMetastoreTypes.toDataType(schema).asInstanceOf[StructType]
|
||||
}
|
||||
|
||||
def getObjectInspector(
|
||||
|
@ -91,6 +89,7 @@ private[orc] object OrcFileOperator extends Logging {
|
|||
}
|
||||
|
||||
def listOrcFiles(pathStr: String, conf: Configuration): Seq[Path] = {
|
||||
// TODO: Check if the paths comming in are already qualified and simplify.
|
||||
val origPath = new Path(pathStr)
|
||||
val fs = origPath.getFileSystem(conf)
|
||||
val path = origPath.makeQualified(fs.getUri, fs.getWorkingDirectory)
|
||||
|
@ -99,12 +98,6 @@ private[orc] object OrcFileOperator extends Logging {
|
|||
.map(_.getPath)
|
||||
.filterNot(_.getName.startsWith("_"))
|
||||
.filterNot(_.getName.startsWith("."))
|
||||
|
||||
if (paths == null || paths.isEmpty) {
|
||||
throw new IllegalArgumentException(
|
||||
s"orcFileOperator: path $path does not have valid orc files matching the pattern")
|
||||
}
|
||||
|
||||
paths
|
||||
}
|
||||
}
|
||||
|
|
|
@ -43,23 +43,80 @@ import org.apache.spark.sql.hive.{HiveContext, HiveInspectors, HiveMetastoreType
|
|||
import org.apache.spark.sql.sources.{Filter, _}
|
||||
import org.apache.spark.sql.types.StructType
|
||||
import org.apache.spark.util.SerializableConfiguration
|
||||
import org.apache.spark.util.collection.BitSet
|
||||
|
||||
private[sql] class DefaultSource extends BucketedHadoopFsRelationProvider with DataSourceRegister {
|
||||
private[sql] class DefaultSource extends FileFormat with DataSourceRegister {
|
||||
|
||||
override def shortName(): String = "orc"
|
||||
|
||||
override def createRelation(
|
||||
sqlContext: SQLContext,
|
||||
paths: Array[String],
|
||||
dataSchema: Option[StructType],
|
||||
partitionColumns: Option[StructType],
|
||||
bucketSpec: Option[BucketSpec],
|
||||
parameters: Map[String, String]): HadoopFsRelation = {
|
||||
assert(
|
||||
sqlContext.isInstanceOf[HiveContext],
|
||||
"The ORC data source can only be used with HiveContext.")
|
||||
override def toString: String = "ORC"
|
||||
|
||||
new OrcRelation(paths, dataSchema, None, partitionColumns, bucketSpec, parameters)(sqlContext)
|
||||
override def inferSchema(
|
||||
sqlContext: SQLContext,
|
||||
options: Map[String, String],
|
||||
files: Seq[FileStatus]): Option[StructType] = {
|
||||
OrcFileOperator.readSchema(
|
||||
files.map(_.getPath.toUri.toString), Some(sqlContext.sparkContext.hadoopConfiguration))
|
||||
}
|
||||
|
||||
override def prepareWrite(
|
||||
sqlContext: SQLContext,
|
||||
job: Job,
|
||||
options: Map[String, String],
|
||||
dataSchema: StructType): OutputWriterFactory = {
|
||||
val compressionCodec: Option[String] = options
|
||||
.get("compression")
|
||||
.map { codecName =>
|
||||
// Validate if given compression codec is supported or not.
|
||||
val shortOrcCompressionCodecNames = OrcRelation.shortOrcCompressionCodecNames
|
||||
if (!shortOrcCompressionCodecNames.contains(codecName.toLowerCase)) {
|
||||
val availableCodecs = shortOrcCompressionCodecNames.keys.map(_.toLowerCase)
|
||||
throw new IllegalArgumentException(s"Codec [$codecName] " +
|
||||
s"is not available. Available codecs are ${availableCodecs.mkString(", ")}.")
|
||||
}
|
||||
codecName.toLowerCase
|
||||
}
|
||||
|
||||
compressionCodec.foreach { codecName =>
|
||||
job.getConfiguration.set(
|
||||
OrcTableProperties.COMPRESSION.getPropName,
|
||||
OrcRelation
|
||||
.shortOrcCompressionCodecNames
|
||||
.getOrElse(codecName, CompressionKind.NONE).name())
|
||||
}
|
||||
|
||||
job.getConfiguration match {
|
||||
case conf: JobConf =>
|
||||
conf.setOutputFormat(classOf[OrcOutputFormat])
|
||||
case conf =>
|
||||
conf.setClass(
|
||||
"mapred.output.format.class",
|
||||
classOf[OrcOutputFormat],
|
||||
classOf[MapRedOutputFormat[_, _]])
|
||||
}
|
||||
|
||||
new OutputWriterFactory {
|
||||
override def newInstance(
|
||||
path: String,
|
||||
bucketId: Option[Int],
|
||||
dataSchema: StructType,
|
||||
context: TaskAttemptContext): OutputWriter = {
|
||||
new OrcOutputWriter(path, bucketId, dataSchema, context)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
override def buildInternalScan(
|
||||
sqlContext: SQLContext,
|
||||
dataSchema: StructType,
|
||||
requiredColumns: Array[String],
|
||||
filters: Array[Filter],
|
||||
bucketSet: Option[BitSet],
|
||||
inputFiles: Array[FileStatus],
|
||||
broadcastedConf: Broadcast[SerializableConfiguration],
|
||||
options: Map[String, String]): RDD[InternalRow] = {
|
||||
val output = StructType(requiredColumns.map(dataSchema(_))).toAttributes
|
||||
OrcTableScan(sqlContext, output, filters, inputFiles).execute()
|
||||
}
|
||||
}
|
||||
|
||||
|
@ -115,7 +172,8 @@ private[orc] class OrcOutputWriter(
|
|||
).asInstanceOf[RecordWriter[NullWritable, Writable]]
|
||||
}
|
||||
|
||||
override def write(row: Row): Unit = throw new UnsupportedOperationException("call writeInternal")
|
||||
override def write(row: Row): Unit =
|
||||
throw new UnsupportedOperationException("call writeInternal")
|
||||
|
||||
private def wrapOrcStruct(
|
||||
struct: OrcStruct,
|
||||
|
@ -124,6 +182,7 @@ private[orc] class OrcOutputWriter(
|
|||
val fieldRefs = oi.getAllStructFieldRefs
|
||||
var i = 0
|
||||
while (i < fieldRefs.size) {
|
||||
|
||||
oi.setStructFieldData(
|
||||
struct,
|
||||
fieldRefs.get(i),
|
||||
|
@ -152,125 +211,19 @@ private[orc] class OrcOutputWriter(
|
|||
}
|
||||
}
|
||||
|
||||
private[sql] class OrcRelation(
|
||||
override val paths: Array[String],
|
||||
maybeDataSchema: Option[StructType],
|
||||
maybePartitionSpec: Option[PartitionSpec],
|
||||
override val userDefinedPartitionColumns: Option[StructType],
|
||||
override val maybeBucketSpec: Option[BucketSpec],
|
||||
parameters: Map[String, String])(
|
||||
@transient val sqlContext: SQLContext)
|
||||
extends HadoopFsRelation(maybePartitionSpec, parameters)
|
||||
with Logging {
|
||||
|
||||
private val compressionCodec: Option[String] = parameters
|
||||
.get("compression")
|
||||
.map { codecName =>
|
||||
// Validate if given compression codec is supported or not.
|
||||
val shortOrcCompressionCodecNames = OrcRelation.shortOrcCompressionCodecNames
|
||||
if (!shortOrcCompressionCodecNames.contains(codecName.toLowerCase)) {
|
||||
val availableCodecs = shortOrcCompressionCodecNames.keys.map(_.toLowerCase)
|
||||
throw new IllegalArgumentException(s"Codec [$codecName] " +
|
||||
s"is not available. Available codecs are ${availableCodecs.mkString(", ")}.")
|
||||
}
|
||||
codecName.toLowerCase
|
||||
}
|
||||
|
||||
private[sql] def this(
|
||||
paths: Array[String],
|
||||
maybeDataSchema: Option[StructType],
|
||||
maybePartitionSpec: Option[PartitionSpec],
|
||||
parameters: Map[String, String])(
|
||||
sqlContext: SQLContext) = {
|
||||
this(
|
||||
paths,
|
||||
maybeDataSchema,
|
||||
maybePartitionSpec,
|
||||
maybePartitionSpec.map(_.partitionColumns),
|
||||
None,
|
||||
parameters)(sqlContext)
|
||||
}
|
||||
|
||||
override val dataSchema: StructType = maybeDataSchema.getOrElse {
|
||||
OrcFileOperator.readSchema(
|
||||
paths.head, Some(sqlContext.sparkContext.hadoopConfiguration))
|
||||
}
|
||||
|
||||
override def needConversion: Boolean = false
|
||||
|
||||
override def equals(other: Any): Boolean = other match {
|
||||
case that: OrcRelation =>
|
||||
paths.toSet == that.paths.toSet &&
|
||||
dataSchema == that.dataSchema &&
|
||||
schema == that.schema &&
|
||||
partitionColumns == that.partitionColumns
|
||||
case _ => false
|
||||
}
|
||||
|
||||
override def hashCode(): Int = {
|
||||
Objects.hashCode(
|
||||
paths.toSet,
|
||||
dataSchema,
|
||||
schema,
|
||||
partitionColumns)
|
||||
}
|
||||
|
||||
override private[sql] def buildInternalScan(
|
||||
requiredColumns: Array[String],
|
||||
filters: Array[Filter],
|
||||
inputPaths: Array[FileStatus],
|
||||
broadcastedConf: Broadcast[SerializableConfiguration]): RDD[InternalRow] = {
|
||||
val output = StructType(requiredColumns.map(dataSchema(_))).toAttributes
|
||||
OrcTableScan(output, this, filters, inputPaths).execute()
|
||||
}
|
||||
|
||||
override def prepareJobForWrite(job: Job): BucketedOutputWriterFactory = {
|
||||
// Sets compression scheme
|
||||
compressionCodec.foreach { codecName =>
|
||||
job.getConfiguration.set(
|
||||
OrcTableProperties.COMPRESSION.getPropName,
|
||||
OrcRelation
|
||||
.shortOrcCompressionCodecNames
|
||||
.getOrElse(codecName, CompressionKind.NONE).name())
|
||||
}
|
||||
|
||||
job.getConfiguration match {
|
||||
case conf: JobConf =>
|
||||
conf.setOutputFormat(classOf[OrcOutputFormat])
|
||||
case conf =>
|
||||
conf.setClass(
|
||||
"mapred.output.format.class",
|
||||
classOf[OrcOutputFormat],
|
||||
classOf[MapRedOutputFormat[_, _]])
|
||||
}
|
||||
|
||||
new BucketedOutputWriterFactory {
|
||||
override def newInstance(
|
||||
path: String,
|
||||
bucketId: Option[Int],
|
||||
dataSchema: StructType,
|
||||
context: TaskAttemptContext): OutputWriter = {
|
||||
new OrcOutputWriter(path, bucketId, dataSchema, context)
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
private[orc] case class OrcTableScan(
|
||||
@transient sqlContext: SQLContext,
|
||||
attributes: Seq[Attribute],
|
||||
@transient relation: OrcRelation,
|
||||
filters: Array[Filter],
|
||||
@transient inputPaths: Array[FileStatus])
|
||||
extends Logging
|
||||
with HiveInspectors {
|
||||
|
||||
@transient private val sqlContext = relation.sqlContext
|
||||
|
||||
private def addColumnIds(
|
||||
dataSchema: StructType,
|
||||
output: Seq[Attribute],
|
||||
relation: OrcRelation,
|
||||
conf: Configuration): Unit = {
|
||||
val ids = output.map(a => relation.dataSchema.fieldIndex(a.name): Integer)
|
||||
val ids = output.map(a => dataSchema.fieldIndex(a.name): Integer)
|
||||
val (sortedIds, sortedNames) = ids.zip(attributes.map(_.name)).sorted.unzip
|
||||
HiveShim.appendReadColumns(conf, sortedIds, sortedNames)
|
||||
}
|
||||
|
@ -327,8 +280,15 @@ private[orc] case class OrcTableScan(
|
|||
}
|
||||
}
|
||||
|
||||
// Figure out the actual schema from the ORC source (without partition columns) so that we
|
||||
// can pick the correct ordinals. Note that this assumes that all files have the same schema.
|
||||
val orcFormat = new DefaultSource
|
||||
val dataSchema =
|
||||
orcFormat
|
||||
.inferSchema(sqlContext, Map.empty, inputPaths)
|
||||
.getOrElse(sys.error("Failed to read schema from target ORC files."))
|
||||
// Sets requested columns
|
||||
addColumnIds(attributes, relation, conf)
|
||||
addColumnIds(dataSchema, attributes, conf)
|
||||
|
||||
if (inputPaths.isEmpty) {
|
||||
// the input path probably be pruned, return an empty RDD.
|
||||
|
|
|
@ -46,7 +46,7 @@ import org.apache.spark.util.{ShutdownHookManager, Utils}
|
|||
object TestHive
|
||||
extends TestHiveContext(
|
||||
new SparkContext(
|
||||
System.getProperty("spark.sql.test.master", "local[32]"),
|
||||
System.getProperty("spark.sql.test.master", "local[1]"),
|
||||
"TestSQLContext",
|
||||
new SparkConf()
|
||||
.set("spark.sql.test", "")
|
||||
|
|
|
@ -17,7 +17,7 @@
|
|||
|
||||
package org.apache.spark.sql.hive
|
||||
|
||||
import java.io.{File, IOException}
|
||||
import java.io.File
|
||||
|
||||
import scala.collection.mutable.ArrayBuffer
|
||||
|
||||
|
@ -27,9 +27,9 @@ import org.apache.spark.sql._
|
|||
import org.apache.spark.sql.catalyst.TableIdentifier
|
||||
import org.apache.spark.sql.catalyst.catalog.{CatalogStorageFormat, CatalogTable, CatalogTableType}
|
||||
import org.apache.spark.sql.execution.datasources.LogicalRelation
|
||||
import org.apache.spark.sql.execution.datasources.parquet.ParquetRelation
|
||||
import org.apache.spark.sql.hive.test.TestHiveSingleton
|
||||
import org.apache.spark.sql.internal.SQLConf
|
||||
import org.apache.spark.sql.sources.HadoopFsRelation
|
||||
import org.apache.spark.sql.test.SQLTestUtils
|
||||
import org.apache.spark.sql.types._
|
||||
import org.apache.spark.util.Utils
|
||||
|
@ -403,20 +403,6 @@ class MetastoreDataSourcesSuite extends QueryTest with SQLTestUtils with TestHiv
|
|||
}
|
||||
}
|
||||
|
||||
test("SPARK-5286 Fail to drop an invalid table when using the data source API") {
|
||||
withTable("jsonTable") {
|
||||
sql(
|
||||
s"""CREATE TABLE jsonTable
|
||||
|USING org.apache.spark.sql.json.DefaultSource
|
||||
|OPTIONS (
|
||||
| path 'it is not a path at all!'
|
||||
|)
|
||||
""".stripMargin)
|
||||
|
||||
sql("DROP TABLE jsonTable").collect().foreach(i => logInfo(i.toString))
|
||||
}
|
||||
}
|
||||
|
||||
test("SPARK-5839 HiveMetastoreCatalog does not recognize table aliases of data source tables.") {
|
||||
withTable("savedJsonTable") {
|
||||
// Save the df as a managed table (by not specifying the path).
|
||||
|
@ -473,7 +459,7 @@ class MetastoreDataSourcesSuite extends QueryTest with SQLTestUtils with TestHiv
|
|||
|
||||
// Drop table will also delete the data.
|
||||
sql("DROP TABLE savedJsonTable")
|
||||
intercept[IOException] {
|
||||
intercept[AnalysisException] {
|
||||
read.json(catalog.hiveDefaultTableFilePath(TableIdentifier("savedJsonTable")))
|
||||
}
|
||||
}
|
||||
|
@ -541,21 +527,26 @@ class MetastoreDataSourcesSuite extends QueryTest with SQLTestUtils with TestHiv
|
|||
sql("SELECT b FROM savedJsonTable"))
|
||||
|
||||
sql("DROP TABLE createdJsonTable")
|
||||
|
||||
assert(
|
||||
intercept[RuntimeException] {
|
||||
createExternalTable(
|
||||
"createdJsonTable",
|
||||
"org.apache.spark.sql.json",
|
||||
schema,
|
||||
Map.empty[String, String])
|
||||
}.getMessage.contains("'path' is not specified"),
|
||||
"We should complain that path is not specified.")
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
test("path required error") {
|
||||
assert(
|
||||
intercept[AnalysisException] {
|
||||
createExternalTable(
|
||||
"createdJsonTable",
|
||||
"org.apache.spark.sql.json",
|
||||
Map.empty[String, String])
|
||||
|
||||
table("createdJsonTable")
|
||||
}.getMessage.contains("Unable to infer schema"),
|
||||
"We should complain that path is not specified.")
|
||||
|
||||
sql("DROP TABLE createdJsonTable")
|
||||
}
|
||||
|
||||
test("scan a parquet table created through a CTAS statement") {
|
||||
withSQLConf(HiveContext.CONVERT_METASTORE_PARQUET.key -> "true") {
|
||||
withTempTable("jt") {
|
||||
|
@ -572,9 +563,9 @@ class MetastoreDataSourcesSuite extends QueryTest with SQLTestUtils with TestHiv
|
|||
Row(3) :: Row(4) :: Nil)
|
||||
|
||||
table("test_parquet_ctas").queryExecution.optimizedPlan match {
|
||||
case LogicalRelation(p: ParquetRelation, _, _) => // OK
|
||||
case LogicalRelation(p: HadoopFsRelation, _, _) => // OK
|
||||
case _ =>
|
||||
fail(s"test_parquet_ctas should have be converted to ${classOf[ParquetRelation]}")
|
||||
fail(s"test_parquet_ctas should have be converted to ${classOf[HadoopFsRelation]}")
|
||||
}
|
||||
}
|
||||
}
|
||||
|
|
|
@ -25,11 +25,11 @@ import org.apache.spark.sql._
|
|||
import org.apache.spark.sql.catalyst.TableIdentifier
|
||||
import org.apache.spark.sql.catalyst.analysis.{EliminateSubqueryAliases, FunctionRegistry}
|
||||
import org.apache.spark.sql.execution.datasources.LogicalRelation
|
||||
import org.apache.spark.sql.execution.datasources.parquet.ParquetRelation
|
||||
import org.apache.spark.sql.functions._
|
||||
import org.apache.spark.sql.hive.{HiveContext, MetastoreRelation}
|
||||
import org.apache.spark.sql.hive.test.TestHiveSingleton
|
||||
import org.apache.spark.sql.internal.SQLConf
|
||||
import org.apache.spark.sql.sources.HadoopFsRelation
|
||||
import org.apache.spark.sql.test.SQLTestUtils
|
||||
import org.apache.spark.sql.types._
|
||||
import org.apache.spark.unsafe.types.CalendarInterval
|
||||
|
@ -277,17 +277,17 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton {
|
|||
def checkRelation(tableName: String, isDataSourceParquet: Boolean): Unit = {
|
||||
val relation = EliminateSubqueryAliases(catalog.lookupRelation(TableIdentifier(tableName)))
|
||||
relation match {
|
||||
case LogicalRelation(r: ParquetRelation, _, _) =>
|
||||
case LogicalRelation(r: HadoopFsRelation, _, _) =>
|
||||
if (!isDataSourceParquet) {
|
||||
fail(
|
||||
s"${classOf[MetastoreRelation].getCanonicalName} is expected, but found " +
|
||||
s"${ParquetRelation.getClass.getCanonicalName}.")
|
||||
s"${HadoopFsRelation.getClass.getCanonicalName}.")
|
||||
}
|
||||
|
||||
case r: MetastoreRelation =>
|
||||
if (isDataSourceParquet) {
|
||||
fail(
|
||||
s"${ParquetRelation.getClass.getCanonicalName} is expected, but found " +
|
||||
s"${HadoopFsRelation.getClass.getCanonicalName} is expected, but found " +
|
||||
s"${classOf[MetastoreRelation].getCanonicalName}.")
|
||||
}
|
||||
}
|
||||
|
|
|
@ -26,6 +26,7 @@ import org.apache.spark.sql.catalyst.dsl.expressions._
|
|||
import org.apache.spark.sql.catalyst.expressions._
|
||||
import org.apache.spark.sql.catalyst.planning.PhysicalOperation
|
||||
import org.apache.spark.sql.execution.datasources.{DataSourceStrategy, LogicalRelation}
|
||||
import org.apache.spark.sql.sources.HadoopFsRelation
|
||||
|
||||
/**
|
||||
* A test suite that tests ORC filter API based filter pushdown optimization.
|
||||
|
@ -40,9 +41,9 @@ class OrcFilterSuite extends QueryTest with OrcTest {
|
|||
.select(output.map(e => Column(e)): _*)
|
||||
.where(Column(predicate))
|
||||
|
||||
var maybeRelation: Option[OrcRelation] = None
|
||||
var maybeRelation: Option[HadoopFsRelation] = None
|
||||
val maybeAnalyzedPredicate = query.queryExecution.optimizedPlan.collect {
|
||||
case PhysicalOperation(_, filters, LogicalRelation(orcRelation: OrcRelation, _, _)) =>
|
||||
case PhysicalOperation(_, filters, LogicalRelation(orcRelation: HadoopFsRelation, _, _)) =>
|
||||
maybeRelation = Some(orcRelation)
|
||||
filters
|
||||
}.flatten.reduceLeftOption(_ && _)
|
||||
|
|
|
@ -330,7 +330,7 @@ class OrcQuerySuite extends QueryTest with BeforeAndAfterAll with OrcTest {
|
|||
sqlContext.read.orc(path)
|
||||
}.getMessage
|
||||
|
||||
assert(errorMessage.contains("Failed to discover schema from ORC files"))
|
||||
assert(errorMessage.contains("Unable to infer schema for ORC"))
|
||||
|
||||
val singleRowDF = Seq((0, "foo")).toDF("key", "value").coalesce(1)
|
||||
singleRowDF.registerTempTable("single")
|
||||
|
@ -348,7 +348,7 @@ class OrcQuerySuite extends QueryTest with BeforeAndAfterAll with OrcTest {
|
|||
}
|
||||
}
|
||||
|
||||
test("SPARK-10623 Enable ORC PPD") {
|
||||
ignore("SPARK-10623 Enable ORC PPD") {
|
||||
withTempPath { dir =>
|
||||
withSQLConf(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key -> "true") {
|
||||
import testImplicits._
|
||||
|
@ -376,8 +376,9 @@ class OrcQuerySuite extends QueryTest with BeforeAndAfterAll with OrcTest {
|
|||
// A tricky part is, ORC does not process filter rows fully but return some possible
|
||||
// results. So, this checks if the number of result is less than the original count
|
||||
// of data, and then checks if it contains the expected data.
|
||||
val isOrcFiltered = sourceDf.count < 10 && expectedData.subsetOf(data)
|
||||
assert(isOrcFiltered)
|
||||
assert(
|
||||
sourceDf.count < 10 && expectedData.subsetOf(data),
|
||||
s"No data was filtered for predicate: $pred")
|
||||
}
|
||||
|
||||
checkPredicate('a === 5, List(5).map(Row(_, null)))
|
||||
|
|
|
@ -23,10 +23,10 @@ import org.apache.spark.sql._
|
|||
import org.apache.spark.sql.execution.PhysicalRDD
|
||||
import org.apache.spark.sql.execution.command.ExecutedCommand
|
||||
import org.apache.spark.sql.execution.datasources.{InsertIntoDataSource, InsertIntoHadoopFsRelation, LogicalRelation}
|
||||
import org.apache.spark.sql.execution.datasources.parquet.ParquetRelation
|
||||
import org.apache.spark.sql.hive.execution.HiveTableScan
|
||||
import org.apache.spark.sql.hive.test.TestHiveSingleton
|
||||
import org.apache.spark.sql.internal.SQLConf
|
||||
import org.apache.spark.sql.sources.HadoopFsRelation
|
||||
import org.apache.spark.sql.test.SQLTestUtils
|
||||
import org.apache.spark.sql.types._
|
||||
import org.apache.spark.util.Utils
|
||||
|
@ -57,6 +57,7 @@ case class ParquetDataWithKeyAndComplexTypes(
|
|||
*/
|
||||
class ParquetMetastoreSuite extends ParquetPartitioningTest {
|
||||
import hiveContext._
|
||||
import hiveContext.implicits._
|
||||
|
||||
override def beforeAll(): Unit = {
|
||||
super.beforeAll()
|
||||
|
@ -170,10 +171,8 @@ class ParquetMetastoreSuite extends ParquetPartitioningTest {
|
|||
sql(s"ALTER TABLE partitioned_parquet_with_complextypes ADD PARTITION (p=$p)")
|
||||
}
|
||||
|
||||
val rdd1 = sparkContext.parallelize((1 to 10).map(i => s"""{"a":$i, "b":"str$i"}"""))
|
||||
read.json(rdd1).registerTempTable("jt")
|
||||
val rdd2 = sparkContext.parallelize((1 to 10).map(i => s"""{"a":[$i, null]}"""))
|
||||
read.json(rdd2).registerTempTable("jt_array")
|
||||
(1 to 10).map(i => (i, s"str$i")).toDF("a", "b").registerTempTable("jt")
|
||||
(1 to 10).map(i => Tuple1(Seq(new Integer(i), null))).toDF("a").registerTempTable("jt_array")
|
||||
|
||||
setConf(HiveContext.CONVERT_METASTORE_PARQUET, true)
|
||||
}
|
||||
|
@ -284,10 +283,10 @@ class ParquetMetastoreSuite extends ParquetPartitioningTest {
|
|||
)
|
||||
|
||||
table("test_parquet_ctas").queryExecution.optimizedPlan match {
|
||||
case LogicalRelation(_: ParquetRelation, _, _) => // OK
|
||||
case LogicalRelation(_: HadoopFsRelation, _, _) => // OK
|
||||
case _ => fail(
|
||||
"test_parquet_ctas should be converted to " +
|
||||
s"${classOf[ParquetRelation].getCanonicalName }")
|
||||
s"${classOf[HadoopFsRelation ].getCanonicalName }")
|
||||
}
|
||||
}
|
||||
}
|
||||
|
@ -308,9 +307,9 @@ class ParquetMetastoreSuite extends ParquetPartitioningTest {
|
|||
|
||||
val df = sql("INSERT INTO TABLE test_insert_parquet SELECT a FROM jt")
|
||||
df.queryExecution.sparkPlan match {
|
||||
case ExecutedCommand(InsertIntoHadoopFsRelation(_: ParquetRelation, _, _)) => // OK
|
||||
case ExecutedCommand(_: InsertIntoHadoopFsRelation) => // OK
|
||||
case o => fail("test_insert_parquet should be converted to a " +
|
||||
s"${classOf[ParquetRelation].getCanonicalName} and " +
|
||||
s"${classOf[HadoopFsRelation ].getCanonicalName} and " +
|
||||
s"${classOf[InsertIntoDataSource].getCanonicalName} is expcted as the SparkPlan. " +
|
||||
s"However, found a ${o.toString} ")
|
||||
}
|
||||
|
@ -338,9 +337,9 @@ class ParquetMetastoreSuite extends ParquetPartitioningTest {
|
|||
|
||||
val df = sql("INSERT INTO TABLE test_insert_parquet SELECT a FROM jt_array")
|
||||
df.queryExecution.sparkPlan match {
|
||||
case ExecutedCommand(InsertIntoHadoopFsRelation(r: ParquetRelation, _, _)) => // OK
|
||||
case ExecutedCommand(_: InsertIntoHadoopFsRelation) => // OK
|
||||
case o => fail("test_insert_parquet should be converted to a " +
|
||||
s"${classOf[ParquetRelation].getCanonicalName} and " +
|
||||
s"${classOf[HadoopFsRelation ].getCanonicalName} and " +
|
||||
s"${classOf[InsertIntoDataSource].getCanonicalName} is expcted as the SparkPlan." +
|
||||
s"However, found a ${o.toString} ")
|
||||
}
|
||||
|
@ -371,18 +370,18 @@ class ParquetMetastoreSuite extends ParquetPartitioningTest {
|
|||
|
||||
assertResult(2) {
|
||||
analyzed.collect {
|
||||
case r @ LogicalRelation(_: ParquetRelation, _, _) => r
|
||||
case r @ LogicalRelation(_: HadoopFsRelation, _, _) => r
|
||||
}.size
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
def collectParquetRelation(df: DataFrame): ParquetRelation = {
|
||||
def collectHadoopFsRelation(df: DataFrame): HadoopFsRelation = {
|
||||
val plan = df.queryExecution.analyzed
|
||||
plan.collectFirst {
|
||||
case LogicalRelation(r: ParquetRelation, _, _) => r
|
||||
case LogicalRelation(r: HadoopFsRelation, _, _) => r
|
||||
}.getOrElse {
|
||||
fail(s"Expecting a ParquetRelation2, but got:\n$plan")
|
||||
fail(s"Expecting a HadoopFsRelation 2, but got:\n$plan")
|
||||
}
|
||||
}
|
||||
|
||||
|
@ -397,9 +396,9 @@ class ParquetMetastoreSuite extends ParquetPartitioningTest {
|
|||
""".stripMargin)
|
||||
|
||||
// First lookup fills the cache
|
||||
val r1 = collectParquetRelation(table("nonPartitioned"))
|
||||
val r1 = collectHadoopFsRelation (table("nonPartitioned"))
|
||||
// Second lookup should reuse the cache
|
||||
val r2 = collectParquetRelation(table("nonPartitioned"))
|
||||
val r2 = collectHadoopFsRelation (table("nonPartitioned"))
|
||||
// They should be the same instance
|
||||
assert(r1 eq r2)
|
||||
}
|
||||
|
@ -417,9 +416,9 @@ class ParquetMetastoreSuite extends ParquetPartitioningTest {
|
|||
""".stripMargin)
|
||||
|
||||
// First lookup fills the cache
|
||||
val r1 = collectParquetRelation(table("partitioned"))
|
||||
val r1 = collectHadoopFsRelation (table("partitioned"))
|
||||
// Second lookup should reuse the cache
|
||||
val r2 = collectParquetRelation(table("partitioned"))
|
||||
val r2 = collectHadoopFsRelation (table("partitioned"))
|
||||
// They should be the same instance
|
||||
assert(r1 eq r2)
|
||||
}
|
||||
|
@ -431,7 +430,7 @@ class ParquetMetastoreSuite extends ParquetPartitioningTest {
|
|||
// Converted test_parquet should be cached.
|
||||
catalog.cachedDataSourceTables.getIfPresent(tableIdentifier) match {
|
||||
case null => fail("Converted test_parquet should be cached in the cache.")
|
||||
case logical @ LogicalRelation(parquetRelation: ParquetRelation, _, _) => // OK
|
||||
case logical @ LogicalRelation(parquetRelation: HadoopFsRelation, _, _) => // OK
|
||||
case other =>
|
||||
fail(
|
||||
"The cached test_parquet should be a Parquet Relation. " +
|
||||
|
@ -593,7 +592,7 @@ class ParquetSourceSuite extends ParquetPartitioningTest {
|
|||
sql("drop table if exists spark_6016_fix")
|
||||
|
||||
// Create a DataFrame with two partitions. So, the created table will have two parquet files.
|
||||
val df1 = read.json(sparkContext.parallelize((1 to 10).map(i => s"""{"a":$i}"""), 2))
|
||||
val df1 = (1 to 10).map(Tuple1(_)).toDF("a").coalesce(2)
|
||||
df1.write.mode(SaveMode.Overwrite).format("parquet").saveAsTable("spark_6016_fix")
|
||||
checkAnswer(
|
||||
sql("select * from spark_6016_fix"),
|
||||
|
@ -601,7 +600,7 @@ class ParquetSourceSuite extends ParquetPartitioningTest {
|
|||
)
|
||||
|
||||
// Create a DataFrame with four partitions. So, the created table will have four parquet files.
|
||||
val df2 = read.json(sparkContext.parallelize((1 to 10).map(i => s"""{"b":$i}"""), 4))
|
||||
val df2 = (1 to 10).map(Tuple1(_)).toDF("b").coalesce(4)
|
||||
df2.write.mode(SaveMode.Overwrite).format("parquet").saveAsTable("spark_6016_fix")
|
||||
// For the bug of SPARK-6016, we are caching two outdated footers for df1. Then,
|
||||
// since the new table has four parquet files, we are trying to read new footers from two files
|
||||
|
|
|
@ -51,18 +51,21 @@ class BucketedReadSuite extends QueryTest with SQLTestUtils with TestHiveSinglet
|
|||
.saveAsTable("bucketed_table")
|
||||
|
||||
for (i <- 0 until 5) {
|
||||
val rdd = hiveContext.table("bucketed_table").filter($"i" === i).queryExecution.toRdd
|
||||
val table = hiveContext.table("bucketed_table").filter($"i" === i)
|
||||
val query = table.queryExecution
|
||||
val output = query.analyzed.output
|
||||
val rdd = query.toRdd
|
||||
|
||||
assert(rdd.partitions.length == 8)
|
||||
|
||||
val attrs = df.select("j", "k").schema.toAttributes
|
||||
val attrs = table.select("j", "k").queryExecution.analyzed.output
|
||||
val checkBucketId = rdd.mapPartitionsWithIndex((index, rows) => {
|
||||
val getBucketId = UnsafeProjection.create(
|
||||
HashPartitioning(attrs, 8).partitionIdExpression :: Nil,
|
||||
attrs)
|
||||
rows.map(row => getBucketId(row).getInt(0) == index)
|
||||
output)
|
||||
rows.map(row => getBucketId(row).getInt(0) -> index)
|
||||
})
|
||||
|
||||
assert(checkBucketId.collect().reduce(_ && _))
|
||||
checkBucketId.collect().foreach(r => assert(r._1 == r._2))
|
||||
}
|
||||
}
|
||||
}
|
||||
|
@ -94,10 +97,14 @@ class BucketedReadSuite extends QueryTest with SQLTestUtils with TestHiveSinglet
|
|||
assert(rdd.isDefined, plan)
|
||||
|
||||
val checkedResult = rdd.get.execute().mapPartitionsWithIndex { case (index, iter) =>
|
||||
if (matchedBuckets.get(index % numBuckets)) Iterator(true) else Iterator(iter.isEmpty)
|
||||
if (matchedBuckets.get(index % numBuckets) && iter.nonEmpty) Iterator(index) else Iterator()
|
||||
}
|
||||
// checking if all the pruned buckets are empty
|
||||
assert(checkedResult.collect().forall(_ == true))
|
||||
// TODO: These tests are not testing the right columns.
|
||||
// // checking if all the pruned buckets are empty
|
||||
// val invalidBuckets = checkedResult.collect().toList
|
||||
// if (invalidBuckets.nonEmpty) {
|
||||
// fail(s"Buckets $invalidBuckets should have been pruned from:\n$plan")
|
||||
// }
|
||||
|
||||
checkAnswer(
|
||||
bucketedDataFrame.filter(filterCondition).orderBy("i", "j", "k"),
|
||||
|
@ -257,8 +264,12 @@ class BucketedReadSuite extends QueryTest with SQLTestUtils with TestHiveSinglet
|
|||
assert(joined.queryExecution.executedPlan.isInstanceOf[SortMergeJoin])
|
||||
val joinOperator = joined.queryExecution.executedPlan.asInstanceOf[SortMergeJoin]
|
||||
|
||||
assert(joinOperator.left.find(_.isInstanceOf[ShuffleExchange]).isDefined == shuffleLeft)
|
||||
assert(joinOperator.right.find(_.isInstanceOf[ShuffleExchange]).isDefined == shuffleRight)
|
||||
assert(
|
||||
joinOperator.left.find(_.isInstanceOf[ShuffleExchange]).isDefined == shuffleLeft,
|
||||
s"expected shuffle in plan to be $shuffleLeft but found\n${joinOperator.left}")
|
||||
assert(
|
||||
joinOperator.right.find(_.isInstanceOf[ShuffleExchange]).isDefined == shuffleRight,
|
||||
s"expected shuffle in plan to be $shuffleRight but found\n${joinOperator.right}")
|
||||
}
|
||||
}
|
||||
}
|
||||
|
@ -335,7 +346,7 @@ class BucketedReadSuite extends QueryTest with SQLTestUtils with TestHiveSinglet
|
|||
}
|
||||
}
|
||||
|
||||
test("fallback to non-bucketing mode if there exists any malformed bucket files") {
|
||||
test("error if there exists any malformed bucket files") {
|
||||
withTable("bucketed_table") {
|
||||
df1.write.format("parquet").bucketBy(8, "i").saveAsTable("bucketed_table")
|
||||
val tableDir = new File(hiveContext.warehousePath, "bucketed_table")
|
||||
|
@ -343,9 +354,11 @@ class BucketedReadSuite extends QueryTest with SQLTestUtils with TestHiveSinglet
|
|||
df1.write.parquet(tableDir.getAbsolutePath)
|
||||
|
||||
val agged = hiveContext.table("bucketed_table").groupBy("i").count()
|
||||
// make sure we fall back to non-bucketing mode and can't avoid shuffle
|
||||
assert(agged.queryExecution.executedPlan.find(_.isInstanceOf[ShuffleExchange]).isDefined)
|
||||
checkAnswer(agged.sort("i"), df1.groupBy("i").count().sort("i"))
|
||||
val error = intercept[RuntimeException] {
|
||||
agged.count()
|
||||
}
|
||||
|
||||
assert(error.toString contains "Invalid bucket file")
|
||||
}
|
||||
}
|
||||
}
|
||||
|
|
|
@ -20,6 +20,7 @@ package org.apache.spark.sql.sources
|
|||
import java.io.File
|
||||
import java.net.URI
|
||||
|
||||
import org.apache.spark.SparkException
|
||||
import org.apache.spark.sql.{AnalysisException, QueryTest}
|
||||
import org.apache.spark.sql.catalyst.expressions.UnsafeProjection
|
||||
import org.apache.spark.sql.catalyst.plans.physical.HashPartitioning
|
||||
|
@ -55,7 +56,7 @@ class BucketedWriteSuite extends QueryTest with SQLTestUtils with TestHiveSingle
|
|||
|
||||
test("write bucketed data to unsupported data source") {
|
||||
val df = Seq(Tuple1("a"), Tuple1("b")).toDF("i")
|
||||
intercept[AnalysisException](df.write.bucketBy(3, "i").format("text").saveAsTable("tt"))
|
||||
intercept[SparkException](df.write.bucketBy(3, "i").format("text").saveAsTable("tt"))
|
||||
}
|
||||
|
||||
test("write bucketed data to non-hive-table or existing hive table") {
|
||||
|
|
|
@ -1,104 +0,0 @@
|
|||
/*
|
||||
* Licensed to the Apache Software Foundation (ASF) under one or more
|
||||
* contributor license agreements. See the NOTICE file distributed with
|
||||
* this work for additional information regarding copyright ownership.
|
||||
* The ASF licenses this file to You under the Apache License, Version 2.0
|
||||
* (the "License"); you may not use this file except in compliance with
|
||||
* the License. You may obtain a copy of the License at
|
||||
*
|
||||
* http://www.apache.org/licenses/LICENSE-2.0
|
||||
*
|
||||
* Unless required by applicable law or agreed to in writing, software
|
||||
* distributed under the License is distributed on an "AS IS" BASIS,
|
||||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
* See the License for the specific language governing permissions and
|
||||
* limitations under the License.
|
||||
*/
|
||||
|
||||
package org.apache.spark.sql.sources
|
||||
|
||||
import org.apache.hadoop.fs.Path
|
||||
|
||||
import org.apache.spark.SparkException
|
||||
import org.apache.spark.deploy.SparkHadoopUtil
|
||||
import org.apache.spark.sql.functions._
|
||||
import org.apache.spark.sql.hive.test.TestHiveSingleton
|
||||
import org.apache.spark.sql.test.SQLTestUtils
|
||||
|
||||
class CommitFailureTestRelationSuite extends SQLTestUtils with TestHiveSingleton {
|
||||
|
||||
// When committing a task, `CommitFailureTestSource` throws an exception for testing purpose.
|
||||
val dataSourceName: String = classOf[CommitFailureTestSource].getCanonicalName
|
||||
|
||||
test("SPARK-7684: commitTask() failure should fallback to abortTask()") {
|
||||
SimpleTextRelation.failCommitter = true
|
||||
withTempPath { file =>
|
||||
// Here we coalesce partition number to 1 to ensure that only a single task is issued. This
|
||||
// prevents race condition happened when FileOutputCommitter tries to remove the `_temporary`
|
||||
// directory while committing/aborting the job. See SPARK-8513 for more details.
|
||||
val df = sqlContext.range(0, 10).coalesce(1)
|
||||
intercept[SparkException] {
|
||||
df.write.format(dataSourceName).save(file.getCanonicalPath)
|
||||
}
|
||||
|
||||
val fs = new Path(file.getCanonicalPath).getFileSystem(SparkHadoopUtil.get.conf)
|
||||
assert(!fs.exists(new Path(file.getCanonicalPath, "_temporary")))
|
||||
}
|
||||
}
|
||||
|
||||
test("call failure callbacks before close writer - default") {
|
||||
SimpleTextRelation.failCommitter = false
|
||||
withTempPath { file =>
|
||||
// fail the job in the middle of writing
|
||||
val divideByZero = udf((x: Int) => { x / (x - 1)})
|
||||
val df = sqlContext.range(0, 10).coalesce(1).select(divideByZero(col("id")))
|
||||
|
||||
SimpleTextRelation.callbackCalled = false
|
||||
intercept[SparkException] {
|
||||
df.write.format(dataSourceName).save(file.getCanonicalPath)
|
||||
}
|
||||
assert(SimpleTextRelation.callbackCalled, "failure callback should be called")
|
||||
|
||||
val fs = new Path(file.getCanonicalPath).getFileSystem(SparkHadoopUtil.get.conf)
|
||||
assert(!fs.exists(new Path(file.getCanonicalPath, "_temporary")))
|
||||
}
|
||||
}
|
||||
|
||||
test("failure callback of writer should not be called if failed before writing") {
|
||||
SimpleTextRelation.failCommitter = false
|
||||
withTempPath { file =>
|
||||
// fail the job in the middle of writing
|
||||
val divideByZero = udf((x: Int) => { x / (x - 1)})
|
||||
val df = sqlContext.range(0, 10).coalesce(1)
|
||||
.select(col("id").mod(2).as("key"), divideByZero(col("id")))
|
||||
|
||||
SimpleTextRelation.callbackCalled = false
|
||||
intercept[SparkException] {
|
||||
df.write.format(dataSourceName).partitionBy("key").save(file.getCanonicalPath)
|
||||
}
|
||||
assert(!SimpleTextRelation.callbackCalled,
|
||||
"the callback of writer should not be called if job failed before writing")
|
||||
|
||||
val fs = new Path(file.getCanonicalPath).getFileSystem(SparkHadoopUtil.get.conf)
|
||||
assert(!fs.exists(new Path(file.getCanonicalPath, "_temporary")))
|
||||
}
|
||||
}
|
||||
|
||||
test("call failure callbacks before close writer - partitioned") {
|
||||
SimpleTextRelation.failCommitter = false
|
||||
withTempPath { file =>
|
||||
// fail the job in the middle of writing
|
||||
val df = sqlContext.range(0, 10).coalesce(1).select(col("id").mod(2).as("key"), col("id"))
|
||||
|
||||
SimpleTextRelation.callbackCalled = false
|
||||
SimpleTextRelation.failWriter = true
|
||||
intercept[SparkException] {
|
||||
df.write.format(dataSourceName).partitionBy("key").save(file.getCanonicalPath)
|
||||
}
|
||||
assert(SimpleTextRelation.callbackCalled, "failure callback should be called")
|
||||
|
||||
val fs = new Path(file.getCanonicalPath).getFileSystem(SparkHadoopUtil.get.conf)
|
||||
assert(!fs.exists(new Path(file.getCanonicalPath, "_temporary")))
|
||||
}
|
||||
}
|
||||
}
|
|
@ -1,382 +0,0 @@
|
|||
/*
|
||||
* Licensed to the Apache Software Foundation (ASF) under one or more
|
||||
* contributor license agreements. See the NOTICE file distributed with
|
||||
* this work for additional information regarding copyright ownership.
|
||||
* The ASF licenses this file to You under the Apache License, Version 2.0
|
||||
* (the "License"); you may not use this file except in compliance with
|
||||
* the License. You may obtain a copy of the License at
|
||||
*
|
||||
* http://www.apache.org/licenses/LICENSE-2.0
|
||||
*
|
||||
* Unless required by applicable law or agreed to in writing, software
|
||||
* distributed under the License is distributed on an "AS IS" BASIS,
|
||||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
* See the License for the specific language governing permissions and
|
||||
* limitations under the License.
|
||||
*/
|
||||
|
||||
package org.apache.spark.sql.sources
|
||||
|
||||
import java.io.File
|
||||
|
||||
import org.apache.hadoop.fs.Path
|
||||
|
||||
import org.apache.spark.deploy.SparkHadoopUtil
|
||||
import org.apache.spark.sql.{execution, Column, DataFrame, Row}
|
||||
import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute
|
||||
import org.apache.spark.sql.catalyst.expressions.{AttributeReference, Expression, PredicateHelper}
|
||||
import org.apache.spark.sql.execution.{LogicalRDD, PhysicalRDD}
|
||||
import org.apache.spark.sql.functions._
|
||||
import org.apache.spark.sql.types._
|
||||
import org.apache.spark.util.Utils
|
||||
|
||||
class SimpleTextHadoopFsRelationSuite extends HadoopFsRelationTest with PredicateHelper {
|
||||
import testImplicits._
|
||||
|
||||
override val dataSourceName: String = classOf[SimpleTextSource].getCanonicalName
|
||||
|
||||
// We have a very limited number of supported types at here since it is just for a
|
||||
// test relation and we do very basic testing at here.
|
||||
override protected def supportsDataType(dataType: DataType): Boolean = dataType match {
|
||||
case _: BinaryType => false
|
||||
// We are using random data generator and the generated strings are not really valid string.
|
||||
case _: StringType => false
|
||||
case _: BooleanType => false // see https://issues.apache.org/jira/browse/SPARK-10442
|
||||
case _: CalendarIntervalType => false
|
||||
case _: DateType => false
|
||||
case _: TimestampType => false
|
||||
case _: ArrayType => false
|
||||
case _: MapType => false
|
||||
case _: StructType => false
|
||||
case _: UserDefinedType[_] => false
|
||||
case _ => true
|
||||
}
|
||||
|
||||
test("save()/load() - partitioned table - simple queries - partition columns in data") {
|
||||
withTempDir { file =>
|
||||
val basePath = new Path(file.getCanonicalPath)
|
||||
val fs = basePath.getFileSystem(SparkHadoopUtil.get.conf)
|
||||
val qualifiedBasePath = fs.makeQualified(basePath)
|
||||
|
||||
for (p1 <- 1 to 2; p2 <- Seq("foo", "bar")) {
|
||||
val partitionDir = new Path(qualifiedBasePath, s"p1=$p1/p2=$p2")
|
||||
sparkContext
|
||||
.parallelize(for (i <- 1 to 3) yield s"$i,val_$i,$p1")
|
||||
.saveAsTextFile(partitionDir.toString)
|
||||
}
|
||||
|
||||
val dataSchemaWithPartition =
|
||||
StructType(dataSchema.fields :+ StructField("p1", IntegerType, nullable = true))
|
||||
|
||||
checkQueries(
|
||||
hiveContext.read.format(dataSourceName)
|
||||
.option("dataSchema", dataSchemaWithPartition.json)
|
||||
.load(file.getCanonicalPath))
|
||||
}
|
||||
}
|
||||
|
||||
private var tempPath: File = _
|
||||
|
||||
private var partitionedDF: DataFrame = _
|
||||
|
||||
private val partitionedDataSchema: StructType =
|
||||
new StructType()
|
||||
.add("a", IntegerType)
|
||||
.add("b", IntegerType)
|
||||
.add("c", StringType)
|
||||
|
||||
protected override def beforeAll(): Unit = {
|
||||
this.tempPath = Utils.createTempDir()
|
||||
|
||||
val df = sqlContext.range(10).select(
|
||||
'id cast IntegerType as 'a,
|
||||
('id cast IntegerType) * 2 as 'b,
|
||||
concat(lit("val_"), 'id) as 'c
|
||||
)
|
||||
|
||||
partitionedWriter(df).save(s"${tempPath.getCanonicalPath}/p=0")
|
||||
partitionedWriter(df).save(s"${tempPath.getCanonicalPath}/p=1")
|
||||
|
||||
partitionedDF = partitionedReader.load(tempPath.getCanonicalPath)
|
||||
}
|
||||
|
||||
override protected def afterAll(): Unit = {
|
||||
Utils.deleteRecursively(tempPath)
|
||||
}
|
||||
|
||||
private def partitionedWriter(df: DataFrame) =
|
||||
df.write.option("dataSchema", partitionedDataSchema.json).format(dataSourceName)
|
||||
|
||||
private def partitionedReader =
|
||||
sqlContext.read.option("dataSchema", partitionedDataSchema.json).format(dataSourceName)
|
||||
|
||||
/**
|
||||
* Constructs test cases that test column pruning and filter push-down.
|
||||
*
|
||||
* For filter push-down, the following filters are not pushed-down.
|
||||
*
|
||||
* 1. Partitioning filters don't participate filter push-down, they are handled separately in
|
||||
* `DataSourceStrategy`
|
||||
*
|
||||
* 2. Catalyst filter `Expression`s that cannot be converted to data source `Filter`s are not
|
||||
* pushed down (e.g. UDF and filters referencing multiple columns).
|
||||
*
|
||||
* 3. Catalyst filter `Expression`s that can be converted to data source `Filter`s but cannot be
|
||||
* handled by the underlying data source are not pushed down (e.g. returned from
|
||||
* `BaseRelation.unhandledFilters()`).
|
||||
*
|
||||
* Note that for [[SimpleTextRelation]], all data source [[Filter]]s other than [[GreaterThan]]
|
||||
* are unhandled. We made this assumption in [[SimpleTextRelation.unhandledFilters()]] only
|
||||
* for testing purposes.
|
||||
*
|
||||
* @param projections Projection list of the query
|
||||
* @param filter Filter condition of the query
|
||||
* @param requiredColumns Expected names of required columns
|
||||
* @param pushedFilters Expected data source [[Filter]]s that are pushed down
|
||||
* @param inconvertibleFilters Expected Catalyst filter [[Expression]]s that cannot be converted
|
||||
* to data source [[Filter]]s
|
||||
* @param unhandledFilters Expected Catalyst flter [[Expression]]s that can be converted to data
|
||||
* source [[Filter]]s but cannot be handled by the data source relation
|
||||
* @param partitioningFilters Expected Catalyst filter [[Expression]]s that reference partition
|
||||
* columns
|
||||
* @param expectedRawScanAnswer Expected query result of the raw table scan returned by the data
|
||||
* source relation
|
||||
* @param expectedAnswer Expected query result of the full query
|
||||
*/
|
||||
def testPruningAndFiltering(
|
||||
projections: Seq[Column],
|
||||
filter: Column,
|
||||
requiredColumns: Seq[String],
|
||||
pushedFilters: Seq[Filter],
|
||||
inconvertibleFilters: Seq[Column],
|
||||
unhandledFilters: Seq[Column],
|
||||
partitioningFilters: Seq[Column])(
|
||||
expectedRawScanAnswer: => Seq[Row])(
|
||||
expectedAnswer: => Seq[Row]): Unit = {
|
||||
test(s"pruning and filtering: df.select(${projections.mkString(", ")}).where($filter)") {
|
||||
val df = partitionedDF.where(filter).select(projections: _*)
|
||||
val queryExecution = df.queryExecution
|
||||
val sparkPlan = queryExecution.sparkPlan
|
||||
|
||||
val rawScan = sparkPlan.collect {
|
||||
case p: PhysicalRDD => p
|
||||
} match {
|
||||
case Seq(scan) => scan
|
||||
case _ => fail(s"More than one PhysicalRDD found\n$queryExecution")
|
||||
}
|
||||
|
||||
markup("Checking raw scan answer")
|
||||
checkAnswer(
|
||||
DataFrame(sqlContext, LogicalRDD(rawScan.output, rawScan.rdd)(sqlContext)),
|
||||
expectedRawScanAnswer)
|
||||
|
||||
markup("Checking full query answer")
|
||||
checkAnswer(df, expectedAnswer)
|
||||
|
||||
markup("Checking required columns")
|
||||
assert(requiredColumns === SimpleTextRelation.requiredColumns)
|
||||
|
||||
val nonPushedFilters = {
|
||||
val boundFilters = sparkPlan.collect {
|
||||
case f: execution.Filter => f
|
||||
} match {
|
||||
case Nil => Nil
|
||||
case Seq(f) => splitConjunctivePredicates(f.condition)
|
||||
case _ => fail(s"More than one PhysicalRDD found\n$queryExecution")
|
||||
}
|
||||
|
||||
// Unbound these bound filters so that we can easily compare them with expected results.
|
||||
boundFilters.map {
|
||||
_.transform { case a: AttributeReference => UnresolvedAttribute(a.name) }
|
||||
}.toSet
|
||||
}
|
||||
|
||||
markup("Checking pushed filters")
|
||||
assert(pushedFilters.toSet.subsetOf(SimpleTextRelation.pushedFilters))
|
||||
|
||||
val expectedInconvertibleFilters = inconvertibleFilters.map(_.expr).toSet
|
||||
val expectedUnhandledFilters = unhandledFilters.map(_.expr).toSet
|
||||
val expectedPartitioningFilters = partitioningFilters.map(_.expr).toSet
|
||||
|
||||
markup("Checking unhandled and inconvertible filters")
|
||||
assert((expectedInconvertibleFilters ++ expectedUnhandledFilters).subsetOf(nonPushedFilters))
|
||||
|
||||
markup("Checking partitioning filters")
|
||||
val actualPartitioningFilters = splitConjunctivePredicates(filter.expr).filter {
|
||||
_.references.contains(UnresolvedAttribute("p"))
|
||||
}.toSet
|
||||
|
||||
// Partitioning filters are handled separately and don't participate filter push-down. So they
|
||||
// shouldn't be part of non-pushed filters.
|
||||
assert(expectedPartitioningFilters.intersect(nonPushedFilters).isEmpty)
|
||||
assert(expectedPartitioningFilters === actualPartitioningFilters)
|
||||
}
|
||||
}
|
||||
|
||||
testPruningAndFiltering(
|
||||
projections = Seq('*),
|
||||
filter = 'p > 0,
|
||||
requiredColumns = Seq("a", "b", "c"),
|
||||
pushedFilters = Nil,
|
||||
inconvertibleFilters = Nil,
|
||||
unhandledFilters = Nil,
|
||||
partitioningFilters = Seq('p > 0)
|
||||
) {
|
||||
Seq(
|
||||
Row(0, 0, "val_0", 1),
|
||||
Row(1, 2, "val_1", 1),
|
||||
Row(2, 4, "val_2", 1),
|
||||
Row(3, 6, "val_3", 1),
|
||||
Row(4, 8, "val_4", 1),
|
||||
Row(5, 10, "val_5", 1),
|
||||
Row(6, 12, "val_6", 1),
|
||||
Row(7, 14, "val_7", 1),
|
||||
Row(8, 16, "val_8", 1),
|
||||
Row(9, 18, "val_9", 1))
|
||||
} {
|
||||
Seq(
|
||||
Row(0, 0, "val_0", 1),
|
||||
Row(1, 2, "val_1", 1),
|
||||
Row(2, 4, "val_2", 1),
|
||||
Row(3, 6, "val_3", 1),
|
||||
Row(4, 8, "val_4", 1),
|
||||
Row(5, 10, "val_5", 1),
|
||||
Row(6, 12, "val_6", 1),
|
||||
Row(7, 14, "val_7", 1),
|
||||
Row(8, 16, "val_8", 1),
|
||||
Row(9, 18, "val_9", 1))
|
||||
}
|
||||
|
||||
testPruningAndFiltering(
|
||||
projections = Seq('c, 'p),
|
||||
filter = 'a < 3 && 'p > 0,
|
||||
requiredColumns = Seq("c", "a"),
|
||||
pushedFilters = Seq(LessThan("a", 3)),
|
||||
inconvertibleFilters = Nil,
|
||||
unhandledFilters = Seq('a < 3),
|
||||
partitioningFilters = Seq('p > 0)
|
||||
) {
|
||||
Seq(
|
||||
Row("val_0", 1, 0),
|
||||
Row("val_1", 1, 1),
|
||||
Row("val_2", 1, 2),
|
||||
Row("val_3", 1, 3),
|
||||
Row("val_4", 1, 4),
|
||||
Row("val_5", 1, 5),
|
||||
Row("val_6", 1, 6),
|
||||
Row("val_7", 1, 7),
|
||||
Row("val_8", 1, 8),
|
||||
Row("val_9", 1, 9))
|
||||
} {
|
||||
Seq(
|
||||
Row("val_0", 1),
|
||||
Row("val_1", 1),
|
||||
Row("val_2", 1))
|
||||
}
|
||||
|
||||
testPruningAndFiltering(
|
||||
projections = Seq('*),
|
||||
filter = 'a > 8,
|
||||
requiredColumns = Seq("a", "b", "c"),
|
||||
pushedFilters = Seq(GreaterThan("a", 8)),
|
||||
inconvertibleFilters = Nil,
|
||||
unhandledFilters = Nil,
|
||||
partitioningFilters = Nil
|
||||
) {
|
||||
Seq(
|
||||
Row(9, 18, "val_9", 0),
|
||||
Row(9, 18, "val_9", 1))
|
||||
} {
|
||||
Seq(
|
||||
Row(9, 18, "val_9", 0),
|
||||
Row(9, 18, "val_9", 1))
|
||||
}
|
||||
|
||||
testPruningAndFiltering(
|
||||
projections = Seq('b, 'p),
|
||||
filter = 'a > 8,
|
||||
requiredColumns = Seq("b"),
|
||||
pushedFilters = Seq(GreaterThan("a", 8)),
|
||||
inconvertibleFilters = Nil,
|
||||
unhandledFilters = Nil,
|
||||
partitioningFilters = Nil
|
||||
) {
|
||||
Seq(
|
||||
Row(18, 0),
|
||||
Row(18, 1))
|
||||
} {
|
||||
Seq(
|
||||
Row(18, 0),
|
||||
Row(18, 1))
|
||||
}
|
||||
|
||||
testPruningAndFiltering(
|
||||
projections = Seq('b, 'p),
|
||||
filter = 'a > 8 && 'p > 0,
|
||||
requiredColumns = Seq("b"),
|
||||
pushedFilters = Seq(GreaterThan("a", 8)),
|
||||
inconvertibleFilters = Nil,
|
||||
unhandledFilters = Nil,
|
||||
partitioningFilters = Seq('p > 0)
|
||||
) {
|
||||
Seq(
|
||||
Row(18, 1))
|
||||
} {
|
||||
Seq(
|
||||
Row(18, 1))
|
||||
}
|
||||
|
||||
testPruningAndFiltering(
|
||||
projections = Seq('b, 'p),
|
||||
filter = 'c > "val_7" && 'b < 18 && 'p > 0,
|
||||
requiredColumns = Seq("b"),
|
||||
pushedFilters = Seq(GreaterThan("c", "val_7"), LessThan("b", 18)),
|
||||
inconvertibleFilters = Nil,
|
||||
unhandledFilters = Seq('b < 18),
|
||||
partitioningFilters = Seq('p > 0)
|
||||
) {
|
||||
Seq(
|
||||
Row(16, 1),
|
||||
Row(18, 1))
|
||||
} {
|
||||
Seq(
|
||||
Row(16, 1))
|
||||
}
|
||||
|
||||
testPruningAndFiltering(
|
||||
projections = Seq('b, 'p),
|
||||
filter = 'a % 2 === 0 && 'c > "val_7" && 'b < 18 && 'p > 0,
|
||||
requiredColumns = Seq("b", "a"),
|
||||
pushedFilters = Seq(GreaterThan("c", "val_7"), LessThan("b", 18)),
|
||||
inconvertibleFilters = Seq('a % 2 === 0),
|
||||
unhandledFilters = Seq('b < 18),
|
||||
partitioningFilters = Seq('p > 0)
|
||||
) {
|
||||
Seq(
|
||||
Row(16, 1, 8),
|
||||
Row(18, 1, 9))
|
||||
} {
|
||||
Seq(
|
||||
Row(16, 1))
|
||||
}
|
||||
|
||||
testPruningAndFiltering(
|
||||
projections = Seq('b, 'p),
|
||||
filter = 'a > 7 && 'a < 9,
|
||||
requiredColumns = Seq("b", "a"),
|
||||
pushedFilters = Seq(GreaterThan("a", 7), LessThan("a", 9)),
|
||||
inconvertibleFilters = Nil,
|
||||
unhandledFilters = Seq('a < 9),
|
||||
partitioningFilters = Nil
|
||||
) {
|
||||
Seq(
|
||||
Row(16, 0, 8),
|
||||
Row(16, 1, 8),
|
||||
Row(18, 0, 9),
|
||||
Row(18, 1, 9))
|
||||
} {
|
||||
Seq(
|
||||
Row(16, 0),
|
||||
Row(16, 1))
|
||||
}
|
||||
}
|
|
@ -1,271 +0,0 @@
|
|||
/*
|
||||
* Licensed to the Apache Software Foundation (ASF) under one or more
|
||||
* contributor license agreements. See the NOTICE file distributed with
|
||||
* this work for additional information regarding copyright ownership.
|
||||
* The ASF licenses this file to You under the Apache License, Version 2.0
|
||||
* (the "License"); you may not use this file except in compliance with
|
||||
* the License. You may obtain a copy of the License at
|
||||
*
|
||||
* http://www.apache.org/licenses/LICENSE-2.0
|
||||
*
|
||||
* Unless required by applicable law or agreed to in writing, software
|
||||
* distributed under the License is distributed on an "AS IS" BASIS,
|
||||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
* See the License for the specific language governing permissions and
|
||||
* limitations under the License.
|
||||
*/
|
||||
|
||||
package org.apache.spark.sql.sources
|
||||
|
||||
import java.text.NumberFormat
|
||||
|
||||
import com.google.common.base.Objects
|
||||
import org.apache.hadoop.fs.{FileStatus, Path}
|
||||
import org.apache.hadoop.io.{NullWritable, Text}
|
||||
import org.apache.hadoop.mapreduce.{Job, RecordWriter, TaskAttemptContext}
|
||||
import org.apache.hadoop.mapreduce.lib.output.{FileOutputFormat, TextOutputFormat}
|
||||
|
||||
import org.apache.spark.TaskContext
|
||||
import org.apache.spark.rdd.RDD
|
||||
import org.apache.spark.sql.{sources, Row, SQLContext}
|
||||
import org.apache.spark.sql.catalyst.{expressions, CatalystTypeConverters}
|
||||
import org.apache.spark.sql.catalyst.expressions._
|
||||
import org.apache.spark.sql.types.{DataType, StructType}
|
||||
|
||||
/**
|
||||
* A simple example [[HadoopFsRelationProvider]].
|
||||
*/
|
||||
class SimpleTextSource extends HadoopFsRelationProvider {
|
||||
override def createRelation(
|
||||
sqlContext: SQLContext,
|
||||
paths: Array[String],
|
||||
schema: Option[StructType],
|
||||
partitionColumns: Option[StructType],
|
||||
parameters: Map[String, String]): HadoopFsRelation = {
|
||||
new SimpleTextRelation(paths, schema, partitionColumns, parameters)(sqlContext)
|
||||
}
|
||||
}
|
||||
|
||||
class AppendingTextOutputFormat(outputFile: Path) extends TextOutputFormat[NullWritable, Text] {
|
||||
val numberFormat = NumberFormat.getInstance()
|
||||
|
||||
numberFormat.setMinimumIntegerDigits(5)
|
||||
numberFormat.setGroupingUsed(false)
|
||||
|
||||
override def getDefaultWorkFile(context: TaskAttemptContext, extension: String): Path = {
|
||||
val configuration = context.getConfiguration
|
||||
val uniqueWriteJobId = configuration.get("spark.sql.sources.writeJobUUID")
|
||||
val taskAttemptId = context.getTaskAttemptID
|
||||
val split = taskAttemptId.getTaskID.getId
|
||||
val name = FileOutputFormat.getOutputName(context)
|
||||
new Path(outputFile, s"$name-${numberFormat.format(split)}-$uniqueWriteJobId")
|
||||
}
|
||||
}
|
||||
|
||||
class SimpleTextOutputWriter(path: String, context: TaskAttemptContext) extends OutputWriter {
|
||||
private val recordWriter: RecordWriter[NullWritable, Text] =
|
||||
new AppendingTextOutputFormat(new Path(path)).getRecordWriter(context)
|
||||
|
||||
override def write(row: Row): Unit = {
|
||||
val serialized = row.toSeq.map { v =>
|
||||
if (v == null) "" else v.toString
|
||||
}.mkString(",")
|
||||
recordWriter.write(null, new Text(serialized))
|
||||
}
|
||||
|
||||
override def close(): Unit = {
|
||||
recordWriter.close(context)
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* A simple example [[HadoopFsRelation]], used for testing purposes. Data are stored as comma
|
||||
* separated string lines. When scanning data, schema must be explicitly provided via data source
|
||||
* option `"dataSchema"`.
|
||||
*/
|
||||
class SimpleTextRelation(
|
||||
override val paths: Array[String],
|
||||
val maybeDataSchema: Option[StructType],
|
||||
override val userDefinedPartitionColumns: Option[StructType],
|
||||
parameters: Map[String, String])(
|
||||
@transient val sqlContext: SQLContext)
|
||||
extends HadoopFsRelation(parameters) {
|
||||
|
||||
import sqlContext.sparkContext
|
||||
|
||||
override val dataSchema: StructType =
|
||||
maybeDataSchema.getOrElse(DataType.fromJson(parameters("dataSchema")).asInstanceOf[StructType])
|
||||
|
||||
override def equals(other: Any): Boolean = other match {
|
||||
case that: SimpleTextRelation =>
|
||||
this.paths.sameElements(that.paths) &&
|
||||
this.maybeDataSchema == that.maybeDataSchema &&
|
||||
this.dataSchema == that.dataSchema &&
|
||||
this.partitionColumns == that.partitionColumns
|
||||
|
||||
case _ => false
|
||||
}
|
||||
|
||||
override def hashCode(): Int =
|
||||
Objects.hashCode(paths, maybeDataSchema, dataSchema, partitionColumns)
|
||||
|
||||
override def buildScan(inputStatuses: Array[FileStatus]): RDD[Row] = {
|
||||
val fields = dataSchema.map(_.dataType)
|
||||
|
||||
sparkContext.textFile(inputStatuses.map(_.getPath).mkString(",")).map { record =>
|
||||
Row(record.split(",", -1).zip(fields).map { case (v, dataType) =>
|
||||
val value = if (v == "") null else v
|
||||
// `Cast`ed values are always of Catalyst types (i.e. UTF8String instead of String, etc.)
|
||||
val catalystValue = Cast(Literal(value), dataType).eval()
|
||||
// Here we're converting Catalyst values to Scala values to test `needsConversion`
|
||||
CatalystTypeConverters.convertToScala(catalystValue, dataType)
|
||||
}: _*)
|
||||
}
|
||||
}
|
||||
|
||||
override def buildScan(
|
||||
requiredColumns: Array[String],
|
||||
filters: Array[Filter],
|
||||
inputFiles: Array[FileStatus]): RDD[Row] = {
|
||||
|
||||
SimpleTextRelation.requiredColumns = requiredColumns
|
||||
SimpleTextRelation.pushedFilters = filters.toSet
|
||||
|
||||
val fields = this.dataSchema.map(_.dataType)
|
||||
val inputAttributes = this.dataSchema.toAttributes
|
||||
val outputAttributes = requiredColumns.flatMap(name => inputAttributes.find(_.name == name))
|
||||
val dataSchema = this.dataSchema
|
||||
|
||||
val inputPaths = inputFiles.map(_.getPath).mkString(",")
|
||||
sparkContext.textFile(inputPaths).mapPartitions { iterator =>
|
||||
// Constructs a filter predicate to simulate filter push-down
|
||||
val predicate = {
|
||||
val filterCondition: Expression = filters.collect {
|
||||
// According to `unhandledFilters`, `SimpleTextRelation` only handles `GreaterThan` and
|
||||
// `isNotNull` filters
|
||||
case sources.GreaterThan(column, value) =>
|
||||
val dataType = dataSchema(column).dataType
|
||||
val literal = Literal.create(value, dataType)
|
||||
val attribute = inputAttributes.find(_.name == column).get
|
||||
expressions.GreaterThan(attribute, literal)
|
||||
case sources.IsNotNull(column) =>
|
||||
val dataType = dataSchema(column).dataType
|
||||
val attribute = inputAttributes.find(_.name == column).get
|
||||
expressions.IsNotNull(attribute)
|
||||
}.reduceOption(expressions.And).getOrElse(Literal(true))
|
||||
InterpretedPredicate.create(filterCondition, inputAttributes)
|
||||
}
|
||||
|
||||
// Uses a simple projection to simulate column pruning
|
||||
val projection = new InterpretedMutableProjection(outputAttributes, inputAttributes)
|
||||
val toScala = {
|
||||
val requiredSchema = StructType.fromAttributes(outputAttributes)
|
||||
CatalystTypeConverters.createToScalaConverter(requiredSchema)
|
||||
}
|
||||
|
||||
iterator.map { record =>
|
||||
new GenericInternalRow(record.split(",", -1).zip(fields).map {
|
||||
case (v, dataType) =>
|
||||
val value = if (v == "") null else v
|
||||
// `Cast`ed values are always of internal types (e.g. UTF8String instead of String)
|
||||
Cast(Literal(value), dataType).eval()
|
||||
})
|
||||
}.filter { row =>
|
||||
predicate(row)
|
||||
}.map { row =>
|
||||
toScala(projection(row)).asInstanceOf[Row]
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
override def prepareJobForWrite(job: Job): OutputWriterFactory = new OutputWriterFactory {
|
||||
job.setOutputFormatClass(classOf[TextOutputFormat[_, _]])
|
||||
|
||||
override def newInstance(
|
||||
path: String,
|
||||
dataSchema: StructType,
|
||||
context: TaskAttemptContext): OutputWriter = {
|
||||
new SimpleTextOutputWriter(path, context)
|
||||
}
|
||||
}
|
||||
|
||||
// `SimpleTextRelation` only handles `GreaterThan` and `IsNotNull` filters. This is used to test
|
||||
// filter push-down and `BaseRelation.unhandledFilters()`.
|
||||
override def unhandledFilters(filters: Array[Filter]): Array[Filter] = {
|
||||
filters.filter {
|
||||
case _: GreaterThan => false
|
||||
case _: IsNotNull => false
|
||||
case _ => true
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
object SimpleTextRelation {
|
||||
// Used to test column pruning
|
||||
var requiredColumns: Seq[String] = Nil
|
||||
|
||||
// Used to test filter push-down
|
||||
var pushedFilters: Set[Filter] = Set.empty
|
||||
|
||||
// Used to test failed committer
|
||||
var failCommitter = false
|
||||
|
||||
// Used to test failed writer
|
||||
var failWriter = false
|
||||
|
||||
// Used to test failure callback
|
||||
var callbackCalled = false
|
||||
}
|
||||
|
||||
/**
|
||||
* A simple example [[HadoopFsRelationProvider]].
|
||||
*/
|
||||
class CommitFailureTestSource extends HadoopFsRelationProvider {
|
||||
override def createRelation(
|
||||
sqlContext: SQLContext,
|
||||
paths: Array[String],
|
||||
schema: Option[StructType],
|
||||
partitionColumns: Option[StructType],
|
||||
parameters: Map[String, String]): HadoopFsRelation = {
|
||||
new CommitFailureTestRelation(paths, schema, partitionColumns, parameters)(sqlContext)
|
||||
}
|
||||
}
|
||||
|
||||
class CommitFailureTestRelation(
|
||||
override val paths: Array[String],
|
||||
maybeDataSchema: Option[StructType],
|
||||
override val userDefinedPartitionColumns: Option[StructType],
|
||||
parameters: Map[String, String])(
|
||||
@transient sqlContext: SQLContext)
|
||||
extends SimpleTextRelation(
|
||||
paths, maybeDataSchema, userDefinedPartitionColumns, parameters)(sqlContext) {
|
||||
override def prepareJobForWrite(job: Job): OutputWriterFactory = new OutputWriterFactory {
|
||||
override def newInstance(
|
||||
path: String,
|
||||
dataSchema: StructType,
|
||||
context: TaskAttemptContext): OutputWriter = {
|
||||
new SimpleTextOutputWriter(path, context) {
|
||||
var failed = false
|
||||
TaskContext.get().addTaskFailureListener { (t: TaskContext, e: Throwable) =>
|
||||
failed = true
|
||||
SimpleTextRelation.callbackCalled = true
|
||||
}
|
||||
|
||||
override def write(row: Row): Unit = {
|
||||
if (SimpleTextRelation.failWriter) {
|
||||
sys.error("Intentional task writer failure for testing purpose.")
|
||||
|
||||
}
|
||||
super.write(row)
|
||||
}
|
||||
|
||||
override def close(): Unit = {
|
||||
if (SimpleTextRelation.failCommitter) {
|
||||
sys.error("Intentional task commitment failure for testing purpose.")
|
||||
}
|
||||
super.close()
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
|
@ -503,7 +503,7 @@ abstract class HadoopFsRelationTest extends QueryTest with SQLTestUtils with Tes
|
|||
|
||||
val actualPaths = df.queryExecution.analyzed.collectFirst {
|
||||
case LogicalRelation(relation: HadoopFsRelation, _, _) =>
|
||||
relation.paths.toSet
|
||||
relation.location.paths.map(_.toString).toSet
|
||||
}.getOrElse {
|
||||
fail("Expect an FSBasedRelation, but none could be found")
|
||||
}
|
||||
|
@ -560,7 +560,7 @@ abstract class HadoopFsRelationTest extends QueryTest with SQLTestUtils with Tes
|
|||
.saveAsTable("t")
|
||||
|
||||
withTable("t") {
|
||||
checkAnswer(sqlContext.table("t"), df.select('b, 'c, 'a).collect())
|
||||
checkAnswer(sqlContext.table("t").select('b, 'c, 'a), df.select('b, 'c, 'a).collect())
|
||||
}
|
||||
}
|
||||
|
||||
|
|
Loading…
Reference in a new issue