spark-instrumented-optimizer

History

Adrian Ionescu 95ad960caf [SPARK-21669] Internal API for collecting metrics/stats during FileFormatWriter jobs ## What changes were proposed in this pull request? This patch introduces an internal interface for tracking metrics and/or statistics on data on the fly, as it is being written to disk during a `FileFormatWriter` job and partially reimplements SPARK-20703 in terms of it. The interface basically consists of 3 traits: - `WriteTaskStats`: just a tag for classes that represent statistics collected during a `WriteTask` The only constraint it adds is that the class should be `Serializable`, as instances of it will be collected on the driver from all executors at the end of the `WriteJob`. - `WriteTaskStatsTracker`: a trait for classes that can actually compute statistics based on tuples that are processed by a given `WriteTask` and eventually produce a `WriteTaskStats` instance. - `WriteJobStatsTracker`: a trait for classes that act as containers of `Serializable` state that's necessary for instantiating `WriteTaskStatsTracker` on executors and finally process the resulting collection of `WriteTaskStats`, once they're gathered back on the driver. Potential future use of this interface is e.g. CBO stats maintenance during `INSERT INTO table ... ` operations. ## How was this patch tested? Existing tests for SPARK-20703 exercise the new code: `hive/SQLMetricsSuite`, `sql/JavaDataFrameReaderWriterSuite`, etc. Author: Adrian Ionescu <adrian@databricks.com> Closes #18884 from adrian-ionescu/write-stats-tracker-api.		2017-08-10 12:37:10 -07:00
..
compatibility/src/test/scala/org/apache/spark/sql/hive/execution	[SPARK-20126][SQL] Remove HiveSessionState	2017-03-28 23:14:31 +08:00
src	[SPARK-21669] Internal API for collecting metrics/stats during FileFormatWriter jobs	2017-08-10 12:37:10 -07:00
pom.xml	[MINOR][BUILD] Remove duplicate test-jar:test spark-sql dependency from Hive module	2017-08-06 16:48:49 -07:00