spark-instrumented-optimizer

History

Takeshi Yamamuro 20ca208bcd [SPARK-23880][SQL] Do not trigger any jobs for caching data ## What changes were proposed in this pull request? This pr fixed code so that `cache` could prevent any jobs from being triggered. For example, in the current master, an operation below triggers a actual job; ``` val df = spark.range(10000000000L) .filter('id > 1000) .orderBy('id.desc) .cache() ``` This triggers a job while the cache should be lazy. The problem is that, when creating `InMemoryRelation`, we build the RDD, which calls `SparkPlan.execute` and may trigger jobs, like sampling job for range partitioner, or broadcast job. This pr removed the code to build a cached `RDD` in the constructor of `InMemoryRelation` and added `CachedRDDBuilder` to lazily build the `RDD` in `InMemoryRelation`. Then, the first call of `CachedRDDBuilder.cachedColumnBuffers` triggers a job to materialize the cache in `InMemoryTableScanExec` . ## How was this patch tested? Added tests in `CachedTableSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #21018 from maropu/SPARK-23880.		2018-04-25 19:06:18 +08:00
..
compatibility/src/test/scala/org/apache/spark/sql/hive/execution	[SPARK-23170][SQL] Dump the statistics of effective runs of analyzer and optimizer rules	2018-01-22 04:31:24 -08:00
src	[SPARK-23880][SQL] Do not trigger any jobs for caching data	2018-04-25 19:06:18 +08:00
pom.xml	[SPARK-23028] Bump master branch version to 2.4.0-SNAPSHOT	2018-01-13 00:37:59 +08:00