spark-instrumented-optimizer/sql/hive
Josh Rosen ef6790fdc3 [SPARK-12075][SQL] Speed up HiveComparisionTest by avoiding / speeding up TestHive.reset()
When profiling HiveCompatibilitySuite, I noticed that most of the time seems to be spent in expensive `TestHive.reset()` calls. This patch speeds up suites based on HiveComparisionTest, such as HiveCompatibilitySuite, with the following changes:

- Avoid `TestHive.reset()` whenever possible:
  - Use a simple set of heuristics to guess whether we need to call `reset()` in between tests.
  - As a safety-net, automatically re-run failed tests by calling `reset()` before the re-attempt.
- Speed up the expensive parts of `TestHive.reset()`: loading the `src` and `srcpart` tables took roughly 600ms per test, so we now avoid this by using a simple heuristic which only loads those tables by tests that reference them. This is based on simple string matching over the test queries which errs on the side of loading in more situations than might be strictly necessary.

After these changes, HiveCompatibilitySuite seems to run in about 10 minutes.

This PR is a revival of #6663, an earlier experimental PR from June, where I played around with several possible speedups for this suite.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10055 from JoshRosen/speculative-testhive-reset.
2015-12-02 07:29:45 +08:00
..
compatibility/src/test/scala/org/apache/spark/sql/hive/execution [SPARK-9034][SQL] Reflect field names defined in GenericUDTF 2015-11-02 23:52:36 -08:00
src [SPARK-12075][SQL] Speed up HiveComparisionTest by avoiding / speeding up TestHive.reset() 2015-12-02 07:29:45 +08:00
pom.xml [SPARK-10300] [BUILD] [TESTS] Add support for test tags in run-tests.py. 2015-10-07 14:11:21 -07:00