8cb415a4b9
This patch adds support for entries larger than the default page size in BytesToBytesMap. These large rows are handled by allocating special overflow pages to hold individual entries.
In addition, this patch integrates BytesToBytesMap with the ShuffleMemoryManager:
- Move BytesToBytesMap from `unsafe` to `core` so that it can import `ShuffleMemoryManager`.
- Before allocating new data pages, ask the ShuffleMemoryManager to reserve the memory:
- `putNewKey()` now returns a boolean to indicate whether the insert succeeded or failed due to a lack of memory. The caller can use this value to respond to the memory pressure (e.g. by spilling).
- `UnsafeFixedWidthAggregationMap. getAggregationBuffer()` now returns `null` to signal failure due to a lack of memory.
- Updated all uses of these classes to handle these error conditions.
- Added new tests for allocating large records and for allocations which fail due to memory pressure.
- Extended the `afterAll()` test teardown methods to detect ShuffleMemoryManager leaks.
Author: Josh Rosen <joshrosen@databricks.com>
Closes #7762 from JoshRosen/large-rows and squashes the following commits:
ae7bc56 [Josh Rosen] Fix compilation
82fc657 [Josh Rosen] Merge remote-tracking branch 'origin/master' into large-rows
34ab943 [Josh Rosen] Remove semi
31a525a [Josh Rosen] Integrate BytesToBytesMap with ShuffleMemoryManager.
|
||
---|---|---|
.. | ||
avro | ||
gen-java/org/apache/spark/sql/parquet/test/avro | ||
java/test/org/apache/spark/sql | ||
resources | ||
scala/org/apache/spark/sql | ||
scripts | ||
thrift | ||
README.md |
Notes for Parquet compatibility tests
The following directories and files are used for Parquet compatibility tests:
.
├── README.md # This file
├── avro
│ ├── parquet-compat.avdl # Testing Avro IDL
│ └── parquet-compat.avpr # !! NO TOUCH !! Protocol file generated from parquet-compat.avdl
├── gen-java # !! NO TOUCH !! Generated Java code
├── scripts
│ └── gen-code.sh # Script used to generate Java code for Thrift and Avro
└── thrift
└── parquet-compat.thrift # Testing Thrift schema
Generated Java code are used in the following test suites:
org.apache.spark.sql.parquet.ParquetAvroCompatibilitySuite
org.apache.spark.sql.parquet.ParquetThriftCompatibilitySuite
To avoid code generation during build time, Java code generated from testing Thrift schema and Avro IDL are also checked in.
When updating the testing Thrift schema and Avro IDL, please run gen-code.sh
to update all the generated Java code.
Prerequisites
Please ensure avro-tools
and thrift
are installed. You may install these two on Mac OS X via:
$ brew install thrift avro-tools