[SPARK-31448][PYTHON] Fix storage level used in persist() in dataframe.py
### What changes were proposed in this pull request? Since the data is serialized on the Python side, we should make cache() in PySpark dataframes use StorageLevel.MEMORY_AND_DISK mode which has deserialized=false. This change was done to `pyspark/rdd.py` as part of SPARK-2014 but was missed from `pyspark/dataframe.py` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Using existing tests Closes #29242 from abhishekd0907/SPARK-31448. Authored-by: Abhishek Dixit <abhishekdixit0907@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>
This commit is contained in:
parent
316242b768
commit
6f36db1fa5
|
@ -678,13 +678,14 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
|
|||
return self
|
||||
|
||||
@since(1.3)
|
||||
def persist(self, storageLevel=StorageLevel.MEMORY_AND_DISK):
|
||||
def persist(self, storageLevel=StorageLevel.MEMORY_AND_DISK_DESER):
|
||||
"""Sets the storage level to persist the contents of the :class:`DataFrame` across
|
||||
operations after the first time it is computed. This can only be used to assign
|
||||
a new storage level if the :class:`DataFrame` does not have a storage level set yet.
|
||||
If no storage level is specified defaults to (`MEMORY_AND_DISK`).
|
||||
If no storage level is specified defaults to (`MEMORY_AND_DISK_DESER`)
|
||||
|
||||
.. note:: The default storage level has changed to `MEMORY_AND_DISK` to match Scala in 2.0.
|
||||
.. note:: The default storage level has changed to `MEMORY_AND_DISK_DESER` to match Scala
|
||||
in 3.0.
|
||||
"""
|
||||
self.is_cached = True
|
||||
javaStorageLevel = self._sc._getJavaStorageLevel(storageLevel)
|
||||
|
|
|
@ -57,3 +57,4 @@ StorageLevel.MEMORY_ONLY_2 = StorageLevel(False, True, False, False, 2)
|
|||
StorageLevel.MEMORY_AND_DISK = StorageLevel(True, True, False, False)
|
||||
StorageLevel.MEMORY_AND_DISK_2 = StorageLevel(True, True, False, False, 2)
|
||||
StorageLevel.OFF_HEAP = StorageLevel(True, True, True, False, 1)
|
||||
StorageLevel.MEMORY_AND_DISK_DESER = StorageLevel(True, True, False, True)
|
||||
|
|
Loading…
Reference in a new issue