[SPARK-31448][PYTHON] Fix storage level used in persist() in dataframe.py
### What changes were proposed in this pull request? Since the data is serialized on the Python side, we should make cache() in PySpark dataframes use StorageLevel.MEMORY_AND_DISK mode which has deserialized=false. This change was done to `pyspark/rdd.py` as part of SPARK-2014 but was missed from `pyspark/dataframe.py` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Using existing tests Closes #29242 from abhishekd0907/SPARK-31448. Authored-by: Abhishek Dixit <abhishekdixit0907@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>
This commit is contained in:
parent
316242b768
commit
6f36db1fa5
|
@ -678,13 +678,14 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
|
||||||
return self
|
return self
|
||||||
|
|
||||||
@since(1.3)
|
@since(1.3)
|
||||||
def persist(self, storageLevel=StorageLevel.MEMORY_AND_DISK):
|
def persist(self, storageLevel=StorageLevel.MEMORY_AND_DISK_DESER):
|
||||||
"""Sets the storage level to persist the contents of the :class:`DataFrame` across
|
"""Sets the storage level to persist the contents of the :class:`DataFrame` across
|
||||||
operations after the first time it is computed. This can only be used to assign
|
operations after the first time it is computed. This can only be used to assign
|
||||||
a new storage level if the :class:`DataFrame` does not have a storage level set yet.
|
a new storage level if the :class:`DataFrame` does not have a storage level set yet.
|
||||||
If no storage level is specified defaults to (`MEMORY_AND_DISK`).
|
If no storage level is specified defaults to (`MEMORY_AND_DISK_DESER`)
|
||||||
|
|
||||||
.. note:: The default storage level has changed to `MEMORY_AND_DISK` to match Scala in 2.0.
|
.. note:: The default storage level has changed to `MEMORY_AND_DISK_DESER` to match Scala
|
||||||
|
in 3.0.
|
||||||
"""
|
"""
|
||||||
self.is_cached = True
|
self.is_cached = True
|
||||||
javaStorageLevel = self._sc._getJavaStorageLevel(storageLevel)
|
javaStorageLevel = self._sc._getJavaStorageLevel(storageLevel)
|
||||||
|
|
|
@ -57,3 +57,4 @@ StorageLevel.MEMORY_ONLY_2 = StorageLevel(False, True, False, False, 2)
|
||||||
StorageLevel.MEMORY_AND_DISK = StorageLevel(True, True, False, False)
|
StorageLevel.MEMORY_AND_DISK = StorageLevel(True, True, False, False)
|
||||||
StorageLevel.MEMORY_AND_DISK_2 = StorageLevel(True, True, False, False, 2)
|
StorageLevel.MEMORY_AND_DISK_2 = StorageLevel(True, True, False, False, 2)
|
||||||
StorageLevel.OFF_HEAP = StorageLevel(True, True, True, False, 1)
|
StorageLevel.OFF_HEAP = StorageLevel(True, True, True, False, 1)
|
||||||
|
StorageLevel.MEMORY_AND_DISK_DESER = StorageLevel(True, True, False, True)
|
||||||
|
|
Loading…
Reference in a new issue