[SPARK-31448][PYTHON] Fix storage level used in persist() in dataframe.py

### What changes were proposed in this pull request?
Since the data is serialized on the Python side, we should make cache() in PySpark dataframes use StorageLevel.MEMORY_AND_DISK mode which has deserialized=false. This change was done to `pyspark/rdd.py` as part of SPARK-2014 but was missed from `pyspark/dataframe.py`

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Using existing tests

Closes #29242 from abhishekd0907/SPARK-31448.

Authored-by: Abhishek Dixit <abhishekdixit0907@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
This commit is contained in:
Abhishek Dixit 2020-09-15 08:41:22 -05:00 committed by Sean Owen
parent 316242b768
commit 6f36db1fa5
2 changed files with 5 additions and 3 deletions

View file

@ -678,13 +678,14 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
return self
@since(1.3)
def persist(self, storageLevel=StorageLevel.MEMORY_AND_DISK):
def persist(self, storageLevel=StorageLevel.MEMORY_AND_DISK_DESER):
"""Sets the storage level to persist the contents of the :class:`DataFrame` across
operations after the first time it is computed. This can only be used to assign
a new storage level if the :class:`DataFrame` does not have a storage level set yet.
If no storage level is specified defaults to (`MEMORY_AND_DISK`).
If no storage level is specified defaults to (`MEMORY_AND_DISK_DESER`)
.. note:: The default storage level has changed to `MEMORY_AND_DISK` to match Scala in 2.0.
.. note:: The default storage level has changed to `MEMORY_AND_DISK_DESER` to match Scala
in 3.0.
"""
self.is_cached = True
javaStorageLevel = self._sc._getJavaStorageLevel(storageLevel)

View file

@ -57,3 +57,4 @@ StorageLevel.MEMORY_ONLY_2 = StorageLevel(False, True, False, False, 2)
StorageLevel.MEMORY_AND_DISK = StorageLevel(True, True, False, False)
StorageLevel.MEMORY_AND_DISK_2 = StorageLevel(True, True, False, False, 2)
StorageLevel.OFF_HEAP = StorageLevel(True, True, True, False, 1)
StorageLevel.MEMORY_AND_DISK_DESER = StorageLevel(True, True, False, True)