[SPARK-31448][PYTHON] Fix storage level used in persist() in dataframe.py

### What changes were proposed in this pull request?
Since the data is serialized on the Python side, we should make cache() in PySpark dataframes use StorageLevel.MEMORY_AND_DISK mode which has deserialized=false. This change was done to `pyspark/rdd.py` as part of SPARK-2014 but was missed from `pyspark/dataframe.py`

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Using existing tests

Closes #29242 from abhishekd0907/SPARK-31448.

Authored-by: Abhishek Dixit <abhishekdixit0907@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
This commit is contained in:
Abhishek Dixit 2020-09-15 08:41:22 -05:00 committed by Sean Owen
parent 316242b768
commit 6f36db1fa5
2 changed files with 5 additions and 3 deletions

View file

@ -678,13 +678,14 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
return self return self
@since(1.3) @since(1.3)
def persist(self, storageLevel=StorageLevel.MEMORY_AND_DISK): def persist(self, storageLevel=StorageLevel.MEMORY_AND_DISK_DESER):
"""Sets the storage level to persist the contents of the :class:`DataFrame` across """Sets the storage level to persist the contents of the :class:`DataFrame` across
operations after the first time it is computed. This can only be used to assign operations after the first time it is computed. This can only be used to assign
a new storage level if the :class:`DataFrame` does not have a storage level set yet. a new storage level if the :class:`DataFrame` does not have a storage level set yet.
If no storage level is specified defaults to (`MEMORY_AND_DISK`). If no storage level is specified defaults to (`MEMORY_AND_DISK_DESER`)
.. note:: The default storage level has changed to `MEMORY_AND_DISK` to match Scala in 2.0. .. note:: The default storage level has changed to `MEMORY_AND_DISK_DESER` to match Scala
in 3.0.
""" """
self.is_cached = True self.is_cached = True
javaStorageLevel = self._sc._getJavaStorageLevel(storageLevel) javaStorageLevel = self._sc._getJavaStorageLevel(storageLevel)

View file

@ -57,3 +57,4 @@ StorageLevel.MEMORY_ONLY_2 = StorageLevel(False, True, False, False, 2)
StorageLevel.MEMORY_AND_DISK = StorageLevel(True, True, False, False) StorageLevel.MEMORY_AND_DISK = StorageLevel(True, True, False, False)
StorageLevel.MEMORY_AND_DISK_2 = StorageLevel(True, True, False, False, 2) StorageLevel.MEMORY_AND_DISK_2 = StorageLevel(True, True, False, False, 2)
StorageLevel.OFF_HEAP = StorageLevel(True, True, True, False, 1) StorageLevel.OFF_HEAP = StorageLevel(True, True, True, False, 1)
StorageLevel.MEMORY_AND_DISK_DESER = StorageLevel(True, True, False, True)