RDD.
persist
Set this RDD’s storage level to persist its values across operations after the first time it is computed. This can only be used to assign a new storage level if the RDD does not have a storage level set yet. If no storage level is specified defaults to (MEMORY_ONLY).
New in version 0.9.1.
StorageLevel
the target storage level
RDD
The same RDD with storage level set to storageLevel.
See also
RDD.cache()
RDD.unpersist()
RDD.getStorageLevel()
Examples
>>> rdd = sc.parallelize(["b", "a", "c"]) >>> rdd.persist().is_cached True >>> str(rdd.getStorageLevel()) 'Memory Serialized 1x Replicated' >>> _ = rdd.unpersist() >>> rdd.is_cached False
>>> from pyspark import StorageLevel >>> rdd2 = sc.range(5) >>> _ = rdd2.persist(StorageLevel.MEMORY_AND_DISK) >>> rdd2.is_cached True >>> str(rdd2.getStorageLevel()) 'Disk Memory Serialized 1x Replicated'
Can not override existing storage level
>>> _ = rdd2.persist(StorageLevel.MEMORY_ONLY_2) Traceback (most recent call last): ... py4j.protocol.Py4JJavaError: ...
Assign another storage level after unpersist
>>> _ = rdd2.unpersist() >>> rdd2.is_cached False >>> _ = rdd2.persist(StorageLevel.MEMORY_ONLY_2) >>> str(rdd2.getStorageLevel()) 'Memory Serialized 2x Replicated' >>> rdd2.is_cached True >>> _ = rdd2.unpersist()