spark.
frame
Return the current DataFrame as a Spark DataFrame. DataFrame.spark.frame() is an alias of DataFrame.to_spark().
DataFrame.spark.frame()
DataFrame.to_spark()
Column names to be used in Spark to represent pandas-on-Spark’s index. The index name in pandas-on-Spark is ignored. By default, the index is always lost.
See also
DataFrame.to_spark
DataFrame.to_pandas_on_spark
DataFrame.spark.frame
Examples
By default, this method loses the index as below.
>>> df = ps.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]}) >>> df.to_spark().show() +---+---+---+ | a| b| c| +---+---+---+ | 1| 4| 7| | 2| 5| 8| | 3| 6| 9| +---+---+---+
>>> df = ps.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]}) >>> df.spark.frame().show() +---+---+---+ | a| b| c| +---+---+---+ | 1| 4| 7| | 2| 5| 8| | 3| 6| 9| +---+---+---+
If index_col is set, it keeps the index column as specified.
>>> df.to_spark(index_col="index").show() +-----+---+---+---+ |index| a| b| c| +-----+---+---+---+ | 0| 1| 4| 7| | 1| 2| 5| 8| | 2| 3| 6| 9| +-----+---+---+---+
Keeping index column is useful when you want to call some Spark APIs and convert it back to pandas-on-Spark DataFrame without creating a default index, which can affect performance.
>>> spark_df = df.to_spark(index_col="index") >>> spark_df = spark_df.filter("a == 2") >>> spark_df.to_pandas_on_spark(index_col="index") a b c index 1 2 5 8
In case of multi-index, specify a list to index_col.
>>> new_df = df.set_index("a", append=True) >>> new_spark_df = new_df.to_spark(index_col=["index_1", "index_2"]) >>> new_spark_df.show() +-------+-------+---+---+ |index_1|index_2| b| c| +-------+-------+---+---+ | 0| 1| 4| 7| | 1| 2| 5| 8| | 2| 3| 6| 9| +-------+-------+---+---+
Likewise, can be converted to back to pandas-on-Spark DataFrame.
>>> new_spark_df.to_pandas_on_spark( ... index_col=["index_1", "index_2"]) b c index_1 index_2 0 1 4 7 1 2 5 8 2 3 6 9