pyspark.pandas.DataFrame.duplicated¶
-
DataFrame.
duplicated
(subset: Union[Any, Tuple[Any, …], List[Union[Any, Tuple[Any, …]]], None] = None, keep: str = 'first') → Series[source]¶ Return boolean Series denoting duplicate rows, optionally only considering certain columns.
- Parameters
- subsetcolumn label or sequence of labels, optional
Only consider certain columns for identifying duplicates, by default use all of the columns
- keep{‘first’, ‘last’, False}, default ‘first’
first
: Mark duplicates asTrue
except for the first occurrence.last
: Mark duplicates asTrue
except for the last occurrence.False : Mark all duplicates as
True
.
- Returns
- duplicatedSeries
Examples
>>> df = ps.DataFrame({'a': [1, 1, 1, 3], 'b': [1, 1, 1, 4], 'c': [1, 1, 1, 5]}, ... columns = ['a', 'b', 'c']) >>> df a b c 0 1 1 1 1 1 1 1 2 1 1 1 3 3 4 5
>>> df.duplicated().sort_index() 0 False 1 True 2 True 3 False dtype: bool
Mark duplicates as
True
except for the last occurrence.>>> df.duplicated(keep='last').sort_index() 0 True 1 True 2 False 3 False dtype: bool
Mark all duplicates as
True
.>>> df.duplicated(keep=False).sort_index() 0 True 1 True 2 True 3 False dtype: bool