pyspark.pandas.DataFrame.duplicated¶

DataFrame.duplicated(subset: Union[Any, Tuple[Any, …], List[Union[Any, Tuple[Any, …]]], None] = None, keep: str = 'first') → Series[source]¶

Return boolean Series denoting duplicate rows, optionally only considering certain columns.

Parameters

subsetcolumn label or sequence of labels, optional

Only consider certain columns for identifying duplicates, by default use all of the columns

keep{‘first’, ‘last’, False}, default ‘first’

first : Mark duplicates as True except for the first occurrence.
last : Mark duplicates as True except for the last occurrence.
False : Mark all duplicates as True.

Returns

duplicatedSeries

Examples

>>> df = ps.DataFrame({'a': [1, 1, 1, 3], 'b': [1, 1, 1, 4], 'c': [1, 1, 1, 5]},
...                   columns = ['a', 'b', 'c'])
>>> df
   a  b  c
0  1  1  1
1  1  1  1
2  1  1  1
3  3  4  5

>>> df.duplicated().sort_index()
0    False
1     True
2     True
3    False
dtype: bool

Mark duplicates as True except for the last occurrence.

>>> df.duplicated(keep='last').sort_index()
0     True
1     True
2    False
3    False
dtype: bool

Mark all duplicates as True.

>>> df.duplicated(keep=False).sort_index()
0     True
1     True
2     True
3    False
dtype: bool

pyspark.pandas.DataFrame.drop_duplicates

pyspark.pandas.DataFrame.equals