pyspark.sql.functions.approx_count_distinct¶
-
pyspark.sql.functions.
approx_count_distinct
(col: ColumnOrName, rsd: Optional[float] = None) → pyspark.sql.column.Column[source]¶ Aggregate function: returns a new
Column
for approximate distinct count of column col.New in version 2.1.0.
Changed in version 3.4.0: Supports Spark Connect.
- Parameters
- col
Column
or str - rsdfloat, optional
maximum relative standard deviation allowed (default = 0.05). For rsd < 0.01, it is more efficient to use
count_distinct()
- col
- Returns
Column
the column of computed results.
Examples
>>> df = spark.createDataFrame([1,2,2,3], "INT") >>> df.agg(approx_count_distinct("value").alias('distinct_values')).show() +---------------+ |distinct_values| +---------------+ | 3| +---------------+