pyspark.sql.functions.approx_count_distinct

pyspark.sql.functions.approx_count_distinct(col: ColumnOrName, rsd: Optional[float] = None) → pyspark.sql.column.Column[source]

Aggregate function: returns a new Column for approximate distinct count of column col.

New in version 2.1.0.

Changed in version 3.4.0: Supports Spark Connect.

Parameters
colColumn or str
rsdfloat, optional

maximum relative standard deviation allowed (default = 0.05). For rsd < 0.01, it is more efficient to use count_distinct()

Returns
Column

the column of computed results.

Examples

>>> df = spark.createDataFrame([1,2,2,3], "INT")
>>> df.agg(approx_count_distinct("value").alias('distinct_values')).show()
+---------------+
|distinct_values|
+---------------+
|              3|
+---------------+