pyspark.pandas.DataFrame.median

DataFrame.median(axis: Union[int, str, None] = None, numeric_only: bool = None, accuracy: int = 10000) → Union[int, float, bool, str, bytes, decimal.Decimal, datetime.date, datetime.datetime, None, Series]

Return the median of the values for the requested axis.

Note

Unlike pandas’, the median in pandas-on-Spark is an approximated median based upon approximate percentile computation because computing median across a large dataset is extremely expensive.

Parameters
axis{index (0), columns (1)}

Axis for the function to be applied on.

numeric_onlybool, default None

Include only float, int, boolean columns. False is not supported. This parameter is mainly for pandas compatibility.

accuracyint, optional

Default accuracy of approximation. Larger value means better accuracy. The relative error can be deduced by 1.0 / accuracy.

Returns
medianscalar or Series

Examples

>>> df = ps.DataFrame({
...     'a': [24., 21., 25., 33., 26.], 'b': [1, 2, 3, 4, 5]}, columns=['a', 'b'])
>>> df
      a  b
0  24.0  1
1  21.0  2
2  25.0  3
3  33.0  4
4  26.0  5

On a DataFrame:

>>> df.median()
a    25.0
b     3.0
dtype: float64

On a Series:

>>> df['a'].median()
25.0
>>> (df['b'] + 100).median()
103.0

For multi-index columns,

>>> df.columns = pd.MultiIndex.from_tuples([('x', 'a'), ('y', 'b')])
>>> df
      x  y
      a  b
0  24.0  1
1  21.0  2
2  25.0  3
3  33.0  4
4  26.0  5

On a DataFrame:

>>> df.median()
x  a    25.0
y  b     3.0
dtype: float64
>>> df.median(axis=1)
0    12.5
1    11.5
2    14.0
3    18.5
4    15.5
dtype: float64

On a Series:

>>> df[('x', 'a')].median()
25.0
>>> (df[('y', 'b')] + 100).median()
103.0