DataFrame.
sample
Return a random sample of items from an axis of object.
Please call this function using named argument by specifying the frac argument.
frac
You can use random_state for reproducibility. However, note that different from pandas, specifying a seed in pandas-on-Spark/Spark does not guarantee the sampled rows will be fixed. The result set depends on not only the seed, but also how the data is distributed across machines and to some extent network randomness when shuffle operations are involved. Even in the simplest case, the result set will depend on the system’s CPU core count.
Number of items to return. This is currently NOT supported. Use frac instead.
Fraction of axis items to return.
Sample with or without replacement.
Seed for the random number generator (if int).
A new object of same type as caller containing the sampled items.
Examples
>>> df = ps.DataFrame({'num_legs': [2, 4, 8, 0], ... 'num_wings': [2, 0, 0, 0], ... 'num_specimen_seen': [10, 2, 1, 8]}, ... index=['falcon', 'dog', 'spider', 'fish'], ... columns=['num_legs', 'num_wings', 'num_specimen_seen']) >>> df num_legs num_wings num_specimen_seen falcon 2 2 10 dog 4 0 2 spider 8 0 1 fish 0 0 8
A random 25% sample of the DataFrame. Note that we use random_state to ensure the reproducibility of the examples.
DataFrame
>>> df.sample(frac=0.25, random_state=1) num_legs num_wings num_specimen_seen falcon 2 2 10 fish 0 0 8
Extract 25% random elements from the Series df['num_legs'], with replacement, so the same items could appear more than once.
Series
df['num_legs']
>>> df['num_legs'].sample(frac=0.4, replace=True, random_state=1) falcon 2 spider 8 spider 8 Name: num_legs, dtype: int64
Specifying the exact number of items to return is not supported at the moment.
>>> df.sample(n=5) Traceback (most recent call last): ... NotImplementedError: Function sample currently does not support specifying ...