Generator#

class dataframely.random.Generator(seed: int | None = None)[source]#

Type that allows to sample primitive types using a random number generator.

All generator methods are called sample_<type> and, if applicable, allow specifying a lower (inclusive) and an upper (exclusive) bound for the type to be sampled.

These methods can be used to sample higher-level types. To this end, users may also directly access the underlying numpy_generator to reuse the generator’s seeding.

Parameters:

seed – The seed to use for initializing the random number generator used for all sampling methods.

Methods:

sample_binary

Sample a list of binary values in the specified length range.

sample_bool

Sample a list of booleans in the specified range.

sample_choice

Sample a list of elements from a list of choices with replacement.

sample_date

Sample a list of dates in the provided range.

sample_datetime

Sample a list of datetimes in the provided range.

sample_duration

Sample a list of durations in the provided range.

sample_float

Sample a list of floating point numbers in the specified range.

sample_int

Sample a list of integers in the specified range.

sample_seed

Sample a single integer that can be used as a seed for other RNGs.

sample_string

Sample a list of strings adhering to the provided regex.

sample_time

Sample a list of times in the provided range.

sample_binary(
n: int = 1,
*,
min_bytes: int,
max_bytes: int,
null_probability: float = 0.0,
) Series[source]#

Sample a list of binary values in the specified length range.

Parameters:
  • n – The number of binary values to sample.

  • min_bytes – The minimum number of bytes for each value.

  • max_bytes – The maximum number of bytes for each value.

  • null_probability – The probability of an element being null.

Returns:

A series with n elements of dtype Binary.

sample_bool(
n: int = 1,
*,
null_probability: float = 0.0,
p_true: float | None = None,
) Series[source]#

Sample a list of booleans in the specified range.

Parameters:
  • n – The number of booleans to sample.

  • null_probability – The probability of an element being null.

  • p_true – Sampling probability for True within non-null samples. Default: 0.5 (uniform sampling)

Returns:

A series with n elements of dtype Boolean.

sample_choice(
n: int = 1,
*,
choices: Sequence[T],
null_probability: float = 0.0,
weights: Sequence[float] | None = None,
) Series[source]#

Sample a list of elements from a list of choices with replacement.

Parameters:
  • n – The number of elements to sample.

  • choices – The choices to sample from.

  • null_probability – The probability of an element being null.

  • weights – A ordered weight vector for the different choices

Returns:

A series with n elements of auto-inferred dtype.

sample_date(
n: int = 1,
*,
min: date,
max: date | None,
resolution: str | None = None,
null_probability: float = 0.0,
) Series[source]#

Sample a list of dates in the provided range.

Parameters:
  • n – The number of dates to sample.

  • min – The minimum date to sample (inclusive).

  • max – The maximum date to sample (exclusive). ‘10000-01-01’ when None.

  • resolution – The resolution that dates in the column must have. This uses the formatting language used by polars datetime round method.

  • null_probability – The probability of an element being null.

Returns:

A series with n elements of dtype Date.

sample_datetime(
n: int = 1,
*,
min: datetime,
max: datetime | None,
resolution: str | None = None,
time_zone: str | tzinfo | None = None,
time_unit: Literal['ns', 'us', 'ms'] = 'us',
null_probability: float = 0.0,
) Series[source]#

Sample a list of datetimes in the provided range.

Parameters:
  • n – The number of datetimes to sample.

  • min – The minimum datetime to sample (inclusive).

  • max – The maximum datetime to sample (exclusive). ‘10000-01-01’ when None.

  • resolution – The resolution that datetimes in the column must have. This uses the formatting language used by polars datetime round method.

  • time_unit – The time unit of the datetime column. Defaults to us (microseconds).

  • time_zone – The time zone that datetimes in the column must have. The time zone must use a valid IANA time zone name identifier e.x. Etc/UTC or America/New_York.

  • null_probability – The probability of an element being null.

Returns:

A series with n elements of dtype Datetime.

sample_duration(
n: int = 1,
*,
min: timedelta,
max: timedelta,
resolution: str | None = None,
null_probability: float = 0.0,
) Series[source]#

Sample a list of durations in the provided range.

Parameters:
  • n – The number of durations to sample.

  • min – The minimum duration to sample (inclusive).

  • max – The maximum duration to sample (exclusive).

  • resolution – The resolution that durations in the column must have. This uses the formatting language used by polars datetime round method.

  • null_probability – The probability of an element being null.

Returns:

A series with n elements of dtype Duration.

sample_float(
n: int = 1,
*,
min: float,
max: float,
null_probability: float = 0.0,
nan_probability: float = 0.0,
inf_probability: float = 0.0,
) Series[source]#

Sample a list of floating point numbers in the specified range.

Parameters:
  • n – The number of floats to sample.

  • min – The minimum float to sample (inclusive).

  • max – The maximum float to sample (exclusive).

  • null_probability – The probability of an element being null.

  • nan_probability – The probability of an element being nan.

  • inf_probability – The probability of an element being inf.

Returns:

A series with n elements of dtype Float64.

sample_int(
n: int = 1,
*,
min: int,
max: int,
null_probability: float = 0.0,
) Series[source]#

Sample a list of integers in the specified range.

Parameters:
  • n – The number of integers to sample.

  • min – The minimum integer to sample (inclusive).

  • max – The maximum integer to sample (exclusive).

  • null_probability – The probability of an element being null.

Returns:

A series with n elements of dtype Int64.

sample_seed() int[source]#

Sample a single integer that can be used as a seed for other RNGs.

Returns:

A seed of type uint32.

sample_string(
n: int = 1,
*,
regex: str,
null_probability: float = 0.0,
) Series[source]#

Sample a list of strings adhering to the provided regex.

Parameters:
  • n – The number of strings to sample.

  • regex – The regex that all elements have to adhere to.

  • null_probability – The probability of an element being null.

Returns:

A series with n elements of dtype String.

sample_time(
n: int = 1,
*,
min: time,
max: time | None,
resolution: str | None = None,
null_probability: float = 0.0,
) Series[source]#

Sample a list of times in the provided range.

Parameters:
  • n – The number of times to sample.

  • min – The minimum time to sample (inclusive).

  • max – The maximum time to sample (exclusive). Midnight when None.

  • resolution – The resolution that times in the column must have. This uses the formatting language used by polars datetime round method.

  • null_probability – The probability of an element being null.

Returns:

A series with n elements of dtype Time.