Schema.sample#

classmethod Schema.sample(
num_rows: int | None = None,
*,
overrides: Mapping[str, Iterable[Any]] | Sequence[Mapping[str, Any]] | None = None,
generator: Generator | None = None,
) DataFrame[Self][source]#

Create a random data frame with a predefined number of rows.

Generally, this method should only be used for testing. Also, if you want to generate _realistic_ test data, it is inevitable to implement your custom sampling logic (by making use of the Generator class).

In order to allow for sampling random data frames in the presence of custom rules and primary key constraints, this method performs fuzzy sampling: it samples in a loop until it finds a data frame of length num_rows which adhere to the schema. The maximum number of sampling rounds is configured via max_sampling_iterations in the Config class. By fixing this setting to 1, it is only possible to reliably sample from schemas without custom rules and without primary key constraints.

Parameters:
  • num_rows – The (optional) number of rows to sample for creating the random data frame. Must be provided (only) if no overrides are provided. If this is None, the number of rows in the data frame is determined by the length of the values in overrides.

  • overrides – Fixed values for a subset of the columns of the sampled data frame. Just like when initializing a polars.DataFrame, overrides may either be provided as “column-” or “row-layout”, i.e. via a mapping or a list of mappings, respectively. The number of rows in the result data frame is equal to the length of the values in overrides. If both overrides and num_rows are provided, the length of the values in overrides must be equal to num_rows. The order of the items is guaranteed to match the ordering in the returned data frame. When providing values for a column, no sampling is performed for that column.

  • generator – The (seeded) generator to use for sampling data. If None, a generator with random seed is automatically created.

Returns:

A data frame valid under the current schema with a number of rows that matches the length of the values in overrides or num_rows.

Raises:
  • ValueError – If num_rows is not equal to the length of the values in overrides.

  • ValueError – If overrides are specified as a sequence of mappings and the mappings do not provide the same keys.

  • ValueError – If no valid data frame can be found in the configured maximum number of iterations.

Attention

Be aware that, due to sampling in a loop, the runtime of this method can be significant for complex schemas. Consider passing a seeded generator and evaluate whether the runtime impact in the tests is bearable. Alternatively, it can be beneficial to provide custom column overrides for columns associated with complex validation rules.