Schema.sample#
- classmethod Schema.sample(
- num_rows: int | None = None,
- *,
- overrides: Mapping[str, Iterable[Any]] | Sequence[Mapping[str, Any]] | None = None,
- generator: Generator | None = None,
Create a random data frame with a predefined number of rows.
Generally, this method should only be used for testing. Also, if you want to generate _realistic_ test data, it is inevitable to implement your custom sampling logic (by making use of the
Generatorclass).In order to allow for sampling random data frames in the presence of custom rules and primary key constraints, this method performs fuzzy sampling: it samples in a loop until it finds a data frame of length
num_rowswhich adhere to the schema. The maximum number of sampling rounds is configured viamax_sampling_iterationsin theConfigclass. By fixing this setting to 1, it is only possible to reliably sample from schemas without custom rules and without primary key constraints.- Parameters:
num_rows – The (optional) number of rows to sample for creating the random data frame. Must be provided (only) if no
overridesare provided. If this isNone, the number of rows in the data frame is determined by the length of the values inoverrides.overrides – Fixed values for a subset of the columns of the sampled data frame. Just like when initializing a
polars.DataFrame, overrides may either be provided as “column-” or “row-layout”, i.e. via a mapping or a list of mappings, respectively. The number of rows in the result data frame is equal to the length of the values inoverrides. If bothoverridesandnum_rowsare provided, the length of the values inoverridesmust be equal tonum_rows. The order of the items is guaranteed to match the ordering in the returned data frame. When providing values for a column, no sampling is performed for that column.generator – The (seeded) generator to use for sampling data. If
None, a generator with random seed is automatically created.
- Returns:
A data frame valid under the current schema with a number of rows that matches the length of the values in
overridesornum_rows.- Raises:
ValueError – If
num_rowsis not equal to the length of the values inoverrides.ValueError – If
overridesare specified as a sequence of mappings and the mappings do not provide the same keys.ValueError – If no valid data frame can be found in the configured maximum number of iterations.
Attention
Be aware that, due to sampling in a loop, the runtime of this method can be significant for complex schemas. Consider passing a seeded generator and evaluate whether the runtime impact in the tests is bearable. Alternatively, it can be beneficial to provide custom column overrides for columns associated with complex validation rules.