Collection.sample#

classmethod Collection.sample(
num_rows: int | None = None,
*,
overrides: Sequence[Mapping[str, Any]] | None = None,
generator: Generator | None = None,
) Self[source]#

Create a random sample from the members of this collection.

Just like sampling for schemas, this method should only be used for testing. Contrary to sampling for schemas, the core difficulty when sampling related values data frames is that they must share primary keys and individual members may have a different number of rows. For this reason, overrides passed to this function must be “row-oriented” (or “sample-oriented”).

Parameters:
  • num_rows – The number of rows to sample for each member. If this is set to None, the number of rows is inferred from the length of the overrides.

  • overrides

    The overrides to set values in member schemas. The overrides must be provided as a list of samples. The structure of the samples must be as follows:

    {
        "<primary_key_1>": <value>,
        "<primary_key_2>": <value>,
        "<member_with_common_primary_key>": {
            "<column_1>": <value>,
            ...
        },
        "<member_with_superkey_of_primary_key>": [
            {
                "<column_1>": <value>,
                ...
            }
        ],
        ...
    }
    

    Any member/value can be left out and will be sampled automatically. Note that overrides for columns of members that are annotated with inline_for_sampling=True can be supplied on the top-level instead of in a nested dictionary.

  • generator – The (seeded) generator to use for sampling data. If None, a generator with random seed is automatically created.

Returns:

A collection where all members (including optional ones) have been sampled according to the input parameters.

Attention

In case the collection has members with a common primary key, the _preprocess_sample() method must return distinct primary key values for each sample. The default implementation does this on a best-effort basis but may cause primary key violations. Hence, it is recommended to override this method and ensure that all primary key columns are set.

Raises:
  • ValueError – If the _preprocess_sample() method does not return all common primary key columns for all samples.

  • ValidationError – If the sampled members violate any of the collection filters. If the collection does not have filters, this error is never raised. To prevent validation errors, overwrite the _preprocess_sample() method appropriately.