Collection.scan_parquet#

classmethod Collection.scan_parquet(
directory: str | Path,
*,
validation: Literal['allow', 'forbid', 'warn', 'skip'] = 'warn',
**kwargs: Any,
) Self[source]#

Lazily read all collection members from parquet files in a directory.

This method searches for files named <member>.parquet in the provided directory for all required and optional members of the collection.

Parameters:
  • directory – The directory where the Parquet files should be read from. Parquet files may have been written with Hive partitioning.

  • validation

    The strategy for running validation when reading the data:

    • "allow": The method tries to read the schema data from the parquet files. If the stored collection schema matches this collection schema, the collection is read without validation. If the stored schema mismatches this schema no metadata can be found in the parquets, or the files have conflicting metadata, this method automatically runs validate() with cast=True.

    • "warn": The method behaves similarly to "allow". However, it prints a warning if validation is necessary.

    • "forbid": The method never runs validation automatically and only returns if the metadata stores a collection schema that matches this collection.

    • "skip": The method never runs validation and simply reads the data, entrusting the user that the schema is valid. Use this option carefully.

  • kwargs – Additional keyword arguments passed directly to polars.scan_parquet() for all members.

Returns:

The initialized collection.

Raises:
  • ValidationRequiredError – If no collection schema can be read from the directory and validation is set to "forbid".

  • ValueError – If the provided directory does not contain parquet files for all required members.

Note

Due to current limitations in dataframely, this method actually reads the parquet file into memory if "validation" is "warn" or "allow" and validation is required.

Note: This method is backward compatible with older versions of dataframely

in which the schema metadata was saved to schema.json files instead of being encoded into the parquet files.

Attention

Be aware that this method suffers from the same limitations as serialize().