Primary keys#

Defining primary keys in Schema#

When working with tabular data, it is often useful to define a primary key. A primary key is a set of one or multiple columns, the combined values of which form a unique identifier for every record in a table.

Dataframely supports marking columns as part of the primary key when defining a Schema by setting primary_key=True on the respective column(s).

Note

Primary key columns must not be nullable. Starting in dataframely version 2, attempts to declare a nullable primary key column raise an error.

One-column primary keys#

For example, when managing data about users, we might use an id column to uniquely identify users:

class UserSchema(dy.Schema):
    id = dy.String(primary_key=True)
    name = dy.String()

When we later validate data with this schema, dataframely checks that the values of the primary key are unique, i.e. there are no two users with the same value of id. Having multiple users with the same name but different id is allowed in this case.

Composite primary keys#

In another scenario, we might be tracking line items on invoices. We have many invoices, and each invoice may contain any number of line items. To uniquely identify a line item, we need to specify the invoice, as well as the line item’s position within the invoice. To encode this, we set primary_key=True on both the invoice_id and item_id columns:

class LineItemSchema(dy.Schema):
    invoice_id = dy.Int64(primary_key=True)
    item_id = dy.Int64(primary_key=True)
    price = dy.Decimal()

Validation will now ensure that all pairs of (invoice_id, item_id) are unique.

Primary keys in Collection#

The central idea behind Collection is to unify multiple tables relating to the same set of underlying entities. This is useful because it allows us to write filter()s that use information from multiple tables to identify whether the underlying entity is valid or not. If any filter()s are defined, dataframely requires the tables in a Collection to have an overlapping primary key (i.e., there must be at least one column that is a primary key in all tables).