Skip to content

ehrQL tutorial: Another minimal dataset definition🔗

Danger

This page discusses the new OpenSAFELY Data Builder for accessing OpenSAFELY data sources.

Use OpenSAFELY cohort-extractor, unless you are specifically involved in the development or testing of Data Builder.

OpenSAFELY Data Builder and its documentation are still undergoing extensive development. We will announce when Data Builder is ready for general use on the Platform News page.

Example dataset definition 1b: Adding an extra output column from a minimal data source🔗

By the end of this tutorial, you should be able to:

  • retrieve additional data columns
  • explain what is involved in creating a Dataset

Full Example🔗

We will start with a similar dataset definition that we used in the previous example with the addition of an extra output column, for sex.

Dataset definition: 1b_minimal_dataset_definition.py
from databuilder.ehrql import Dataset
from databuilder.tables.examples.tutorial import patients

dataset = Dataset()

year_of_birth = patients.date_of_birth.year
dataset.define_population(year_of_birth >= 2000)

dataset.sex = patients.sex

If we run this against the sample data provided, it will pick out only patients who were born in 2000 or later, and will also provide information about sex as a new column.

Data table: minimal/patients.csv
patient_id date_of_birth sex
1 1980-05-01 M
2 2005-10-01 F
3 1946-01-01 M
4 1920-11-01 M
5 2010-04-01 M
6 1999-12-01 F
7 2000-01-01 M

In this case, we see patient 2, 5 and 7.

Output dataset: outputs/1b_minimal_dataset_definition.csv
patient_id sex
2 F
5 M
7 M

Line by line explanation🔗

Addition of Sex🔗

Only the last line is new.

Info

In Python code, like in many other languages, the dot operator, ., appears frequently.

Generally, . is used when you want to access something "belonging" to a value. That "something" is typically either a data attribute, or a method (a function). Methods will have brackets, and may accept arguments. Attributes will not have brackets.

sex is a data attribute that is available on the patient table and can be thought of as corresponding to the sex column.

Here, we can simply look at the raw data tables to see the available columns. For real OpenSAFELY data tables, this is not possible, but you can use the contracts documentation to discover which columns are available.

Your turn🔗

Run the dataset definition by:

opensafely exec databuilder:v0 generate-dataset "1b_minimal_dataset_definition.py" --dummy-tables "example-data/minimal/" --output "outputs.csv"

or if you are using project.yaml:

opensafely run extract_1b_minimal_population

Question

Can you modify the dataset definition so that the output shows:

  1. Both date_of_birth and sex columns?
  2. year_of_birth instead of date_of_birth?