Using Data Builder in an OpenSAFELY project🔗

Danger

This page discusses the new OpenSAFELY Data Builder for accessing OpenSAFELY data sources.

Use OpenSAFELY cohort-extractor, unless you are specifically involved in the development or testing of Data Builder.

OpenSAFELY Data Builder and its documentation are still undergoing extensive development. We will announce when Data Builder is ready for general use on the Platform News page.

Todo

We could consider moving all the examples to be project based and covering some of the topics here earlier on.

However, more likely is that we favour opensafely exec and only discuss a simplistic project.yaml containing one dataset definition here.

The relationship between Data Builder and OpenSAFELY projects🔗

Learning objectives🔗

By the end of this tutorial, you should know how to:

Create an OpenSAFELY project that uses Data Builder.
Run that project to generate the dataset definition output.

Running Data Builder via an OpenSAFELY project🔗

So far in this tutorial, we have run dataset definitions entirely via Data Builder.

This is fine for learning purposes in this tutorial. However, to run against an OpenSAFELY backend, we must create an OpenSAFELY project.

To create an OpenSAFELY project, there are three steps:

Create the dataset definition, as we have already covered in these tutorial examples.
Create an OpenSAFELY project that uses Data Builder, by writing a project.yaml file.
Use the OpenSAFELY CLI to run that project.yaml file.

Requirements🔗

In addition to the previous requirements, you will also need the OpenSAFELY CLI installed.

The dataset definition we will work with🔗

We will use a simple dataset definition that we have already seen.

Dataset definition: 1a_minimal_dataset_definition.py

1a_minimal_dataset_definition.py

from databuilder.ehrql import Dataset
from databuilder.tables.examples.tutorial import patients

dataset = Dataset()

year_of_birth = patients.date_of_birth.year
dataset.define_population(year_of_birth >= 2000)

The `minimal` data source🔗

Data table: minimal/patients.csv

patient_id	date_of_birth	sex
1	1980-05-01	M
2	2005-10-01	F
3	1946-01-01	M
4	1920-11-01	M
5	2010-04-01	M
6	1999-12-01	F
7	2000-01-01	M

The `project.yaml`🔗

A project.yaml file configures how analytic code is run for OpenSAFELY projects.

Using Data Builder in a project.yaml is much like working with other OpenSAFELY used by other OpenSAFELY actions.

Project pipeline: project.yaml

version: "3.0"

expectations:
  population_size: 3000

actions:
  extract_1a_minimal_population:
    run: databuilder:v0 generate-dataset "./1a_minimal_dataset_definition.py" --dummy-tables "example-data/minimal" --output "outputs/1a_minimal_dataset_definition.csv"
    outputs:
      highly_sensitive:
        population: outputs/1a_minimal_dataset_definition.csv

  extract_1b_minimal_population:
    run: databuilder:v0 generate-dataset "./1b_minimal_dataset_definition.py" --dummy-tables "example-data/minimal" --output "outputs/1b_minimal_dataset_definition.csv"
    outputs:
      highly_sensitive:
        population: outputs/1b_minimal_dataset_definition.csv

  extract_2a_multiple_population:
    run: databuilder:v0 generate-dataset "./2a_multiple_dataset_definition.py" --dummy-tables "example-data/multiple" --output "outputs/2a_multiple_dataset_definition.csv"
    outputs:
      highly_sensitive:
        population: outputs/2a_multiple_dataset_definition.csv

  extract_3a_multiple2_population:
    run: databuilder:v0 generate-dataset "./3a_multiple2_dataset_definition.py" --dummy-tables "example-data/multiple2" --output "outputs/3a_multiple2_dataset_definition.csv"
    outputs:
      highly_sensitive:
        population: outputs/3a_multiple2_dataset_definition.csv

  extract_3a1_multiple2_population:
    run: databuilder:v0 generate-dataset "./3a1_multiple2_dataset_definition.py" --dummy-tables "example-data/multiple2" --output "outputs/3a1_multiple2_dataset_definition.csv"
    outputs:
      highly_sensitive:
        population: outputs/3a1_multiple2_dataset_definition.csv

  extract_3a2_multiple2_population:
    run: databuilder:v0 generate-dataset "./3a2_multiple2_dataset_definition.py" --dummy-tables "example-data/multiple2" --output "outputs/3a2_multiple2_dataset_definition.csv"
    outputs:
      highly_sensitive:
        population: outputs/3a2_multiple2_dataset_definition.csv

  extract_4a_multiple2_population:
    run: databuilder:v0 generate-dataset "./4a_multiple2_dataset_definition.py" --dummy-tables "example-data/multiple2" --output "outputs/4a_multiple2_dataset_definition.csv"
    outputs:
      highly_sensitive:
        population: outputs/4a_multiple2_dataset_definition.csv

  extract_5a_multiple3_population:
    run: databuilder:v0 generate-dataset "./5a_multiple3_dataset_definition.py" --dummy-tables "example-data/multiple3" --output "outputs/5a_multiple3_dataset_definition.csv"
    outputs:
      highly_sensitive:
        population: outputs/5a_multiple3_dataset_definition.csv

  extract_6a_multiple4_population:
    run: databuilder:v0 generate-dataset "./6a_multiple4_dataset_definition.py" --dummy-tables "example-data/multiple4" --output "outputs/6a_multiple4_dataset_definition.csv"
    outputs:
      highly_sensitive:
        population: outputs/6a_multiple4_dataset_definition.csv

Running the `project.yaml`🔗

Running a project.yaml which contains a Data Builder action is much the same as for any other OpenSAFELY project.

Use opensafely run to run the project.yaml:

In your terminal, change directory to where you have the example project.yaml file.
Run opensafely run extract_1a_minimal_population
The OpenSAFELY CLI should run Data Builder with the dataset definition and you should find the output in the relative path shown under outputs in the project.yaml.

Tutorial exercises🔗

Todo

What is required to create a new OpenSAFELY project? Do you need a Git repository configured?

Note

At the moment, the project.yaml contains all the dataset definitions, which renders these questions redundant. Could we include as snippet, or provide another project.yaml?

Question

Can you create an OpenSAFELY project for one of the other dataset definitions we have covered in these tutorials?
Try running that project and confirm that the outputs are the same as running the dataset definition directly with Data Builder.