Using Data Builder in an OpenSAFELY project🔗
Danger
This page discusses the new OpenSAFELY Data Builder for accessing OpenSAFELY data sources.
Use OpenSAFELY cohort-extractor, unless you are specifically involved in the development or testing of Data Builder.
OpenSAFELY Data Builder and its documentation are still undergoing extensive development. We will announce when Data Builder is ready for general use on the Platform News page.
Todo
We could consider moving all the examples to be project based and covering some of the topics here earlier on.
However, more likely is that we favour opensafely exec
and only discuss a simplistic project.yaml
containing one dataset definition here.
The relationship between Data Builder and OpenSAFELY projects🔗
Learning objectives🔗
By the end of this tutorial, you should know how to:
- Create an OpenSAFELY project that uses Data Builder.
- Run that project to generate the dataset definition output.
Running Data Builder via an OpenSAFELY project🔗
So far in this tutorial, we have run dataset definitions entirely via Data Builder.
This is fine for learning purposes in this tutorial. However, to run against an OpenSAFELY backend, we must create an OpenSAFELY project.
To create an OpenSAFELY project, there are three steps:
- Create the dataset definition, as we have already covered in these tutorial examples.
- Create an OpenSAFELY project that uses Data Builder,
by writing a
project.yaml
file. - Use the OpenSAFELY CLI to run that
project.yaml
file.
Requirements🔗
In addition to the previous requirements, you will also need the OpenSAFELY CLI installed.
The dataset definition we will work with🔗
We will use a simple dataset definition that we have already seen.
Dataset definition: 1a_minimal_dataset_definition.py
from databuilder.ehrql import Dataset
from databuilder.tables.examples.tutorial import patients
dataset = Dataset()
year_of_birth = patients.date_of_birth.year
dataset.define_population(year_of_birth >= 2000)
The minimal
data source🔗
Data table: minimal/patients.csv
patient_id | date_of_birth | sex |
---|---|---|
1 | 1980-05-01 | M |
2 | 2005-10-01 | F |
3 | 1946-01-01 | M |
4 | 1920-11-01 | M |
5 | 2010-04-01 | M |
6 | 1999-12-01 | F |
7 | 2000-01-01 | M |
The project.yaml
🔗
A project.yaml
file configures how analytic code is run for OpenSAFELY projects.
Using Data Builder in a project.yaml
is much like working with other OpenSAFELY used by other OpenSAFELY actions.
Project pipeline: project.yaml
version: "3.0"
expectations:
population_size: 3000
actions:
extract_1a_minimal_population:
run: databuilder:v0 generate-dataset "./1a_minimal_dataset_definition.py" --dummy-tables "example-data/minimal" --output "outputs/1a_minimal_dataset_definition.csv"
outputs:
highly_sensitive:
population: outputs/1a_minimal_dataset_definition.csv
extract_1b_minimal_population:
run: databuilder:v0 generate-dataset "./1b_minimal_dataset_definition.py" --dummy-tables "example-data/minimal" --output "outputs/1b_minimal_dataset_definition.csv"
outputs:
highly_sensitive:
population: outputs/1b_minimal_dataset_definition.csv
extract_2a_multiple_population:
run: databuilder:v0 generate-dataset "./2a_multiple_dataset_definition.py" --dummy-tables "example-data/multiple" --output "outputs/2a_multiple_dataset_definition.csv"
outputs:
highly_sensitive:
population: outputs/2a_multiple_dataset_definition.csv
extract_3a_multiple2_population:
run: databuilder:v0 generate-dataset "./3a_multiple2_dataset_definition.py" --dummy-tables "example-data/multiple2" --output "outputs/3a_multiple2_dataset_definition.csv"
outputs:
highly_sensitive:
population: outputs/3a_multiple2_dataset_definition.csv
extract_3a1_multiple2_population:
run: databuilder:v0 generate-dataset "./3a1_multiple2_dataset_definition.py" --dummy-tables "example-data/multiple2" --output "outputs/3a1_multiple2_dataset_definition.csv"
outputs:
highly_sensitive:
population: outputs/3a1_multiple2_dataset_definition.csv
extract_3a2_multiple2_population:
run: databuilder:v0 generate-dataset "./3a2_multiple2_dataset_definition.py" --dummy-tables "example-data/multiple2" --output "outputs/3a2_multiple2_dataset_definition.csv"
outputs:
highly_sensitive:
population: outputs/3a2_multiple2_dataset_definition.csv
extract_4a_multiple2_population:
run: databuilder:v0 generate-dataset "./4a_multiple2_dataset_definition.py" --dummy-tables "example-data/multiple2" --output "outputs/4a_multiple2_dataset_definition.csv"
outputs:
highly_sensitive:
population: outputs/4a_multiple2_dataset_definition.csv
extract_5a_multiple3_population:
run: databuilder:v0 generate-dataset "./5a_multiple3_dataset_definition.py" --dummy-tables "example-data/multiple3" --output "outputs/5a_multiple3_dataset_definition.csv"
outputs:
highly_sensitive:
population: outputs/5a_multiple3_dataset_definition.csv
extract_6a_multiple4_population:
run: databuilder:v0 generate-dataset "./6a_multiple4_dataset_definition.py" --dummy-tables "example-data/multiple4" --output "outputs/6a_multiple4_dataset_definition.csv"
outputs:
highly_sensitive:
population: outputs/6a_multiple4_dataset_definition.csv
Running the project.yaml
🔗
Running a project.yaml
which contains a Data Builder action
is much the same as for any other OpenSAFELY project.
Use opensafely run
to run the project.yaml
:
- In your terminal, change directory to where you have the example
project.yaml
file. - Run
opensafely run extract_1a_minimal_population
- The OpenSAFELY CLI should run Data Builder with the dataset definition
and you should find the output in the relative path shown under
outputs
in theproject.yaml
.
Tutorial exercises🔗
Todo
What is required to create a new OpenSAFELY project? Do you need a Git repository configured?
Note
At the moment,
the project.yaml
contains all the dataset definitions,
which renders these questions redundant.
Could we include as snippet,
or provide another project.yaml
?
Question
- Can you create an OpenSAFELY project for one of the other dataset definitions we have covered in these tutorials?
- Try running that project and confirm that the outputs are the same as running the dataset definition directly with Data Builder.