Writing a dataset definition🔗
Danger
This page discusses the new OpenSAFELY Data Builder for accessing OpenSAFELY data sources.
Use OpenSAFELY cohort-extractor, unless you are specifically involved in the development or testing of Data Builder.
OpenSAFELY Data Builder and its documentation are still undergoing extensive development. We will announce when Data Builder is ready for general use on the Platform News page.
How is a dataset constructed?🔗
The OpenSAFELY framework:
- Uses a single dataset definition to query different vendor-specific EHR databases or locally provided dummy data.
- Reads your dataset definition from the Python script (usually
analysis/dataset_definition.py
) - Writes the output data frame in a tabular CSV file (usually
output/input.csv
).- For queries to vendor databases, the results are stored on a secure server.
What is a dataset definition?🔗
A dataset definition is a formal specification of the data that you want to extract from the OpenSAFELY database. This includes:
- the patient population (dataset rows)
- the variables (dataset columns)
It is written in ehrQL. Dataset definitions are written in a language designed for OpenSAFELY: ehrQL. ehrQL runs on Python, but ehrQL is designed to be easily written, read, and reviewed by anyone with some epidemiological knowledge.
dataset_definition.py
structure🔗
Before writing a dataset definition, add Data Builder to your OpenSAFELY project.
Importing code building blocks🔗
At the start of the dataset definition, we first import some code from
the Data Builder package. Put the following codeblock at the top of your
dataset_definition.py
file:
# TODO: replace this when the new DSL is in
# from databuilder.dsl import Cohort
# from databuilder.tables import clinical_events, practice_registrations
Note
This is a simple example. You may need different imports depending on your dataset definition.
A simple example🔗
The Cohort()
class (imported above) is used to define both the data population and the variables.
# TODO: replace this when the new DSL is in
# cohort = Cohort()
# index_date = "2020-11-01"
# registered = (
# practice_registrations.filter(practice_registrations.date_start <= index_date)
# .filter(practice_registrations.date_end >= index_date)
# .exists_for_patient()
# )
# cohort.define_population(registered)
# cohort.code = (
# clinical_events.sort_by(clinical_events.date)
# .first_for_patient()
# .select_column(clinical_events.code)
# )
How do Dataset definitions work?🔗
- Dataset is defined in the dataset definition
- Query transformation: The researcher then loads that dataset definition into Data Builder. Provided the dataset definition is valid, Data Builder transforms the dataset definition into an internal representation of the query called the query model.
- Query submission: Data Builder then translates the query model into the appropriate query language for the data backend being accessed. This means that the same dataset definition can be run against multiple backends which may have different structures or underlying software.
For a more indepth technical explanation of how this works, see explaining the query engine.
TO BE REPLACED IN FULL DOCS BUILD
This snippet will be replaced in the main docs with the parent file 'includes/glossary.md'