ehrQL tutorial: Date handling🔗
Danger
This page discusses the new OpenSAFELY Data Builder for accessing OpenSAFELY data sources.
Use OpenSAFELY cohort-extractor, unless you are specifically involved in the development or testing of Data Builder.
OpenSAFELY Data Builder and its documentation are still undergoing extensive development. We will announce when Data Builder is ready for general use on the Platform News page.
Example dataset definition 4a: Date handling🔗
Learning objectives🔗
By the end of this tutorial, you should be able to:
- specify dates in dataset definitions
- explain date operations
- use the key word
where
in dataset definitions
Full Example🔗
Dataset definition: 4a_multiple2_dataset_definition.py
from databuilder.ehrql import Dataset, days
from databuilder.tables.examples.tutorial import hospitalisations, patients
dataset = Dataset()
start_date = "2020-03-01"
final_date = "2020-05-31"
date_of_birth = patients.date_of_birth
hospitalisations_in_range = hospitalisations.where(
hospitalisations.date.is_on_or_after(start_date)
& hospitalisations.date.is_before(final_date)
)
population = (date_of_birth < "2000-01-01") & (
hospitalisations_in_range.exists_for_patient()
)
dataset.define_population(population)
dataset.sex = patients.sex
dataset.year_of_birth = patients.date_of_birth.year
first_hospitalisation_in_range = hospitalisations_in_range.sort_by(
hospitalisations.date
).first_for_patient()
dataset.last_day_of_month_before_first_hospitalisation = (
first_hospitalisation_in_range.date.to_first_of_month() - days(1)
)
In this section, we will be building up even more complex dataset definitions, concentrating on date operations. We will be using two different tables:
patients
hospitalisations
For brevity, the tables will not be displayed here but can be reviewed in the example-data/multiple2/
folder.
The output of the query above should generate a table with the columns:
- sex
- year of birth
- last day of month before first hospitalisation as columns.
Output dataset: outputs/4a_multiple2_dataset_definition.csv
patient_id | sex | year_of_birth | last_day_of_month_before_first_hospitalisation |
---|---|---|---|
1 | M | 1980 | 2020-02-29 |
4 | M | 1920 | 2020-03-31 |
Line by line explanation🔗
This dataset definition finds the patients whose data meet all of the following conditions:
- born before the 1st of January, 2000.
- and hospitalised in a given date range.
For each of these patients, we extract sex, year of birth, and the last day of the month prior to when the patient was first hospitalised in that date range.
Import statements🔗
As in the previous tutorials, we import the tables that we wish to work with. In this case, we only need two tables: patients and hospitalisations.
Remember that hospitalisations
is an event level data table.
This means that it captures information at an event level of hospitalisation
where each row represents a hospitalisation episode.
Patients can have many hospitalisations.
This is in contrast to patient level tables such as patients where each row represents one patient.
Specifying dates in a dataset definition🔗
In this dataset definitions, we specify dates of interest with strings written in the ISO8601 format: YYYY-MM-DD.
We are creating a start date and an end date of the study.
We will use these dates with the hospitalisation
variable
to restrict the population to those with hospitalisations inclusive of this date range.
Hospitalisations within a range🔗
We select the hospitalisation rows that match a condition
with the where
method.
This restricts the hospitalisation data to data that matches the condition passed into the where()
function.
We are passing in a condition
that the date column in the hospitalisation table is on or after the start date created above,
and is before the end date created above.
This creates a subset of the hospitalisation data.
Info
Instead of using start_date
,
we could specify 2020-03-01
.
However, by providing a specific named variable,
it helps to both explain what the value means
and allow the value to be reused.
Define the population🔗
Now we have the hospitalisation subset of data, we can use this to create our dataset. Just as with the date of birth examples in previous tutorials, we can also perform basic comparisons on dates to check equality or to determine if one date is before or after another.
With string provided dates,
we can use Python's standard comparison operators like: <
, >=
, !=
.
We are using a special function called exists_for_patient()
to restrict the population.
This checks if a row exists for a patient and if it does it passes back True
.
In this case we are restricting the population by only those who have been hospitalised,
indicated by their presence as rows in the hospitalisation subset.
Extracting components of data🔗
Next, we restrict the population to patients who meet our criteria.
We then add sex
and year_of_birth
to the dataset.
As we have done in previous dataset definitions, and we do here, we can extract individual components — year, month, day — of a date.
First hospitalisation in range🔗
The aim here is to find the last date of the month that immediately proceeded the first hospitalisation of a patient.
First of all, we need to find the first date of hospitalisation. We have already created the subset of hospitalisation in range as a variable for use in setting the population. We can reuse this subset to find the first date of hospitalisation.
We do this by sorting the hospitalisation for each patient by date.
Remember that this is event level data
so patients can have many hospitalisations and hence rows of data.
We then use first_for_patient()
to take the first of these values.
This means we will take the first date.
Now we need to find the preceding month and the last date of this month. The easiest way to do this is to get the first of the month for the hospitalisation date and then subtract 1 day to get the last day of the preceding month. This allows us to account for varying lengths of months, rather than setting it at 28th of the month.
To do this, we:
- Add the variable to the dataset with
dataset.last_day_of_month_before_first_hospitalisation
- Round the date down to the first of the month (
to_first_of_month()
) - Subtract 1 day (
subtract_days(1)
) to get the last date of the proceeding month.
This function subtract_days()
is an example of the type of date operations that can be carried out on the dates.
Your Turn🔗
Run the dataset definition.
Question
- Can you modify the date conditions in this dataset definition to include both dates in the range, instead of excluding the specified end date? See the ehrQL reference for the operations and methods on dates.
- Can you modify the dataset definition to find the date seven days after hospitalisation?
- Can you modify the dataset definition to find the 15th of the month of hospitalisation?
- Can you add a new column to get the date of the last hospitalisation for each patient?