Skip to content

ehrQL tutorial: Filtering and aggregation🔗

Danger

This page discusses the new OpenSAFELY Data Builder for accessing OpenSAFELY data sources.

Use OpenSAFELY cohort-extractor, unless you are specifically involved in the development or testing of Data Builder.

OpenSAFELY Data Builder and its documentation are still undergoing extensive development. We will announce when Data Builder is ready for general use on the Platform News page.

Example dataset definition 6a: Filtering and aggregation🔗

Learning objectives🔗

By the end of this tutorial, you should be able to:

  • explain how to filter rows
  • implement filtering of rows
  • combine filters
  • implement aggregation of values

In this tutorial, we will develop more complex queries and learn how to combine different filters into one statement. We also learn what sort of inbuild aggregation of values are available within ehrQL.

Full example🔗

Dataset definition: 6a_multiple4_dataset_definition.py
from databuilder.ehrql import Dataset
from databuilder.tables.examples.tutorial import clinical_events, patients

dataset = Dataset()

tutorial_code_system_events = clinical_events.except_where(
    clinical_events.system == "AnotherCodeSystem"
).where(clinical_events.system == "TutorialCodeSystem")

minimum_h1_threshold = 200.0
start_date_of_interest = "2004-01-01"
end_date_of_interest = "2005-12-31"

high_code_h1_events = tutorial_code_system_events.where(
    (tutorial_code_system_events.code == "h1")
    & (tutorial_code_system_events.numeric_value > minimum_h1_threshold)
    & (tutorial_code_system_events.date >= start_date_of_interest)
    & (tutorial_code_system_events.date <= end_date_of_interest)
)

count_of_high_code_h1_events = high_code_h1_events.count_for_patient()
maximum_h1_value = high_code_h1_events.numeric_value.maximum_for_patient()

population = high_code_h1_events.exists_for_patient()
dataset.define_population(population)

dataset.date_of_birth = patients.date_of_birth
dataset.h1_count = count_of_high_code_h1_events
dataset.h1_max = maximum_h1_value

In this section, we will use two different tables: patients and clinical_events. patients is, as before, a patient-level table meaning that each row in the table represents one patient, and patients can only appear in the table once.

clinical_events is another event-level table. This means that a row is an event like a diagnosis. Patients can have multiple events and therefore multiple rows in the table.

For brevity, the tables will not be displayed here but can be reviewed in the example-data/multiple4/ folder.

The output of the query above should generate the table below:

Output dataset: outputs/6a_multiple4_dataset_definition.csv
patient_id date_of_birth h1_count h1_max
1 1998-05-06 2 355
2 1995-03-04 2 455
6 1938-07-05 1 675

Line by line explanation🔗

In this dataset definition, we select details of patients who:

  • have had a particular clinical event code recorded
  • with an associated numeric value higher than a given threshold
  • within a specified date range

We then extract:

  • the patient's date of birth
  • the maximum numeric value recorded for the patient for the specified clinical event code
  • the number of matching clinical events that exceed the given threshold

Filtering clinical events🔗

We create a variable called tutorial_code_system_events. This filters the clinical events table to include only events that belong to a coding system called TutorialCodesystem.

The where() and except_where() methods allow filtering of table rows:

  • where() specifies rows that you wish to include
  • except_where() specifies rows that you wish to exclude

Both where() and except_where() require an expression inside their parentheses that evaluates to a Boolean True or False for each row.

In previous tutorials, we have used where(). In this example, we are going to use except_where to exclude rows for AnotherCodingSystem. Rows that result in a True value for this expression then have the filter applied in the result.

Filter by h1 events🔗

Now we can apply a further filter to generate a new variable called high_code_h1_events. In this filter, we filter by 4 conditions:

  1. code equal h1
  2. numeric_value is greater than 200
  3. date is after start date of interest
  4. date is before end date of interest

We combine these with & which means AND.

Aggregation of values🔗

We can perform simple aggregations per patient and we have already seen some of these such as exists_for_patient().

To our dataset, we use some of the simple numerical aggregations.

First, we add the number of relevant matching clinical events to the dataset, by counting the events with count_for_patient(). and we find the highest value recorded in those clinical events by using maximum_for_patient().

Your Turn🔗

Question

  1. In this dataset definition, we initially filtered all of the clinical events to those using the TutorialCodeSystem code system. How would we rewrite that same selection to use a single except_where()?
  2. How would we find the sum of the numeric values of the m1 clinical events for each patient within the same date range already specified? Refer to the ehrQL reference.
  3. As the dataset definition shows, we can use combine multiple filters using where() and except_where() in different ways. Either we can specify multiple conditions to a single where() or except_where(). Or we can chain multiple where() and except_where() methods. The output of each method is a frame.

    Todo

    Or series? Check! Does this work on series, or just frames?

    You may find either way to express the same process useful: it may make your dataset definition either clearer or more consise to read.

    Can you rewrite the where() with multiple conditions to be a series of chained where() methods? Refer to the ehrQL reference.

    Todo

    In general, do we want to suggest particular ehrQL idioms?