Executing and inspecting StatsModels with pybids.modeling

A minimalistic tutorial illustrating usage of the tools in the bids.modeling module—most notably, BIDSStatsModelsGraph and its various components.

%load_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np
from bids.modeling import BIDSStatsModelsGraph
from bids.layout import BIDSLayout
from bids.tests import get_test_data_path
from os.path import join
import json

Load BIDSLayout

Standard stuff: load the BIDSLayout object (we’ll use the ds005 dataset bundled with PyBIDS tests) and read in one of the included model files (ds005/models/ds-005_type-test_model.json).

layout_path = join(get_test_data_path(), 'ds005')
layout = BIDSLayout(layout_path)
json_file = join(layout_path, 'models', 'ds-005_type-test_model.json')
spec = json.load(open(json_file, 'r'))

Initialize the graph

The BIDSStatsModelsGraph is the primary interface to design data, model constructions, etc. We build it from a BIDSLayout and the contents of a JSON model file.

graph = BIDSStatsModelsGraph(layout, spec)
/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/bids/modeling/statsmodels.py:63: UserWarning: [Node run]:Transformations reformatted to {'transformer': 'pybids-transforms-v1', 'instructions': [{'name': 'Factor', 'input': 'trial_type'}, {'name': 'Rename', 'input': 'trial_type.parametric gain', 'output': 'gain'}]}
  warnings.warn(f"[Node {node['name']}]:"

Load variables from BIDS dataset

We will typically want to load variables into BIDS “collections” from the BIDS project. The scan_length argument is only necessary here because the test dataset doesn’t actually include nifti image headers, so duration of scans can’t be inferred. In typical use, you can call this without arguments.

graph.load_collections(scan_length=480)

Access specific nodes

Now that the graph is loaded, we can start interacting with its nodes. We’ll typically start with the root node, which will usually contain the run-level model information. We can either access the .root_node, or use get_node() to get the node by its unique name (defined in the JSON file).

# Equivalent to calling graph.get_node('run') in this case.
root_node = graph.root_node

Inspect BIDSVariableCollection

We can take a look at the original variables available for each node prior to running the node (i.e. applying any transformations)

colls = root_node.get_collections()
first_sub = colls[0] # Collection for the first subject / run 
first_sub.to_df(entities=False)
onset duration PTval RT distance from indifference gain loss parametric gain parametric loss respcat respnum trial_type
0 0 3 5.15 0.000 NaN 20 15 -0.139 NaN -1 0 parametric gain
1 4 3 6.12 1.793 NaN 18 12 -0.189 NaN 1 2 parametric gain
2 8 3 -4.85 1.637 NaN 10 15 -0.389 NaN 0 3 parametric gain
3 18 3 18.16 1.316 NaN 34 16 0.211 NaN 1 1 parametric gain
4 24 3 13.05 1.670 NaN 18 5 -0.189 NaN 1 1 parametric gain
... ... ... ... ... ... ... ... ... ... ... ... ...
81 458 3 -1.82 1.785 NaN 16 18 -0.239 NaN 0 4 parametric gain
82 462 3 24.12 1.280 NaN 36 12 0.261 NaN 1 2 parametric gain
83 466 3 17.09 1.394 NaN 26 9 0.011 NaN 1 1 parametric gain
84 470 3 8.12 1.249 NaN 20 12 -0.139 NaN 1 2 parametric gain
85 474 3 10.14 1.266 NaN 24 14 -0.039 NaN 1 2 parametric gain

86 rows × 12 columns

Note that by default to_df combines both sparse and dense variables.

We can take a look at the individual variable objects as well:

first_sub.variables
{'trial_type': <SparseRunVariable(name='trial_type', source='events')>,
 'loss': <SparseRunVariable(name='loss', source='events')>,
 'RT': <SparseRunVariable(name='RT', source='events')>,
 'distance from indifference': <SparseRunVariable(name='distance from indifference', source='events')>,
 'respcat': <SparseRunVariable(name='respcat', source='events')>,
 'parametric loss': <SparseRunVariable(name='parametric loss', source='events')>,
 'gain': <SparseRunVariable(name='gain', source='events')>,
 'PTval': <SparseRunVariable(name='PTval', source='events')>,
 'respnum': <SparseRunVariable(name='respnum', source='events')>,
 'parametric gain': <SparseRunVariable(name='parametric gain', source='events')>}

Executing a node

We can “run” each node to produce design matrices for each unique combination of entities/inputs we want. This is determined by the grouping structure. When working with the API directly, this is indicated via the group_by argument to a .run() method. In this case, we’re indicating that the design information should be set up separately for every unique combination of run and subject.

Note that we need to include subject even though this is a strictly run-level model, because if we only pass ['run'], there will only be 3 separate analyses conducted: one for run 1, one for run 2, and one for run 3. Whereas what we actually want is for the procedure to be done separately for each unique combination of the 3 runs and 16 subjects (i.e., 48 times).

# force_dense controls whether the output for run-level design matrices
# will be resampled to a uniform "dense" representation, or left alone
# as a sparse representation if possible.
outputs = root_node.run(group_by=['run', 'subject'], force_dense=False, transformation_history=True)

Node outputs

The result is a list of objects of type BIDSStatsModelsNodeOutput. This is a lightweight container that stores design information and various other useful pieces of information. There should one element in the list for each unique combination of the grouping variables (in this case, run and subject):

len(outputs)
48
# The first 10 `BIDSStatsModelsNodeOutput` objects
outputs[:10]
[<BIDSStatsModelsNodeOutput(name=run, entities={'run': 1, 'subject': '01'})>,
 <BIDSStatsModelsNodeOutput(name=run, entities={'run': 2, 'subject': '01'})>,
 <BIDSStatsModelsNodeOutput(name=run, entities={'run': 3, 'subject': '01'})>,
 <BIDSStatsModelsNodeOutput(name=run, entities={'run': 1, 'subject': '02'})>,
 <BIDSStatsModelsNodeOutput(name=run, entities={'run': 2, 'subject': '02'})>,
 <BIDSStatsModelsNodeOutput(name=run, entities={'run': 3, 'subject': '02'})>,
 <BIDSStatsModelsNodeOutput(name=run, entities={'run': 1, 'subject': '03'})>,
 <BIDSStatsModelsNodeOutput(name=run, entities={'run': 2, 'subject': '03'})>,
 <BIDSStatsModelsNodeOutput(name=run, entities={'run': 3, 'subject': '03'})>,
 <BIDSStatsModelsNodeOutput(name=run, entities={'run': 1, 'subject': '04'})>]

Let’s take a closer look at the BIDSStatsModelsNodeOutput. First, we have an .entities attribute that tells us what levels of the grouping variables this output corresponds to:

outputs[0].entities
{'run': 1, 'subject': '01'}

Next, we can get the design matrix itself via the .X attribute:

outputs[0].X
RT gain RT:gain
0 0.000 1.0 0.000
1 1.793 1.0 1.793
2 1.637 1.0 1.637
3 1.316 1.0 1.316
4 1.670 1.0 1.670
... ... ... ...
81 1.785 1.0 1.785
82 1.280 1.0 1.280
83 1.394 1.0 1.394
84 1.249 1.0 1.249
85 1.266 1.0 1.266

86 rows × 3 columns

Here we have a column for each of the contrasts specified in the model (with the same name as the contrast). We can access the contrasts too, via—you guessed it—.contrasts:

outputs[0].contrasts
[ContrastInfo(name='RT:gain', conditions=['RT:gain'], weights=[1], test='t', entities={'run': 1, 'subject': '01', 'contrast': 'RT:gain'}),
 ContrastInfo(name='gain', conditions=['gain'], weights=[1], test='t', entities={'run': 1, 'subject': '01', 'contrast': 'gain'}),
 ContrastInfo(name='RT', conditions=['RT'], weights=[1], test='t', entities={'run': 1, 'subject': '01', 'contrast': 'RT'})]

Each ContrastInfo is a named tuple with fields that map directly on the definition of contrasts in the BIDS-StatsModels specification. The only addition is the inclusion of an .entities field that stores a dictionary of the entities that identify what subset of our data the contrast applies to.

One thing you might be puzzled by, looking at the output of the .X call above, is the absence of any timing information. .X is supposed to give us a design matrix, but how come the output only has the actual values for each column, and no information about event timing? How are we supposed to know what the onsets, durations, etc. of each event are?

The answer is that .X contains only the actual values that go into the design matrix, and not any metadata—no matter how important. Fortunately, that metadata is available to us. It’s conveniently stored in a .metadata attribute on the BIDSStatsModelsNodeOutput object.

outputs[0].metadata
datatype duration onset run subject suffix task
0 func 3 0 1 01 bold mixedgamblestask
1 func 3 4 1 01 bold mixedgamblestask
2 func 3 8 1 01 bold mixedgamblestask
3 func 3 18 1 01 bold mixedgamblestask
4 func 3 24 1 01 bold mixedgamblestask
... ... ... ... ... ... ... ...
81 func 3 458 1 01 bold mixedgamblestask
82 func 3 462 1 01 bold mixedgamblestask
83 func 3 466 1 01 bold mixedgamblestask
84 func 3 470 1 01 bold mixedgamblestask
85 func 3 474 1 01 bold mixedgamblestask

86 rows × 7 columns

There’s a 1-to-1 mapping from rows in .X to rows in .metadata. This means you can, if you like, simply concatenate the two along the column axis to get one big DataFrame with everything. But by maintaining a default separation, it’s made very clear to us which columns are properly a part of the design, and which contain additional metadata.

Transformation History

To generate the final design matrix, pybids applies all transformations specified in the model for that given run.

However, it’s often useful to look at the intermediary outputs from each transformation to perform sanity checks, such as previewing the variable prior to convolution.

Optionally, you can run the Node with transformation_history=True, and the BIDSStatsModelsNodeOutput object will have a trans_hist attribute which is a list of intermediary outputs after every transformation

ts = outputs[0].trans_hist
ts
[TransformationOutput(index=0, output=<BIDSRunVariableCollection['PTval', 'RT', 'distance from indifference', 'gain', 'loss', 'parametric gain', 'parametric loss', 'respcat', 'respnum', 'trial_type']>, transformation_name=None, transformation_kwargs=None, input_cols=None, level=None),
 TransformationOutput(index=1, output=<BIDSRunVariableCollection['PTval', 'RT', 'distance from indifference', 'gain', 'loss', 'parametric gain', 'parametric loss', 'respcat', 'respnum', 'trial_type.parametric gain']>, transformation_name='Factor', transformation_kwargs={}, input_cols='trial_type', level='run'),
 TransformationOutput(index=2, output=<BIDSRunVariableCollection['PTval', 'RT', 'distance from indifference', 'gain', 'loss', 'parametric gain', 'parametric loss', 'respcat', 'respnum']>, transformation_name='Rename', transformation_kwargs={'output': 'gain'}, input_cols='trial_type.parametric gain', level='run')]

The first item (index=0), is the original collection. We can access the BIDSRunVariableCollection using the output attribute:

ts[0].output
<BIDSRunVariableCollection['PTval', 'RT', 'distance from indifference', 'gain', 'loss', 'parametric gain', 'parametric loss', 'respcat', 'respnum', 'trial_type']>

We can examine the full collection at each step using either a combined dataframe:

ts[1].output.to_df(entities=False)
onset duration PTval RT distance from indifference gain loss parametric gain parametric loss respcat respnum trial_type.parametric gain
0 0 3 5.15 0.000 NaN 20.0 15.0 -0.139 NaN -1.0 0.0 1.0
1 4 3 6.12 1.793 NaN 18.0 12.0 -0.189 NaN 1.0 2.0 1.0
2 8 3 -4.85 1.637 NaN 10.0 15.0 -0.389 NaN 0.0 3.0 1.0
3 18 3 18.16 1.316 NaN 34.0 16.0 0.211 NaN 1.0 1.0 1.0
4 24 3 13.05 1.670 NaN 18.0 5.0 -0.189 NaN 1.0 1.0 1.0
... ... ... ... ... ... ... ... ... ... ... ... ...
81 458 3 -1.82 1.785 NaN 16.0 18.0 -0.239 NaN 0.0 4.0 1.0
82 462 3 24.12 1.280 NaN 36.0 12.0 0.261 NaN 1.0 2.0 1.0
83 466 3 17.09 1.394 NaN 26.0 9.0 0.011 NaN 1.0 1.0 1.0
84 470 3 8.12 1.249 NaN 20.0 12.0 -0.139 NaN 1.0 2.0 1.0
85 474 3 10.14 1.266 NaN 24.0 14.0 -0.039 NaN 1.0 2.0 1.0

86 rows × 12 columns

Or examine individual variables in their native representation (e.g. SparseRunVariable or DenseRunVariable)

ts[1].output.variables
{'loss': <SparseRunVariable(name='loss', source='events')>,
 'RT': <SparseRunVariable(name='RT', source='events')>,
 'distance from indifference': <SparseRunVariable(name='distance from indifference', source='events')>,
 'respcat': <SparseRunVariable(name='respcat', source='events')>,
 'parametric loss': <SparseRunVariable(name='parametric loss', source='events')>,
 'gain': <SparseRunVariable(name='gain', source='events')>,
 'PTval': <SparseRunVariable(name='PTval', source='events')>,
 'respnum': <SparseRunVariable(name='respnum', source='events')>,
 'parametric gain': <SparseRunVariable(name='parametric gain', source='events')>,
 'trial_type.parametric gain': <SparseRunVariable(name='trial_type.parametric gain', source='events')>}
ts[1].output.variables['RT'].to_df(entities=False)
amplitude onset duration condition
0 0.000 0 3 RT
1 1.793 4 3 RT
2 1.637 8 3 RT
3 1.316 18 3 RT
4 1.670 24 3 RT
... ... ... ... ...
81 1.785 458 3 RT
82 1.280 462 3 RT
83 1.394 466 3 RT
84 1.249 470 3 RT
85 1.266 474 3 RT

86 rows × 4 columns

Traversing the graph

So far we’ve executed the root node, which by definition required no inputs from any previous node. But in typical workflows, we’ll be passing outputs from one node in as inputs to another. For example, we often want to take the run-level parameter estimates and pass them to a subject-level model that does nothing but average over runs within each subject. This requires us to somehow traverse the graph based on the edges specified in the BIDS-StatsModel document. We can do that by taking advantage of each node’s .children attribute, which contains a list of BIDSStatsModelsEdge named tuples that specify an edge between two nodes.

root_node.children
[BIDSStatsModelsEdge(source=<BIDSStatsModelsNode(level=run, name=run)>, destination=<BIDSStatsModelsNode(level=subject, name=participant)>, filter={})]

In this case the root node connects to only one other node. We can directly access that node by following the .destination property of the edge:

next_node = root_node.children[0].destination
next_node.level, next_node.name
('subject', 'participant')

We assign the first connected node to next_node, and print out its level and name for inspection (both are session).

Passing in inputs

Armed with that, we can run the session node and proceed and before. However, there’s a twist: whereas the root node only needs to know about variables loaded directly from the BIDSLayout (which we achieved with that .load_collections() call earlier), the session node can’t get the inputs it needs from the BIDSLayout, because there aren’t any (at least in this particular dataset). What we want to do at the session level is average over our run-level estimates within-session. But to do that, we need to actually pass in information about those runs!

The way we do this is to pass, as the first argument to .run(), a list of ContrastInfo objects informing our node about what inputs it should use to construct its design matrices. The typical use pattern is to pass one concatenated list containing all of the outputs from the previous level that we want to pass on. Note that we may not want to pass all of the outputs forward. For example, suppose that 2 out of 48 run-level models failed during the estimation process. We might not want to keep passing information about those 2 runs forward, as we can’t compute them. So we can always filter the list of ContrastInfo objects we received from the previous node before we pass them on to the next node. (We could also do other things, like rename each ContrastInfo to use whatever naming scheme our software prefers; modifying the entities; and so on. But we won’t do any of that here.)

Let’s concatenate the 48 outputs we got from the previous level and drop the last 2, in preparation for passing them forward to our next_node:

from itertools import chain
contrasts = list(chain(*[s.contrasts for s in outputs[:-2]]))
len(contrasts)
138

Notice that we’re left we’re 138 individual ContrastInfo objects. Why 138? Because we have ((3 runs x 16 subjects) - 2) * 3 contrasts. Recall that we’re dropping the last two of the 48 outputs from the previous level. But each of those lists contains three ContrastInfo objects (one per contrast—RT, gain, and the RT:gain interaction). Hence, 138.

Now we can call .run() on our session-level node, passing in the contrasts as inputs. We want the model specification (i.e., the part in the "Model" section of the node) to be applied separately to each unique combination of contrast, session, and subject.

sess_outputs = next_node.run(contrasts, group_by=['subject', 'contrast'])
len(sess_outputs)
48

Again we get back a list of BIDSStatsModelsNodeOutput objects. And again we have 48 of them. It might seem odd that we have the same number of outputs from a subject-level node as we had from the run-level node, but there’s actually a difference. In the run-level case, our 48 results reflected 16 subjects x 3 runs. In the subject-level case, we have successfully aggregated over runs within each subject, but we now have 3 sets of contrasts producing outputs (i.e., 16 subjects x 3 contrasts).

This becomes clearer if we inspect the same attributes we looked at earlier:

sess_outputs[0].contrasts
[ContrastInfo(name='RT:gain', conditions=['intercept'], weights=[1], test='FEMA', entities={'contrast': 'RT:gain', 'subject': '01'}),
 ContrastInfo(name='RT:gain_neg', conditions=['intercept'], weights=[-1], test='FEMA', entities={'contrast': 'RT:gain_neg', 'subject': '01'})]
# Concatenate the X and metadata DFs for easy reading.
# Note that only the first column is actually part of the
# design matrix; the others are just metadata.
pd.concat([sess_outputs[0].X, sess_outputs[0].metadata], axis=1)
intercept contrast run subject
0 1 RT:gain 1 01
1 1 RT:gain 2 01
2 1 RT:gain 3 01

Notice how the entities differ: the run-level node grouped on run and subject; the subject-level node groups on subject and contrast. The number of outputs is identical in both cases, but this is just an (un)happy accident, not a general principle. You can verify this for yourself by re-running the subject-level node with a different grouping (e.g., only ['subject']).