Executing and inspecting StatsModels with pybids.modeling
¶
A minimalistic tutorial illustrating usage of the tools in the bids.modeling
module—most notably, BIDSStatsModelsGraph
and its various components.
%load_ext autoreload
%autoreload 2
import pandas as pd
import numpy as np
from bids.modeling import BIDSStatsModelsGraph
from bids.layout import BIDSLayout
from bids.tests import get_test_data_path
from os.path import join
import json
Load BIDSLayout¶
Standard stuff: load the BIDSLayout
object (we’ll use the ds005 dataset bundled with PyBIDS tests) and read in one of the included model files (ds005/models/ds-005_type-test_model.json
).
layout_path = join(get_test_data_path(), 'ds005')
layout = BIDSLayout(layout_path)
json_file = join(layout_path, 'models', 'ds-005_type-test_model.json')
spec = json.load(open(json_file, 'r'))
Initialize the graph¶
The BIDSStatsModelsGraph
is the primary interface to design data, model constructions, etc. We build it from a BIDSLayout
and the contents of a JSON model file.
graph = BIDSStatsModelsGraph(layout, spec)
/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/bids/modeling/statsmodels.py:63: UserWarning: [Node run]:Transformations reformatted to {'transformer': 'pybids-transforms-v1', 'instructions': [{'name': 'Factor', 'input': 'trial_type'}, {'name': 'Rename', 'input': 'trial_type.parametric gain', 'output': 'gain'}]}
warnings.warn(f"[Node {node['name']}]:"
Load variables from BIDS dataset¶
We will typically want to load variables into BIDS “collections” from the BIDS project. The scan_length
argument is only necessary here because the test dataset doesn’t actually include nifti image headers, so duration of scans can’t be inferred. In typical use, you can call this without arguments.
graph.load_collections(scan_length=480)
Access specific nodes¶
Now that the graph is loaded, we can start interacting with its nodes. We’ll typically start with the root node, which will usually contain the run-level model information. We can either access the .root_node
, or use get_node()
to get the node by its unique name (defined in the JSON file).
# Equivalent to calling graph.get_node('run') in this case.
root_node = graph.root_node
Inspect BIDSVariableCollection
¶
We can take a look at the original variables available for each node prior to running the node (i.e. applying any transformations)
colls = root_node.get_collections()
first_sub = colls[0] # Collection for the first subject / run
first_sub.to_df(entities=False)
onset | duration | PTval | RT | distance from indifference | gain | loss | parametric gain | parametric loss | respcat | respnum | trial_type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | 5.15 | 0.000 | NaN | 20 | 15 | -0.139 | NaN | -1 | 0 | parametric gain |
1 | 4 | 3 | 6.12 | 1.793 | NaN | 18 | 12 | -0.189 | NaN | 1 | 2 | parametric gain |
2 | 8 | 3 | -4.85 | 1.637 | NaN | 10 | 15 | -0.389 | NaN | 0 | 3 | parametric gain |
3 | 18 | 3 | 18.16 | 1.316 | NaN | 34 | 16 | 0.211 | NaN | 1 | 1 | parametric gain |
4 | 24 | 3 | 13.05 | 1.670 | NaN | 18 | 5 | -0.189 | NaN | 1 | 1 | parametric gain |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
81 | 458 | 3 | -1.82 | 1.785 | NaN | 16 | 18 | -0.239 | NaN | 0 | 4 | parametric gain |
82 | 462 | 3 | 24.12 | 1.280 | NaN | 36 | 12 | 0.261 | NaN | 1 | 2 | parametric gain |
83 | 466 | 3 | 17.09 | 1.394 | NaN | 26 | 9 | 0.011 | NaN | 1 | 1 | parametric gain |
84 | 470 | 3 | 8.12 | 1.249 | NaN | 20 | 12 | -0.139 | NaN | 1 | 2 | parametric gain |
85 | 474 | 3 | 10.14 | 1.266 | NaN | 24 | 14 | -0.039 | NaN | 1 | 2 | parametric gain |
86 rows × 12 columns
Note that by default to_df
combines both sparse and dense variables.
We can take a look at the individual variable objects as well:
first_sub.variables
{'trial_type': <SparseRunVariable(name='trial_type', source='events')>,
'loss': <SparseRunVariable(name='loss', source='events')>,
'RT': <SparseRunVariable(name='RT', source='events')>,
'distance from indifference': <SparseRunVariable(name='distance from indifference', source='events')>,
'respcat': <SparseRunVariable(name='respcat', source='events')>,
'parametric loss': <SparseRunVariable(name='parametric loss', source='events')>,
'gain': <SparseRunVariable(name='gain', source='events')>,
'PTval': <SparseRunVariable(name='PTval', source='events')>,
'respnum': <SparseRunVariable(name='respnum', source='events')>,
'parametric gain': <SparseRunVariable(name='parametric gain', source='events')>}
Executing a node¶
We can “run” each node to produce design matrices for each unique combination of entities/inputs we want. This is determined by the grouping structure. When working with the API directly, this is indicated via the group_by
argument to a .run()
method. In this case, we’re indicating that the design information should be set up separately for every unique combination of run
and subject
.
Note that we need to include subject
even though this is a strictly run-level model, because if we only pass ['run']
, there will only be 3 separate analyses conducted: one for run 1, one for run 2, and one for run 3. Whereas what we actually want is for the procedure to be done separately for each unique combination of the 3 runs and 16 subjects (i.e., 48 times).
# force_dense controls whether the output for run-level design matrices
# will be resampled to a uniform "dense" representation, or left alone
# as a sparse representation if possible.
outputs = root_node.run(group_by=['run', 'subject'], force_dense=False, transformation_history=True)
Node outputs¶
The result is a list of objects of type BIDSStatsModelsNodeOutput
. This is a lightweight container that stores design information and various other useful pieces of information. There should one element in the list for each unique combination of the grouping variables (in this case, run and subject):
len(outputs)
48
# The first 10 `BIDSStatsModelsNodeOutput` objects
outputs[:10]
[<BIDSStatsModelsNodeOutput(name=run, entities={'run': 1, 'subject': '01'})>,
<BIDSStatsModelsNodeOutput(name=run, entities={'run': 2, 'subject': '01'})>,
<BIDSStatsModelsNodeOutput(name=run, entities={'run': 3, 'subject': '01'})>,
<BIDSStatsModelsNodeOutput(name=run, entities={'run': 1, 'subject': '02'})>,
<BIDSStatsModelsNodeOutput(name=run, entities={'run': 2, 'subject': '02'})>,
<BIDSStatsModelsNodeOutput(name=run, entities={'run': 3, 'subject': '02'})>,
<BIDSStatsModelsNodeOutput(name=run, entities={'run': 1, 'subject': '03'})>,
<BIDSStatsModelsNodeOutput(name=run, entities={'run': 2, 'subject': '03'})>,
<BIDSStatsModelsNodeOutput(name=run, entities={'run': 3, 'subject': '03'})>,
<BIDSStatsModelsNodeOutput(name=run, entities={'run': 1, 'subject': '04'})>]
Let’s take a closer look at the BIDSStatsModelsNodeOutput
. First, we have an .entities
attribute that tells us what levels of the grouping variables this output corresponds to:
outputs[0].entities
{'run': 1, 'subject': '01'}
Next, we can get the design matrix itself via the .X
attribute:
outputs[0].X
RT | gain | RT:gain | |
---|---|---|---|
0 | 0.000 | 1.0 | 0.000 |
1 | 1.793 | 1.0 | 1.793 |
2 | 1.637 | 1.0 | 1.637 |
3 | 1.316 | 1.0 | 1.316 |
4 | 1.670 | 1.0 | 1.670 |
... | ... | ... | ... |
81 | 1.785 | 1.0 | 1.785 |
82 | 1.280 | 1.0 | 1.280 |
83 | 1.394 | 1.0 | 1.394 |
84 | 1.249 | 1.0 | 1.249 |
85 | 1.266 | 1.0 | 1.266 |
86 rows × 3 columns
Here we have a column for each of the contrasts specified in the model (with the same name as the contrast). We can access the contrasts too, via—you guessed it—.contrasts
:
outputs[0].contrasts
[ContrastInfo(name='RT:gain', conditions=['RT:gain'], weights=[1], test='t', entities={'run': 1, 'subject': '01', 'contrast': 'RT:gain'}),
ContrastInfo(name='gain', conditions=['gain'], weights=[1], test='t', entities={'run': 1, 'subject': '01', 'contrast': 'gain'}),
ContrastInfo(name='RT', conditions=['RT'], weights=[1], test='t', entities={'run': 1, 'subject': '01', 'contrast': 'RT'})]
Each ContrastInfo
is a named tuple with fields that map directly on the definition of contrasts in the BIDS-StatsModels specification. The only addition is the inclusion of an .entities
field that stores a dictionary of the entities that identify what subset of our data the contrast applies to.
One thing you might be puzzled by, looking at the output of the .X
call above, is the absence of any timing information. .X
is supposed to give us a design matrix, but how come the output only has the actual values for each column, and no information about event timing? How are we supposed to know what the onsets, durations, etc. of each event are?
The answer is that .X
contains only the actual values that go into the design matrix, and not any metadata—no matter how important. Fortunately, that metadata is available to us. It’s conveniently stored in a .metadata
attribute on the BIDSStatsModelsNodeOutput
object.
outputs[0].metadata
datatype | duration | onset | run | subject | suffix | task | |
---|---|---|---|---|---|---|---|
0 | func | 3 | 0 | 1 | 01 | bold | mixedgamblestask |
1 | func | 3 | 4 | 1 | 01 | bold | mixedgamblestask |
2 | func | 3 | 8 | 1 | 01 | bold | mixedgamblestask |
3 | func | 3 | 18 | 1 | 01 | bold | mixedgamblestask |
4 | func | 3 | 24 | 1 | 01 | bold | mixedgamblestask |
... | ... | ... | ... | ... | ... | ... | ... |
81 | func | 3 | 458 | 1 | 01 | bold | mixedgamblestask |
82 | func | 3 | 462 | 1 | 01 | bold | mixedgamblestask |
83 | func | 3 | 466 | 1 | 01 | bold | mixedgamblestask |
84 | func | 3 | 470 | 1 | 01 | bold | mixedgamblestask |
85 | func | 3 | 474 | 1 | 01 | bold | mixedgamblestask |
86 rows × 7 columns
There’s a 1-to-1 mapping from rows in .X
to rows in .metadata
. This means you can, if you like, simply concatenate the two along the column axis to get one big DataFrame with everything. But by maintaining a default separation, it’s made very clear to us which columns are properly a part of the design, and which contain additional metadata.
Transformation History¶
To generate the final design matrix, pybids applies all transformations specified in the model for that given run.
However, it’s often useful to look at the intermediary outputs from each transformation to perform sanity checks, such as previewing the variable prior to convolution.
Optionally, you can run the Node
with transformation_history=True
, and the BIDSStatsModelsNodeOutput
object will have a trans_hist
attribute which is a list of intermediary outputs after every transformation
ts = outputs[0].trans_hist
ts
[TransformationOutput(index=0, output=<BIDSRunVariableCollection['PTval', 'RT', 'distance from indifference', 'gain', 'loss', 'parametric gain', 'parametric loss', 'respcat', 'respnum', 'trial_type']>, transformation_name=None, transformation_kwargs=None, input_cols=None, level=None),
TransformationOutput(index=1, output=<BIDSRunVariableCollection['PTval', 'RT', 'distance from indifference', 'gain', 'loss', 'parametric gain', 'parametric loss', 'respcat', 'respnum', 'trial_type.parametric gain']>, transformation_name='Factor', transformation_kwargs={}, input_cols='trial_type', level='run'),
TransformationOutput(index=2, output=<BIDSRunVariableCollection['PTval', 'RT', 'distance from indifference', 'gain', 'loss', 'parametric gain', 'parametric loss', 'respcat', 'respnum']>, transformation_name='Rename', transformation_kwargs={'output': 'gain'}, input_cols='trial_type.parametric gain', level='run')]
The first item (index=0), is the original collection. We can access the BIDSRunVariableCollection
using the output attribute:
ts[0].output
<BIDSRunVariableCollection['PTval', 'RT', 'distance from indifference', 'gain', 'loss', 'parametric gain', 'parametric loss', 'respcat', 'respnum', 'trial_type']>
We can examine the full collection at each step using either a combined dataframe:
ts[1].output.to_df(entities=False)
onset | duration | PTval | RT | distance from indifference | gain | loss | parametric gain | parametric loss | respcat | respnum | trial_type.parametric gain | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | 5.15 | 0.000 | NaN | 20.0 | 15.0 | -0.139 | NaN | -1.0 | 0.0 | 1.0 |
1 | 4 | 3 | 6.12 | 1.793 | NaN | 18.0 | 12.0 | -0.189 | NaN | 1.0 | 2.0 | 1.0 |
2 | 8 | 3 | -4.85 | 1.637 | NaN | 10.0 | 15.0 | -0.389 | NaN | 0.0 | 3.0 | 1.0 |
3 | 18 | 3 | 18.16 | 1.316 | NaN | 34.0 | 16.0 | 0.211 | NaN | 1.0 | 1.0 | 1.0 |
4 | 24 | 3 | 13.05 | 1.670 | NaN | 18.0 | 5.0 | -0.189 | NaN | 1.0 | 1.0 | 1.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
81 | 458 | 3 | -1.82 | 1.785 | NaN | 16.0 | 18.0 | -0.239 | NaN | 0.0 | 4.0 | 1.0 |
82 | 462 | 3 | 24.12 | 1.280 | NaN | 36.0 | 12.0 | 0.261 | NaN | 1.0 | 2.0 | 1.0 |
83 | 466 | 3 | 17.09 | 1.394 | NaN | 26.0 | 9.0 | 0.011 | NaN | 1.0 | 1.0 | 1.0 |
84 | 470 | 3 | 8.12 | 1.249 | NaN | 20.0 | 12.0 | -0.139 | NaN | 1.0 | 2.0 | 1.0 |
85 | 474 | 3 | 10.14 | 1.266 | NaN | 24.0 | 14.0 | -0.039 | NaN | 1.0 | 2.0 | 1.0 |
86 rows × 12 columns
Or examine individual variables in their native representation (e.g. SparseRunVariable
or DenseRunVariable
)
ts[1].output.variables
{'loss': <SparseRunVariable(name='loss', source='events')>,
'RT': <SparseRunVariable(name='RT', source='events')>,
'distance from indifference': <SparseRunVariable(name='distance from indifference', source='events')>,
'respcat': <SparseRunVariable(name='respcat', source='events')>,
'parametric loss': <SparseRunVariable(name='parametric loss', source='events')>,
'gain': <SparseRunVariable(name='gain', source='events')>,
'PTval': <SparseRunVariable(name='PTval', source='events')>,
'respnum': <SparseRunVariable(name='respnum', source='events')>,
'parametric gain': <SparseRunVariable(name='parametric gain', source='events')>,
'trial_type.parametric gain': <SparseRunVariable(name='trial_type.parametric gain', source='events')>}
ts[1].output.variables['RT'].to_df(entities=False)
amplitude | onset | duration | condition | |
---|---|---|---|---|
0 | 0.000 | 0 | 3 | RT |
1 | 1.793 | 4 | 3 | RT |
2 | 1.637 | 8 | 3 | RT |
3 | 1.316 | 18 | 3 | RT |
4 | 1.670 | 24 | 3 | RT |
... | ... | ... | ... | ... |
81 | 1.785 | 458 | 3 | RT |
82 | 1.280 | 462 | 3 | RT |
83 | 1.394 | 466 | 3 | RT |
84 | 1.249 | 470 | 3 | RT |
85 | 1.266 | 474 | 3 | RT |
86 rows × 4 columns
Traversing the graph¶
So far we’ve executed the root node, which by definition required no inputs from any previous node. But in typical workflows, we’ll be passing outputs from one node in as inputs to another. For example, we often want to take the run-level parameter estimates and pass them to a subject-level model that does nothing but average over runs within each subject. This requires us to somehow traverse the graph based on the edges specified in the BIDS-StatsModel document. We can do that by taking advantage of each node’s .children
attribute, which contains a list of BIDSStatsModelsEdge
named tuples that specify an edge between two nodes.
root_node.children
[BIDSStatsModelsEdge(source=<BIDSStatsModelsNode(level=run, name=run)>, destination=<BIDSStatsModelsNode(level=subject, name=participant)>, filter={})]
In this case the root node connects to only one other node. We can directly access that node by following the .destination
property of the edge:
next_node = root_node.children[0].destination
next_node.level, next_node.name
('subject', 'participant')
We assign the first connected node to next_node
, and print out its level
and name
for inspection (both are session).
Passing in inputs¶
Armed with that, we can run the session node and proceed and before. However, there’s a twist: whereas the root node only needs to know about variables loaded directly from the BIDSLayout
(which we achieved with that .load_collections()
call earlier), the session node can’t get the inputs it needs from the BIDSLayout
, because there aren’t any (at least in this particular dataset). What we want to do at the session level is average over our run-level estimates within-session. But to do that, we need to actually pass in information about those runs!
The way we do this is to pass, as the first argument to .run()
, a list of ContrastInfo
objects informing our node about what inputs it should use to construct its design matrices. The typical use pattern is to pass one concatenated list containing all of the outputs from the previous level that we want to pass on. Note that we may not want to pass all of the outputs forward. For example, suppose that 2 out of 48 run-level models failed during the estimation process. We might not want to keep passing information about those 2 runs forward, as we can’t compute them. So we can always filter the list of ContrastInfo
objects we received from the previous node before we pass them on to the next node. (We could also do other things, like rename each ContrastInfo
to use whatever naming scheme our software prefers; modifying the entities; and so on. But we won’t do any of that here.)
Let’s concatenate the 48 outputs we got from the previous level and drop the last 2, in preparation for passing them forward to our next_node
:
from itertools import chain
contrasts = list(chain(*[s.contrasts for s in outputs[:-2]]))
len(contrasts)
138
Notice that we’re left we’re 138 individual ContrastInfo
objects. Why 138? Because we have ((3 runs x 16 subjects) - 2) * 3 contrasts. Recall that we’re dropping the last two of the 48 outputs from the previous level. But each of those lists contains three ContrastInfo
objects (one per contrast—RT
, gain
, and the RT:gain
interaction). Hence, 138.
Now we can call .run()
on our session-level node, passing in the contrasts as inputs. We want the model specification (i.e., the part in the "Model"
section of the node) to be applied separately to each unique combination of contrast
, session
, and subject
.
sess_outputs = next_node.run(contrasts, group_by=['subject', 'contrast'])
len(sess_outputs)
48
Again we get back a list of BIDSStatsModelsNodeOutput
objects. And again we have 48 of them. It might seem odd that we have the same number of outputs from a subject-level node as we had from the run-level node, but there’s actually a difference. In the run-level case, our 48 results reflected 16 subjects x 3 runs. In the subject-level case, we have successfully aggregated over runs within each subject, but we now have 3 sets of contrasts producing outputs (i.e., 16 subjects x 3 contrasts).
This becomes clearer if we inspect the same attributes we looked at earlier:
sess_outputs[0].contrasts
[ContrastInfo(name='RT:gain', conditions=['intercept'], weights=[1], test='FEMA', entities={'contrast': 'RT:gain', 'subject': '01'}),
ContrastInfo(name='RT:gain_neg', conditions=['intercept'], weights=[-1], test='FEMA', entities={'contrast': 'RT:gain_neg', 'subject': '01'})]
# Concatenate the X and metadata DFs for easy reading.
# Note that only the first column is actually part of the
# design matrix; the others are just metadata.
pd.concat([sess_outputs[0].X, sess_outputs[0].metadata], axis=1)
intercept | contrast | run | subject | |
---|---|---|---|---|
0 | 1 | RT:gain | 1 | 01 |
1 | 1 | RT:gain | 2 | 01 |
2 | 1 | RT:gain | 3 | 01 |
Notice how the entities differ: the run-level node grouped on run
and subject
; the subject-level node groups on subject
and contrast
. The number of outputs is identical in both cases, but this is just an (un)happy accident, not a general principle. You can verify this for yourself by re-running the subject-level node with a different grouping (e.g., only ['subject']
).