--- jupytext: text_representation: extension: .md format_name: myst format_version: 0.13 jupytext_version: 1.14.1 kernelspec: display_name: Python 3 (ipykernel) language: python name: python3 --- # Introduction to `pybids` [`pybids`](https://github.com/bids-standard/pybids) is a tool to query, summarize and manipulate data using the BIDS standard. In this tutorial we will use a `pybids` test dataset to illustrate some of the functionality of `pybids.layout` ```{code-cell} ipython3 from bids import BIDSLayout from bids.tests import get_test_data_path import os ``` ## The `BIDSLayout` At the core of pybids is the `BIDSLayout` object. A `BIDSLayout` is a lightweight Python class that represents a BIDS project file tree and provides a variety of helpful methods for querying and manipulating BIDS files. While the `BIDSLayout` initializer has a large number of arguments you can use to control the way files are indexed and accessed, you will most commonly initialize a `BIDSLayout` by passing in the BIDS dataset root location as a single argument: ```{code-cell} ipython3 # Here we're using an example BIDS dataset that's bundled with the pybids tests data_path = os.path.join(get_test_data_path(), '7t_trt') # Initialize the layout layout = BIDSLayout(data_path) # Print some basic information about the layout layout ``` ### Querying the `BIDSLayout` When we initialize a `BIDSLayout`, all of the files and metadata found under the specified root folder are indexed. This can take a few seconds (or, for very large datasets, a minute or two). Once initialization is complete, we can start querying the `BIDSLayout` in various ways. The workhorse method is [`.get()`](https://bids-standard.github.io/pybids/generated/bids.grabbids.BIDSLayout.html#bids.grabbids.BIDSLayout.get). If we call `.get()` with no additional arguments, we get back a list of all the BIDS files in our dataset: ```{code-cell} ipython3 all_files = layout.get() print("There are {} files in the layout.".format(len(all_files))) print("\nThe first 10 files are:") all_files[:10] ``` The returned object is a Python list. By default, each element in the list is a `BIDSFile` object. We discuss the `BIDSFile` object in much more detail below. For now, let's simplify things and work with just filenames: ```{code-cell} ipython3 layout.get(return_type='filename')[:10] ``` This time, we get back only the names of the files. ### Filtering files by entities The utility of the `BIDSLayout` would be pretty limited if all we could do was retrieve a list of all files in the dataset. Fortunately, the `.get()` method accepts all kinds of arguments that allow us to filter the result set based on specified criteria. In fact, we can pass *any* BIDS-defined keywords (or, as they're called in PyBIDS, *entities*) as constraints. For example, here's how we would retrieve all BOLD runs with `.nii.gz` extensions for subject `'01'`: ```{code-cell} ipython3 # Retrieve filenames of all BOLD runs for subject 01 layout.get(subject='01', extension='nii.gz', suffix='bold', return_type='filename') ``` If you're wondering what entities you can pass in as filtering arguments, the answer is contained in the `.json` configuration files [housed here](https://github.com/bids-standard/pybids/tree/master/bids/layout/config). To save you the trouble, here are a few of the most common entities: * `suffix`: The part of a BIDS filename just before the extension (e.g., `'bold'`, `'events'`, `'physio'`, etc.). * `subject`: The subject label * `session`: The session label * `run`: The run index * `task`: The task name New entities are continually being defined as the spec grows, and in principle (though not always in practice), PyBIDS should be aware of all entities that are defined in the BIDS specification. ### Filtering by metadata All of the entities listed above are found in the names of BIDS files. But sometimes we want to search for files based not just on their names, but also based on metadata defined (per the BIDS spec) in JSON files. Fortunately for us, when we initialize a `BIDSLayout`, all metadata files associated with BIDS files are automatically indexed. This means we can pass any key that occurs in any JSON file in our project as an argument to `.get()`. We can combine these with any number of core BIDS entities (like `subject`, `run`, etc.). For example, say we want to retrieve all files where (a) the value of `SamplingFrequency` (a metadata key) is `100`, (b) the `acquisition` type is `'prefrontal'`, and (c) the subject is `'01'` or `'02'`. Here's how we can do that: ```{code-cell} ipython3 # Retrieve all files where SamplingFrequency (a metadata key) = 100 # and acquisition = prefrontal, for the first two subjects layout.get(subject=['01', '02'], SamplingFrequency=100, acquisition="prefrontal") ``` Notice that we passed a list in for `subject` rather than just a string. This principle applies to all filters: you can always pass in a list instead of a single value, and this will be interpreted as a logical disjunction (i.e., a file must match any one of the provided values). ### Other `return_type` values While we'll typically want to work with either `BIDSFile` objects or filenames, we can also ask `get()` to return unique values (or ids) of particular entities. For example, say we want to know which subjects have at least one `T1w` file. We can request that information by setting `return_type='id'`. When using this option, we also need to specify a target entity (or metadata keyword) called `target`. This combination tells the `BIDSLayout` to return the unique values for the specified `target` entity. For example, in the next example, we ask for all of the unique subject IDs that have at least one file with a `T1w` suffix: ```{code-cell} ipython3 # Ask get() to return the ids of subjects that have T1w files layout.get(return_type='id', target='subject', suffix='T1w') ``` If our `target` is a BIDS entity that corresponds to a particular directory in the BIDS spec (e.g., `subject` or `session`) we can also use `return_type='dir'` to get all matching subdirectories: ```{code-cell} ipython3 layout.get(return_type='dir', target='subject') ``` ### Other `get()` options The `.get()` method has a number of other useful arguments that control its behavior. We won't discuss these in detail here, but briefly, here are a couple worth knowing about: * `regex_search`: If you set this to `True`, string filter argument values will be interpreted as regular expressions. * `scope`: If your BIDS dataset contains BIDS-derivatives sub-datasets, you can specify the scope (e.g., `derivatives`, or a BIDS-Derivatives pipeline name) of the search space. +++ ## The `BIDSFile` When you call `.get()` on a `BIDSLayout`, the default returned values are objects of class `BIDSFile`. A `BIDSFile` is a lightweight container for individual files in a BIDS dataset. It provides easy access to a variety of useful attributes and methods. Let's take a closer look. First, let's pick a random file from our existing `layout`. ```{code-cell} ipython3 # Pick the 15th file in the dataset bf = layout.get()[15] # Print it bf ``` Here are some of the attributes and methods available to us in a `BIDSFile` (note that some of these are only available for certain subclasses of `BIDSFile`; e.g., you can't call `get_image()` on a `BIDSFile` that doesn't correspond to an image file!): * `.path`: The full path of the associated file * `.filename`: The associated file's filename (without directory) * `.dirname`: The directory containing the file * `.get_entities()`: Returns information about entities associated with this `BIDSFile` (optionally including metadata) * `.get_image()`: Returns the file contents as a nibabel image (only works for image files) * `.get_df()`: Get file contents as a pandas DataFrame (only works for TSV files) * `.get_metadata()`: Returns a dictionary of all metadata found in associated JSON files * `.get_associations()`: Returns a list of all files associated with this one in some way Let's see some of these in action. ```{code-cell} ipython3 # Print all the entities associated with this file, and their values bf.get_entities() ``` ```{code-cell} ipython3 # Print all the metadata associated with this file bf.get_metadata() ``` ```{code-cell} ipython3 # We can the union of both of the above in one shot like this bf.get_entities(metadata='all') ``` Here are all the files associated with our target file in some way. Notice how we get back both the JSON sidecar for our target file, and the BOLD run that our target file contains physiological recordings for. ```{code-cell} ipython3 bf.get_associations() ``` +++ {"slideshow": {"slide_type": "slide"}} In cases where a file has a `.tsv.gz` or `.tsv` extension, it will automatically be created as a `BIDSDataFile`, and we can easily grab the contents as a pandas `DataFrame`: ```{code-cell} ipython3 # Use a different test dataset--one that contains physio recording files data_path = os.path.join(get_test_data_path(), 'synthetic') layout2 = BIDSLayout(data_path) # Get the first physiological recording file recfile = layout2.get(suffix='physio')[0] # Get contents as a DataFrame and show the first few rows df = recfile.get_df() df.head() ``` While it would have been easy enough to read the contents of the file ourselves with pandas' `read_csv()` method, notice that in the above example, `get_df()` saved us the trouble of having to read the physiological recording file's metadata, pull out the column names and sampling rate, and add timing information. Mind you, if we don't *want* the timing information, we can ignore it: ```{code-cell} ipython3 recfile.get_df(include_timing=False).head() ``` ## Other utilities ### Filename parsing Say you have a filename, and you want to manually extract BIDS entities from it. The `parse_file_entities` method provides the facility: ```{code-cell} ipython3 path = "/a/fake/path/to/a/BIDS/file/sub-01_run-1_T2w.nii.gz" layout.parse_file_entities(path) ``` A version of this utility independent of a specific layout is available at `bids.layout` ([doc](https://bids-standard.github.io/pybids/generated/bids.layout.parse_file_entities.html#bids.layout.parse_file_entities)) - ```{code-cell} ipython3 from bids.layout import parse_file_entities path = "/a/fake/path/to/a/BIDS/file/sub-01_run-1_T2w.nii.gz" parse_file_entities(path) ``` ### Path construction You may want to create valid BIDS filenames for files that are new or hypothetical that would sit within your BIDS project. This is useful when you know what entity values you need to write out to, but don't want to deal with looking up the precise BIDS file-naming syntax. In the example below, imagine we've created a new file containing stimulus presentation information, and we want to save it to a `.tsv.gz` file, per the BIDS naming conventions. All we need to do is define a dictionary with the name components, and `build_path` takes care of the rest (including injecting sub-directories!): ```{code-cell} ipython3 entities = { 'subject': '01', 'run': 2, 'task': 'nback', 'suffix': 'bold' } layout.build_path(entities) ``` You can also use `build_path` in more sophisticated ways—for example, by defining your own set of matching templates that cover cases not supported by BIDS out of the box. For example, suppose you want to create a template for naming a new z-stat file. You could do something like: ```{code-cell} ipython3 # Define the pattern to build out of the components passed in the dictionary pattern = "sub-{subject}[_ses-{session}]_task-{task}[_acq-{acquisition}][_rec-{reconstruction}][_run-{run}][_echo-{echo}]_{suffix}.nii.gz", entities = { 'subject': '01', 'run': 2, 'task': 'nback', 'suffix': 'z' } # Notice we pass the new pattern as the second argument layout.build_path(entities, pattern, validate=False) ``` Note that in the above example, we set `validate=False` to ensure that the standard BIDS file validator doesn't run (because the pattern we defined isn't actually compliant with the BIDS specification). +++ ### Loading derivatives By default, `BIDSLayout` objects are initialized without scanning contained `derivatives/` directories. But you can easily ensure that all derivatives files are loaded and endowed with the extra structure specified in the [derivatives config file](https://github.com/bids-standard/pybids/blob/master/bids/layout/config/derivatives.json): ```{code-cell} ipython3 # Define paths to root and derivatives folders root = os.path.join(get_test_data_path(), 'synthetic') layout2 = BIDSLayout(root, derivatives=True) layout2 ``` The `scope` argument to `get()` specifies which part of the project to look in. By default, valid values are `'bids'` (for the "raw" BIDS project that excludes derivatives) and `'derivatives'` (for all BIDS-derivatives files). You can also pass the names of individual derivatives pipelines (e.g., passing `'fmriprep'` would search only in a `/derivatives/fmriprep` folder). Either a string or a list of strings can be passed. The following call returns the filenames of all derivatives files. ```{code-cell} ipython3 # Get all files in derivatives layout2.get(scope='derivatives', return_type='file') ``` ### Exporting a `BIDSLayout` to a pandas `Dataframe` If you want a summary of all the files in your `BIDSLayout`, but don't want to have to iterate `BIDSFile` objects and extract their entities, you can get a nice bird's-eye view of your dataset using the `to_df()` method. ```{code-cell} ipython3 # Convert the layout to a pandas dataframe df = layout.to_df() df.head() ``` We can also include metadata in the result if we like (which may blow up our `DataFrame` if we have a large dataset). Note that in this case, most of our cells will have missing values. ```{code-cell} ipython3 layout.to_df(metadata=True).head() ``` ## Retrieving BIDS variables BIDS variables are stored in .tsv files at the run, session, subject, or dataset level. You can retrieve these variables with `layout.get_collections()`. The resulting objects can be converted to dataframes and merged with the layout to associate the variables with corresponding scans. In the following example, we request all subject-level variable data available anywhere in the BIDS project, and merge the results into a single `DataFrame` (by default, we'll get back a single `BIDSVariableCollection` object for each subject). ```{code-cell} ipython3 # Get subject variables as a dataframe and merge them back in with the layout subj_df = layout.get_collections(level='subject', merge=True).to_df() subj_df.head() ``` ## BIDSValidator `pybids` implicitly imports a `BIDSValidator` class from the separate [`bids-validator`](https://github.com/bids-standard/bids-validator) package. You can use the `BIDSValidator` to determine whether a filepath is a valid BIDS filepath, as well as answering questions about what kind of data it represents. Note, however, that this implementation of the BIDS validator is *not* necessarily up-to-date with the JavaScript version available online. Moreover, the Python validator only tests individual files, and is currently unable to validate entire BIDS datasets. For that, you should use the [online BIDS validator](https://bids-standard.github.io/bids-validator/). ```{code-cell} ipython3 from bids import BIDSValidator # Note that when using the bids validator, the filepath MUST be relative to the top level bids directory validator = BIDSValidator() validator.is_bids('/sub-02/ses-01/anat/sub-02_ses-01_T2w.nii.gz') ``` ```{code-cell} ipython3 # Can decide if a filepath represents a file part of the specification validator.is_file('/sub-02/ses-01/anat/sub-02_ses-01_T2w.json') ``` ```{code-cell} ipython3 # Can check if a file is at the top level of the dataset validator.is_top_level('/dataset_description.json') ``` ```{code-cell} ipython3 # or subject (or session) level validator.is_subject_level('/dataset_description.json') ``` ```{code-cell} ipython3 validator.is_session_level('/sub-02/ses-01/sub-02_ses-01_scans.json') ``` ```{code-cell} ipython3 # Can decide if a filepath represents phenotypic data validator.is_phenotypic('/sub-02/ses-01/anat/sub-02_ses-01_T2w.nii.gz') ```