Work with Next-Generation Sequencing Data

Overview

Many biological experiments produce huge data files that are difficult to access due to their size, which can cause memory issues when reading the file into the MATLAB^® Workspace. You can construct a BioIndexedFile object to access the contents of a large text file containing nonuniform size entries, such as sequences, annotations, and cross-references to data sets. The BioIndexedFile object lets you quickly and efficiently access this data without loading the source file into memory.

You can use the BioIndexedFile object to access individual entries or a subset of entries when the source file is too big to fit into memory. You can access entries using indices or keys. You can read and parse one or more entries using provided interpreters or a custom interpreter function.

Use the BioIndexedFile object in conjunction with your large source file to:

Access a subset of the entries for validation or further analysis.
Parse entries using a custom interpreter function.

What Files Can You Access?

You can use the BioIndexedFile object to access large text files.

Your source file can have these application-specific formats:

FASTA
FASTQ
SAM

Your source file can also have these general formats:

Table — Tab-delimited table with multiple columns. Keys can be in any column. Rows with the same key are considered separate entries.
Multi-row Table — Tab-delimited table with multiple columns. Keys can be in any column. Contiguous rows with the same key are considered a single entry. Noncontiguous rows with the same key are considered separate entries.
Flat — Flat file with concatenated entries separated by a character vector, typically //. Within an entry, the key is separated from the rest of the entry by a white space.

Before You Begin

Before constructing a BioIndexedFile object, locate your source file on your hard drive or a local network.

When you construct a BioIndexedFile object from your source file for the first time, you also create an auxiliary index file, which by default is saved to the same location as your source file. However, if your source file is in a read-only location, you can specify a different location to save the index file.

Tip

If you construct a BioIndexedFile object from your source file on subsequent occasions, it takes advantage of the existing index file, which saves time. However, the index file must be in the same location or a location specified by the subsequent construction syntax.

Tip

If insufficient memory is not an issue when accessing your source file, you may want to try an appropriate read function, such as genbankread, for importing data from GenBank^® files.

Additionally, several read functions such as fastaread, fastqread, samread, and sffread include a Blockread property, which lets you read a subset of entries from a file, thus saving memory.

Create a BioIndexedFile Object to Access Your Source File

To construct a BioIndexedFile object from a multi-row table file:

Create a variable containing the full absolute path of your source file. For your source file, use the yeastgenes.sgd file, which is included with the Bioinformatics Toolbox™ software.
```
sourcefile = which('yeastgenes.sgd');
```
Use the BioIndexedFile constructor function to construct a BioIndexedFile object from the yeastgenes.sgd source file, which is a multi-row table file. Save the index file in the Current Folder. Indicate that the source file keys are in column 3. Also, indicate that the header lines in the source file are prefaced with !, so the constructor ignores them.
```
gene2goObj = BioIndexedFile('mrtab', sourcefile, '.', ...
                            'KeyColumn', 3, 'HeaderPrefix','!')
```
The BioIndexedFile constructor function constructs gene2goObj, a BioIndexedFile object, and also creates an index file with the same name as the source file, but with an IDX extension. It stores this index file in the Current Folder because we specified this location. However, the default location for the index file is the same location as the source file.
Caution
Do not modify the index file. If you modify it, you can get invalid results. Also, the constructor function cannot use a modified index file to construct future objects from the associated source file.

Determine the Number of Entries Indexed by a BioIndexedFile Object

To determine the number of entries indexed by a BioIndexedFile object, use the NumEntries property of the BioIndexedFile object. For example, for the gene2goObj object:

gene2goObj.NumEntries

ans =

        6476

Note

For a list and description of all properties of the object, see BioIndexedFile.

Retrieve Entries from Your Source File

Retrieve entries from your source file using either:

The index of the entry
The entry key

Retrieve Entries Using Indices

Use the getEntryByIndex method to retrieve a subset of entries from your source file that correspond to specified indices. For example, retrieve the first 12 entries from the yeastgenes.sgd source file:

subset_entries = getEntryByIndex(gene2goObj, [1:12]);

Retrieve Entries Using Keys

Use the getEntryByKey method to retrieve a subset of entries from your source file that are associated with specified keys. For example, retrieve all entries with keys of AAC1 and AAD10 from the yeastgenes.sgd source file:

subset_entries = getEntryByKey(gene2goObj, {'AAC1' 'AAD10'});

The output subset_entries is a character vector of concatenated entries. Because the keys in the yeastgenes.sgd source file are not unique, this method returns all entries that have a key of AAC1 or AAD10.

Read Entries from Your Source File

The BioIndexedFile object includes a read method, which you can use to read and parse a subset of entries from your source file. The read method parses the entries using an interpreter function specified by the Interpreter property of the BioIndexedFile object.

Set the Interpreter Property

Before using the read method, make sure the Interpreter property of the BioIndexedFile object is set appropriately.

If you constructed a BioIndexedFile object from ...	The Interpreter property ...
A source file with an application-specific format (FASTA, FASTQ, or SAM)	By default is a handle to a function appropriate for that file type and typically does not require you to change it.
A source file with a table, multi-row table, or flat format	By default is `[]`, which means the interpreter is an anonymous function in which the output is equivalent to the input. You can change this to a handle to a function that accepts a character vector of one or more concatenated entries and returns a structure or an array of structures containing the interpreted data.

There are two ways to set the Interpreter property of the BioIndexedFile object:

When constructing the BioIndexedFile object, use the Interpreter property name/property value pair
After constructing the BioIndexedFile object, set the Interpreter property

Note

For more information on setting the Interpreter property of the object, see BioIndexedFile.

Read a Subset of Entries

The read method reads and parses a subset of entries that you specify using either entry indices or keys.

Example

To quickly find all the gene ontology (GO) terms associated with a particular gene because the entry keys are gene names:

Set the Interpreter property of the gene2goObj BioIndexedFile object to a handle to a function that reads entries and returns only the column containing the GO term. In this case the interpreter is a handle to an anonymous function that accepts character vectors and extracts those that start with the characters GO.
```
gene2goObj.Interpreter = @(x) regexp(x,'GO:\d+','match')
```

Read only the entries that have a key of YAT2, and return their GO terms.

GO_YAT2_entries = read(gene2goObj, 'YAT2')

GO_YAT2_entries = 

'GO:0004092' 'GO:0005737' 'GO:0006066' 'GO:0006066' 'GO:0009437'