FASTA Loader

By Malcolm McLean Homepage

This file contains functions to parse FASTA format files.

Any enhancements or bug fixes send to

Source files


A FASTA format file consists of one or more DNA, RNA or protein sequences. It is designed to be relatively simple, basically just consisting of a header with the raw sequence data.

Example FASTA file

> 1abo.pdb | some funny protein or other | ACDEFGHIKLMNPQRSTVWY ASEQVENCE ; comment (rarely seen) > Another header, this ANQTHER_SEQVENCEHERE --ALIGNED Fasta format

The header is introduced with the greater than character. The convention is to separate fields with vertical bars.

Comments are allowed and are handled by this loader, but are in fact rarely used and may break other software.

The sequence data is in one-letter codes, with lower case translated to upper case. Whitespace is ignored, and lines should be shorter than eighty characters.

The fasta object is intended to be semi-opaque. Most code should access it using the access functions, however the fields of the structure are exported if you need them.

All functions will assert fail if passed bad indices. All indices are zero-based.

FASTA *loadfasta(char *fname, int *err);

Load a FASTA file
err - return for the error code (can be null)
0 - OK
-1 - out of memory
-2 - can't load file
-3 - parse error, file corrupted

The function will return NULL if the file falis to load.

FASTA *floadfasta(FILE *fp, int *err);

Load a FASTA file from an already opened file As previous function, useful if using pipes.
void killfasta(FASTA *fa);

Destroy the object and free memory used.
int fasta_getNsequences(FASTA fa)

Get the number of sequences in the file.
void fasta_getsequence(FASTA *fa, int index, char *out);

Retrive the sequence. Gaps are closed. If the sequence has a trailing STOP codon, '*' it will be silently suppressed.

Buffer must be large enough to contain sequence as well as termimating NUL.

void fasta_getgappedsequence(FASTA *fa, int index, char *out);

Retrieve sequence, including gap character, represented by hyphens '-'.
int fasta_getlength(FASTA *fa, int index);

Get the length of a sequence, excluding gaps.
int fasta_getgappedlength(FASTA *fa, int index);

Get the length of a sequence, including gaps.
int fasta_gettype(FASTA *fs, int index);

Get the type of a sequence

FASTA_UNKNOWN can't work out type of data
FASTA_PROTEIN - canonical 20 amino acids
FASTA_XPROTEINextended protein

Some protein sequences may also be legal RNA or DNA sequences. In practise this is unlikely to be a real problem, and the algorithm will simply assign to the most probably type.

The extended sequences contain codes for unknown or modified residues, also protein sequences with embedded stop codons.