Description

This track displays conservation based on the Multiple Sequence Alignment (MSA) results of VAX004 HIV-1 gp120 protein sequences (translated from the corresponding nucleotide sequences).

Methods

Conservation

At each position of the MSA, the counts of different amino acids were tallied and the frequency distribution calculated. The most frequent residue at each position was used as the conservation measure.

MSA method

Katoh et al. reported modifications of the MAFFT program which were used in these multiple alignments. These modifications included two local (L-INS-i and E-INS-i) and one global (G-INS-i) iterative refinement algorithms. MAFFT also incorporates a BLOSUM62 scoring matrix (an empirical matrix calculated from large datasets representing extreme protein family diversity) and assumes a fixed frequency for each AA for building MSAs; however, the estimated relative rates of changes in this matrix may be too general to fit datasets of specific gene families or high substitution rates, such as gp120. Hence, here we performed multiple MAFFT alignments using both the BLOSUM62 and gp120-specific scorings matrixes and/or fixed and variable AA frequencies estimated for the data at hand. Gene-specific empirical matrices and AA frequencies based on the VAX004 and LAGB-VAX004 datasets were generated via the program MATRIXGEN.

The quality of the MSA was then compared using the maximum likelihood (ML) tree building approach in PhyML (Guindon and Gascuel 2003) under the GTR+G+I model. ML trees showing the highest likelihood scores were indicative of better alignments. G-INS-i + BLOSUM62 + AA specific frequencies generated the trees with the best likelihood scores.

Uncertainty in our MSAs was assessed using GBlocks v0.91b (Castresana 2000). Ambiguous blocks were selected and removed according to the following settings:

These settings generated a conserved MSA of 1,401 sites.

Credits

The MSA of VAX004 HIV-1 sequences was performed by Keith A. Crandall at Genoma LLC.

References

Castresana J. Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Molecular Biology and Evolution. 2000 April;17(4):540-52.

Guindon S, Gascuel O. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol. 2003;52(5):696-704.

Katoh K, Kuma K, Toh H, Miyata T. MAFFT version 5: improvement in accuracy of multiple sequence alignment . Nucleic Acids Res. 2005;33(2):511-18.