Description

This track displays Protein Multiple Sequence Alignment (MSA) results of HIV-1 gp120 protein sequences. The protein sequences shown are translated from the corresponding sequences in the DNA MSA track.

Display Conventions and Configuration

Pairwise alignments of the sequence from each subject to the $organism genome are displayed as a grayscale density plot (in pack mode) or as a wiggle (in full mode) that indicates alignment quality. In dense display mode, conservation is shown in grayscale using darker values to indicate higher levels of overall conservation.

To choose a set of subjects to view in the MSA display, visit the Table View. If there are no subjects chosen, the MSA track will display a default set of subjects.

Checkboxes on the track configuration page allow selection of the sequence from the chosen subjects to include in the pairwise display. Configuration buttons are available to select all of the sequences (Set all) or deselect all of the sequences (Clear all). Note that excluding sequences from the pairwise display does not alter the the conservation score display.

To view detailed information about the alignments at a specific position click on the alignment. Note that gap spacing shown is reflective of the aligned group as a whole and is not generated based on any subset of sequences chosen for display.

Protein Level

When zoomed-in to the protein-level display, the track shows the amino acid composition of each alignment, with amino acids in annotated repetitive elements displayed in lower case. The numbers and symbols on the Gaps line indicate the lengths of gaps in the $organism sequence at those alignment positions relative to the longest sequence shown in the display. If there is sufficient space in the display, the size of the gap is shown. If the space is insufficient and the gap would cause no frame shift, a "*" is displayed; otherwise a "+" is displayed. Zoom in to see the gap size displayed as a number.

Methods

HIV-1 gp120 nucleotide sequences were translated into amino acids using AlignmentHelper1.2 and the resulting peptide sequences were then aligned in MAFFT v5.7 (Katoh et al. 2005).

We applied two local (L-INS-i and E-INS-i) and one global (G-INS-i) iterative refinement algorithms. MAFFT also incorporates a BLOSUM62 scoring matrix (an empirical matrix calculated from large datasets representing extreme protein family diversity) and assumes a fixed frequency for each AA for building MSAs; however, the estimated relative rates of changes in this matrix may be too general to fit datasets of specific gene families or high substitution rates, such as gp120. Hence, here we performed multiple MAFFT alignments using both the BLOSUM62 and gp120-specific scorings matrixes and/or fixed and variable AA frequencies estimated for the data at hand. Gene-specific empirical matrices and AA frequencies based on the VAX004 and LAGB-VAX004 datasets were generated via the program MATRIXGEN.

The quality of the MSA was then compared using the maximum likelihood (ML) tree building approach in PhyML (Guindon and Gascuel 2003) under the GTR+G+I model. ML trees showing the highest likelihood scores were indicative of better alignments. G-INS-i + BLOSUM62 + AA specific frequencies generated the trees with the best likelihood scores.

Uncertainty in our MSAs was assessed using GBlocks v0.91b (Castresana 2000). Ambiguous blocks were selected and removed according to the following settings:

These settings generated a conserved MSA of 1,401 sites.

Credits

The Multiple Sequence Alignment (MSA) of VAX004 HIV-1 sequences was performed by Keith A. Crandall at Genoma LLC.

References

Castresana J. Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Molecular Biology and Evolution. 2000 April;17(4):540-52.

Guindon S, Gascuel O. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol. 2003;52(5):696-704.

Katoh K, Kuma K, Toh H, Miyata T. MAFFT version 5: improvement in accuracy of multiple sequence alignment . Nucleic Acids Res. 2005;33(2):511-18.