Description

This track is based on text-mining of full-text biomedical articles and includes two types of subtracks:

Sequences found in publications, grouped by article and searched in genomes with BLAT
Identifiers in publications that directly relate to chromosome locations (e.g., gene symbols, SNP identifiers, etc)

Both sources of information are linked to the respective articles. Background information on how permission to full-text data was obtained can be found on the project website. See also text2genome.

Display Convention and Configuration

The sequence subtrack indicates the location of sequences in publications mapped back to the genome, annotated with the first author and the year of the publication. All matches of one article are grouped ("chained") together. Article titles are shown when you move the mouse cursor over the features. Thicker parts of the features (exons) represent matching sequences, connected by thin lines to matches from the same article within 30 kbp.

The subtrack "individual sequence matches" activates automatically when the user clicks a sequence match and follows the link "Show sequence matches individually" from the details page. Mouse-overs show flanking text around the sequence, and clicking features links to BLAT alignments.

All other subtracks (i.e. bands, genes, SNPs) show the number of matching articles as the feature description. Clicking on them shows the sentences and sections in articles where the identifiers were found.

The track configuration includes a keyword and year filter. Keywords are space-separated and are searched in the article's title, author list, and abstract.

Data

The track is based on text from biomedical research articles, obtained as part of the UCSC Genocoding Project.

The current dataset consists of about 600,000 files (main text and supplementary files) from PubMed Central (Open-Access set) and around 6 million text files (main text) from Elsevier (as part of the Sciverse Apps program).

Methods

All file types (including XML, raw ASCII, PDFs and various Microsoft Office formats (Excel, Word, PowerPoint)) were converted to text. The results were processed to find groups of words that look like DNA/RNA sequences or words that look like protein sequences. These were then mapped with BLAT to the human genome and these model organisms: mouse (mm9), rat (rn4), zebrafish (danRer6), Drosophila melanogaster (dm3), X. tropicalis (xenTro2), Medaka (oryLat2), C. intestinalis (ci2), C. elegans (ce6) and yeast (sacCer2). The pipeline roughly proceeds through these steps:

For sequences, the best match across all genomes is used, if it is longer than 17 bp and matches at 90% identity. Two sets of BLAT parameters are tried, the default ones for sequences longer than 25 bp, very sensitive ones (stepSize=5) for shorter sequences.
Sequences are mapped to genomic DNA. Those that do not match are mapped to RefSeq cDNAs.
Hits from the same article that are closer than 30 kbp are joined into one feature (shown as exon-blocks on the browser).
All parts of a joined feature have to match at least 25 bp.
Non-unique hits are kept in the joined feature with the most members.
Joined features with identical members in two different genomes are kept in both genomes.

Note that due to the 90% identity filter, some sequences do not match anywhere in the genome. Examples include primers with added restriction sites, mutation primers, or any other sequence that joins or mixes two pieces of genomic DNA not part of RefSeq. Also note that some gene symbols correspond to English words which can sometimes lead to many false positives.

Credits

Software and processing by Maximilian Haeussler. UCSC Track visualisation by Larry Meyer and Hiram Clawson. Elsevier support by Max Berenstein, Raphael Sidi, Judd Dunham, Scott Robbins and colleagues. Original version written at the Bergman Lab, University of Manchester, UK. Testing by Mary Mangan, OpenHelix Inc, and Greg Roe, UCSC.

Feedback

Please send ideas, comments or feedback on this track to max@soe.ucsc.edu. We are very interested in getting access to more articles from publishers for this dataset; see the project website.

References

Aerts S, Haeussler M, van Vooren S, Griffith OL, Hulpiau P, Jones SJ, Montgomery SB, Bergman CM, Open Regulatory Annotation Consortium. Text-mining assisted regulatory annotation. Genome Biol. 2008;9(2):R31.

Haeussler M, Gerner M, Bergman CM. Annotating genes and genomes with DNA sequences extracted from biomedical articles. Bioinformatics. 2011 Apr 1;27(7):980-6.

Van Noorden R. Trouble at the text mine. Nature. 2012 Mar 7;483(7388):134-5.