BIO650

Repository for the BIO650 web site at https://vprobon.github.io/BIO650. Includes information regarding schedule, assessment, teaching material etc.

View the Project on GitHub vprobon/BIO650

BIO650 - Special Topics in Bioinformatics - Assignment1 (Fall 2023)

Instructor: Asscociate Prof. Vasilis J Promponas. promponas.vasileios@ucy.ac.cy

Assignment 1 - Becoming sequence detectives

Background information

Biomolecular sequences are produced in ever increasing rates, mainly due to the continuous reduction in sequencing costs and the availability of underlying high-throughput technologies. These sequences are routinely deposited in archival databases (e.g., GenBank). Due to our inability to perform “wet” experiments at such a large scale, gene/genome/protein sequences are often subject to computational analyses for obtaining meaningful annotations: such computational analyses aim (among others) to

The functional annotation process often relies on the assumption that homologous genes/proteins perform the same function for some further info click here. Homology can often be detected when sufficient sequence similarity is detected between two macromolecular sequences; then we can “transfer” existing annotations from a gene/protein of known function to a newly characterized one. Obviously, if erroneous functional information is entered for an entry in a sequence database, these errors can be propagated during annotation transfer, thus these errors will be maintained in the sequence database.

There have been several reports in the literature, highlighting the existence of database errors (and their different types), discussing their significance and studying their mode of propagation within databases [1-11].

Open question

A paper published in 2015 [9] presents some particular types of annotation errors identified in a survey by the authors. A trivial case considers the appearance of a typo, the misspelled word “putaitve” (instead of the correct term “putative”), across several entries in the database. The authors in [9], used all-against-all sequence comparisons, followed by sequence similarity-based clustering of the erroneously annotated sequences in order to trace back the errors to their root, i.e., the initial sources of error which were later propagated in the database. While this specific error does not harm severely the database annotations (as the term “putative” weakens any annotation it accompanies), it can be used as a test-bed to understand how other typos (e.g., in the names of proteins) can be propagated in sequence databases.

Today, almost 10 years after this publication:

Practical Steps

Step 0: Preparatory work

We will use the Cytoscape application [12] for displaying protein networks based on their sequence similarities. You can freely download the software on your computer using the URL https://cytoscape.org/.

Note: Computers running Windows/MacOS/Linux are supported.

Note 2: Cytoscape is accompanied by a rich collection of extensions (Cytoscape Apps). Feel free to play with any of these useful tools. Nevertheless, for the purposes of this practical the basic Cytoscape functionality (i.e., no extra Apps) will suffice.

Step 1: Identification and retrieval of erroneously annotated protein sequences

Here we will collect our sequence dataset consisting of “putaitve” proteins, as described in [9].

Step 2: Sequence comparison computation

We will perform sequence similarity-based sequence clustering, to identify groups of possible homologous sequences. If we assume that proteins in the same cluster have “inherited” the typographical error from an initial mis-annotated protein, we could potentially identify the source of error (how??).

Step 3: Sequence similarity-based clustering

For simplicity, we will not perform “proper” clustering to our sequence similarity data. We will use the similarities detected using BLAST to generate a protein network: proteins will serve as the nodes (or vertices) of the network and network edges will only be constructed for protein pairs with a detected sequence similarity. Instead of using specialized clustering algorithms for generating network partitions (aka subnetworks), we will use the notion of the connected components(see this link for a definition).

Note: Optionally, if you want to experiment with a proper clustering method (e.g., as performed in [9]), you can install the clusterMaker2 Cytoscape plugin that offers several options to choose from.

Step 4: Tracing the source(s) of errors

Using the protein identifiers for each connected component/cluster you can revisit the protein database at the NCBI website and check their annotations.

Step 5: Prepare a report

Prepare a report in the form of a short research paper (max. 2500 words) describing your work for completing Steps 1-4. Your report/paper should contain the following sections (no numbering):

Note 1: You can use Figures, Tables or other visual elements you find appropriate. These elements (and their legends) will not count towards the word count.

Note 2: You are free to use named subsections under any sections.

Note 3: You should include data supplements with all files containing results of your analyses. A short text describing the content of each of these files should be included as an appendix to your paper (not counting towards the word count).

Note 4: You may choose any citation style you prefer as long as it is used uniformly throughout the text.

Note 5: Make sure that you address all questions posed above (even in the introduction).

Note 6: A critical approach when interpreting your results is expected.

Bonus

Any student interested in performing extra work on this assignment (e.g., repeating the work with other databases, or other types of annotation errors) may compete for a bonus mark after discussing with the instructor. Feel free to come up with ideas but let’s discuss before you start to perform any serious work for the bonus …

References

  1. Kyrpides NC, Ouzounis CA. Whole-genome sequence annotation: ‘Going wrong with confidence’. Mol Microbiol. 1999;32(4):886–7.

  2. Devos D, Valencia A. Intrinsic errors in genome annotation. Trends Genet. 2001;17(8):429–31.

  3. Gilks WR, Audit B, De Angelis D, Tsoka S, Ouzounis CA. Modeling the percolation of annotation errors in a database of protein sequences. Bioinformatics. 2002;18(12):1641–9.

  4. Ouzounis CA, Karp PD. The past, present and future of genome-wide re-annotation. Genome Biol. 2002;3(2):COMMENT2001.

  5. Green ML, Karp PD. Genome annotation errors in pathway databases due to semantic ambiguity in partial EC numbers. Nucleic Acids Res. 2005;33(13):4035–9.

  6. Schnoes AM, Brown SD, Dodevski I, Babbitt PC. Annotation error in public databases: misannotation of molecular function in enzyme superfamilies. PLoS Comput Biol. 2009;5(12):e1000605.

  7. Ben-Shitrit T, Yosef N, Shemesh K, Sharan R, Ruppin E, Kupiec M. Systematic identification of gene annotation errors in the widely used yeast mutation collections. Nat Methods. 2012;9(4):373–8.

  8. Percudani R, Carnevali D, Puggioni V. Ureidoglycolate hydrolase, amidohydrolase, lyase: how errors in biological databases are incorporated in scientific papers and vice versa. Database (Oxford). 2013;2013:bat071.

  9. Promponas VJ, Iliopoulos I, Ouzounis CA. Annotation inconsistencies beyond sequence similarity-based function prediction - phylogeny and genome structure. Stand Genomic Sci. 2015;10:108.

  10. Goudey B, Geard N, Verspoor K, Zobel J. Propagation, detection and correction of errors using the sequence database network. Brief Bioinform. 2022;23(6):bbac416.

  11. Kress A, Poch O, Lecompte O, Thompson JD. Real or fake? Measuring the impact of protein annotation errors on estimates of domain gain and loss events. Front Bioinform. 2023 Apr 20;3:1178926.

  12. Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003;13(11):2498-504.

  13. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25(17):3389-402.

Suggested Reading

Bioinformatics Research Laboratory @UCY 2005-2023.