BIO650 - Special Topics in Bioinformatics - Assignment1 (Fall 2023)

Instructor: Asscociate Prof. Vasilis J Promponas. promponas.vasileios@ucy.ac.cy

Assignment 1 - Becoming sequence detectives

Background information

Biomolecular sequences are produced in ever increasing rates, mainly due to the continuous reduction in sequencing costs and the availability of underlying high-throughput technologies. These sequences are routinely deposited in archival databases (e.g., GenBank). Due to our inability to perform “wet” experiments at such a large scale, gene/genome/protein sequences are often subject to computational analyses for obtaining meaningful annotations: such computational analyses aim (among others) to

Characterize the extent of coding regions.
Identify exon/intron boundaries.
Functionally characterise gene products (i.e., assign possible functions).

The functional annotation process often relies on the assumption that homologous genes/proteins perform the same function for some further info click here. Homology can often be detected when sufficient sequence similarity is detected between two macromolecular sequences; then we can “transfer” existing annotations from a gene/protein of known function to a newly characterized one. Obviously, if erroneous functional information is entered for an entry in a sequence database, these errors can be propagated during annotation transfer, thus these errors will be maintained in the sequence database.

There have been several reports in the literature, highlighting the existence of database errors (and their different types), discussing their significance and studying their mode of propagation within databases [1-11].

Open question

A paper published in 2015 [9] presents some particular types of annotation errors identified in a survey by the authors. A trivial case considers the appearance of a typo, the misspelled word “putaitve” (instead of the correct term “putative”), across several entries in the database. The authors in [9], used all-against-all sequence comparisons, followed by sequence similarity-based clustering of the erroneously annotated sequences in order to trace back the errors to their root, i.e., the initial sources of error which were later propagated in the database. While this specific error does not harm severely the database annotations (as the term “putative” weakens any annotation it accompanies), it can be used as a test-bed to understand how other typos (e.g., in the names of proteins) can be propagated in sequence databases.

Today, almost 10 years after this publication:

Do you expect that this particular misspelling still exists in the database?
Can you predict whether the number of the misspelled entries has changed (increased or decreased)?
Repeat the same sequence analysis steps, essentially becoming sequence detectives, in order to reproduce the results shown in Figure 1 of [9], based on the current database status.
Try and identify the possible source(s) of error.
Compare your results to the results in [9].
Discuss your findings, also placing them in the context of the available literature. Do you believe, typographical errors are easy to detect (and eventually correct) in sequence databases?

Practical Steps

Step 0: Preparatory work

We will use the Cytoscape application [12] for displaying protein networks based on their sequence similarities. You can freely download the software on your computer using the URL https://cytoscape.org/.

Note: Computers running Windows/MacOS/Linux are supported.

Note 2: Cytoscape is accompanied by a rich collection of extensions (Cytoscape Apps). Feel free to play with any of these useful tools. Nevertheless, for the purposes of this practical the basic Cytoscape functionality (i.e., no extra Apps) will suffice.

Step 1: Identification and retrieval of erroneously annotated protein sequences

Here we will collect our sequence dataset consisting of “putaitve” proteins, as described in [9].

Use your favorite web browser to navigate to the protein database at the NCBI website.
Enter the keyword putaitve at the text box and search the database
- How many protein sequence entries are retrieved by this search?
- Can you identify where the typographical error is found?
- Does the number of entries you found compare to the one reported in [9]?
Download the erroneous database entries in a local file on your computer
- Tip: On the top-right of the results page, select “Send to”, then “File”. You must select the sequences to be in FASTA format, as we will use them as input to software that recognizes this format.

Step 2: Sequence comparison computation

We will perform sequence similarity-based sequence clustering, to identify groups of possible homologous sequences. If we assume that proteins in the same cluster have “inherited” the typographical error from an initial mis-annotated protein, we could potentially identify the source of error (how??).

Computation of all-versus-all protein sequence comparisons. For this purpose we will use the NCBI BLAST webserver [13].
- Use your browser to access the NCBI BLAST page.
- Take a minute or two to see what BLAST options are available. Then, select “Protein BLAST” (we deal with protein sequences after all!!).
- In the top panel (titled “Enter Query Sequence”) select the checkbox “Align two or more sequences”. Then two similar areas for entering sequences for comparisons become available.
- To enter the data for all-versus-all sequence comparison you may
  - open the FASTA file containing the sequences you downloaded in Step 1 and copy-paste all sequences to the two text boxes provided, or
  - directly upload your files (use the “Choose file” button).
- Scroll down to “Algorithm parameters” and choose the “Max target sequences” to the maximum possible value. Leave all other parameters to their default values.
- Press the BLAST button to execute the comparisons. Be patient, the results will appear in a few moments.
- Take a few moments to explore the comparison results on your browser. For each sequence you submitted you can see the results in a separate page.
- Save the results on your computer by clicking the “Download All” link.
- Tip: There are several available formats. Select “Hit table(csv)” and save the file on your computer.

Step 3: Sequence similarity-based clustering

For simplicity, we will not perform “proper” clustering to our sequence similarity data. We will use the similarities detected using BLAST to generate a protein network: proteins will serve as the nodes (or vertices) of the network and network edges will only be constructed for protein pairs with a detected sequence similarity. Instead of using specialized clustering algorithms for generating network partitions (aka subnetworks), we will use the notion of the connected components(see this link for a definition).

Construction and Visualization of the sequence similarity-based network
- Start Cytoscape on your computer (be patient, it takes sometime to load).
- You are prompted to start a new session. Select “From Network File”.
- In the dialog box that appears navigate to the location where you have downloaded the BLAST output file.
- Cytoscape will import your data and construct a network based on your choice of parameters:
  - Under “Interaction definition” you need to make the following selections
    - “Source Interaction” -> Select “Column 1”
    - “Target Interaction” -> Select “Column 2”
    - “Interaction Type” -> Select “Column 12”
  - Under “Advanced” check the “Show Text File Import Options”, then uncheck “Transfer first line …“
- Then press “OK”. Your network is now constructed.
- From the “Layout” menu, select “Prefuse Force Directed Layout”. Now your network displays in a more comprehensible way.
  - Can you now recognize the connected components of your network?
  - How many are they?
Computation of all connected components
- From the “Tools” menu, select “NetworkAnalyzer”->“Network Analysis”->“Analyze Network”, then choose “Treat the network as undirected” and press “OK”.
- Under the “Simple Parameters” tab you can inspect several parameters of your network.
  - Can you see the number of connected components computed? Compare to your previous observation.
  - Observe the “Number of nodes” and “Number of shelf loops”. Can you comment on these values?
- From the “Tools” menu, select “NetworkAnalyzer”->“Subnetwork Creation”->“Extract Connected Components”.
- A new window appears, showing the individual connected components detected in your protein network. In parentheses, you have the respective number of edges (i.e., proteins) in each connected component.
  - How many proteins exist in each connected component?
  - How can you interpret these figures?
- By selecting a connected component and pressing the “Extract” button you generate a new network based only on this particular connected component.
  - Extract all the connected components (“clusters”) and record which proteins participate in each one. Tip: You may want to use the export function and save the nodes table as a file to more easily process.

Note: Optionally, if you want to experiment with a proper clustering method (e.g., as performed in [9]), you can install the clusterMaker2 Cytoscape plugin that offers several options to choose from.

Step 4: Tracing the source(s) of errors

Using the protein identifiers for each connected component/cluster you can revisit the protein database at the NCBI website and check their annotations.

For each cluster seek the source of error in the database.
Using any cluster (with >5 members) of your choice, present in detail your rationale and how you worked to identify the source of error.
When tracing errors for different protein clusters, do you observe any recurring features?
Can you comment on the time(s) when the erroneous sequence entries/annotations appeared in the database?

Step 5: Prepare a report

Prepare a report in the form of a short research paper (max. 2500 words) describing your work for completing Steps 1-4. Your report/paper should contain the following sections (no numbering):

Title page
Introduction
Data and Methods
Results
Discussion
References

Note 1: You can use Figures, Tables or other visual elements you find appropriate. These elements (and their legends) will not count towards the word count.

Note 2: You are free to use named subsections under any sections.

Note 3: You should include data supplements with all files containing results of your analyses. A short text describing the content of each of these files should be included as an appendix to your paper (not counting towards the word count).

Note 4: You may choose any citation style you prefer as long as it is used uniformly throughout the text.

Note 5: Make sure that you address all questions posed above (even in the introduction).

Note 6: A critical approach when interpreting your results is expected.

Bonus

Any student interested in performing extra work on this assignment (e.g., repeating the work with other databases, or other types of annotation errors) may compete for a bonus mark after discussing with the instructor. Feel free to come up with ideas but let’s discuss before you start to perform any serious work for the bonus …

References

Kyrpides NC, Ouzounis CA. Whole-genome sequence annotation: ‘Going wrong with confidence’. Mol Microbiol. 1999;32(4):886–7.
Devos D, Valencia A. Intrinsic errors in genome annotation. Trends Genet. 2001;17(8):429–31.
Gilks WR, Audit B, De Angelis D, Tsoka S, Ouzounis CA. Modeling the percolation of annotation errors in a database of protein sequences. Bioinformatics. 2002;18(12):1641–9.
Ouzounis CA, Karp PD. The past, present and future of genome-wide re-annotation. Genome Biol. 2002;3(2):COMMENT2001.
Green ML, Karp PD. Genome annotation errors in pathway databases due to semantic ambiguity in partial EC numbers. Nucleic Acids Res. 2005;33(13):4035–9.
Schnoes AM, Brown SD, Dodevski I, Babbitt PC. Annotation error in public databases: misannotation of molecular function in enzyme superfamilies. PLoS Comput Biol. 2009;5(12):e1000605.
Ben-Shitrit T, Yosef N, Shemesh K, Sharan R, Ruppin E, Kupiec M. Systematic identification of gene annotation errors in the widely used yeast mutation collections. Nat Methods. 2012;9(4):373–8.
Percudani R, Carnevali D, Puggioni V. Ureidoglycolate hydrolase, amidohydrolase, lyase: how errors in biological databases are incorporated in scientific papers and vice versa. Database (Oxford). 2013;2013:bat071.
Promponas VJ, Iliopoulos I, Ouzounis CA. Annotation inconsistencies beyond sequence similarity-based function prediction - phylogeny and genome structure. Stand Genomic Sci. 2015;10:108.
Goudey B, Geard N, Verspoor K, Zobel J. Propagation, detection and correction of errors using the sequence database network. Brief Bioinform. 2022;23(6):bbac416.
Kress A, Poch O, Lecompte O, Thompson JD. Real or fake? Measuring the impact of protein annotation errors on estimates of domain gain and loss events. Front Bioinform. 2023 Apr 20;3:1178926.
Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003;13(11):2498-504.
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25(17):3389-402.