This project begins with you going to the NCBI site and downloading the prokaryote dna sequence for Mycoplasma genitalium G37, complete genome. This is one of the smallest life forms on earth with 580076 bp in its DNA and being a prokaryote the majority of its dna is coding dna without intron complications found in eukaryotes. It is a bacteria that resides in the implied location on humans. Here are the steps you need to go thru to print out all ORF’s that are greater than 300 codons long. Note, as we discussed in class, you will need to make three passes on the given strand and 3 passes on the complementary string going in the opposite direction. We will also analyze the pattern of start and stop codons in this sequence.
- Download the genome in FASTA format.
- Open this file in Python and read it in construction a single DNA string of length 580076 characters long. Be sure and remove the header information in the file. An easy way to do this is just read in every line, removing the white space, and concatenating the resulting strings together. Print out the strand and its length to make sure this works.
- Part A of this project is the analysis of the pattern of start codons and stop codons in the above sequence. It should be noted that stop codons always stop the translation while start codons do not always start a translation. I wonder how many start and stop codons are there. Are there more starts than stops or vice versa? Check this out. Write a script that counts them.. Remember there are three different stops. How many times is the distance from stop to the next stop greater than 600? Are there multiple starts found along these long stretches of no stops? What thoughts do you have about this? Have your program print out the six frames (three forward and 3 complementary) as follows. Large stop blocks are those that are 600 chars long without any intervening stops. For each frame print a line like this:
Frame # : startct= # , stopct= # , large stop blocks= #.
- As we discussed in class build a dict() that associates all the start codons just prior to a stop with the stop. The stop is the key and the list of starts is the value
- We will restrict our analysis to these large stop to stop blocks.
- Part B: But before we do this let’s look at the genbank file for this little guy. It contains the actual genes that the original researchers annotated. Normally when the sequencing is first performed this information is not known. They have to look at every ORF, and either convert it to a protein sequence and check if this protein is known or at least look at the sequence statistically and see if it resembles known protein in its pattern. In this file you will notice CDS entries. The Coding Sequence (CDS) is the actual region of DNA that is supposedly translated to form proteins, tRNA etc. Some are hypothetical in the sense that the protein was not observed at the time of the annotation.
While the ORF may contain introns (in eukaryotes), the ORF and the CDS are the same in prokaryotes. Since this is a long file write a program that extracts the gene information using regular expressions. For each gene put the gene in a list(or some other data structure ie dict()) with the start location being the first and the stop its second value. Print out the smallest gene, the largest gene in length and the number of genes. We can use this list to check to see if any of the ORF’s we find in the FASTA file are in the dictionary. I will discuss regular expressions on monday.
- Part C: The final stage of this program is to determine which of the large ORFs that you find in the FASTA file are actual genes in the gb file. Just go thru the either the Fasta or the Genbank data and see if the gene or ORF is in the other. Print out the number that you find and the largest 5 genes. Just print its start and stop value and whether or not it is on the complementary strand. Also print out the number of large ORF’s that are not found in the gb file.