This project begins with you going to the NCBI site and downloading the prokaryote dna sequence for Mycoplasma genitalium G37, complete genome. This is one of the smallest life forms on earth with 580076 bp in its DNA and being a prokaryote the majority of its dna is coding dna without intron complications found in eukaryotes. It is a bacteria that resides in the implied location on humans. Here are the steps you need to go thru to print out all ORF’s that are greater than 300 codons long. Note, as we discussed in class, you will need to make three passes on the given strand and 3 passes on the complementary string going in the opposite direction. We will also analyze the pattern of start and stop codons in this sequence.
- Download the genome in FASTA format.
- Open this file in Python and read it in construction a single DNA string of length 580076 characters long. Be sure and remove the header information in the file. An easy way to do this is just read in every line, removing the white space, and concatenating the resulting strings together. Print out the strand and its length to make sure this works.
- Before we do the above let’s first analyze the pattern of start codons and stop codons in the above sequence. It should be noted that stop codons always stop the translation while start codons do not always start a translation. I wonder how many start and stop codons are there. Are there more starts than stops or vice versa? Check this out. Write a script that counts them.. Remember there are three different stops. How many times is the distance from stop to the next stop greater than 600? Are there multiple starts found along these long stretches of no stops? What thoughts do you have about this? Have your program print out the six frames (three forward and 3 complementary) as follows. Large stop blocks are those that are 600 chars long without any intervening stops. For each frame print a line like this:
Frame # : startct= # , stopct= # , large stop blocks= #.
- We will restrict our analysis to these large stop to stop blocks. But before we do this let’s look at the genbank file for this little guy. It contains the actual genes that the original researchers annotated. Normally when the sequencing is first performed this information is not known. They have to look at every ORF, and either convert it to a protein sequence and check if this protein is known or at least look at the sequence statistically and see if it resembles known protein in its pattern. In this file you will notice CDS entries. The Coding Sequence (CDS) is the actual region of DNA that is supposedly translated to form proteins, tRNA etc. Some are hypothetical in the sense that the protein was not observed at the time of the annotation.
While the ORF may contain introns (in eukaryotes), the ORF and the CDS are the same in prokaryotes. Since this is a long file write a program that extracts the gene information using regular expressions. For each gene put the gene in a dictionary with the start location being the key and the length its value. What is the smallest gene what is the largest gene. We can use this dictionary to check to see if any of the ORF’s we find in the FASTA file are in the dictionary. I will discuss regular expressions on monday.
- To be continued