Simple DNA processing with suffix arrays

In this homework concerning suffix arrays you are to perform the following operations on a string using Python.  This is due on test day next week.

1. Download read in a DNA strand.  Go to NCBI and download the FASTA file for Homo sapiens neanderthalensis D-loop, partial sequence; mitochondrial.  Its GenBank number is JQ670672.1.  Use  the send to tab on the upper right to save it to a file. It should look like the following after downloaded.  Set the database to Nucleotide to search for the above.

>gi|398637208|gb|JQ670672.1| Homo sapiens neanderthalensis D-loop,...

Your little program should do the following.

  1. Your initial job is to open the above file and read in the text.  You are to strip the first line and collect the actual DNA into a single variable called dna.  This should be one long string without any white space, just A,C,T,and G’s.  Now concat a ‘$’ onto the end of the string.  Print out the length with appropriate comment.
  2.  We will define suffix i of the string s as s[i:].  For example if s=’ABCDE’ then suffix 0 is the entire string, suffix 1 is ‘BCDE’, suffix 2 is ‘CDE’ and so on.  You are to now create a list of all possible suffixes, called SuffixList, of the dna string from suffix 0 to suffix n where n=len(dna).  Print out the total number of characters in this list with appropriate comment.  What would the total size be if there are N characters in the above string?
  3. Now create a list of integers called SA, (ie this is the suffix array) ,such that the numbers in the lists represent the indices of the suffixes in sorted order. In other words SuffixList[SA[i]] from i=0 to n is the list of the suffixes in sorted order.  To do this just sort the SA[i] array using an index sort. The indexes are with respect to the above string.  What is the complexity of your sort.
  4. print out SuffixList[SA[5]] together with its length.
  5.  Search the Suffix array in log n time (bsearch) and print out the range of suffixes that are prefixed with “CATG”, ie give the low subscript and high subscript.  Print the suffix at the low subscript.

Comments are closed.