Implementation of the classic Dynamic Programming problem using the Needleman–Wunsch algorithm which requires quadratic space & time complexity.
Given 2 sequences, find the minimum cost of aligning the 2 sequences
Gaps can be inserted to 1 sequence or the other, but not at the same time
2
= Gap Penalty (δ)
If 2 characters are aligned with each other, there may be a mismatch penalty (αi j)
0
= aligning identical letters1
= aligning a vowel with a vowel, or a consonant with a consonant3
= aligning a vowel with a consonant
Minimum cost = sum of mismatch & gap penalties (the optimal alignment)
O(M*N)
Storage: also O(M*N) which requires a large 2D array
- Letters only (no spaces, numbers or special characters)
_
(Underscore character) is reserved to represent- View results in a fixed-width font for the 2 sequences to be lined up
- Leading & trailing whitespace is trimmed
- Uppercase / Lowercase is ignored (all converted to lowercase)
predecessorIndexes
is calculated when creatingmemoTable
but not used to find the actual alignment, only to show where the values inmemoTable
came from- For example: if
predecessorIndexes[4][4]
contained the array[4, 3]
, it means the value ofmemoTable[4][4]
(which is6
) came frommemoTable[4][3]
(i.e.case3
,seq2
with a gap so it came from the left) predecessorIndexes[0][0]
contains the array[-1, -1]
because the upper left corner has no predecessor, it was the initial base case
- Provide 2 strings (edit
testSequences
2D array) - Optionally change the
GAP_PENALTY
and the mismatch penalties
CurrentlyVOWEL_VOWEL_PENALTY
andCONSONANT_CONSONANT_PENALTY
are the same, but are arbitrary
findAlignment()
should only be called from insidecalcOptimalAlignment()
because it usesmemoTable
& the 2 sequences to display the actual alignmentseq1
&seq2
have a leading space added (this is after trimming any trailing/leading whitespace)
This is so thatseq1.charAt(i)
&seq2.charAt(j)
work in the loops & the string indexes match up
It causes some adjustments for the array sizes.memoTable = new int[seq1.length()][seq2.length()]
uses the exact value oflength()
which includes the space- Loops like
for(int i=0; i<seq1.length(); i++)
use<seq1.length()
to not go out of bounds of the string indexes - In
findAlignment()
,int i = seq1.length()-1;
&int j = seq2.length()-1;
since the 2 sequences have leading spaces & arrays are indexes from0
findAlignment()
basically retraces each calculation in thememoTable
to see where it came from
There may be multiple optimal paths, but this only finds1
of them
- String Alignment using Dynamic Programming - Gina M. Cannarozzi to retrace the memo table & find the alignment