-
-
Save slowkow/06c6dba9180d013dfd82bec217d22eb5 to your computer and use it in GitHub Desktop.
| #!/usr/bin/env python | |
| """ | |
| The Needleman-Wunsch Algorithm | |
| ============================== | |
| This is a dynamic programming algorithm for finding the optimal alignment of | |
| two strings. | |
| Example | |
| ------- | |
| >>> x = "GATTACA" | |
| >>> y = "GCATGCU" | |
| >>> print(nw(x, y)) | |
| G-ATTACA | |
| GCA-TGCU | |
| LICENSE | |
| This is free and unencumbered software released into the public domain. | |
| Anyone is free to copy, modify, publish, use, compile, sell, or | |
| distribute this software, either in source code form or as a compiled | |
| binary, for any purpose, commercial or non-commercial, and by any | |
| means. | |
| In jurisdictions that recognize copyright laws, the author or authors | |
| of this software dedicate any and all copyright interest in the | |
| software to the public domain. We make this dedication for the benefit | |
| of the public at large and to the detriment of our heirs and | |
| successors. We intend this dedication to be an overt act of | |
| relinquishment in perpetuity of all present and future rights to this | |
| software under copyright law. | |
| THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, | |
| EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF | |
| MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. | |
| IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR | |
| OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, | |
| ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR | |
| OTHER DEALINGS IN THE SOFTWARE. | |
| For more information, please refer to <http://unlicense.org/> | |
| """ | |
| import numpy as np | |
| def nw(x, y, match = 1, mismatch = 1, gap = 1): | |
| nx = len(x) | |
| ny = len(y) | |
| # Optimal score at each possible pair of characters. | |
| F = np.zeros((nx + 1, ny + 1)) | |
| F[:,0] = np.linspace(0, -nx * gap, nx + 1) | |
| F[0,:] = np.linspace(0, -ny * gap, ny + 1) | |
| # Pointers to trace through an optimal aligment. | |
| P = np.zeros((nx + 1, ny + 1)) | |
| P[:,0] = 3 | |
| P[0,:] = 4 | |
| # Temporary scores. | |
| t = np.zeros(3) | |
| for i in range(nx): | |
| for j in range(ny): | |
| if x[i] == y[j]: | |
| t[0] = F[i,j] + match | |
| else: | |
| t[0] = F[i,j] - mismatch | |
| t[1] = F[i,j+1] - gap | |
| t[2] = F[i+1,j] - gap | |
| tmax = np.max(t) | |
| F[i+1,j+1] = tmax | |
| if t[0] == tmax: | |
| P[i+1,j+1] += 2 | |
| if t[1] == tmax: | |
| P[i+1,j+1] += 3 | |
| if t[2] == tmax: | |
| P[i+1,j+1] += 4 | |
| # Trace through an optimal alignment. | |
| i = nx | |
| j = ny | |
| rx = [] | |
| ry = [] | |
| while i > 0 or j > 0: | |
| if P[i,j] in [2, 5, 6, 9]: | |
| rx.append(x[i-1]) | |
| ry.append(y[j-1]) | |
| i -= 1 | |
| j -= 1 | |
| elif P[i,j] in [3, 5, 7, 9]: | |
| rx.append(x[i-1]) | |
| ry.append('-') | |
| i -= 1 | |
| elif P[i,j] in [4, 6, 7, 9]: | |
| rx.append('-') | |
| ry.append(y[j-1]) | |
| j -= 1 | |
| # Reverse the strings. | |
| rx = ''.join(rx)[::-1] | |
| ry = ''.join(ry)[::-1] | |
| return '\n'.join([rx, ry]) | |
| x = "GATTACA" | |
| y = "GCATGCU" | |
| print(nw(x, y)) | |
| # G-ATTACA | |
| # GCA-TGCU | |
| np.random.seed(42) | |
| x = np.random.choice(['A', 'T', 'G', 'C'], 50) | |
| y = np.random.choice(['A', 'T', 'G', 'C'], 50) | |
| print(nw(x, y, gap = 0)) | |
| # ----G-C--AGGCAAGTGGGGCACCCGTATCCT-T-T-C-C-AACTTACAAGGGT-C-CC-----CGT-T | |
| # GTGCGCCAGAGG-AAGT----CA--C-T-T--TATATCCGCG--C--AC---GGTACTCCTTTTTC-TA- | |
| print(nw(x, y, gap = 1)) | |
| # GCAG-GCAAGTGG--GGCAC-CCGTATCCTTTC-CAAC-TTACAAGGGTCC-CCGT-T- | |
| # G-TGCGCCAGAGGAAGTCACTTTATATCC--GCGC-ACGGTAC-----TCCTTTTTCTA | |
| print(nw(x, y, gap = 2)) | |
| # GCAGGCAAGTGG--GGCAC-CCGTATCCTTTCCAACTTACAAGGGTCCCCGTT | |
| # GTGCGCCAGAGGAAGTCACTTTATATCC-GCGCACGGTAC-TCCTTTTTC-TA | |
Mantenemos la puntuación por coincidencia, no coincidencia y gap. La elección en cada posición depende de la mejor puntuación posible para la alineación global.
Podemos aumentar la puntuación no coincidente para evitar (A - T):
In [2]: print(nw(x, y, gap = 1, mismatch = 1))
GCAG-GCAAGTGG--GGCAC-CCGTATCCTTTC-CAAC-TTACAAGGGTCC-CCGT-T-
G-TGCGCCAGAGGAAGTCACTTTATATCC--GCGC-ACGGTAC-----TCCTTTTTCTA
In [3]: print(nw(x, y, gap = 1, mismatch = 2))
--GCAGGCA-AGTG-GGGCACCCGTATCCT-T-TCCAACTTACAAGGGT-C-CC-----CGTT
GTGC-GCCAGAG-GAAGTCA--C-T-T--TATATCC-GC--GC-ACGGTACTCCTTTTTC-TA
In [4]: print(nw(x, y, gap = 1, mismatch = 3))
----G-C--AGGCAAGTGGGGCACCCGTATCCT-T-T-C-C-AACTTACAAGGGT-C-CC-----CGT-T
GTGCGCCAGAGG-AAGT----CA--C-T-T--TATATCCGCG--C--AC---GGTACTCCTTTTTC-TA-
In [5]: print(nw(x, y, gap = 1, mismatch = 4))
----G-C--AGGCAAGTGGGGCACCCGTATCCT-T-T-C-C-AACTTACAAGGGT-C-CC-----CGT-T
GTGCGCCAGAGG-AAGT----CA--C-T-T--TATATCCGCG--C--AC---GGTACTCCTTTTTC-TA-
This R code is slow, but it might be useful for visualizing some examples of sequence alignments: https://gist.github.com/slowkow/508393
The score matrix F is represented with numbers.
The pointer matrix P is represented with arrows pointing up, left, or up-left to indicate gaps (up or left) or match/mismatch (up-left).
can you please tell me the changes i need to do in this code for finding the most optimal alignment cosidering Smith Watterman algorithm for local alignment ??
@ayush-mourya There are lot of resources online that you might want to look at. Don't give up, keep reading!
Here is one resource that seems relevant to your question: https://open.oregonstate.education/appliedbioinformatics/chapter/chapter-3/
(University websites are always a great starting point, try searching for queries like smith-waterman site:edu to get results from universities.)
In the Needleman-Wunsch (global alignment) algorithm, we start from the bottom-right corner of the matrix, and we move upward and to the left, stopping in the top-left corner of the matrix. This ensures that we globally align all of the bases from the two sequences.
In the Smith-Waterman (local alignment) algorithm, we do not always start from the bottom-right corner of the matrix. Instead, we choose the maximum value from the bottom row or the right-most column. From that position, we proceed upward and to the left, but we can stop before we reach the top-left corner. This means we are interested in the local alignment of a subset of the first and second sequences, not the global alignment of the entirety of the two sequences.
I hope that helps! Good luck with your learning.

Adenina con timina (A - T), ¿eso es correcto como alineación optima resultante? en la linea 116 y 117 hay un resultado que alinea adenina con timina, ¿me podrian explicar si eso se toma en cuenta como una alineación? ¿no debería haber un gap? o deleción ¿?