๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
๐Ÿงฌ Biology/๋ฐ”์ด์˜ค ์ฝ”๋”ฉ ๋ฌธ์ œ

[ROSALIND] (k, d)-motif ์ฐพ๊ธฐ

by HelloRabbit 2023. 7. 2.
728x90

๋ฌธ์ œ ์„ค๋ช…

Motif๋ž€ ๋น„์Šทํ•œ ์„œ์—ด์„ ๊ฐ€์ง์œผ๋กœ์„œ ๋น„์Šทํ•œ ๊ธฐ๋Šฅ์„ ํ•˜๋Š” ๊ฒƒ์œผ๋กœ ์•Œ๋ ค์ง„ DNA ๋‚ด์˜ ์งง์€ ์„œ์—ด๋“ค์ด๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, A๋ผ๋Š” ๋‹จ๋ฐฑ์งˆ์ด DNA์— ๊ฒฐํ•ฉํ•˜๋Š” ์œ„์น˜๋Š” DNA ์ƒ์—์„œ ์—ฌ๋Ÿฌ ์œ„์น˜์— ์กด์žฌํ•˜์ง€๋งŒ ๋ชจ๋‘ ๋น„์Šทํ•œ ์„œ์—ด์„ ๊ฐ€์ ธ์•ผ A๊ฐ€ ๊ฒฐํ•ฉํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ์ด๋‹ค. ํ•˜์ง€๋งŒ motif๋Š” 100% ๋˜‘๊ฐ™์€ ์„œ์—ด์ด ์•„๋‹ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์ด๋Ÿฌํ•œ motif๋ฅผ ์ฐพ๊ธฐ ์œ„ํ•ด์„œ๋Š” ์–ด๋Š ์ •๋„ ์—ผ๊ธฐ์„œ์—ด์˜ ์ฐจ์ด๋ฅผ ๊ณ ๋ คํ•ด์•ผ ํ•œ๋‹ค.

์ด ๋ฌธ์ œ์—์„œ๋Š” (k, d)-motif๋ฅผ ์ฐพ๋Š”๋ฐ ์—ฌ๊ธฐ์„œ k๋Š” k ๊ธธ์ด๋ฅผ ๊ฐ€์ง„ motif๋ฅผ ๋œปํ•˜๊ณ , d๋Š” ์ตœ๋Œ€ d ๊ฐœ์ˆ˜๋งŒํผ ์—ผ๊ธฐ์„œ์—ด ์ฐจ์ด๊ฐ€ ์žˆ์„ ์ˆ˜ ์žˆ๋‹ค๋Š”๊ฑธ ๋œปํ•œ๋‹ค. ์ด๋Ÿฌํ•œ (k, d)-motif๋Š” ์„œ์—ด์— ์ง์ ‘ ์กด์žฌํ•˜์ง€ ์•Š์„ ์ˆ˜๋„ ์žˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์šฐ๋ฆฌ๊ฐ€ ์ฐพ์€ 15 bp ๊ธธ์ด๋ฅผ ๊ฐ€์ง„ (k, d)-motif๊ฐ€ AAAAAAAAGGGGGGG ์ผ ๋•Œ, ์œ ์‚ฌํ•œ motif๋“ค์€ AgAAgAAAGGttGGG์™€ cAAtAAAAcGGGGcG ์ผ ์ˆ˜ ์žˆ๋‹ค. ๋‘ ์„œ์—ด์€ (k, d)-motif์—์„œ 4๊ฐœ์˜ ์—ผ๊ธฐ์„œ์—ด ๋ฐ–์— ์ฐจ์ด๊ฐ€ ๋‚˜์ง€ ์•Š์ง€๋งŒ ์„œ๋กœ๋ฅผ ๋น„๊ตํ–ˆ์„ ๋•Œ๋Š” 8๊ฐœ์˜ ์—ผ๊ธฐ์„œ์—ด์˜ ์ฐจ์ด๊ฐ€ ์žˆ๋‹ค.

์—ผ๊ธฐ์„œ์—ด ์ฐจ์ด(d)๊ฐ€ 8์ด๋‹ค

์ด๋ ‡๊ฒŒ (k, d)-motif๋ฅผ ์ฐพ๋Š” ๋ฌธ์ œ๋Š” ์‹ค์ œ ์„œ์—ด์—๋Š” ์—†์„ ์ˆ˜๋„ ์žˆ๋Š” motif ์„œ์—ด์„ ์ฐพ๋Š” ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์— motif๋ฅผ ์ฐพ๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋งŒ๋“œ๋Š” ๊ฒƒ์ด ์ข€ ๋” ๊นŒ๋‹ค๋กœ์šธ ์ˆ˜ ์žˆ๋‹ค.

๋ฌธ์ œ (ํ’€์–ด๋ณด๊ธฐ)

์ฒซ ์ค„์— k์™€ d ๊ฐ’์ด ์ฃผ์–ด์ง€๊ณ , ๋‚˜๋จธ์ง€ ์ค„์—๋Š” ์—ฌ๋Ÿฌ DNA ์„œ์—ด๋“ค์ด ์ฃผ์–ด์กŒ์„ ๋•Œ, ๋ชจ๋“  ์„œ์—ด์ด ๊ณตํ†ต์ ์œผ๋กœ ๊ฐ€์ง„ (k, d)-motif๋ฅผ ๋ชจ๋‘ ์ฐพ์œผ์‹œ์˜ค. 

์˜ˆ์‹œ

3 1
ATTTGGC
TGCCTTA
CGGTATC
GAAAATT

์˜ˆ์ƒ ๊ฒฐ๊ณผ

ATA ATT GTT TTT

 

ํ•ด๊ฒฐ

# a์™€ b ์„œ์—ด์—์„œ ์ฐจ์ด๋‚˜๋Š” ์—ผ๊ธฐ์„œ์—ด ๊ฐœ์ˆ˜๋ฅผ ๋ฐ˜ํ™˜ํ•˜๋Š” ํ•จ์ˆ˜
def hamming_dist(a, b):
    return sum(map(lambda x: x[0] != x[1], zip(a, b)))

# d ์ฐจ์ด๋‚˜๋Š” ๋ชจ๋“  k-mer ์ฐพ๋Š” ํ•จ์ˆ˜
def neighbors(pattern, d):
    if d == 0:
        return [pattern]
    if len(pattern) == 1:
        return ['A', 'C', 'T', 'G']

    neighborhood = []
    suffix_neighbors = neighbors(pattern[1:], d)
    
    for suffix in suffix_neighbors:
        
        # d ๋ณด๋‹ค ์ž‘์œผ๋ฉด ์•„์ง ์ตœ์†Œ 1๊ฐœ๋Š” ๋” ๋‹ฌ๋ผ๋„ ๋œ๋‹ค๋Š” ๋œป์ด๋‹ˆ๊นŒ ์•ž์— A,C,T,G๋ฅผ ๋‹ค ๋ถ™์—ฌ๋ด๋„ ๋จ
        if hamming_dist(pattern[1:], suffix) < d:
            for nuc in ['A', 'C', 'T', 'G']:
                neighborhood.append(nuc + suffix)
        
        # d๋ž‘ ๊ฐ™๋‹ค๋ฉด ๋” ์ด์ƒ mismatch๊ฐ€ ์žˆ์œผ๋ฉด ์•ˆ๋˜๊ธฐ ๋•Œ๋ฌธ์— ์›๋ž˜ pattern์˜ ์•ž๊ธ€์ž๋งŒ ๋”ํ•ด์คŒ
        else:
            neighborhood.append(pattern[0] + suffix)

    return neighborhood

# ๋ชจ๋“  k-mer์˜ ๋ฆฌ์ŠคํŠธ๋ฅผ ๋ฐ˜ํ™˜ํ•˜๋Š” ํ•จ์ˆ˜
def get_kmers(k, dna):
    kmers = []
    for i in range(len(dna) - k + 1):
        kmers.append(dna[i:i+k])
    
    return kmers

# (k, d)-motif ์ฐพ๋Š” ํ•จ์ˆ˜
def motif_enumeration(k, d, dnas):
    kmers = get_kmers(k, dnas[0])    # ์ฒซ๋ฒˆ์งธ ์„œ์—ด์˜ ๋ชจ๋“  kmer ์ฐพ๊ธฐ

    kd_motifs = []
    for kmer in kmers:
        neighborhood = neighbors(kmer, d)
        for neighbor in neighborhood:
            available = [1] + [0 for i in range(len(dnas) - 1)]
            for i in range(1, len(dnas)):
                if neighbor in ' '.join(neighbors(dnas[i], d)):
                    available[i] = 1
            if sum(available) == len(dnas):
                kd_motifs.append(neighbor)
    
    return set(kd_motifs)

# ์ž…๋ ฅ๊ฐ’ ๊ฐ€์ ธ์˜ค๊ธฐ
given = []
with open('rosalind_ba2a.txt', 'r') as f:
    for line in f.readlines():
        given.append(line.strip())

# ์ฝ”๋“œ ์‹คํ–‰ํ•˜๊ธฐ
k, d = map(int, given[0].split())
motifs = motif_enumeration(k, d, given[1:])
print(' '.join(motifs))

 

์ฐธ๊ณ 

https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-2

 

 

 

๋Œ“๊ธ€