前言
CRISPR因为具有NHEJ效应,在修复过程中会造成一定长度的突变,这种突变包括小片段插入(insertion),小片段缺失(deletion),单碱基突变(substitution)。
为了有效准确的分析突变情况,我们需要对测序序列和参考序列进行全局比对,以期望发现不同种类的突变情况
为了加速双序列比对,我们采用cython模块来加速,这里就直接采用别人写好的开源程序Github: brentp / align
git clone https://github.com/brentp/align.gitcd alignpython3 setup.py buildpython3 setup.py install
这个模块的用法也很简单,但要注意该模块包含python和cython两种比对,当然cython的速度会更快
#for python alignfrom align.align import aligner as pyaligner#for cython alignfrom align.calign import aligner as calignerfrom align.matrix import DNAFULL
序列比对有三种方法,分别是global,glogal_cfe,glocal分别针对全局比对,全局长序列比对以及局部比对。
下面直接就是主函数
from align.calign import alignerfrom align.matrix import DNAFULLimport refrom collections import Counterimport argparseparser = argparse.ArgumentParser(description='This script was used to calculate indel of 37bp trap library')parser.add_argument('-r', '--ref', help='Please input your reference file')parser.add_argument('-f', '--fasta', help='Please input your filter sequence.')parser.add_argument('-e', '--edits', help='Please input your edits style. eg: insertion;deletion;sub')parser.add_argument('-m', '--methods', help='Please input your select method. eg: size;pos;ss')args = parser.parse_args()##seq1 is refdef find_insertion(str1, str2):indel = []j = 0for i in re.finditer("-+", str1):pos = i.start() + 1 - jsize = i.end() - i.start()ss = str2[i.start(): i.end()]j = j + sizeindel.append((pos, size, ss))return indeldef find_deletion(str1, str2):indel = []new_s1 = str1.replace("-", "")new_s2 = "".join([str2[i] for i in range(len(str1)) if str1[i] != "-"])for i in re.finditer("-+", new_s2):pos = i.start() + 1size = i.end() - i.start()ss = new_s1[i.start(): i.end()]indel.append((pos, size, ss))return indeldef find_sub(str1, str2):sub = []new_s1 = str1.replace("-", "")new_s2 = "".join([str2[i] for i in range(len(str1)) if str1[i] != "-"])for i in range(len(new_s1)):if new_s1[i] != new_s2[i] and new_s2[i] != "-":pos = i + 1size = 1ss = new_s2[i]sub.append((pos, size, ss))else:continuereturn subif args.ref:lb_lst = []rf_lst = []with open(args.ref, "r") as f:for line in f:label, ref_seq = line.split(",")lb_lst.append(label.strip())rf_lst.append(ref_seq.strip("\n"))else:print("No ref input!")exit(0)if args.fasta:fasta = {}with open(args.fasta, 'r') as f:for line in f:if line.startswith(">"):ss_lst = []name = line.strip("\n").replace(">", "")elif line != "\n":ss_lst.append(line.strip("\n"))fasta[name] = ss_lstelse:continueelse:print("No sequence input!")exit(0)haha_lst = []# calculate insertion for every sitefor lb, rf in zip(lb_lst, rf_lst):for val in fasta[lb]:ag = aligner(val, rf, method="global_cfe", matrix=DNAFULL)bidui = ag.pop()ctrl = bidui.seq2.decode()case = bidui.seq1.decode()if args.edits == "insertion":haha_lst.extend(find_insertion(ctrl, case))elif args.edits == "deletion":haha_lst.extend(find_deletion(ctrl, case))elif args.edits == "sub":haha_lst.extend(find_sub(ctrl, case))else:print("please input correct call methods!eg:insertion;deletion;sub")if len(haha_lst) > 0:if args.methods == "pos":new_lst = [haha[0] for haha in haha_lst if len(haha) == 3]fin = Counter(new_lst)print("{} in every site is:".format(args.edits))for i in range(37):if fin.get((i+1), 0):print("{}\t{}".format((i + 1), fin[i + 1]))else:print("{}\t{}".format((i+1), 0))elif args.methods == "size":new_lst = [haha[1] for haha in haha_lst if len(haha) == 3]fin = Counter(new_lst)print("{} in every size is:".format(args.edits))for i in range(1, 10):if fin.get((i), 0):print("{}\t{}".format((i), fin[i]))else:print("{}\t{}".format((i), 0))elif args.methods == "ss":new_lst = [haha[2] for haha in haha_lst if len(haha) == 3]fin = Counter(new_lst)print("{} in every sequence is:".format(args.edits))for key, vals in fin.items():print("{}\t{}".format(key, vals))else:print("No {}".format(args.edits))
用法
如果想知道fasta序列中相对于ref序列,分别在1-37bp上的插入是多少,可以键入一下参数
python3 cal_indel.py -r ref.txt -f fasta.fa -e insertion -m pos
#cat ref.txt1,AGAAGTGGAGCTGCAGCTGCAGGCAGCTCCCGGATCC2,AGTGGCTGAGTATGATCAGTGTCCAGTGTCTGGCCCA
参考序列是两列文件,第一列代表序号,第二列代表参考序列
#cat fasta.fa>1AGAAGTGGAGCTGCAGCTGCAGGCAGCCTCCCGGATCCAGAAGTGGAGCTGCAGATCCAGAAGTGGAGCTGCAGATCCAGAAGTGGAGCTAGAAGTGGAGCTGCAGCTGCAGGCAGCCTCCCGGATCCAGAAGTGGAGCTGCAGCTGCAGGCAGAGCTCCTGGATCCAGAAGTGGAGCTGCAGCTGCAGGCAGAGCTCCTGGATCCAGAAGTGGAGCTAGAAGTGGAGCTGCAGCTGCAGGCTGGCCCGGATCC>2AGTGGCTGAGTATGATCAGTGTCCAGTCTGGCCCAAGTGGCTGAGTATGATCAGTGTCTGGCCCAAGTGGCTGAGTATGATCAGCCCAAGTGGCTGAGTATGATCAGTGTCCAAGTGGCTGAGTATGATCAGTGTCCAGTGTCTGGCCCAAGTGGCTGAGTATGATCAGTGTGTCTGGCCCAAGTGGCTGAATATGATCAGTGTCCAGTTGTCTGGCCCAAGTGGCTGAGTATGATCAGTGTCCAGAGTGGCTGAGTATGATCAGTGTGTCTGGCCCAAATGGCTGAGTATGATCAGTGTCCAGATCTGGCCCAAGTGGCTGAATAGTCTGGCCCA
输入序列为序列名称为参考序列对应的编号,而后面跟着以换行符分隔的小片段序列
-e参数可以分别输入insertion,deletion,sub,分别对应插入,缺失以及单碱基替换
-m参数可以分别输入size,pos,ss分别对应按照大小,位置和序列进行统计
注意:如果报global_cfe的错误,就是因为比对序列太多了,更换比对参数为global即可!
