立即注册找回密码

QQ登录

只需一步,快速开始

微信登录

微信扫一扫,快速登录

手机动态码快速登录

手机号快速注册登录

搜索

图文播报

查看: 232|回复: 0

[分享] 全网第一篇免疫组库分析教程[MiXCR+VDJtools+Python+R]

[复制链接]
发表于 2025-3-12 15:31 | 显示全部楼层 |阅读模式

登陆有奖并可浏览互动!

您需要 登录 才可以下载或查看,没有账号?立即注册 微信登录 手机动态码快速登录

×
本人已委托维权骑士(http://www.rightknights.com)进行原创维权。
思想高屋建瓴,知识原创共享。
转载请引用链接。
<hr/>免疫组库(Immune Repertoire,IR)是指某个个体在任何特定时间点其循环系统中所有功能多样性B淋巴细胞和T淋巴细胞的总和。免疫组库分析是指通过高通量测序技术,对机体免疫组库的多样性及每种 T、B细胞克隆的独特性序列组成和变化进行分析,从而全面评估机体的免疫状态,明确疾病与T、B细胞克隆组成及变化之间的关系,是免疫学和医学领域常用的分析方法。


<hr/>

  • 准备工作:
1.1 MiXCR and VDJtools:
Releases · milaboratory/mixcrRelease 1.2.1 · mikessh/vdjtools下载MiXCR和VDJtools的压缩包并解压。
1.2 Java Runtime Environment (JRE, at least v1.8)
https://java.com/en/download/ 下载Java组件并按照提示安装,因为MiXCR和VDJtools都基于java编写,因此需要java运行。
1.3 R packages VDJtools needs:
VDJTools的可视化依赖于R的一些可视化包如我们使用过的circlize等,因此需要安装这些R包,清单如下:

  • ape
  • circlize
  • FField
  • ggplot2
  • gplots
  • grid
  • gridExtra
  • MASS
  • plotrix
  • RColorBrewer
  • reshape
  • reshape2
  • scales
  • VennDiagram
可在R中手动安装,也可以在命令行通过以下代码安装:
java -jar /path/to/vdjtools/vdjtools.jar Rinstall
#/path/to/vdjtools/为vdjtools.jar的所在路径1.4 免疫组库数据:
一方面,从理论上来说,只要测序覆盖了T细胞CDR或B细胞IG的编码区,就可以从测序数据中提取出免疫组信息。另一方面,根据MiXCR官网的描述,除了靶向建库的免疫组库测序数据,也可以使用RNA-Seq或其他非靶向但覆盖有免疫组信息的测序数据作为输入。
1.5 免疫组库数据对应的样本信息:
包含有所有样本的分组信息,以便后续的分组分析,文件格式参考如下:


<hr/>2. 分析流程:
2.1 通过fastqc、multiqc、trimmomatic进行质量过滤:
和其他基因测序数据一样,拿到数据后首先要通过fastqc、multiqc、trimmomatic进行质量过滤,详细步骤可见我之前的文章 生科院的老张:RNAseq转录组差异表达分析教程 .
2.1.1 md5检查数据完整性:
md5sum *gz > md5.txt && md5sum -c md5.txt
# *gz 任意以gz结尾的文件
# > 运行结果保存至
# && 连接符,前一命令完成继续执行下一命令2.1.2 fastqc质量控制与multiqc合并质控报告:
fastqc *gz && multiqc ./
# 对所有gz结尾文件进行质控并对当前目录下所有质控报告进行合并2.1.3 trimmomatic质量过滤:
trimmomatic
PE #双端测序,单端测序为SE
-threads 4 #指定线程数为4
~/WES_WGS_WGRS/data/sample1/sample1_1.fq.gz #输入序列
~/WES_WGS_WGRS/data/sample1/sample1_2.fq.gz
~/WES_WGS_WGRS/data/sample1/sample1_paired_clean_1.fq.gz #输出配对序列和非配对序列
~/WES_WGS_WGRS/data/sample1/sample1_unpair_clean_1.fq.gz
~/WES_WGS_WGRS/data/sample1/sample1_paired_clean_2.fq.gz
~/WES_WGS_WGRS/data/sample1/sample1_unpair_clean_2.fq.gz
ILLUMINACLIP:/data1/guest/yinlei/miniconda3/pkgs/trimmomatic-0.39-1/share/trimmomatic/adapters/TruSeq3-PE-2.fa:2:30:10:1:true
#去除ILLUMINA接头,根据质控报告选择trimmomatic文件夹adapters路径下的接头文件
LEADING:3 #从reads开头切除质量低于阈值3的碱基
TRAILING:3 #从reads末尾切除质量低于阈值3的碱基
SLIDINGWINDOW:4:20 #从reads 5‘端开始进行长度为4的滑窗过滤,切除碱基质量低于阈值20的碱基
MINLEN:50 #丢弃剪切后长度低于阈值50的reads
TOPHRED33 #将reads的碱基质量值体系转为phred-33,若为phred-64则为TOPHRED64<hr/>2.2 MiXCR提取免疫组信息:
#for targeted TCR/IG libraries
java -jar /path/to/mixcr/mixcr.jar
    analyze amplicon
    -s hsa --starting-material dna  --5-end v-primers --3-end j-primers --adapters adapters-present
    sample1_1.fastq sample1_2.fastq /path/mixcr_result/sample1

#for non-enriched data
java -jar /path/to/mixcr/mixcr.jar
    analyze shotgun
    -s hsa --starting-material dna
    sample1_1.fastq sample1_2.fastq /path/mixcr_result/sample1以下为结果文件:



MiXCR结果文件

包含了IG和CDR的信息,本文以TCR betta 链的分析为例,如下为TRB文件的格式:



.clonotypes.TRB.txt文件格式

包含有clonotype,氨基酸序列,核苷酸序列,reads计数,V(D)J基因组成等信息。
<hr/>2.3 格式转换:
将MiXCR输出格式转换为VDJtools支持的输入格式,通过以下代码实现:
java -jar /path/to/vdjtools/vdjtools-1.2.1.jar
    Convert -S mixcr path/to/mixcr_result/*.TRB.txt /path/to/vdjtools_result/输出文件格式如下:


<hr/>2.4 基础分析:
对格式转换后的文件进行后续分析。
java -jar /path/to/vdjtools/vdjtools-1.2.1.jar
    CalcBasicStats sample1.txt sample2.txt sample3.txt... /path/to/vdjtools_result/输出文件包含有各样本的基础信息,格式如下:


<hr/>2.5 CDR3 Clonotypes Richness:
import seaborn as sns, matplotlib.pyplot as plt
########################################################################################################################
input_path=&#39;/Users/ZYP/Downloads/imm_repertoire/vdjtools_result/&#39;#输入路径
output_path=&#39;/Users/ZYP/Downloads/imm_repertoire/vdjtools_result/&#39;#输出路径
sample_info=&#39;sample_info.txt&#39;#记录有样本分组的样本信息
########################################################################################################################
list_pos=[]
for line in open(input_path+sample_info,&#39;r&#39;):
    info=line[:-1].split(&#39;\t&#39;)
    if info[0]!=&#39;ID&#39;:
        if str(info[1])==&#39;Class1&#39;:
            list_pos.append(info[0])

list_class=[]
list_sample=[]
list_richness=[]
for line in open(output_path+&#39;basicstats.txt&#39;,&#39;r&#39;):
    info=line[:-1].split(&#39;\t&#39;)
    if info[0]!=&#39;sample_id&#39;:
        if info[0].split(&#39;.&#39;)[0] in list_pos:
            list_class.append(&#39;Class1&#39;)
        else:
            list_class.append(&#39;Class2&#39;)
        list_sample.append(info[0].split(&#39;.&#39;)[0])
        list_richness.append(float(info[3]))

ax1 = sns.barplot(x=list_class,y=list_richness,capsize=.2,palette=(sns.xkcd_rgb[&#34;dark red&#34;],sns.xkcd_rgb[&#34;marine blue&#34;]))
ax1_1 = sns.stripplot(x=list_class,y=list_richness,size=5,palette=(sns.xkcd_rgb[&#34;dark red&#34;],sns.xkcd_rgb[&#34;marine blue&#34;]))
ax1.spines[&#39;top&#39;].set_visible(False)
ax1.spines[&#39;right&#39;].set_visible(False)
plt.savefig(output_path+&#39;Richness_barplot_Class.pdf&#39;)
plt.show()
ax2 = sns.barplot(x=list_sample,y=list_richness)
ax2.set_xticklabels(ax2.get_xticklabels(), rotation=-90)
ax2.spines[&#39;top&#39;].set_visible(False)
ax2.spines[&#39;right&#39;].set_visible(False)
plt.savefig(output_path+&#39;Richness_barplot_Sample.pdf&#39;)
plt.show()结果如下图所示:



CDR3 Clonotypes Richness

<hr/>2.6 CDR3 Clonotypes Length:
import seaborn as sns, matplotlib.pyplot as plt
########################################################################################################################
input_path=&#39;/Users/ZYP/Downloads/imm_repertoire/vdjtools_input/&#39;#输入路径
output_path=&#39;/Users/ZYP/Downloads/imm_repertoire/vdjtools_result/&#39;#输出路径
sample_info=&#39;sample_info.txt&#39;#记录有样本分组的样本信息
########################################################################################################################
dict_sample_LengthCount={}
list_pos=[]
for line in open(input_path+sample_info,&#39;r&#39;):
    info=line[:-1].split(&#39;\t&#39;)
    id=info[0]
    if id!=&#39;ID&#39;:
        if str(info[1])==&#39;Class1&#39;:
            list_pos.append(id)
        read_file=open(input_path+id+&#39;.clonotypes.TRB.txt&#39;,&#39;r&#39;)
        dict_length_count={}
        for line in read_file:
            if line[0]!=&#39;c&#39;:
                length=len(line[:-1].split(&#39;\t&#39;)[3])
                if length < 30:
                    if length not in dict_length_count.keys():
                        dict_length_count[length]=1
                    else:
                        dict_length_count[length]+=1
        dict_sample_LengthCount[id]=dict_length_count
        read_file.close()

list_length=[]
list_class=[]
list_sample=[]
list_count=[]
for sample in dict_sample_LengthCount.keys():
    dict_length_count=dict_sample_LengthCount[sample]
    for length in dict_length_count.keys():
        list_length.append(length)
        list_sample.append(sample)
        if sample in list_pos:
            list_class.append(&#39;Class1&#39;)
        else:
            list_class.append(&#39;Class2&#39;)
        list_count.append(dict_length_count[length])

ax1 = sns.barplot(x=list_length,y=list_count,hue=list_class,errwidth=0.5,capsize=0.5,palette=(sns.xkcd_rgb[&#34;dark red&#34;],sns.xkcd_rgb[&#34;marine blue&#34;]))
ax1_1 = sns.stripplot(x=list_length,y=list_count,hue=list_class,size=2,dodge=True,palette=(sns.xkcd_rgb[&#34;dark red&#34;],sns.xkcd_rgb[&#34;marine blue&#34;]))
ax1.set_xticklabels(ax1.get_xticklabels(), rotation=-90,fontsize=6)
ax1.spines[&#39;top&#39;].set_visible(False)
ax1.spines[&#39;right&#39;].set_visible(False)
plt.savefig(output_path+&#39;Length_barplot_Class.pdf&#39;)
plt.show()
ax2 = sns.barplot(x=list_length,y=list_count,hue=list_sample,errwidth=0.5,capsize=0.5)
ax2.set_xticklabels(ax2.get_xticklabels(), rotation=-90,fontsize=6)
ax2.spines[&#39;top&#39;].set_visible(False)
ax2.spines[&#39;right&#39;].set_visible(False)
plt.savefig(output_path+&#39;Length_barplot_Sample.pdf&#39;)
plt.show()结果如下图所示:



CDR3 Clonotypes Length

<hr/>2.7 CDR3 Clonotypes Abundance Proportion:
import numpy as np, seaborn as sns, matplotlib.pyplot as plt
########################################################################################################################
input_path=&#39;/Users/ZYP/Downloads/imm_repertoire/vdjtools_input/&#39;#输入路径
output_path=&#39;/Users/ZYP/Downloads/imm_repertoire/vdjtools_result/&#39;输出路径
sample_info=&#39;sample_info.txt&#39;#记录有样本分组的样本信息
########################################################################################################################
dict_sample_ReadsCounts={}
list_pos=[]
for line in open(input_path+sample_info,&#39;r&#39;):
    info = line[:-1].split(&#39;\t&#39;)
    id = info[0]
    if id != &#39;ID&#39;:
        if str(info[1]) == &#39;Class1&#39;:
            list_pos.append(id)
        read_file = open(input_path + id + &#39;.clonotypes.TRB.txt&#39;, &#39;r&#39;)
        dict_read_count={&#39;1&#39;:0,&#39;2-3&#39;:0,&#39;4-10&#39;:0,&#39;11-30&#39;:0,&#39;31-100&#39;:0,&#39;101-MAX&#39;:0}
        number_clonotypes=0
        for line in read_file:
            if line[0] != &#39;c&#39;:
                number_clonotypes+=1
                info=line[:-1].split(&#39;\t&#39;)
                count=info[0]
                if int(count) == 1:
                    dict_read_count[&#39;1&#39;]+=1
                elif int(count) in [2,3]:
                    dict_read_count[&#39;2-3&#39;]+=1
                elif 4 <= int(count) <= 10:
                    dict_read_count[&#39;4-10&#39;]+=1
                elif 11 <= int(count) <= 30:
                    dict_read_count[&#39;11-30&#39;]+=1
                elif 31 <= int(count) <= 100:
                    dict_read_count[&#39;31-100&#39;]+=1
                elif int(count) >= 101:
                    dict_read_count[&#39;101-MAX&#39;]+=1
        for key in dict_read_count:
            dict_read_count[key]=dict_read_count[key]/number_clonotypes*100
        dict_sample_ReadsCounts[id]=dict_read_count
        read_file.close()

list_sample=[]
list_class=[]
list_read=[]
list_count=[]
dict_read_SampleCount={&#39;1&#39;: {},&#39;2-3&#39;: {},&#39;4-10&#39;: {},&#39;11-30&#39;: {},&#39;31-100&#39;: {},&#39;101-MAX&#39;: {}}
print(dict_sample_ReadsCounts)
for sample in dict_sample_ReadsCounts.keys():
    dict_read_count=dict_sample_ReadsCounts[sample]
    for read in dict_read_count.keys():
        list_sample.append(sample)
        if sample in list_pos:
            list_class.append(&#39;Class1&#39;)
        else:
            list_class.append(&#39;Class2&#39;)
        list_read.append(read)
        count=float(dict_read_count[read])
        list_count.append(count)
        dict_sample_count=dict_read_SampleCount[read]
        dict_sample_count[sample]=count
        dict_read_SampleCount[read]=dict_sample_count

ax1 = sns.barplot(x=list_read,y=list_count,hue=list_class,errwidth=1,capsize=0.2,palette=(sns.xkcd_rgb[&#34;dark red&#34;],sns.xkcd_rgb[&#34;marine blue&#34;]))
ax1_1 = sns.stripplot(x=list_read,y=list_count,hue=list_class,dodge=True,palette=(sns.xkcd_rgb[&#34;dark red&#34;],sns.xkcd_rgb[&#34;marine blue&#34;]))
ax1.set_xticklabels(ax1.get_xticklabels(), rotation=-90,fontsize=6)
ax1.spines[&#39;top&#39;].set_visible(False)
ax1.spines[&#39;right&#39;].set_visible(False)
plt.savefig(output_path+&#39;CountProportion_lappedplot_Class.pdf&#39;)
plt.show()

fig,ax = plt.subplots()
list_1=[]
list_2_3=[]
list_4_10=[]
list_11_30=[]
list_31_100=[]
list_101_MAX=[]
for read in dict_read_SampleCount.keys():
    dict_sample_count=dict_read_SampleCount[read]
    list_sample=[]
    for sample in dict_sample_count.keys():
        list_sample.append(sample)
        if read==&#39;1&#39;:
            list_1.append(dict_sample_count[sample])
        elif read==&#39;2-3&#39;:
            list_2_3.append(dict_sample_count[sample])
        elif read==&#39;4-10&#39;:
            list_4_10.append(dict_sample_count[sample])
        elif read==&#39;11-30&#39;:
            list_11_30.append(dict_sample_count[sample])
        elif read==&#39;31-100&#39;:
            list_31_100.append(dict_sample_count[sample])
        elif read==&#39;101-MAX&#39;:
            list_101_MAX.append(dict_sample_count[sample])
list_1=np.array(list_1)
list_2_3=np.array(list_2_3)
list_4_10=np.array(list_4_10)
list_11_30=np.array(list_11_30)
list_31_100=np.array(list_31_100)
list_101_MAX=np.array(list_101_MAX)
width = 0.35
print(list_1)
ax.bar(list_sample,list_1,width, label=&#39;1&#39;)
ax.bar(list_sample,list_2_3,width,label=&#39;2_3&#39;,bottom=list_1)
ax.bar(list_sample,list_4_10,width,label=&#39;4-10&#39;,bottom=list_1+list_2_3)
ax.bar(list_sample,list_11_30,width,label=&#39;11-30&#39;,bottom=list_1+list_2_3+list_4_10)
ax.bar(list_sample,list_31_100,width,label=&#39;31-100&#39;,bottom=list_1+list_2_3+list_4_10+list_11_30)
ax.bar(list_sample,list_101_MAX,width,label=&#39;101-MAX&#39;,bottom=list_1+list_2_3+list_4_10+list_11_30+list_31_100)
ax.spines[&#39;top&#39;].set_visible(False)
ax.spines[&#39;right&#39;].set_visible(False)
ax.legend()
ax.set_xticklabels(list_sample, rotation=90, fontsize=10)
plt.savefig(output_path+&#39;CountProportion_barplot_sample.pdf&#39;)
plt.show()结果如下图所示:



CDR3 Clonotypes Abundance Proportion:

<hr/>2.8 The relationship between CDR3 Abundance and CDR3 Clonotypes Richness:
import seaborn as sns, matplotlib.pyplot as plt
########################################################################################################################
input_path=&#39;/Users/ZYP/Downloads/imm_repertoire/vdjtools_input/&#39;#输入路径
output_path=&#39;/Users/ZYP/Downloads/imm_repertoire/vdjtools_result/&#39;输出路径
sample_info=&#39;sample_info.txt&#39;#记录有样本分组的样本信息
########################################################################################################################
dict_sample_AbundanceRichness={}
list_pos=[]
for line in open(input_path+sample_info,&#39;r&#39;):
    info = line[:-1].split(&#39;\t&#39;)
    id = info[0]
    if id != &#39;ID&#39;:
        if str(info[1]) == &#39;Class1&#39;:
            list_pos.append(id)
        dict_abundance_richness={}
        read_file = open(input_path + id + &#39;.clonotypes.TRB.txt&#39;, &#39;r&#39;)
        for line in read_file:
            if line[0]!=&#39;c&#39;:
                info=line[:-1].split(&#39;\t&#39;)
                abundance=info[0]
                if abundance not in dict_abundance_richness.keys():
                    dict_abundance_richness[abundance]=1
                else:
                    dict_abundance_richness[abundance]+=1
        dict_sample_AbundanceRichness[id]=dict_abundance_richness
        read_file.close()

list_abundance=[]
list_richniess=[]
list_sample=[]
list_class=[]
for sample in dict_sample_AbundanceRichness.keys():
    dict_abundance_richness=dict_sample_AbundanceRichness[sample]
    for abundance in dict_abundance_richness.keys():
        list_sample.append(sample)
        if sample in list_pos:
            list_class.append(&#39;Class1&#39;)
        else:
            list_class.append(&#39;Class2&#39;)
        list_abundance.append(int(abundance))
        list_richniess.append(int(dict_abundance_richness[abundance]))

ax = sns.lineplot(x=list_abundance,y=list_richniess,hue=list_sample,style=list_class)
ax.spines[&#39;top&#39;].set_visible(False)
ax.spines[&#39;right&#39;].set_visible(False)
plt.xscale(&#39;log&#39;)
plt.yscale(&#39;log&#39;)
plt.savefig(output_path+&#39;Abundance_Proportion_relatplot.pdf&#39;)
plt.show()结果如下图所示:



The relationship between CDR3 Abundance and CDR3 Clonotypes Richness

<hr/>2.9 CDR3 Overlap:
2.9.1 计算样本间CDR3重叠部分并写入本地文件overlapped_CDR3.txt:
########################################################################################################################
input_file=open(&#39;/Users/ZYP/Downloads/imm_repertoire/vdjtools_result/intersect.batch.aa.txt&#39;,&#39;r&#39;)#请将带有绝对路径的CalcPairwiseDistances的输出文件写在此处的&#39;&#39;内
save_file=open(&#39;/Users/ZYP/Downloads/imm_repertoire/vdjtools_result/overlapped_CDR3.txt&#39;,&#39;w&#39;)#请将带有绝对路径的输出文件写在此处的&#39;&#39;内
########################################################################################################################
list_samples=[]
diction={}
row=1
for line in input_file:
    if row!=1:
        info=line[:-1].split(&#39;\t&#39;)
        for sample in [info[0],info[1]]:
            if sample not in list_samples:
                list_samples.append(sample)
        key=info[0]+&#39;and&#39;+info[1]
        value=info[4]
        diction[key]=value
    row+=1
input_file.close()
word=&#39;&#39;
for sample in list_samples:
    sample=sample.split(&#39;.&#39;)[0]
    word+=sample+&#39;\t&#39;
save_file.write(word[:-1]+&#39;\n&#39;)
for sample_row in list_samples:
    word=&#39;&#39;
    for sample_column in list_samples:
        if sample_row+&#39;and&#39;+sample_column in diction.keys():
            word+=str(diction[sample_row+&#39;and&#39;+sample_column])+&#39;\t&#39;
        else:
            if sample_column+&#39;and&#39;+sample_row in diction.keys():
                word+=str(diction[sample_column+&#39;and&#39;+sample_row])+&#39;\t&#39;
            else:
                word+=&#39;0&#39;+&#39;\t&#39;
    word=word[:-1]+&#39;\n&#39;
    save_file.write(word)
save_file.close()2.9.2 CDR3 Overlap-heatmap
import seaborn as sns, pandas as pd, matplotlib.pyplot as plt
########################################################################################################################
input_file=&#39;/Users/ZYP/Downloads/imm_repertoire/vdjtools_result/overlapped_CDR3.txt&#39;#请将带有绝对路径的输入文件写在此处的&#39;&#39;内
output_file=&#39;/Users/ZYP/Downloads/imm_repertoire/vdjtools_result/overlapped_CDR3.pdf&#39;#请将带有绝对路径的输出文件写在此处的&#39;&#39;内
########################################################################################################################
list_samples=[]
list_value_list=[]
number_row=1
for line in open(input_file):
    if number_row!=1:
        list_value=[]
        index=0
        for value in line[:-1].split(&#39;\t&#39;):
            list_value.append(float(value))
            index+=1
        list_value_list.append(list_value)
    else:
        for sample in line[:-1].split(&#39;\t&#39;):
            list_samples.append(sample)
    number_row+=1
data=pd.DataFrame(data=list_value_list,index=list_samples,columns=list_samples)
ax=sns.clustermap(data,standard_scale=1,figsize=(19,19))
plt.savefig(output_file)
plt.show()结果如下图所示:



CDR3 Overlap-heatmap

2.9.3 CDR3 Overlap-circosplot-Vennplot:
#circosplot
java -jar /path/to/vdjtools/vdjtools.jar
    CalcPairwiseDistances -p --low-mem
    sample1.txt sample2.txt sample3.txt... ./vdjtools_result/
#vennplot
java -jar /path/to/vdjtools/vdjtools.jar
    JoinSamples -p sample1.txt sample2.txt sample3.txt... ./vdjtools_result/结果如下图所示:



CDR3 Overlap-circosplot-Vennplot

<hr/>2.10 CDR3 diversity / evenness / clonality:
2.10.1 计算diversity / evenness / clonality 并可视化:
import math,seaborn as sns,matplotlib.pyplot as plt
########################################################################################################################
input_path=&#39;/Users/ZYP/Downloads/imm_repertoire/vdjtools_input/&#39;#输入路径
output_path=&#39;/Users/ZYP/Downloads/imm_repertoire/vdjtools_result/&#39;输出路径
sample_info=&#39;sample_info.txt&#39;#记录有样本分组的样本信息
########################################################################################################################
def Shannon_entropy(list_frequency):
    sum=0
    for frequency in list_frequency:
        sum+=frequency*math.log(frequency)
    H=-sum
    return H

def Pielou_evenness(H,richness):
    E=H/math.log(richness)
    return E

def Clonality(E):
    C=1-E
    return C

list_pos=[]
list_characteristic=[]
list_value=[]
list_sample=[]
list_class=[]
for line in open(input_path+sample_info,&#39;r&#39;):
    info = line[:-1].split(&#39;\t&#39;)
    id = info[0]
    if id != &#39;ID&#39;:
        if info[1] == &#39;Class1&#39;:
            list_pos.append(id)
        read_file = open(input_path + id + &#39;.clonotypes.TRB.txt&#39;, &#39;r&#39;)
        richness=0
        TotalCount=0
        for line in read_file:
            if line[0] != &#39;c&#39;:
                count = int(line[:-1].split(&#39;\t&#39;)[0])
                richness+=1
                TotalCount+=count
        read_file.close()
        read_file = open(input_path + id + &#39;.clonotypes.TRB.txt&#39;, &#39;r&#39;)
        list_frequency = []
        for line in read_file:
            if line[0] != &#39;c&#39;:
                count = int(line[:-1].split(&#39;\t&#39;)[0])
                list_frequency.append(count/TotalCount)
        diversity=Shannon_entropy(list_frequency)
        evenness=Pielou_evenness(diversity,richness)
        clonality=Clonality(evenness)
        list_characteristic+=[&#39;diversity&#39;,&#39;evenness&#39;,&#39;clonality&#39;]
        list_value+=[diversity,evenness,clonality]
        for time in range(0,3):
            list_sample.append(id)
            if id in list_pos:
                list_class.append(&#39;Class1&#39;)
            else:
                list_class.append(&#39;Class2&#39;)
        read_file.close()

print(list_class)
ax1 = sns.barplot(x=list_characteristic,y=list_value,hue=list_class,errwidth=0.5,capsize=0.3,palette=(sns.xkcd_rgb[&#34;dark red&#34;],sns.xkcd_rgb[&#34;marine blue&#34;]))
ax1_1 = sns.stripplot(x=list_characteristic,y=list_value,size=2,hue=list_class,dodge=True,palette=(sns.xkcd_rgb[&#34;dark red&#34;],sns.xkcd_rgb[&#34;marine blue&#34;]))
ax1.spines[&#39;top&#39;].set_visible(False)
ax1.spines[&#39;right&#39;].set_visible(False)
plt.savefig(output_path+&#39;diversity_evenness_clonality_barplot_class.pdf&#39;)
plt.show()

ax2 = sns.violinplot(x=list_characteristic,y=list_value,hue=list_class,palette=(&#39;white&#39;,&#39;white&#39;))
ax2_1 = sns.stripplot(x=list_characteristic,y=list_value,size=4,hue=list_class,dodge=True,palette=(sns.xkcd_rgb[&#34;dark red&#34;],sns.xkcd_rgb[&#34;marine blue&#34;]))
ax2.spines[&#39;top&#39;].set_visible(False)
ax2.spines[&#39;right&#39;].set_visible(False)
plt.savefig(output_path+&#39;diversity_evenness_clonality_violinplot_class.pdf&#39;)
plt.show()

ax3 = sns.barplot(x=list_characteristic,y=list_value,hue=list_sample,errwidth=0.5,capsize=0.5)
ax3.spines[&#39;top&#39;].set_visible(False)
ax3.spines[&#39;right&#39;].set_visible(False)
plt.savefig(output_path+&#39;diversity_evenness_clonality_sample.pdf&#39;)
plt.show()结果如下图所示:



CDR3 diversity / evenness / clonality

2.10.2 The relationship between CDR3 Abundance and CDR3 diversity:
java -jar /path/to/vdjtools/vdjtools.jar
    RarefactionPlot sample1.txt sample2.txt sample3.txt... ./vdjtools_result/结果如下图所示:



The relationship between CDR3 Abundance and CDR3 diversity

2.10.3 pieplot for 1 sample:
java -jar /path/to/vdjtools/vdjtools.jar
    PlotQuantileStats -t 5 sample1.txt ./vdjtools_result/结果如下图所示:



pieplot for 1 sample

<hr/>2.11 DR3 length distribution for 1 sample:
2.10.1 CDR3 length distribution among CDRs:
java -jar /path/to/vdjtools/vdjtools.jar
    PlotFancySpectratype -t 20 sample1.txt /path/to/vdjtools_result/结果如下图所示:



CDR3 length distribution among CDRs

2.11.2 CDR3 length distribution among genes based on clonotype / frequency:
#CDR3 length distribution based on clonotype number among Vgenes
java -jar /path/to/vdjtools/vdjtools.jar
    PlotSpectratypeV -u sample1.txt /path/to/vdjtools_result/
#CDR3 length distribution based on read frequency among Vgenes
java -jar /path/to/vdjtools/vdjtools.jar
    PlotSpectratypeV sample1.txt /path/to/vdjtools_result/结果如下图所示:



CDR3 length distribution among genes based on clonotype / frequency

<hr/>2.12 sample similarity based on CDR3:
java -jar /path/to/vdjtools/vdjtools.jar
    ClusterSamples -p -l /path/to/vdjtools_result/ /path/to/vdjtools_result/结果如下图所示:



sample similarity based on CDR3

<hr/>2.13 gene usage:
java -jar /path/to/vdjtools/vdjtools.jar
    CalcSegmentUsage -p sample1.txt sample2.txt sample3.txt... /path/to/vdjtools_result/结果如下图所示:



gene usage

<hr/>2.14 gene pairing for 1 sample:
java -jar /path/to/vdjtools/vdjtools.jar
    PlotFancyVJUsage sample1.txt /path/to/vdjtools_result/结果如下图所示:



gene pairing for 1 sample

<hr/>2.15 gene diff-expression:
2.15.1 计算样本基因数:
path=&#39;/Users/ZYP/Downloads/imm_repertoire/&#39;

sample_info=open(path+&#39;vdjtools_result/sample_info.txt&#39;,&#39;w&#39;)
sample_info.write(&#39;ID\tlabel\tbatch\n&#39;)
list_files=[]
for line in open(path+&#39;vdjtools_input/metadata.txt&#39;,&#39;r&#39;):
    if line[0]!=&#39;f&#39;:
        file=line.split(&#39;\t&#39;)[1]
        sample=file.split(&#39;.&#39;)[0]
        list_files.append(file+&#39;.txt&#39;)
        if int(sample[-3:])<100:
            sample_info.write(sample+&#39;\t&#39;+&#39;Class1&#39;+&#39;\t&#39;+&#39;1&#39;+&#39;\n&#39;)
        else:
            sample_info.write(sample+&#39;\t&#39;+&#39;Class2&#39;+&#39;\t&#39;+&#39;1&#39;+&#39;\n&#39;)
sample_info.close()
list_sample=[]
list_V=[]
list_J=[]
list_dict_V=[]
list_dict_J=[]
for file in list_files:
    read_file=open(path+&#39;vdjtools_input/&#39;+file,&#39;r&#39;)
    sample=file.split(&#39;.&#39;)[0]
    list_sample.append(sample)
    dict_CDR={}
    dict_V={}
    dict_J={}
    for line in read_file:
        if line[0]!=&#39;c&#39;:
            info=line[:-1].split(&#39;\t&#39;)
            name_V=info[4]
            name_J=info[6]
            if name_V not in list_V:
                list_V.append(name_V)
            if name_J not in list_J:
                list_J.append(name_J)
            if name_V not in dict_V.keys():
                dict_V[name_V]=1
            else:
                dict_V[name_V]+=1
            if name_J not in dict_J.keys():
                dict_J[name_J]=1
            else:
                dict_J[name_J]+=1
    list_dict_V.append(dict_V)
    list_dict_J.append(dict_J)
    read_file.close()

word = &#39;item&#39;
for sample in list_sample:
    word+=&#39;\t&#39;+sample
word+=&#39;\n&#39;
V_file=open(&#39;/Users/ZYP/Downloads/imm_repertoire/vdjtools_result/matrix_V.txt&#39;,&#39;w&#39;)
V_file.write(word)
for V in list_V:
    check = []
    info=str(V)
    for dict_V in list_dict_V:
        if V in dict_V.keys():
            info+=&#39;\t&#39;+str(dict_V[V])
            check.append(float(dict_V[V]))
        else:
            info+=&#39;\t&#39;+&#39;0&#39;
            check.append(0)
    info+=&#39;\n&#39;
    if max(check)!=0:
        V_file.write(info)
V_file.close()
print(&#39;V gene expression matrix have been build!&#39;)
J_file=open(&#39;/Users/ZYP/Downloads/imm_repertoire/vdjtools_result/matrix_J.txt&#39;,&#39;w&#39;)
J_file.write(word)
for J in list_J:
    check = []
    info=str(J)
    for dict_J in list_dict_J:
        if J in dict_J.keys():
            info+=&#39;\t&#39;+str(dict_J[J])
            check.append(float(dict_J[J]))
        else:
            info+=&#39;\t&#39;+&#39;0&#39;
            check.append(0)
    info+=&#39;\n&#39;
    if max(check)!=0:
        J_file.write(info)
J_file.close()
print(&#39;J gene expression matrix have been build!&#39;)
print(&#39;expression matrix have been build!&#39;)输出文件格式如下:



matrix_V.txt

2.15.2 R包DESeq2差异分析:
library(DESeq2)
################################################################################################################################################
input_matrix=&#39;/Users/ZYP/Downloads/imm_repertoire/vdjtools_result/matrix_V.txt&#39;#将前面得到的表达信息文件matrix.txt的绝对路径及文件写在此处的&#39;&#39;内
input_info=&#39;/Users/ZYP/Downloads/imm_repertoire/vdjtools_result/sample_info.txt&#39;#将预先整理的样本信息文件sample_info.txt的绝对路径及文件写在此处的&#39;&#39;内
output_file=&#39;/Users/ZYP/Downloads/imm_repertoire/vdjtools_result/diffexp_result_V.txt&#39;#将保存输出文件的绝对路径写在此处的&#39;&#39;内
label#将样本信息文件sample_info.txt中用于分组进行差异分析的标签名写在此处
batch#将样本信息文件sample_info.txt中用作差异分析因素的标签名写在此处
################################################################################################################################################
input_data <- read.table(input_matrix,header = TRUE,row.names = 1)
input_data <- round(input_data,digits=0)#取整
input_data <- as.matrix(input_data)#转换为矩阵
input_data
info_data <- read.table(input_info,header = TRUE,row.names = 1)
info_data <- as.matrix(info_data)
info_data

dds <- DESeqDataSetFromMatrix(countData = input_data,colData = info_data,design = ~label)#若batch为空则design=~label,否则为design=~batch+label
dds <- DESeq(dds)
result <- results(dds,alpha = 0.1)
summary(result)
result <- result[order(result$padj),]
result_data <- merge(as.data.frame(result),as.data.frame(counts(dds,normalized=TRUE)),by=&#39;row.names&#39;,sort=FALSE)
names(result_data)[1] <- &#39;Gene&#39;
write.table(result_data,output_file,sep = &#39;\t&#39;,quote=F,row.names = F)输出文件格式如下:



diffexp_result_V.txt

2.15.3 Volcanoplot
import seaborn as sns, matplotlib.pyplot as plt, pandas as pd, numpy as np
########################################################################################################################
input_file=&#39;/Users/ZYP/Downloads/imm_repertoire/vdjtools_result/diffexp_result_J.txt&#39;#请将带有绝对路径的输入文件写在此处的&#39;&#39;内
output_file=&#39;/Users/ZYP/Downloads/imm_repertoire/vdjtools_result/diffexp_VolcanoPlot_J.pdf&#39;#请将带有绝对路径的输出文件写在此处的&#39;&#39;内
########################################################################################################################
list_log2FC=[]
list_padj=[]
index_log2FC=2
index_padj=6
number_row=1
for line in open(input_file):
    if number_row==1:
        index_log2FC=line[:-1].split(&#39;\t&#39;).index(&#39;log2FoldChange&#39;)
        index_padj=line[:-1].split(&#39;\t&#39;).index(&#39;padj&#39;)
    else:
        if line.split(&#39;\t&#39;)[index_log2FC]!=&#39;NA&#39; and line.split(&#39;\t&#39;)[index_padj]!=&#39;NA&#39;:
            list_log2FC.append(float(line.split(&#39;\t&#39;)[index_log2FC]))
            list_padj.append(float(line.split(&#39;\t&#39;)[index_padj]))
    number_row+=1
foldchange=np.asarray(list_log2FC)
pvalue=np.asarray(list_padj)

result = pd.DataFrame({&#39;pvalue&#39;: pvalue, &#39;FoldChange&#39;: foldchange})
result[&#39;log(pvalue)&#39;] = -np.log10(result[&#39;pvalue&#39;])
result[&#39;sig&#39;] = &#39;normal&#39;
result[&#39;size&#39;] = np.abs(result[&#39;FoldChange&#39;]) / 10

result.loc[(result.FoldChange > 1) & (result.pvalue < 0.05), &#39;sig&#39;] = &#39;up&#39;
result.loc[(result.FoldChange < -1) & (result.pvalue < 0.05), &#39;sig&#39;] = &#39;down&#39;

ax = sns.scatterplot(x=&#34;FoldChange&#34;, y=&#34;log(pvalue)&#34;,hue=&#39;sig&#39;,hue_order = (&#39;up&#39;,&#39;down&#39;,&#39;normal&#39;),palette=(&#34;#E41A1C&#34;,&#34;#377EB8&#34;,&#34;grey&#34;),data=result)
ax.set_ylabel(&#39;-log10Padj&#39;,fontweight=&#39;bold&#39;)
ax.set_xlabel(&#39;log2FoldChange&#39;,fontweight=&#39;bold&#39;)
ax.spines[&#39;top&#39;].set_visible(False)
ax.spines[&#39;right&#39;].set_visible(False)
plt.savefig(output_file)
plt.show()结果如下图所示:



Volcanoplot of V / J genes

2.15.4 Heatmap:
import seaborn as sns
import pandas as pd
from matplotlib import pyplot as plt
########################################################################################################################
input_file=&#39;/Users/ZYP/Downloads/imm_repertoire/vdjtools_result/padj<0.05&log2FC>1or<-1_V.txt&#39;#请将带有绝对路径的输入文件写在此处的&#39;&#39;内
output_file=&#39;/Users/ZYP/Downloads/imm_repertoire/vdjtools_result/diffexp_heatmap_V.pdf&#39;#请将带有绝对路径的输出文件写在此处的&#39;&#39;内
sample_info=&#39;/Users/ZYP/Downloads/imm_repertoire/vdjtools_result/sample_info.txt&#39;#请将带有绝对路径的样本分组文件写在此处的&#39;&#39;内
########################################################################################################################
list_pos=[]
for line in open(sample_info,&#39;r&#39;):
    info=line[:-1].split(&#39;\t&#39;)
    if info[0]!=&#39;ID&#39;:
        if str(info[1])==&#39;Class1&#39;:
            list_pos.append(info[0])

list_ids=[]
list_samples=[]
list_value_list=[]
list_sample_colors=[]
list_gene_colors=[]
begin_samples=7
index_log2FC=2
index_p=5
number_row=1
for line in open(input_file):
    if number_row!=1:
        list_ids.append(line[:-1].split(&#39;\t&#39;)[0])
        list_value=[]
        for value in line[:-1].split(&#39;\t&#39;)[begin_samples:]:
            list_value.append(float(value))
        list_value_list.append(list_value)
        log2FC=float(line[:-1].split(&#39;\t&#39;)[index_log2FC])
        p=float(line[:-1].split(&#39;\t&#39;)[index_p])
        if p<0.05 and log2FC>1:
            list_gene_colors.append(sns.xkcd_rgb[&#34;dark red&#34;])
        if p<0.05 and log2FC<-1:
            list_gene_colors.append(sns.xkcd_rgb[&#34;marine blue&#34;])

    else:
        begin_samples = line[:-1].split(&#39;\t&#39;).index(&#39;padj&#39;) + 1
        index_log2FC = line[:-1].split(&#39;\t&#39;).index(&#39;log2FoldChange&#39;)
        index_padj = line[:-1].split(&#39;\t&#39;).index(&#39;padj&#39;)
        for sample in line[:-1].split(&#39;\t&#39;)[begin_samples:]:
            list_samples.append(sample)
            if sample in list_pos:
                list_sample_colors.append(sns.xkcd_rgb[&#34;dark red&#34;])
            else:
                list_sample_colors.append(sns.xkcd_rgb[&#34;marine blue&#34;])
    number_row+=1
print(len(list_sample_colors))
print(len(list_gene_colors))
data=pd.DataFrame(data=list_value_list,index=list_ids,columns=list_samples)
print(data)
ax=sns.clustermap(data,standard_scale=1,row_colors=list_gene_colors,col_colors=list_sample_colors)
plt.savefig(output_file)
plt.show()结果如下图所示:



Heatmap of diff-expressed V genes (there were no diff-expressed J genes)

<hr/>作者的话:

  • 如果你喜欢我的科研创作,欢迎将它分享给更多的研友。一起向着创造和传播知识的目标继续进发吧!
  • 学术交流欢迎关注、私信。

原文地址:https://zhuanlan.zhihu.com/p/383280340
楼主热帖
回复

使用道具 举报

发表回复

您需要登录后才可以回帖 登录 | 立即注册 微信登录 手机动态码快速登录

本版积分规则

关闭

官方推荐 上一条 /3 下一条

快速回复 返回列表 客服中心 搜索 官方QQ群 洽谈合作
快速回复返回顶部 返回列表