DeephageTP: A Convolutional Neural Network Framework for Identifying Phage-specific Proteins from metagenomic sequencing data
Running title: an alignment-free deep learning framework for identifying phage- specific proteins

参考：原文 & Github https://github.com/chuym726/DeephageTP

目标

3 specific proteins of the tailed phage:

Portal (portal protein)
TerL (large terminase subunit protein)
TerS (small terminase subunit protein)

Develop a CNN-based framework to identify these 3 proteins from metagenome data

数据集

train/test set

(原文位置：106-127行)

来源：Uniport(www.uniprot.org)

预处理过程：

从数据库中获取各类蛋白质序列

according to the taxonomy in the UniProt database, all proteins in archaea, bacteria and viruses were obtained from the database;

看example_data.fa文件，长这样：

>UniRef100_A0A017QK57 PBSX family phage terminase large subunit n=1 Tax=Glaesserella parasuis str. Nagasaki TaxID=1117322 RepID=A0A017QK57_HAEPR 1
MKIQLNLPPKLIPVFTQQNVRYRGAYGGRGSAKTRTFAKMTAVVAYQRAMQGESGVILCGREFMNSLEDSSLEEIKQAIQSEPWLTDFFEVGEKYVRTKCGRISYIFTGLRHNLDSIKSKARILLAWIDEAESVSEMAWRKLLPTVRENGSEIWLTWNPEKKGSATDLRFRQHQDESMAIVEMNYSDNPWFPDVLEQERLRDKARLDDATYRWIWEGAYLEQSEAQIFRDKFQELEFKPNGDFSGPYFGLDFGFAQDPTAAVKCWVFKDELYIEYEAGKVGLELDDTATFLQKGIVGIEQYVIRADSARPESISYLKRHGLPRIDGVSKWKGSVEDGIAHIKSYKKIYIHPRCQQTLNEFRLYSYKTDRLSGDILPVVLDENNHYIDALRYALEPLMKGRQSWFG
>UniRef100_A0A017QKX8 PBSX family phage terminase large subunit n=1 Tax=Glaesserella parasuis str. Nagasaki TaxID=1117322 RepID=A0A017QKX8_HAEPR 1
MKIQLNLPPKLIPVFTQQNVRYRGAYGGRGSAKTRTFAKMTAVVAYQRAMQGESGVILCGREFMNSLEDSSLEEIKQAIQSEPWLANFFDVGEKYVHTKCGRISYIFTGLRHNLDSIKSKARILLAWIDEAESVSEMAWRKLLPTVRESGSEIWLTWNPEKKGSATDLRFRQYQDESMAIVEMNYNDNPWFPDVLKQERLRDKARLDDATYRWIWEGDYLEESEAQVFRGKYQELEFKPLPDFEGPYHGLDFGFAQDPTAAIKCWVFKDELYIEYEAGKVGLELDDTATFLQKGIVGIEQYVIRADSARPESISYLKRHGLPRIDGVSKWKGSVEDGIAHIKSYKKIYIHPRCQQTLNEFRLYSYKTDRLSGDVLPTLVDAHNHYIDALRYALNPRIQRKGDFSQNPLKLY
...

去除无关数据

the protein sequences were searched by the keywords (i.e., portal, large terminase subunit, and small terminase subunit), and the noise sequences with the uncertain keywords (e.g., hypothetical, possible, like, predicted) were removed to ensure that the selected protein sequences in the three categories are veracious;
其它类过多，随机留一部分

the remaining sequences without the keywords of interest (portal, large terminase subunit and, small terminase subunit) were labeled as the ‘others’ category. However, the size of the ‘others’ category is more than 75 times larger than that of the three categories. To relieve the class-imbalance problem brought by this situation, we randomly selected 20000 protein sequences from the remaining sequences and labeled as the ‘others’ category;
去除不正常长度的序列

to further guarantee that the sequences from the database with the three categories are veracious, we calculated length distribution of these sequences (see Fig. S1), then manually checked the sequences with the abnormal length (<5% and > 95%) using Blastp (https://blast.ncbi.nlm.nih.gov/Blast.cgi) against NCBI nr database, and the sequences without hitting to the targeted references were filtered out (almost all the sequences with abnormal length) and labeled as the ‘others’ category.

最终结果：

mimic metagenomic dataset

(原文位置：128-140行)

来源：UniRef100 (https://www.uniprot.org/uniref/)

处理过程和上面类似

且：2个数据库有重合，手动删除重合部分

用这组数据测试了模型在不同数据集大小上的表现效果

virome dataset

(原文位置：141-147行)

用于测试模型在真实数据集上的结果

这组是原始数据，自行获取蛋白质序列

3个数据集的用途

(原文位置：193-197行 + Fig.1A)

用train/test set训练模型，分析可行性
用mimic dataset测试cutoff value
用virome dataset测试真实效果

In summary, as shown in Fig. 1A, in this study, the proposed DeephageTP framework was firstly implemented on the training dataset for feasibility analysis, and then the trained model was applied on the mimic dataset for test and the cutoff value of each category of interest was determined according to the responding loss values distributions; finally, we applied the trained model on the real metagenomic datasets for examining the performance of our framework.

蛋白质序列编码

(原文位置：148-161行)

方法：one-hot

对20个氨基酸进行one-hot编码，1个氨基酸占20位

L个氨基酸的蛋白质序列将被编码成 L * 20 的矩阵

输入需要固定，故设置最大长度为900，多了砍掉，少了填0

最终每个蛋白质序列都被编码成 900 * 20 的矩阵，作为输入

模型

结构

(原文位置：163-170行)

输入层 (ReLu)
卷积层
最大池化层 (dropout)
全连接层1 (ReLu) (dropout)
全连接层2
输出层 (SoftMax)

超参数选择

(原文位置：171-189行)

全连接单元数量、dropout rate、learning rate等的取值，大部分是常用的默认值；一些用5-fold cross-validation确定

20个氨基酸分成7组（根据dipole moments & side-chain）： {A,G,V}, {I,L,F,P}, {Y,M,T,S}, {H,N,Q,W}, {R,K}, {D,E} and {C} -> 卷积层filter大小设置为 7 * 1

模型评估指标

(原文位置：198-213行)

Accuracy
Precision
Recall
F1-score

Loss计算

(原文位置：214-234行)

yk 第k维的序列的真实标签

pk 第k维的序列的预测标签

实现

(原文位置：190-192行)

Python 3.6

Keras

https://github.com/chuym726/DeephageTP

没看懂的

cutoff value是做什么的？怎么算的？
loss怎么算的？