https://github.com/chuym726/DeephageTP

环境搭建&数据准备

（已有：conda 4.9.2）

下载https://github.com/chuym726/DeephageTP
创建新的虚拟环境deephageTP

1	conda create --name deephageTP python=3.6 numpy theano keras scikit-learn

新建工程文件夹DeephageTP-test（需要的文件从下载的DeephageTP-master拷）
PyCharm打开DeephageTP-test，激活环境

1	conda activate deephageTP

Python Interpreter也改成deephageTP

测试example_data.fa

代码里面直接读的training_data.faa，X和Y应该在同一个文件里，

但是给的training_data.faa.X.npy.tar.gz、training_data.faa.Y.npy.tar.gz是在不同文件

于是先用example_data.fa试一下（遇到bug #1，解决见Debug）

(deephageTP) coconutnut@x86_64-apple-darwin13 DeephageTP-test % python DeephageTP_model_training.py
Using TensorFlow backend.
only amino acid code.
ok! there is the same number (193) of labels and sequences. 
193
193
193
ok! windows is 900.
ok! raw data has been saved as a npy file example_data.fa.X/Y
now runing is DL_Train.
nb_filters:  50
kernel_s:  3
n_batch:  10
n_echos:  20
dropout1:  0.1
dropout2:  0.1
ok! the npy file example_data.fa.X/Y.npy are loaded!
ok! all labels are in 4 kinds.
193
now training for all, Be noted here no test part !!!
[[1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
......
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]]
2020-11-19 17:30:03.895070: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.
2020-11-19 17:30:03.895316: I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 8. Tune using inter_op_parallelism_threads for best performance.
Epoch 1/20
193/193 [==============================] - 12s 64ms/step - loss: 1.1962 - accuracy: 0.4767
Epoch 2/20
193/193 [==============================] - 11s 55ms/step - loss: 0.4503 - accuracy: 0.8446
Epoch 3/20
193/193 [==============================] - 10s 53ms/step - loss: 0.1844 - accuracy: 0.9534
Epoch 4/20
193/193 [==============================] - 10s 52ms/step - loss: 0.1079 - accuracy: 0.9845
Epoch 5/20
193/193 [==============================] - 11s 55ms/step - loss: 0.0383 - accuracy: 1.0000
Epoch 6/20
193/193 [==============================] - 10s 50ms/step - loss: 0.0203 - accuracy: 1.0000
Epoch 7/20
193/193 [==============================] - 12s 61ms/step - loss: 0.0105 - accuracy: 1.0000
Epoch 8/20
193/193 [==============================] - 10s 51ms/step - loss: 0.0063 - accuracy: 1.0000
Epoch 9/20
193/193 [==============================] - 10s 51ms/step - loss: 0.0058 - accuracy: 1.0000
Epoch 10/20
193/193 [==============================] - 10s 50ms/step - loss: 0.0104 - accuracy: 0.9948
Epoch 11/20
193/193 [==============================] - 10s 51ms/step - loss: 0.0066 - accuracy: 1.0000
Epoch 12/20
193/193 [==============================] - 10s 51ms/step - loss: 0.0041 - accuracy: 1.0000
Epoch 13/20
193/193 [==============================] - 10s 52ms/step - loss: 0.0017 - accuracy: 1.0000
Epoch 14/20
193/193 [==============================] - 10s 50ms/step - loss: 0.0026 - accuracy: 1.0000
Epoch 15/20
193/193 [==============================] - 10s 51ms/step - loss: 0.0017 - accuracy: 1.0000
Epoch 16/20
193/193 [==============================] - 10s 50ms/step - loss: 0.0013 - accuracy: 1.0000
Epoch 17/20
193/193 [==============================] - 10s 52ms/step - loss: 0.0011 - accuracy: 1.0000
Epoch 18/20
193/193 [==============================] - 9s 48ms/step - loss: 8.4943e-04 - accuracy: 1.0000
Epoch 19/20
193/193 [==============================] - 9s 49ms/step - loss: 8.9927e-04 - accuracy: 1.0000
Epoch 20/20
193/193 [==============================] - 9s 49ms/step - loss: 0.0011 - accuracy: 1.0000
the model (example_data.fa.all.h5) has saved!

产生了3个文件

example_data.fa.all.h5

example_data.fa.X.npy

example_data.fa.Y.npy

所以给的training_data.faa.X.npy.tar.gz、training_data.faa.Y.npy.tar.gz也是生成的吗？data文件夹里就一个叫a的空文件，难道是没传training_data.faa？😧

看了下代码，还真是🙄

aa_ref2npy()用来转格式，存成one-hot编码后，X和Y分开的形式

DL_Train()用来训练

既然如此，直接用training_data.faa.X.npy.tar.gz、training_data.faa.Y.npy.tar.gz解压，不跑aa_ref2npy()，就可以直接训练了

测试training_data.faa

解压training_data.faa.X.npy.tar.gz、training_data.faa.Y.npy.tar.gz

（25.3MB解压后3.97 GB，是得有多少0啊，不愧是one-hot）

代码改了2处：

第8行

1	from sklearn.cross_validation import train_test_split

改成

1	from sklearn.model_selection import train_test_split

最后的

1 2	if 1: aa_ref2npy(ref_Data=ref_Data,len_w=len_w)

注释掉

运行

(deephageTP) coconutnut@x86_64-apple-darwin13 DeephageTP-test % python DeephageTP_model_training.py
Using TensorFlow backend.
only amino acid code.
now runing is DL_Train.
nb_filters:  50
kernel_s:  3
n_batch:  10
n_echos:  20
dropout1:  0.1
dropout2:  0.1
ok! the npy file training_data.faa.X/Y.npy are loaded!
ok! all labels are in 4 kinds.
27585
now training for all, Be noted here no test part !!!
[[1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 ...
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]]
2020-11-19 17:52:30.350217: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.
2020-11-19 17:52:30.350529: I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 8. Tune using inter_op_parallelism_threads for best performance.
Epoch 1/20
 1250/27585 [>.............................] - ETA: 22:33 - loss: 0.5788 - accuracy: 0.8016

太大了，不跑了

代码分析

训练部分 DeephageTP_model_training.py

aa_ref2npy()

做one-hot编码用的

把蛋白质序列↓

1
2

>UniRef100_A0A017QK57 PBSX family phage terminase large subunit n=1 Tax=Glaesserella parasuis str. Nagasaki TaxID=1117322 RepID=A0A017QK57_HAEPR 1
MKIQLNLPPKLIPVFTQQNVRYRGAYGGRGSAKTRTFAKMTAVVAYQRAMQGESGVILCGREFMNSLEDSSLEEIKQAIQSEPWLTDFFEVGEKYVRTKCGRISYIFTGLRHNLDSIKSKARILLAWIDEAESVSEMAWRKLLPTVRENGSEIWLTWNPEKKGSATDLRFRQHQDESMAIVEMNYSDNPWFPDVLEQERLRDKARLDDATYRWIWEGAYLEQSEAQIFRDKFQELEFKPNGDFSGPYFGLDFGFAQDPTAAVKCWVFKDELYIEYEAGKVGLELDDTATFLQKGIVGIEQYVIRADSARPESISYLKRHGLPRIDGVSKWKGSVEDGIAHIKSYKKIYIHPRCQQTLNEFRLYSYKTDRLSGDILPVVLDENNHYIDALRYALEPLMKGRQSWFG

one-hot编码，np.array格式，保存（分别存X和Y）

把X打出来看下

[[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],  对应M
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],  对应K
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],  对应I
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
 ...

就是和蛋白质序列MKI…一一对应的one-hot编码（还是手动写死的）

大小是(900, 20)，和文中描述一样

这里Y是0，根据每个蛋白质第一行

1	>UniRef100_A0A017QK57 PBSX family phage terminase large subunit n=1 Tax=Glaesserella parasuis str. Nagasaki TaxID=1117322 RepID=A0A017QK57_HAEPR 1

最后这个数字-1表示类别

如下面这个蛋白的类别是2

1
2

>UniRef100_A0A072NPV4 Phage terminase, small subunit n=1 Tax=Bacillus azotoformans MEV2011 TaxID=1348973 RepID=A0A072NPV4_BACAZ 3
MAKDGTNRGGARVGAGAKKKPLTDKIAEGNPGGRKLTVMEFKDTADLKGLEMPEPNKMLEAIQKDGKALVAGEIYRNTWAWLNERGCAALVSPQLLERYAMSVARWIQCEEAVTEYGFLAKHPTTGNAIQSPYVAMGQNYMNQTNRLWMEIFQIVKENCTGEYSGINPQDDVMERLLTARRGK

DL_Train()

def DL_Train(ref_Data, len_w):
	# 参数设置
	nb_filters = 50
	kernel_s = 3
	n_batch = 10
	n_echos = 20
	dropout1 = 0.10
	dropout2 = 0.10
	print ("now runing is DL_Train.")
		# end_lossrint("all is ended.")
	print ("nb_filters: ", nb_filters)
	print ("kernel_s: ", kernel_s)
	print ("n_batch: ", n_batch)
	print ("n_echos: ", n_echos)
	print ("dropout1: ", dropout1)
	print ("dropout2: ", dropout2)
	
	# 处理数据
	X2 = np.load(ref_Data + ".X.npy")
	Y2 = np.load(ref_Data + ".Y.npy")
	print ("ok! the npy file " + ref_Data + ".X/Y.npy are loaded!" )
	n_classes = 4    #len(np.unique(Y))
	print ("ok! all labels are in " + str(n_classes) + " kinds." )
	YY_t = []
	for i in Y2:
		ll = np.zeros(n_classes)
		ll[i] = 1
		YY_t.append(ll)
	YY_t = np.array(YY_t)
	print(len(YY_t))
	print("now training for all, Be noted here no test part !!!")
	X_train = X2.reshape(-1,1,len_w,matrix_size)
	Y_train = YY_t
	print(Y_train)
	
	# 构建模型
	model = Sequential()
	model.add(Conv2D(filters=nb_filters,kernel_size=(7,1),padding='same',input_shape=(1,len_w,matrix_size),data_format='channels_first'))
	model.add(Activation('relu'))
	model.add(MaxPooling2D(pool_size=(3,3)))
	model.add(Dropout(dropout1))
	model.add(Flatten())
	model.add(Dense(100,activation='relu'))
	model.add(Dropout(dropout2))
	model.add(Dense(n_classes,activation='softmax'))
	model.compile(loss='categorical_crossentropy',optimizer=Adam(),metrics=['accuracy'])   
	
	# 训练模型
	model.fit(X_train,Y_train,batch_size=n_batch,epochs=n_echos,verbose=1)
	
	# 保存模型
	model.save(ref_Data + '.all.h5')
	print("the model (" + ref_Data + ".all.h5) has saved!")

Debug

#1

第一次跑DeephageTP_model_training.py时遇到的bug

1	ModuleNotFoundError: No module named 'sklearn.cross_validation'

解决：

https://blog.csdn.net/qq_35962520/article/details/85295228

1 2	# from sklearn.cross_validation import train_test_split # cross_validation不再使用，移至model_selection from sklearn.model_selection import train_test_split