Skip to content

Gene fusions classification through Recurrent Neural Network

Notifications You must be signed in to change notification settings

d-michele/gene-fusions-classifier

Repository files navigation

Bioinformatics A.A. 2018/2019 - Report

Sviluppo di un set di reti neurali, basate su LSTM, per la discriminazione di sequenze oncogeniche in fusioni di geni.

Architetture & Risultati

Per valutare in maniera generale e comparabile i risultati di ogni modello, si è optato per un approccio in tre fasi:

  • 1 - Holdout - Selezione dei migliori parametri;

  • 2 - Training - Fit del modello con i parametri scelti;

    • 2.a* algoritmo 7.2 attraverso opzione: --early_stopping_epoch
    • 2.b* algoritmo 7.3 attraverso opzione: --early_stopping_on_loss
  • 3 - Testing - Predizione su dati mai visti dal modello.

Nota che nello stilare i seguenti risultati è stato utilizzato esclusivamente l'algoritmo 2.b, in quanto ha fornito migliori prestazioni.

* (Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. The MIT Press, pp. 246-250.)


 

Modello #1 - LSTM unidirezionale su sequenze di DNA

 

 

Caratteristiche della rete:

  • Codifica one-hot
  • Dimensione del batch: 16
  • Numero di unità del layer FC: 32
  • Dropout rate: 0.4

Risultati:

Holdout learning rate LSTM units Dropout ES_epoch Loss Accuracy F1 Precision Recall
1 1e-3 32 0.1 18 0.589 0.738 0.689 0.838 0.627
2 1e-3 32 0.3 31 0.609 0.753 0.707 0.862 0.629
3 1e-3 32 0.5 34 0.586 0.738 0.708 0.812 0.658
4 1e-4 48 0.1 5 0.609 0.671 0.572 0.799 0.477
5 1e-3 48 0.1 26 0.715 0.563 0.367 0.589 0.289
6 1e-4 48 0.3 23 0.605 0.740 0.696 0.835 0.629
7 1e-4 48 0.5 56 0.709 0.587 0.421 0.663 0.331
8 1e-3 64 0.1 18 0.581 0.740 0.698 0.814 0.646
9 1e-4 64 0.1 47 0.639 0.704 0.619 0.778 0.535
10 5e-4 64 0.3 12 0.602 0.717 0.614 0.852 0.512
11 1e-3 64 0.3 24 0.596 0.758 0.723 0.839 0.659
12 1e-3 64 0.5 6 0.622 0.686 0.604 0.799 0.525
13 1e-4 64 0.5 45 0.628 0.693 0.616 0.734 0.549
14 1e-4 16 0.1 24 0.539 0.799 0.799 0.820 0.802
15 1e-4 16 0.3 24 0.559 0.717 0.701 0.746 0.691
16 1e-4 16 0.5 24 0.559 0.755 0.720 0.825 0.659
Test Modello Loss Acc F1-Score Precision Recall AP TN FP FN TP
- 14 0.386 0.871 0.865 0.919 0.828 0.930 211 19 40 190

 

Modello #2 - LSTM bidirezionale su sequenze di DNA

 

 

Caratteristiche della rete:

  • Codifica one-hot
  • Max pooling size: 2
  • Dimensione del batch: 20
  • Learning rate: 5e-4

Risultati:

Holdout Dropout lstm_units kernel_size conv_num_filters ES_epoch Loss Accuracy F1 Precision Recall
1 [0.1;0.1;0.1;0.1] 16 3 50 7 0.57295 0.73866 0.71711 0.77771 0.68804
2 [0.1;0.1;0.1;0.1] 16 5 50 5 0.64208 0.69978 0.62190 0.77794 0.54552
3 [0.1;0.1;0.1;0.1] 16 10 50 3 0.62138 0.71490 0.61748 0.87659 0.50720
4 [0.3;0.3;0.3;0.3] 16 3 50 42 0.60901 0.69330 0.68989 0.71246 0.71297
5 [0.3;0.3;0.3;0.3] 16 5 50 17 0.50554 0.76674 0.69009 0.89745 0.60612
6 [0.3;0.3;0.3;0.3] 16 10 50 8 0.62940 0.69114 0.70468 0.69709 0.76915
7 [0.3;0.3;0.3;0.3] 32 5 50 15 0.49978 0.76026 0.71858 0.85039 0.65371
8 [0.3;0.3;0.3;0.3] 32 10 50 6 0.62290 0.68035 0.61435 0.80636 0.52854
9 [0.3;0.3;0.3;0.3] 16 3 25 48 0.48879 0.79914 0.77643 0.86375 0.72218
10 [0.3;0.3;0.3;0.3] 16 5 25 14 0.55411 0.78186 0.74186 0.85337 0.69907
11 [0.3;0.3;0.3;0.3] 16 10 25 9 0.64211 0.63715 0.62165 0.68523 0.60172
Test Modello Loss Accuracy F1 Precision Recall AP TN FP FN TP
- 9 0.706 0.711 0.677 0.750 0.633 0.799 180 50 83 147

 

Modello #3 - LSTM unidirezionale su sequenze di proteine

 

 

Caratteristiche della rete:

  • Codifica one-hot

Risultati:

Holdout Layer size* Dropout* L2 Regularization* Learning rate Batch size ES_epoch Loss Accuracy F1 Precision Recall
1 [128;128;128] [0.5;0.5;0.5;0.5] [0;0;0;0] 1e-4 32 11 0.51546 0.73002 0.63162 0.94690 0.50496
2 [128;128;128] [0.5;0.5;0.5;0.5] [1e-3;1e-3;1e-3;1e-3] 1e-4 32 10 0.73175 0.73218 0.64949 0.88817 0.54384
3 [128;128;128] [0.5;0.5;0.5;0.5] [1e-4;1e-4;1e-4;1e-4] 1e-4 32 11 0.54805 0.73002 0.63080 0.94690 0.50474
4 [128;128;128] [0.5;0.5;0.7;0.7] [0;0;0;0] 1e-4 32 11 0.52107 0.71490 0.62148 0.89304 0.51461
5 [128;128;128] [0.3;0.3;0.5;0.5] [0;0;0;0] 1e-4 32 11 0.53654 0.74082 0.66415 0.87473 0.57688
6 [128;128;128] [0.1;0.1;0.5;0.5] [0;0;0;0] 1e-4 32 10 0.52756 0.72570 0.62877 0.92785 0.50865
7 [128;128;128] [0.4;0.4;0.2;0.2] [0;0;0;0] 5e-4 32 11 0.51169 0.72786 0.68798 0.85348 0.60160
8 [128;128;128] [0.4;0.4;0.5;0.5] [0;0;0;0] 5e-4 32 11 0.51184 0.73650 0.64521 0.92785 0.52804
9 [128;128;128] [0.4;0.4;0.8;0.8] [0;0;0;0] 5e-4 32 10 0.53321 0.70842 0.63742 0.81173 0.55339
10 [128;128;128] [0.4;0.4;0.8;0.8] [0;0;0;0] 1e-3 32 6 0.51737 0.73002 0.69366 0.79503 0.63938
11 [128;128;64] [0.5;0.5;0.5;0.5] [0;0;0;0] 1e-4 32 10 0.52251 0.73002 0.66371 0.87113 0.57291
12 [128;128;64] [0.5;0.5;0.5;0.5] [0;0;0;0] 1e-4 16 7 0.53382 0.74730 0.69236 0.87422 0.60221
13 [64;64;64] [0.5;0.5;0.5;0.5] [0;0;0;0] 1e-4 32 24 0.52053 0.74082 0.64615 0.91960 0.54510
14 [64;64;32] [0.3;0.3;0.3;0.3] [0;0;0;0] 1e-4 16 17 0.52438 0.74730 0.65596 0.82823 0.57388
15 [32;32;32] [0.5;0.5;0.5;0.5] [0;0;0;0] 1e-4 32 39 0.53357 0.74730 0.70721 0.87628 0.63022
16 [32;32;32] [0.5;0.5;0.5;0.5] [0;0;0;0] 1e-4 16 25 0.52169 0.76890 0.72858 0.88953 0.64982
17 [32;32;32] [0.3;0.3;0.5;0.5] [0;0;0;0] 1e-4 32 28 0.53589 0.75162 0.70089 0.87739 0.62547
18 [16;16;32] [0.3;0.3;0.3;0.3] [0;0;0;0] 1e-4 16 27 0.53081 0.75378 0.76873 0.77319 0.79635

* [LSTM; LSTM; FC]

Test Modello Loss Accuracy F1 Precision Recall AP TN FP FN TP
- 18 0.708 0.652 0.643 0.680 0.634 0.697 153 77 83 147

 

Modello #4 - LSTM bidirezionale su sequenze di proteine

 

Modello A

 

Caratteristiche della rete:

  • Codifica embedding
  • Dimensione del batch: 32
  • Learning rate iniziale: 5e-4
  • Dropout rate: [0.2, 0.4, 0.6]
  • Numero di unità LSTM: 0.5 * emb_size

Risultati:

Holdout Dropout emb_size ES_epoch Loss Accuracy F1 Precision Recall
A1 [0.2,0.2,0.1] 16 20 0.674 0.736 0.723 0.766 0.699
A2 [0.2,0.2,0.3] 16 61 0.605 0.711 0.680 0.797 0.620
A3 [0.2,0.2,0.5] 16 45 0.628 0.739 0.727 0.755 0.725
A4 [0.4,0.4,0.1] 16 45 0.634 0.745 0.737 0.769 0.729
A5 [0.4,0.4,0.3] 16 19 0.670 0.711 0.718 0.726 0.731
A6 [0.4,0.4,0.5] 16 37 0.644 0.719 0.729 0.717 0.765
A7 [0.6,0.6,0.1] 16 35 0.635 0.730 0.729 0.752 0.726
A8 [0.6,0.6,0.3] 16 24 0.663 0.752 0.728 0.791 0.687
A9 [0.6,0.6,0.5] 16 47 0.624 0.732 0.743 0.721 0.787
A10 [0.2,0.2,0.1] 32 35 0.673 0.767 0.743 0.834 0.685
A11 [0.2,0.2,0.3] 32 47 0.643 0.732 0.666 0.881 0.559
A12 [0.2,0.2,0.5] 32 45 0.692 0.719 0.726 0.720 0.752
A13 [0.4,0.4,0.1] 32 31 0.731 0.741 0.717 0.797 0.668
A14 [0.4,0.4,0.3] 32 32 0.667 0.754 0.730 0.781 0.705
A15 [0.4,0.4,0.5] 32 35 0.740 0.708 0.713 0.752 0.707
A16 [0.6,0.6,0.1] 32 26 0.717 0.713 0.724 0.719 0.748
A17 [0.6,0.6,0.3] 32 19 0.764 0.706 0.699 0.754 0.670
A18 [0.6,0.6,0.5] 32 26 0.696 0.717 0.724 0.711 0.751

 

Modello B

 
Caratteristiche della rete:

  • Codifica one-hot
  • Dimensione del batch: 20
  • Learning rate iniziale: 5e-4
  • Dropout rate: [0.5, 0.6]
  • Numero di unità LSTM: [6, 10]
  • Dimensione kernel convoluzionale: 3
  • Max pooling size: 2

Risultati:

Holdout Dropout lstm_units conv_num_filters ES_epoch Loss Accuracy F1 Precision Recall
B1 [0.5,0.6,0.5] 6 50 10 0.550 0.721 0.721 0.758 0.738
B2 [0.5,0.5,0.6] 6 50 12 0.545 0.717 0.717 0.740 0.741
B3 [0.6,0.5,0.5] 6 50 12 0.554 0.726 0.714 0.763 0.711
B4 [0.5,0.6,0.6] 6 50 10 0.541 0.715 0.722 0.756 0.727
B5 [0.6,0.5,0.6] 6 50 12 0.544 0.728 0.716 0.763 0.715
B6 [0.6,0.6,0.5] 6 50 12 0.546 0.724 0.707 0.786 0.675
B7 [0.6,0.6,0.6] 6 50 10 0.541 0.715 0.722 0.756 0.727
B8 [0.5,0.5,0.5] 10 50 12 0.544 0.706 0.680 0.779 0.639
Test Modello Loss Accuracy F1 Precision Recall AP TN FP FN TP
- A10 0.729 0.652 0.610 0.709 0.551 0.760 173 57 103 127
- B7 0.708 0.667 0.615 0.722 0.584 0.772 172 58 95 135

About

Gene fusions classification through Recurrent Neural Network

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published