Detection of Potential Viral Sequence from Next Generation Sequencing Data Using Convolutional Neural Network

Authors

  • Xin Ying Lim Faculty of Computing Universiti Teknologi Malaysia 81310 UTM Johor Bahru, Johor, Malaysia
  • Jia Yee Lim Faculty of Computing Universiti Teknologi Malaysia 81310 UTM Johor Bahru, Johor, Malaysia
  • Weng Howe Chan Faculty of Computing Universiti Teknologi Malaysia 81310 UTM Johor Bahru, Johor, Malaysia https://orcid.org/0000-0003-0612-3661
  • Hui Wen Nies Faculty of Computing Universiti Teknologi Malaysia 81310 UTM Johor Bahru, Johor, Malaysia

DOI:

https://doi.org/10.11113/ijic.v13n1.382

Keywords:

Next generation sequencing, viral sequence detection, convolutional neural network, bioinformatics

Abstract

Next Generation Sequencing (NGS) is a modern sequencing technology that can determine the sequences of RNA and DNA faster and at lower cost. The availability of NGS data has sparked numerous efforts in bioinformatics, especially in the study of genetic variation and viral sequence detection. Viral sequence detection has been one of the important processes in studying virus-induced diseases. Common methods in detecting viral sequences involve alignment of the sequence with existing databases, which remains limited as these databases might be incomplete and difficult to detect highly divergent viruses. Thus, machine learning and deep learning have been used in this regard, to unveil the patterns that distinguish viral sequences through learning from the NGS data. This study focuses on viral sequence detection using convolutional neural network (CNN). This study intended to investigate how CNN model can be used for analysis of NGS data and develop a CNN model for detecting potential viral sequences from NGS data. The CNN architecture used for this study is based on an existing design that divided into two branches namely pattern and frequency branch that cater for extracting different aspects of information from the data and lastly combined into a full model. This study further implemented slightly modified architecture that includes additional convolution layer and pooling layer. Then, parameter tuning is implemented to identify near optimal parameters for the CNN to elucidate the performance impact. The evaluation of the optimized CNN model is done using a dataset with 18,445 DNA sequences. The results show that the CNN model in this study achieved a better performance compared with existing in terms of area under receiver operating characteristics curve (AUROC) for full model (+0.1434).

Downloads

Published

2023-05-30

How to Cite

Lim, X. Y., Lim, J. Y., Chan, W. H., & Nies, H. W. (2023). Detection of Potential Viral Sequence from Next Generation Sequencing Data Using Convolutional Neural Network . International Journal of Innovative Computing, 13(1), 13–19. https://doi.org/10.11113/ijic.v13n1.382

Issue

Section

Computer Science