Performance Comparative Study on Zero Day Malware Detection Using XGBoost and Random Forest Classifiers

Ahmad Faris Aiman Arizal; Marina Md-Arshad; Adlina Abdul-Samad; Maheyzah Md Sirat; Siti Hajar Othman

doi:10.11113/ijic.v14n2.449

Authors

Ahmad Faris Aiman Arizal Faculty of Computing, Universiti Teknologi Malaysia, 81310 UTM Johor Bahru, Johor, Malaysia
Marina Md-Arshad Faculty of Computing, Universiti Teknologi Malaysia, 81310 UTM Johor Bahru, Johor, Malaysia
Adlina Abdul-Samad Faculty of Computing, Universiti Teknologi Malaysia, 81310 UTM Johor Bahru, Johor, Malaysia
Maheyzah Md Sirat Faculty of Computing, Universiti Teknologi Malaysia, 81310 UTM Johor Bahru, Johor, Malaysia
Siti Hajar Othman Faculty of Computing, Universiti Teknologi Malaysia, 81310 UTM Johor Bahru, Johor, Malaysia

DOI:

https://doi.org/10.11113/ijic.v14n2.449

Keywords:

Machine learning, zero-day malware, hyperparameter optimisation, ensemble machin learning algorithms

Abstract

Zero-day malware is a significant threat to cybersecurity as it is unknown to antivirus systems and can cause significant damage before being detected. Traditional malware detection methods rely on signatures and patterns specific to known malware, but these methods are ineffective against zero-day malware that has not been previously encountered. Machine learning has shown promise in detecting and classifying unknown threats, including zero-day malware. In this research, we propose using machine learning classifiers to detect and classify zero-day malware. The selected classifiers are Random Forest and XGBoost, well-known and widely used in machine learning. To evaluate the effectiveness of our approach, we first collect and pre-process a dataset of known malware. The dataset, Meraz'18, is used to train and test the selected classifiers. This dataset contains PEHeaders with static analysis performed with each section calculated on its entropy. The dataset contains a representation of benign and malicious files that has been use in previous studies for zero-day malware detection, namely during the payload injection phase. To prevent overfitting, 10-fold-cross-validation is utilized. The performance metrics of these classifiers such as F1-score, accuracy, Cohen’s kappa, precision and recall analyzed on the known malware dataset and evaluate their ability to detect and classify zero-day malware. Hyperparameter tuning is used to tune each model to give the best performance of each model. The results show that the proposed classifiers perform extremely well, both achieving up to almost 98% accuracy. Using machine learning classifiers for zero-day malware detection and classification can significantly improve cybersecurity by providing a way to detect and protect against unknown threats. This work is an essential step towards the development of more robust cybersecurity systems that can effectively protect against unknown threats.

Performance Comparative Study on Zero Day Malware Detection Using XGBoost and Random Forest Classifiers

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

IJIC