Diabetic Retinopathy Image Classification Using Shift Window Transformer
DOI:
https://doi.org/10.11113/ijic.v13n1-2.415Keywords:
Diabetic retinopathy, Swin Transformers, Image classificationAbstract
Diabetic retinopathy is one of the most dangerous complications for diabetic patients, leading to blindness if not diagnosed early. However, early diagnosis can control and prevent the disease from progressing to blindness. Transformers are considered state-of-the-art models in natural language processing that do not use convolutional layers. In transformers, means of multi-head attention mechanisms capture long-range contextual relations between pixels. For grading diabetic retinopathy, CNNs currently dominate deep learning solutions. However, the benefits of transformers, have led us to propose an appropriate transformer-based method to recognize diabetic retinopathy grades. A major objective of this research is to demonstrate that the pure attention mechanism can be used to determine diabetic retinopathy and that transformers can replace standard CNNs in identifying the degrees of diabetic retinopathy. In this study, a Swin Transformer-based technique for diagnosing diabetic retinopathy is presented by dividing fundus images into nonoverlapping batches, flattening them, and maintaining positional information using a linear and positional embedding procedure. Several multi-headed attention layers are fed into the resulting sequence to construct the final representation. In the classification step, the initial token sequence is passed into the SoftMax classification layer, which produces the recognition output. This work introduced the Swin transformer performance on the APTOS 2019 Kaggle for training and testing using fundus images of different resolutions and patches. The test accuracy, test loss, and test top 2 accuracies were 69.44%, 1.13, and 78.33%, respectively for 160*160 image size, patch size=2, and embedding dimension C=64. While the test accuracy was 68.85%, test loss: 1.12, and test top 2 accuracy: 79.96% when the patch size=4, and embedding dimension C=96. And when the size image is 224*224, patch size=2, and embedding dimension C=64, the test accuracy: 72.5%, test loss: 1.07, and test top 2 accuracy: 83.7%. When the patch size =4, embedding dimension C=96, the test accuracy was 74.51%, test loss: 1.02, and the test top 2 accuracy was 85.3%. The results showed that the Swin Transformer can achieve flexible memory savings. The proposed method highlights that an attention mechanism based on the Swin Transformer model is promising for the diabetic retinopathy grade recognition task.