Enhancing Vision Transformer with Vision language Model Text Embeddings for Robust Air Pollution Classification

Authors

  • Mohamed Mokhtar Ouardi Faculty of Computing, Faculty of Engineering Universiti Teknologi Malaysia 81310, UTM Johor Bahru, Johor, Malaysia
  • Dayang N. A. Jawawi aculty of Computing, Faculty of Engineering Universiti Teknologi Malaysia 81310, UTM Johor Bahru, Johor, Malaysia
  • Farhan Mohamed Faculty of Computing, Faculty of Engineering Universiti Teknologi Malaysia 81310, UTM Johor Bahru, Johor, Malaysia

DOI:

https://doi.org/10.11113/ijic.v16n1.678

Keywords:

Vision Language Models, Multimodal Learning, Computer Vision, Environmental Monitoring

Abstract

Far from arguing the importance of ecological awareness in our current era it is however necessary to highlight the pressing global environmental challenges that are faced by our society. While air pollution is only a sector of concern it is essentially one of the most critical factors influencing human health and environmental sustainability. Under this premise, monitoring air quality is necessary. While the Air quality index has mostly been measured using Internet of Things (IoT) sensors, detecting visible air pollution has garnered interest due to its accessibility.  However, the existing works based on vision only methods (CNNs, Vits) have shown limitations in capturing generalized correlations that are essential for a robust air pollution detection system. The proposed solution investigates the capability of semantic information to broaden the scope of features learned by the model. A Vision language Model (VLM) based text encoder with the objective of introducing knowledge anchors across any datasets. The model generates language tokens to guide a Vision transformer. The proposal also investigates tuning mechanisms for the VLM and image filtering for the input data. The key innovation targets a cross modal integration of vision transformers with a vision language model to create a few shots classification model for air pollution classification. The aim is to produce a model with flexible data integration and capable of leveraging visual and semantic correlations. The research demonstrated an improved generalization across a broad dataset variance. The model outperforms baseline CNN in accuracy when it comes to cross data implementation.  

References

Pozzer, A., Anenberg, S. C., Dey, S., Haines, A., Lelieveld, J., & Chowdhury, S. (2023). Mortality attributable to ambient air pollution: A review of global estimates. GeoHealth, 7, e2022GH000711. https://doi.org/10.1029/2022GH000711.

Xu, R., Ye, T., Yue, X., Yang, Z., Yu, W., Zhang, Y., Bell, M. L., Morawska, L., Yu, P., Zhang, Y., Wu, Y., Liu, Y., Johnston, F., Lei, Y., Abramson, M. J., Guo, Y., & Li, S. (2023). Global population exposure to landscape fire air pollution from 2000 to 2019. Nature, 621, 521–529. https://doi.org/10.1038/s41586-023-06398-6.

Burke, M. (2021). The changing risk and burden of wildfire in the United States. Proceedings of the National Academy of Sciences, 118(49), e2011048118. https://doi.org/10.1073/pnas.2011048118.

Alfano, B., Spinelle, L., Gerboles, M., & Cattaneo, A. (2020). A review of low-cost particulate matter sensors for air quality monitoring. Atmosphere, 11(2), 212. https://doi.org/10.3390/atmos11020212.

Wang, Y., Li, X., & Zhang, Z. (2023). Surveillance-image-based outdoor air quality monitoring using deep learning. IEEE Transactions on Geoscience and Remote Sensing, 61, 1–12.

Utomo, A. D., et al. (2023). Eff-AQI: An efficient CNN-based model for air pollution estimation — A study case in India. In ACM GoodIT Conference Proceedings.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., … & Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv. arXiv:2010.11929.

Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H. (2021). Training data-efficient image transformers & distillation through attention. arXiv. arXiv:2012.12877.

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., … & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. arXiv. arXiv:2103.14030.

Touvron, H., et al. (2021). Going deeper with image transformers (CaiT). arXiv. arXiv:2106.04560.

Bao, H., Dong, L., & Wei, F. (2021). BEiT: BERT pre-training of image transformers. arXiv. arXiv:2106.08254.

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., … & Sutskever, I. (2021). Learning transferable visual models from natural language supervision (CLIP). arXiv. arXiv:2103.00020.

Li, J., Li, K., H. H., & Chang, S. (2022). BLIP: Bootstrapping Language-Image Pre-training. arXiv. arXiv:2201.12086.

Li, J., et al. (2023). BLIP-2: Bootstrapping Language-Image Pre-training with frozen image encoders and large language models. arXiv. arXiv:2301.12597.

Yu, L., et al. (2022). CoCa: Contrastive captioners unify vision-language pretraining. arXiv. arXiv:2205.01917.

Wang, W., et al. (2021). SimVLM: Simple visual language model pre-training with weak supervision. arXiv. arXiv:2108.10904.

Li, X., et al. (2021). ALBEF: Align before fuse for vision-and-language pretraining. In Advances in Neural Information Processing Systems (NeurIPS 2021).

Alayrac, J.-B., et al. (2022). Flamingo: A visual language model for few-shot learning. arXiv. arXiv:2204.14198.

Qin, X., et al. (2020). FFA-Net: Feature fusion attention network for single image dehazing. In Proceedings of the AAAI Conference on Artificial Intelligence.

Zamir, S. W., et al. (2022). Restormer: Efficient transformer for high-resolution image restoration. arXiv. arXiv:2111.09881.

Zhao, Y., et al. (2025). YOLOv12: Attention-centric real-time object detectors. arXiv. arXiv:2502.12524.

Wei, J., Wang, X., Schuurmans, D., Le, Q., & Makri, A. (2022). Chain-of-thought prompting elicits reasoning in large language models. arXiv. arXiv:2201.11903.

Guo, D., et al. (2025). DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv. arXiv:2501.12948.

Maxwell, I. A. (2025). Meta-Cognitive Prompting: A Comparative Framework for Prompt Engineering in Large Language Models. ResearchGate preprint. https://doi.org/10.13140/RG.2.2.22405.46562.

Downloads

Published

2026-06-10

How to Cite

Ouardi, M. M., N. A. Jawawi, D., & Mohamed, F. (2026). Enhancing Vision Transformer with Vision language Model Text Embeddings for Robust Air Pollution Classification. International Journal of Innovative Computing, 16(1), 97–108. https://doi.org/10.11113/ijic.v16n1.678

Issue

Section

Article