Applying Multimodal Large Language Models for Visual Question Answering: Toward Vietnamese Educational Reasoning Systems

Authors

  • Xinh Le ¹Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology (HCMUT), 268 Ly Thuong Kiet Street, Dien Hong Ward, Ho Chi Minh City, Vietnam ²Vietnam National University Ho Chi Minh City, Linh Xuan Ward, Ho Chi Minh City, Vietnam
  • Tho Quan ¹Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology (HCMUT), 268 Ly Thuong Kiet Street, Dien Hong Ward, Ho Chi Minh City, Vietnam ²Vietnam National University Ho Chi Minh City, Linh Xuan Ward, Ho Chi Minh City, Vietnam
  • Tien-Thinh Nguyen ³Faculty of Information Technology ⁴Industrial University of Ho Chi Minh City Ho Chi Minh City, Vietnam

DOI:

https://doi.org/10.11113/ijic.v16n1.680

Keywords:

Visual Question Answering (VQA), Multimodal Large Language Models (MLLMs), STEM Education, Visual Reasoning, Chain-of-Thought Reasoning

Abstract

Visual Question Answering (VQA) is rapidly advancing due to Multimodal Large Language Models (MLLMs), which demonstrate powerful complex reasoning capabilities. However, this progress is predominantly centered on English and general-domain contexts. Domain-specific fields, such as STEM education, and low-resource languages, such as Vietnamese, remain significantly underserved, lacking both standardized datasets and specialized reasoning models. This research addresses this gap by investigating how MLLMs can be adapted for Vietnamese educational settings. As a foundational step, the ViPPS dataset (Vietnamese Physics Problem Solving) has been constructed and publicly released, the first multimodal dataset for physics problem solving in Vietnamese. Initial experiments on ViPPS show that current MLLMs achieve promising results, but still struggle with domain-specific reasoning and numerical accuracy. Based on these observations, the next stage of this research will focus on expanding the ViPPS dataset in both scale and scope, developing and evaluating advanced text-image reasoning and calculation mechanisms and extending these capabilities to VideoQA and model explainability. This research will contribute critical resources and methods, advancing the field of educational VQA for low-resource languages.

References

Liu, H., Li, C., Li, Y., & Lee, Y. J. (2024). Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024) (pp. 26296–26306). https://doi.org/10.1109/CVPR52733.2024.02484.

Lian, Z., Sun, L., Sun, H., Chen, K., Wen, Z., Gu, H., Liu, B., & Tao, J. (2024). GPT-4V with emotion: A zero-shot benchmark for generalized emotion recognition. Information Fusion, 108. https://doi.org/10.1016/j.inffus.2024.102367.

Vo, Q. T. N., Le, X. T., Tran, T. H. M., & Quan, T. T. (2025). ViPPS: Building a multimodal dataset for physics problem solving in Vietnamese. In Proceedings of the 18th Multi-Disciplinary International Conference on Artificial Intelligence (MIWAI 2025) (pp. 308–319). Springer. https://doi.org/10.1007/978-981-95-4960-3_25

Larkin, V. D., Ivanov, Y. S., & Chukhnov, A. P. (2025). One-shot visual detection of phishing resources with CLIP ViT and contrastive learning. In Proceedings of the 2025 International Russian Smart Industry Conference (SmartIndustryCon) (pp. 837–842). IEEE. https://doi.org/10.1109/SmartIndustryCon65166.2025.10986239

Caumartin, G., Qin, Q., Chatragadda, S., Panjrolia, J., Li, H., & Costa, D. E. (2025). Exploring the potential of Llama models in automated code refinement: A replication study. In Proceedings of the 2025 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) (pp. 681–692). IEEE. https://doi.org/10.1109/SANER64311.2025.00070.

Li, J., Li, D., Savarese, S., & Hoi, S. (2023). BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the 40th International Conference on Machine Learning (ICML 2023) (pp. 19730–19742). https://doi.org/10.5555/3618408.3619222

Tzelepi, M., & Mezaris, V. (2024). Exploiting LMM-based knowledge for image classification tasks. In Proceedings of the 25th International Conference on Engineering Applications of Neural Networks (EANN 2024) (pp. 166–177). Springer. https://doi.org/10.1007/978-3-031-62495-7_13

Tait, I., Bensemann, J., & Wang, Z. (2024). Is GPT-4 conscious? Journal of Artificial Intelligence and Consciousness, 11(1), 1–16. https://doi.org/10.1142/S270507852450005X.

Long, Y., Tang, P., Wang, H., & Yu, J. (2021). Improving reasoning with contrastive visual information for visual question answering. Electronics Letters, 57(20), 758–760. https://doi.org/10.1049/ell2.12255.

Qiu, C., Xie, Z., Liu, M., & Hu, H. (2024). Explainable knowledge reasoning via thought chains for knowledge-based visual question answering. Information Processing & Management, 61, 103726. https://doi.org/10.1016/j.ipm.2024.103726.

Amini, A., Gabriel, S., Lin, S., Kedziorski, R. K., Choi, Y., & Hajishirzi, H. (2019). MathQA: Towards interpretable math word problem solving with operation-based formalisms. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL 2019) (pp. 2357–2367). https://doi.org/10.18653/v1/N19-1245.

Nguyen, N. H., Vo, D. T. D., Nguyen, K. V., & Nguyen, N. L. T. (2023). OpenViVQA: Task, dataset, and multimodal fusion models for visual question answering in Vietnamese. Information Fusion, 100, 101868. https://doi.org/10.1016/j.inffus.2023.101868.

Nguyen, Q. V., Tran, D. Q., Pham, H. Q., Nguyen, T. K. B., Nguyen, N. H., Nguyen, K. V., & Nguyen, N. L. T. (2026). ViTextVQA: A large-scale visual question answering dataset and a novel multimodal feature fusion method for Vietnamese text comprehension in images. Expert Systems with Applications, 308, 130839. https://doi.org/10.1016/j.eswa.2025.130839.

Nguyen, H. T., Huynh, T. N., Mai, N. T. N., Le, K. D. D., & Pham, D. T. N. (2023). PhoBERT application in disease classification based on Vietnamese symptom analysis. Applied Computer Systems, 28(1), 35–43. https://doi.org/10.2478/acss-2023-0004.

Phan, L., Tran, H., Nguyen, H., & Trieu, T. H. (2022). ViT5: Pretrained text-to-text transformer for Vietnamese language generation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop (NAACL 2022) (pp. 136–142). https://doi.org/10.18653/v1/2022.naacl-srw.18.

Downloads

Published

2026-06-10

How to Cite

Le, X., Quan, T., & Nguyen, T.-T. (2026). Applying Multimodal Large Language Models for Visual Question Answering: Toward Vietnamese Educational Reasoning Systems. International Journal of Innovative Computing, 16(1), 109–114. https://doi.org/10.11113/ijic.v16n1.680

Issue

Section

Article