Research on Deep Learning-Based Methods for Matching Traffic Sign Images with Textual Captions

Ning Wang; Jian Qu

doi:10.37936/ecti-cit.2025193.262155

PDF

Published: Jul 19, 2025

DOI: https://doi.org/10.37936/ecti-cit.2025193.262155

Keywords:

Traffic sign recognition cross-modal matching contrastive learning InfoNCE loss

Ning Wang

Panyapiwat Institute of Management, Thailand

Jian Qu

Panyapiwat Institute of Management, Thailand

Abstract

Intelligent transportation systems face challenges in matching traffic sign images with natural language descriptions, particularly in modal heterogeneity and fine-grained semantic alignment. This issue is crucial for accurately understanding the traffic environment and safety decision-making in autonomous driving, carrying significant application value. Most existing methods are based on classification or template matching and lack deep semantic modelling between images and texts, making it difficult to adapt to real-world complex scenarios. To address this, this paper proposes a deep learning-based image-text matching method that automatically extracts directory structures to generate fine-grained labels, and introduces InfoNCE contrastive loss based on intra-batch negative samples to achieve cross-modal learning. Pré-trained ResNeXt50_32x4d and DistilBERT are employed as image and text encoders, which are uniformly mapped to a shared embedding space. Experimental results demonstrate that the proposed method outperforms existing methods regarding Recall@1, mean Average Precision (mAP), and Mean Reciprocal Rank (MRR), showcasing stronger semantic alignment capability and application potential.

How to Cite

[1]

N. Wang and J. Qu, “Research on Deep Learning-Based Methods for Matching Traffic Sign Images with Textual Captions”, ECTI-CIT Transactions, vol. 19, no. 3, pp. 379–391, Jul. 2025.

Issue

Vol. 19 No. 3 (2025): ECTI Transactions on CIT (Jul 2025)

Section

Research Article

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

References

Y. Li and J. Qu, “Intelligent Road Tracking and Real-time Acceleration-Deceleration for Autonomous Driving Using Modified Convolutional Neural Networks,” Current Applied Science and Technology, vol. 22, no. 1, pp. 1–10, 2022.

T. Chen, S. Kornblith, M. Norouzi, & G. Hinton, “UniT: Multimodal multitask learning with a unified transformer,” in Proceedings of the

IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1439–1449, 2021.

L. Yang, P. Luo, C. C. Loy, & X. Tang, “Fine-grained traﬃc sign recognition with hierarchical attention and localization,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 6, pp. 5967–5979, 2022.

B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, & A. Torralba, “Learning to generate fine-grained image labels with minimal supervision,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 3, pp. 3576–3584, 2022.

K. He, H. Fan, Y. Wu, S. Xie, & R. Girshick,“Momentum contrast for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9729–9738, 2020.

A. van den Oord, Y. Li, & O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.

Y. Xie & J. Qu, “A study on bilingual deep learning PIS neural network model based on graph text modal fusion,” ECTI Transactions on Computer and Information Technology, vol. 19, no. 1, pp. 13–24, 2025.

F. Zheng & J. Qu, “TIDCB: Text image dangerous-scene convolutional baseline,” ECTITransactions on Computer and Information Technology, vol. 18, no. 3, pp. 45–54, 2024.

Y. Xie & J. Qu, “A study on Chinese language cross-modal pedestrian image information retrieval,” Songklanakarin Journal of Science and Technology, vol. 46, no. 5, pp. 466–475, 2024.

Z. Liu et al., “Traﬃc sign captioning: Generating contextual descriptions for autonomous driving,” IEEE Transactions on Intelligent Vehicles, vol. 8, no. 2, pp. 1567–1581, 2023.

Y. Wang et al., “AutoLabel: Weakly supervised learning for eﬃcient traﬃc sign annotation,” Pattern Recognition, vol. 135, pp. 109–123, 2023.

X. Chen et al., “Fine-grained traﬃc sign recognition with attribute-guided attention,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10245–10254, 2022.

R. Zhang et al., “Contrastive learning for crossmodal traﬃc sign retrieval,” Engineering Applications of Artificial Intelligence, vol. 129, pp. 107–120, 2024.

N. Wang & J. Qu, “Explainable image captioning for autonomous driving: A traﬃc sign recognition task,” in Proceedings of the IEEE International Conference on Business and Industrial Research (ICBIR), to be printed in IEEE Explore, 2025.

Z. Zhao et al., “Masking-Based Cross-Modal Remote Sensing Image–Text Retrieval via Dynamic Contrastive Learning,” in IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1-15, 2024.

Article Sidebar

Main Article Content

Abstract

Article Details

References