Research on Deep Learning-Based Methods for Matching Traffic Sign Images with Textual Captions
Main Article Content
Abstract
Intelligent transportation systems face challenges in matching traffic sign images with natural language descriptions, particularly in modal heterogeneity and fine-grained semantic alignment. This issue is crucial for accurately understanding the traffic environment and safety decision-making in autonomous driving, carrying significant application value. Most existing methods are based on classification or template matching and lack deep semantic modelling between images and texts, making it difficult to adapt to real-world complex scenarios. To address this, this paper proposes a deep learning-based image-text matching method that automatically extracts directory structures to generate fine-grained labels, and introduces InfoNCE contrastive loss based on intra-batch negative samples to achieve cross-modal learning. Pré-trained ResNeXt50_32x4d and DistilBERT are employed as image and text encoders, which are uniformly mapped to a shared embedding space. Experimental results demonstrate that the proposed method outperforms existing methods regarding Recall@1, mean Average Precision (mAP), and Mean Reciprocal Rank (MRR), showcasing stronger semantic alignment capability and application potential.
Article Details

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
References
Y. Li and J. Qu, “Intelligent Road Tracking and Real-time Acceleration-Deceleration for Autonomous Driving Using Modified Convolutional Neural Networks,” Current Applied Science and Technology, vol. 22, no. 1, pp. 1–10, 2022.
T. Chen, S. Kornblith, M. Norouzi, & G. Hinton, “UniT: Multimodal multitask learning with a unified transformer,” in Proceedings of the
IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1439–1449, 2021.
L. Yang, P. Luo, C. C. Loy, & X. Tang, “Fine-grained traffic sign recognition with hierarchical attention and localization,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 6, pp. 5967–5979, 2022.
B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, & A. Torralba, “Learning to generate fine-grained image labels with minimal supervision,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 3, pp. 3576–3584, 2022.
K. He, H. Fan, Y. Wu, S. Xie, & R. Girshick,“Momentum contrast for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9729–9738, 2020.
A. van den Oord, Y. Li, & O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
Y. Xie & J. Qu, “A study on bilingual deep learning PIS neural network model based on graph text modal fusion,” ECTI Transactions on Computer and Information Technology, vol. 19, no. 1, pp. 13–24, 2025.
F. Zheng & J. Qu, “TIDCB: Text image dangerous-scene convolutional baseline,” ECTITransactions on Computer and Information Technology, vol. 18, no. 3, pp. 45–54, 2024.
Y. Xie & J. Qu, “A study on Chinese language cross-modal pedestrian image information retrieval,” Songklanakarin Journal of Science and Technology, vol. 46, no. 5, pp. 466–475, 2024.
Z. Liu et al., “Traffic sign captioning: Generating contextual descriptions for autonomous driving,” IEEE Transactions on Intelligent Vehicles, vol. 8, no. 2, pp. 1567–1581, 2023.
Y. Wang et al., “AutoLabel: Weakly supervised learning for efficient traffic sign annotation,” Pattern Recognition, vol. 135, pp. 109–123, 2023.
X. Chen et al., “Fine-grained traffic sign recognition with attribute-guided attention,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10245–10254, 2022.
R. Zhang et al., “Contrastive learning for crossmodal traffic sign retrieval,” Engineering Applications of Artificial Intelligence, vol. 129, pp. 107–120, 2024.
N. Wang & J. Qu, “Explainable image captioning for autonomous driving: A traffic sign recognition task,” in Proceedings of the IEEE International Conference on Business and Industrial Research (ICBIR), to be printed in IEEE Explore, 2025.
Z. Zhao et al., “Masking-Based Cross-Modal Remote Sensing Image–Text Retrieval via Dynamic Contrastive Learning,” in IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1-15, 2024.