Boosting Multi-Object Tracking Performance with Advanced Attention Mechanism based on Transformer
Main Article Content
Abstract
The Transformer architecture has been successful for a variety of tasks in natural language processing and has recently gained popularity in computer vision applications such as medical image analysis, traffic light monitoring, surveillance, and object tracking. This is due to the self-attention operation in Transformer, which enables global interactions between image patches, resulting in powerful modeling capabilities. However, the quadratic complexity in time and memory associated with the Transformer poses a challenge. To address this issue, we propose an advanced attention mechanism based on the Transformer architecture for multi-object tracking. The advanced attention mechanism consists of Transposed Self-Attention (TSA) and Cross Patch Interaction (CPI) modules. The TSA module utilizes a transposed self-attention mechanism instead of the conventional self-attention, operating across feature channels rather than image patches. It has a linear computational complexity with respect to the number of patches, resulting in reduced computational costs. Meanwhile, the CPI module is added following the TSA module to enable communication across patches, thereby enhancing the model's learning efficiency. This approach can efficiently process high-resolution images, achieving state-of-the-art performance in multi-object tracking with a MOTA of 72.8% on MOT17.
Article Details

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
All authors need to complete copyright transfer to Journal of Applied Informatics and Technology prior to publication. For more details click this link: https://ph01.tci-thaijo.org/index.php/jait/copyrightlicense
References
Bergmann, P., Meinhardt, T., and Leal-Taix´e, L. (2019). Tracking without bells and whistles. In 2019 IEEE/CVF International Conference on
Computer Vision (ICCV), pages 941–951. DOI: 10.1109/ICCV.2019.00103.
Bochinski, E., Eiselein, V., and Sikora, T. (2017). High-speed tracking-by-detection without using image information. In 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), page 1–6. IEEE. DOI: 10.1109/avss.2017.8078516.
Boragule, A., Jang, H., Ha, N., and Jeon, M. (2022). Pixel-guided association for multi-object tracking. Sensors, 22(22):8922. DOI: 10.3390/s22228922.
Bras´o, G. and Leal-Taix´e, L. (2020). Learning a neural solver for multiple object tracking. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), page 6246–6256. IEEE.DOI: 10.1109/cvpr42600.2020.00628.
Cao, J., Pang, J., Weng, X., Khirodkar, R., and Kitani, K. (2023). Observation-centric sort: Rethinking sort for robust multi-object tracking. In
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9686–9696. DOI: 10.1109/CVPR52729.2023.00934.
Chen, X., Li, Y., Li, Y., Yu, J., and Li, X. (2016). A novel probabilistic data association for target tracking in a cluttered environment. Sensors, 16(12):2180. DOI: 10.3390/s16122180.
Chuangju, W. (2023). Comprehensive survey of deep learning-based approaches for aerial visual tracking. Journal of Optics, 53(3):1906–1913. DOI: 10.1007/s12596-023-01357-w.
Dendorfer, P., Osep, A., Milan, A., Schindler, K., Cremers, D., Reid, I., Roth, S., and Leal-Taix´e, L. (2020). MOTChallenge: A benchmark for single-camera multiple target tracking. International Journal of Computer Vision, 129(4):845–881. DOI: 10.1007/s11263-020-01393-0.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. CoRR, abs/2010.11929.
El-Nouby, A., Touvron, H., Caron, M., Bojanowski, P., Douze, M., Joulin, A., Laptev, I., Neverova, N., Synnaeve, G., Verbeek, J., and J´egou, H. (2021). XCiT: Cross-covariance image transformers. CoRR, abs/2106.09681.
Fang, K., Xiang, Y., Li, X., and Savarese, S. (2018). Recurrent autoregressive networks for online multi-object tracking. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 466–475. DOI: 10.1109/WACV.2018.00057.
Fu, J., Zong, L., Li, Y., Li, K., Yang, B., and Liu, X. (2019). Model adaption object detection system for robot. CoRR, abs/1911.02718.
Gad, A., Basmaji, T., Yaghi, M., Alheeh, H., Alkhedher, M., and Ghazal, M. (2022). Multiple object tracking in robotic applications: Trends and
challenges. Applied Sciences, 12(19):9408. DOI: 10.3390/app12199408.
Fu, J., Zong, L., Li, Y., Li, K., Yang, B., and Liu, X. (2019). Model adaption object detection system for robot. CoRR, abs/1911.02718.
Gad, A., Basmaji, T., Yaghi, M., Alheeh, H., Alkhedher, M., and Ghazal, M. (2022). Multiple object tracking in robotic applications: Trends and
challenges. Applied Sciences, 12(19):9408. DOI: 10.3390/app12199408.
Girbau, A., Gir´o-i-Nieto, X., Rius, I., and Marqu´es, F. (2021). Multiple object tracking with mixture density networks for trajectory estimation. CoRR, abs/2106.10950.
He, Y., Wei, X., Hong, X., Ke, W., and Gong, Y. (2022). Identity-quantity harmonic multi-object tracking. IEEE Transactions on Image Processing, 31:2201–2215. DOI: 10.1109/tip.2022.3154286.
Huo, Y., Jin, K., Cai, J., Xiong, H., and Pang, J. (2023). Vision transformer (ViT)-based applications in image classification. In 2023 IEEE 9th
Intl Conference on Big Data Security on Cloud (BigDataSecurity), IEEE Intl Conference on High Performance and Smart Computing, (HPSC) and IEEE Intl Conference on Intelligent Data and Security (IDS), pages 135–140. DOI: 10.1109/BigDataSecurity-HPSC-IDS58521.2023.00033.
Kadam, P., Fang, G., and Zou, J. J. (2024). Object tracking using computer vision: A review. Computers, 13(6):136. DOI: 10.3390/computers13060136.
Kawanishi, Y. (2022). Label-based multiple object ensemble tracking with randomized frame dropping. In 2022 26th International Conference on Pattern Recognition (ICPR), pages 900–906. DOI: 10.1109/ICPR56361.2022.9956158.
Li, Z., Chen, J., and Bi, J. (2023). Multiple object tracking with appearance feature prediction and similarity fusion. IEEE Access, 11:52492–52500. DOI: 10.1109/ACCESS.2023.3279868.
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), page 9992–10002. IEEE. DOI: 10.1109/iccv48922.2021.00986.
Magalh˜aes, S. A., Castro, L., Moreira, G., dos Santos, F. N., Cunha, M., Dias, J., and Moreira, A. P. (2021). Evaluating the single-shot multibox detector and YOLO, deep learning models for the detection of tomatoes in a greenhouse. Sensors, 21(10):3569. DOI: 10.3390/s21103569.
Mahmoudi, N., Ahadi, S. M., and Rahmati, M. (2018). Multi-target tracking using CNN-based features: CNNMTT. Multimedia Tools and Applications, 78(6):7077–7096. DOI: 10.1007/s11042-018-6467-6.
Mostafa, R., Baraka, H., and Bayoumi, A. (2022). LMOT: Efficient light-weight detection and tracking in crowds. IEEE Access, 10:83085–83095. DOI: 10.1109/access.2022.3197157.
Nazir, A. and Wani, M. A. (2023). You only look once - object detection models: A review. In 2023 10th International Conference on Computing for Sustainable Global Development (INDIACom), pages 1088–1095.
Pang, B., Li, Y., Zhang, Y., Li, M., and Lu, C. (2020). TubeTK: Adopting tubes to track multi-object in a one-step training model. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6307–6317. DOI: 10.1109/CVPR42600.2020.00634.
Park, Y., Dang, L. M., Lee, S., Han, D., and Moon, H. (2021). Multiple object tracking in deep learning approaches: A survey. Electronics, 10(19):2406. DOI: 10.3390/electronics10192406.
Ren, S., He, K., Girshick, R., and Sun, J. (2017). Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6):1137–1149. DOI: 10.1109/tpami.2016.2577031.
Stadler, D. and Beyerer, J. (2021). Multi-pedestrian tracking with clusters. In 2021 17th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pages 1–10. DOI: 10.1109/AVSS52988.2021.9663829.
Tian, Z., Shen, C., Chen, H., and He, T. (2019). FCOS: Fully convolutional one-stage object detection. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9626–9635. DOI: 10.1109/ICCV.2019.00972.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is all you need. In
st Conference on Neural Information Processing Systems (NIPS 2017), page 6246–6256. DOI: 10.1109/cvpr42600.2020.00628.
Voigtlaender, P., Krause, M., Osep, A., Luiten, J., Sekar, B. B. G., Geiger, A., and Leibe, B. (2019). MOTS: Multi-object tracking and segmentation. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), page 7934–7943. IEEE. DOI: 10.1109/cvpr.2019.00813.
Wang, P., Wang, Y., and Li, D. (2024). DroneMOT: Drone-based multi-object tracking considering detection difficulties and simultaneous moving of drones and objects. In 2024 IEEE International Conference on Robotics and Automation (ICRA), page 7397–7404. IEEE. DOI: 10.1109/icra57147.2024.10610941.
Wang, Z., Zheng, L., Liu, Y., Li, Y., and Wang, S. (2020). Towards real-time multi-object tracking. In Computer Vision – ECCV 2020, page 107–122. Springer International Publishing. DOI: 10.1007/978-3-030-58621-8 7.
Wu, J., Cao, J., Song, L., Wang, Y., Yang, M., and Yuan, J. (2021). Track to detect and segment: An online multi-object tracker. CoRR, abs/2103.08808.
Xu, Y., Ban, Y., Delorme, G., Gan, C., Rus, D., and Alameda-Pineda, X. (2023). TransCenter: Transformers with dense representations for multiple-object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(6):7820–7835. DOI: 10.1109/tpami.2022.3225078.
You, S., Yao, H., Bao, B.-k., and Xu, C. (2023). Utm: A unified multiple object tracking model with identity-aware feature enhancement. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21876–21886. DOI: 10.1109/CVPR52729.2023.02095.
Zeng, K., You, Y., Shen, T., Wang, Q., Tao, Z., Wang, Z., and Liu, Q. (2023). NCT: noise-control multi-object tracking. Complex & Intelligent Systems, 9(4):4331–4347. DOI: 10.1007/s40747-022-00946-9.
Zhang, Y., Sun, P., Jiang, Y., Yu, D., Weng, F., Yuan, Z., Luo, P., Liu, W., and Wang, X. (2022). ByteTrack: Multi-object tracking by associating every detection box. In Computer Vision – ECCV 2022, page 1–21. Springer Nature Switzerland. DOI: 10.1007/978-3-031-20047-2 1.
Zhang, Y., Wang, C., Wang, X., Zeng, W., and Liu, W. (2021). FairMOT: On the fairness of detection and re-identification in multiple object tracking. International Journal of Computer Vision, 129(11):3069–3087. DOI: 10.1007/s11263-021-01513-4.
Zhao, X., Wang, L., Zhang, Y., Han, X., Deveci, M., and Parmar, M. (2024). A review of convolutional neural networks in computer vision. Artificial Intelligence Review, 57(4). DOI: 10.1007/s10462-024-10721-6.
Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P. H. S., and Zhang, L. (2020). Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. CoRR, abs/2012.15840.
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., and Dai, J. (2020). Deformable DETR: deformable transformers for end-to-end object detection. CoRR,
abs/2010.04159.