HalluCVE: A Multi-Signal Benchmark for Hallucination Detection in LLM-Generated in Cyber Threat Intelligence

Thuan Dao Duy

doi:10.55003/ETH.430206

Authors

Thuan Dao Duy University of Information Technology, Vietnam National University Ho Chi Minh City, Vietnam

DOI:

https://doi.org/10.55003/ETH.430206

Keywords:

Cyber Threat Intelligence, Hallucination Detection, Large Language Models, Benchmarking

Abstract

Large Language Models (LLMs) are increasingly utilized for automated Cyber Threat Intelligence (CTI) tasks, such as vulnerability analysis and security advisory generation. However, LLMs are susceptible to hallucination, which refers to the generation of plausible yet factually incorrect content, posing significant risks in security-critical contexts. Although concerns have increased, there is currently no dedicated benchmark for the systematic evaluation of hallucination in LLM-generated cyber threat intelligence (CTI). This study introduces HalluCVE, a multi-signal benchmark designed to detect hallucinations in LLM-generated Common Vulnerabilities and Exposures (CVE). HalluCVE incorporates four complementary detection components: 1) Natural Language Inference-based entailment scoring, 2) lexical factual alignment, 3) LLM-as-a-Judge self-reflection, and 4) cross-model consensus divergence. Five state-of-the-art LLMs are evaluated on 1000 CVE entries as the dataset, from 2022 to 2026, encompassing both known (pre-training cutoff) and unknown (post-cutoff) vulnerabilities. The results indicate pervasive hallucination across all models, with mean Hallucination Index values ranging from 0.480 to 0.820. Notably, models demonstrate near-universal confabulation, reaching up to 100%, when queried about post-cutoff vulnerabilities, and frequently respond with high confidence instead of appropriate refusal. HalluCVE establishes a rigorous evaluation framework for assessing LLM reliability in security-sensitive CTI applications and provides insights into potential mitigation strategies.

References

M. A. Ferrag, A. Battah, N. Tihanyi, R. Jain, D. Maimuţ, F. Alwahedi, T. Lestable, N. S. Thandi, A. Mechri, M. Debbah and L. C. Cordeiro, “SecureFalcon: Are We There Yet in Automated Software Vulnerability Detection With LLMs?,” IEEE Transactions on Software Engineering, vol. 51, no. 4, pp. 1248–1265, 2025, doi: 10.1109/TSE.2025.3548168.

N. Tihanyi, M. A. Ferrag, R. Jain, T. Bisztray and M. Debbah, “CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity Knowledge,” in 2024 IEEE International Conference on Cyber Security and Resilience (CSR), Sep. 2–4, 2024, pp. 296–302, doi: 10.1109/CSR61664.2024.10679494.

L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin and T. Liu, “A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions,” ACM Transactions on Information Systems, vol. 43, no. 2, pp. 1–55, 2025, doi: 10.1145/3703155.

P. Manakul, A. Liusie, and M. J. F. Gales, “SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models,” in Proc. 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), Singapore, Dec. 6–10, 2023, pp. 9004–9017, doi: 10.18653/v1/2023.emnlp-main.557.

S. Min, K. Krishna, X. Lyu, M. Lewis, W. Yih, P. Koh, M. Iyyer, L. Zettlemoyer and H. Hajishirzi, “FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation,” in Proc. 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, Dec. 6–10, 2023, pp. 12076–12100, doi: 10.18653/v1/2023.emnlp-main.741.

M. T. Alam, D. Bhusal, L. Nguyen and N. Rastogi, “CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence,” in Advances in Neural Information Processing Systems 37, Vancouver, Canada, Dec. 10–15, 2024, pp. 50805–50825, doi: 10.52202/079017-1607.

J. Li, X. Cheng, W. X. Zhao, J. -Y. Nie and J. -R. Wen, “HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models,” in Proc. 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, Dec. 6–10, 2023, pp. 6449–6464, doi: 10.18653/v1/2023.emnlp-main.397.

S. Lin, J. Hilton and O. Evans, “TruthfulQA: Measuring How Models Mimic Human False-Hoods,” in Proc. 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, May 22–27, 2022, pp. 3214–3252, doi: 10.18653/v1/2022.acl-long.229.

J. Maynez, S. Narayan, B. Bohnet and R. Mc-Donald, “On Faithfulness and Factuality in Abstractive Summarization,” in Proc. 58th Annual Meeting of the Association for Computational Linguistics, Jul. 5–10, 2020, pp. 1906–1919, doi: 10.18653/v1/2020.acl-main.173.

Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. J. Bang, A. Madotto and P. Fung, “Survey of Hallucination in Natural Language Generation,” ACM Computing Surveys, vol. 55, no. 12, pp. 1–38, 2023, doi: 10.1145/3571730.

O. Honovich, R. Aharoni, J. Herzig, H. Taitelbaum, D. Kukliansy, V. Cohen, T. Scialom, I. Szpektor, A. Hassidim, and Y. Matias, “TRUE: Re-evaluating Factual Consistency Evaluation,” in Proc. Second DialDoc Workshop on Document-grounded Dialogue and Conversational Question Answering, Dublin, Ireland, May 2022, pp. 161–175, doi: 10.18653/v1/2022.dialdoc-1.19.

P. Laban, T. Schnabel, P. N. Bennett and M. A. Hearst, “SummaC: Re-visiting NLI-based Models for Inconsistency Detection in Summarization,” Transactions of the Association for Computational Linguistics, vol. 10, pp. 163–177, 2022, doi: 10.1162/tacl_a_00453.

L. Zheng, W. L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez and I. Stoica, “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena,” in Advances in Neural Information Processing Systems 36, New Orleans, LA, USA, Dec. 10–16, 2023, pp.46595–46623, doi: 10.52202/075280-2020.

D. Emery, M. Goitia, F. Vargus and I. Neagu, “HalluMix: A Task-Agnostic, Multi-Domain Benchmark for Real-World Hallucination Detection,” arXiv preprint arXiv:2505.00506, 2025.

R. Eliav, A. Cattan, E. Hirsch, S. Bassan, E. Stengel-Eskin, M. Bansal and I. Dagan, “CLATTER: Comprehensive Entailment Reasoning for Hallucination Detection,” arXiv preprint arXiv:2506.05243, 2025.

X. Hu, D. Ru, L. Qiu, Q. Guo, T. Zhang, Y. Xu, Y. Luo, P. Liu, Y. Zhang and Z. Zhang, “Knowledge-Centric Hallucination Detection,” in Proc. 2024 Conf. Empirical Methods in Natural Language Processing (EMNLP), Miami, FL, USA, Nov. 12–16, 2024, pp. 6953–6975, doi: 10.18653/v1/2024.emnlp-main.395.

M. T. Alam, D. Bhusal, S. Ahmad, N. Rastogi and P. Worth, “AthenaBench: A Dynamic Benchmark for Evaluating LLMs in Cyber Threat Intelligence,” arXiv preprint arXiv:2511.01144, 2025.

J. Jin, B. Tang, M. Ma, X. Liu, Y. Wang, Q. Lai, J. Yang, and C. Zhou, “Crimson: Empowering Strategic Reasoning in Cybersecurity through Large Language Models,” arXiv preprint arXiv:2403.00878, 2024.

M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, “BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension,” in Proc. 58th Annual Meeting of the Association for Computational Linguistics, Jul. 5–10, 2020, pp. 7871–7880, doi: 10.18653/v1/2020.acl-main.703.

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe, “Training language models to follow instructions with human feedback,” in Proc. 36th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, Nov. 28–Dec. 9, 2022, pp. 27730–27744.

S. Kadavath, T. Conerly, A. Askell, T. Henighan, D. Drain, E. Perez, N. Schiefer, Z. Hatfield-Dodds, N. DasSarma, E. Tran-Johnston et al., “Language models (mostly) know what they know,” arXiv preprint arXiv:2207.05221, 2022.

P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. Yih, T. Rocktäschel et al., “Retrieval-augmented generation for knowledge-intensive NLP tasks,” in Proc. 34th International Conference on Neural Information Processing Systems, BC, Canada, Dec. 6–12, 2020, pp. 9459–9474.