Comparative Evaluation of Log Reduction Techniques Using Vector on Public Security Datasets
Main Article Content
Abstract
Efficient log reduction is critical for Security Operations Centers (SOCs) and Managed Security Service Providers (MSSPs), which must store, analyze, and retain massive volumes of event data while satisfying compliance requirements and controlling operational costs. Traditional pipelines often retain redundant or low-value records, leading to excessive storage overhead and slower analytics. This study evaluates five Vector-based log-reduction methods: lter-based selection, eld pruning, event sampling, template hashing, and a combined pruning + sampling profile. The evaluation uses more than 3 million log records from two well-known public intrusion datasets, CIC-IDS2017 and UNSW-NB15, to measure efficiency, throughput, and attack coverage under the same experimental setup. Compared with a baseline Filebeat pipeline, the proposed Vector-based approach improved throughput by 45%, reduced outbound traffic by 80%, and maintained 98% attack coverage. The results show that a substantial proportion of raw logs is redundant and can be trimmed without compromising essential evidence or analytic clarity. Template hashing preserved fidelity with moderate CPU cost; although it required slightly more processing than filtering or pruning, it still consumed fewer resources than the baseline. We repeated each test three times to ensure consistent results and validated the findings through ClickHouse queries at the sink layer. We also release the scripts and benchmark data to support reproduction and extension. Overall, the benchmark demonstrates how log-reduction design can improve operational efficiency while preserving analytic fidelity.
Article Details

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
References
R. Marty, Applied Security Visualization, Addison-Wesley Professional, Upper Saddle River, NJ, USA, 2008.
K. Kent and M. P. Souppaya, “Guide to Computer Security Log Management,” NIST Special Publication 800-92, National Institute of Standards and Technology, Gaithersburg, MD, USA, 2006.
G. Gonz´alez-Granadillo, S. Gonz´alez-Zarzosa and R. Diaz, “Security information and event management (SIEM): Analysis, trends, and usage in critical infrastructures,” Sensors, vol. 21, no. 14:4759, 2021.
M. Landauer, F. Skopik, M. Wurzenberger and A. Rauber, “System log clustering approaches for cyber security applications: A survey,” Computers & Security, vol. 92, no. 101739, 2020.
Vector, “Vector: Open-source observability data pipeline,” [Online]. Available: https:// vector.dev. [Accessed: 08-Jun-2026].
M. Di Mauro, G. Galatro, G. Fortino and A. Liotta, “Supervised feature selection techniques in network intrusion detection: A critical review,” Engineering Applications of Artificial Intelligence, vol. 101, no. 104216, 2021.
M. Du, F. Li, G. Zheng and V. Srikumar, “DeepLog: Anomaly detection and diagnosis from system logs through deep learning,” Proceedings of ACM SIGSAC Conference on Computer and Communications Security (CCS), Dallas, TX, USA, pp. 1285-1298, 2017.
P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi and K. Tzoumas, “Apache Flink: Stream and batch processing in a single engine,” IEEE Data Engineering Bulletin, vol. 38, no. 4, pp. 28-38, 2015.
T. Akidau et al., “The dataflow model: A practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing,” Proceedings of the VLDB Endowment, vol. 8, no. 12, pp. 1792-1803, 2015.
J. Paulo and J. Pereira, “A survey and classification of storage deduplication systems,” ACM Computing Surveys, vol. 47, no. 1:11, pp. 1-30, 2014.
W. Xia, H. Jiang, D. Feng, F. Douglis, P. Shilane, Y. Hua, M. Fu, Y. Zhang and Y. Zhou, “A comprehensive study of the past, present, and future of data deduplication,” Proceedings of the IEEE, vol. 104, no. 9, pp. 1681-1710, 2016.
T. Kalamatianos, K. Kontogiannis and P. Matthews, “Domain independent event analysis for log data reduction,” Proceedings of IEEE 36th Annual Computer Software and Applications Conference (COMPSAC), Izmir, Turkey, pp. 225-232, 2012.
S. He, P. He, Z. Chen, T. Yang, Y. Su and M. R. Lyu, “A survey on automated log analysis for reliability engineering,” ACM Computing Surveys, vol. 54, no. 6:130, pp. 1-37, 2021.
T. Zhang, H. Qiu, G. Castellano, M. Rifai, C. S. Chen and F. Pianese, “System log parsing: A survey,” arXiv preprint arXiv:2212.14277, 2022.
N. Duffield, “Sampling for passive Internet measurement: A review,” Statistical Science, vol. 19, No. 3, pp. 472-498, 2004.
I. Sharafaldin, A. H. Lashkari and A. A. Ghorbani, “Toward generating a new intrusion detection dataset and intrusion traffic characterization,” Proceedings of 4th International Conference on Information Systems Security and Privacy (ICISSP), Funchal, Portugal, pp. 108-116, 2018.
N. Moustafa and J. Slay, “UNSW-NB15: A comprehensive data set for network intrusion detection systems,” Proceedings of Military Communications and Information Systems Conference (MilCIS), Canberra, ACT, Australia, pp. 1-6, 2015.
I. Kotenko, D. Gaifulina and I. Zelichenok, “Systematic literature review of security event correlation methods,” IEEE Access, vol. 10, pp. 43387-43420, 2022.
P. He, J. Zhu, Z. Zheng and M. R. Lyu, “Drain: An online log parsing approach with fixed depth tree,” Proceedings of IEEE International Conference on Web Services (ICWS), Honolulu, HI, USA, pp. 33-40, 2017.
A. Pescape, D. Rossi, D. Tammaro and S. Valenti, “On the impact of sampling on traffic monitoring and analysis,” Proceedings of 22nd International Teletraffic Congress (ITC 22), Amsterdam, Netherlands, pp. 1-8, 2010.
M. Sedki, A. Hamou-Lhadj and O. AitMohamed, “AWSOM-LP: An effective log parsing technique using pattern recognition and frequency analysis,” arXiv preprint arXiv:2110.15473, 2021.
M. Du and F. Li, “Spell: Streaming parsing of system event logs,” Proceedings of IEEE 16th International Conference on Data Mining (ICDM), Barcelona, Spain, pp. 859-864, 2016.
S.-W. Huang, X. Wu and H. Li, “LogLSHD: Fast log parsing with locality-sensitive hashing and dynamic time warping,” Proceedings of 21st International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE), Trondheim, Norway, pp. 11-20, 2025.
B. Gedik, K.-L. Wu, P. S. Yu and L. Liu, “Adaptive load shedding for windowed stream joins,” Proceedings of 14th ACM International Conference on Information and Knowledge Management (CIKM), Bremen, Germany, pp. 171178, 2005.