Transforming Unstructured Data in IT Project: A Comparative Study of Zero-Shot and Generative AI Text Classification

Main Article Content

Cai Tung-lersloy
Worapat Paireekreng
Nantika Prinyapol

Abstract

In today's world, we have a lot of messy, unorganized data from things like comments, interviews, and images. This is especially true in IT projects, where there's often too much information to handle easily. Our study looks at how we can turn this messy data into useful numbers and insights using smart computer programs. We tested two main methods: Zero-Shot Text Classification and Generative AI Text Classification. Zero-Shot is like having a smart assistant that can sort information without needing examples first. Generative AI is more like having a creative writer who can come up with new examples to help sort information. We asked 42 participants with experience in working with unstructured data to answer some questions, then used these methods to analyze their answers. We found that Zero-Shot works better for information that has clear patterns, while Generative AI is good at handling more complex or unclear information. Our results show that choosing the right method can make a big difference in how well we understand and use the data. Zero-Shot was about 15% more accurate for well-organized information, while Generative AI was 20% better at dealing with complex, messy data. This research helps companies and researchers choose the best way to make sense of their data, especially in IT projects where there's often too much information to handle manually.

Article Details

Section
Research Article

References

H.-J. Kong, “Managing unstructured big data in healthcare system,” Healthc. Inform. Res., vol. 25, no. 1, pp. 1–2, 2019.

H. Abburi, M. Suesserman, N. Pudota, B. Veeramani, E. Bowen, and S. Bhattacharya, “Generative AI text classification using ensemble LLM approaches,” in Proc. Iberian Lang. Eval. Forum, Jaén, Spain, Sep. 2023. [Online]. Available: https://ceur-ws.org/Vol-3496/autextification-paper14.pdf

C. Han, H. Pei, X. Du, and H. Ji, “Zero-shot classification by logical reasoning on natural language explanations,” in Proc. 61st Annu. Meeting Assoc. Comput. Linguistics, Toronto, Canada, Jul. 2023. pp. 8967–8981.

J. Zhang, P. Lertvittayakumjorn, and Y. Guo, “Integrating semantic knowledge to tacklezero-shot text classification,” in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics: Human Lang. Technologies, Minneapolis, MN, USA, Jun. 2019, pp. 1031–1040.

Forbes Thailand. “Unstructured Data: A treasure trove waiting to be unlocked.” (in Thai), FORBESTHAILAND.com. https://forbesthailand.com/commentaries/Insights/unstructured-data-ขุมทรัพย์ที่รอการปลด (accessed Aug. 1, 2024).

Q. Ye et al., “Zero-shot text classification via reinforced self-training,” in Proc. 58th Annu. Meeting Assoc. Comput. Linguistics, Seattle, WA, USA, Jul. 2020, pp. 3014–3024.

W. Yin, J. Hay, and D. Roth, “Benchmarking zero-shot text classification: Datasets, evaluation and entailment approach,” in Proc. Conf. Empirical Methods in Natural Lang. Process. and 9th Int. Joint Conf. Natural Lang. Process. (EMNLP-IJCNLP), Hong Kong, China, Nov. 2019, pp. 3914–3923.

T. Brown et al., “Language models are few-shot learners,” in Proc. 34th Conf. Neural Inf. Process. Syst. (NeurIPS), Vancouver, Canada, Dec. 2020, pp.1–25.

K. J. Srnka and S. T. Koeszegi, “From words to numbers: How to transform qualitative data into meaningful quantitative results,” Schmalenbach Bus. Rev., vol. 59, pp. 29–57, 2007.

H. Abi Akl, “A ML-LLM pairing for better code comment classification,” in Proc. 15th Meeting Forum Inf. Retrieval Eval. (FIRE), Panjim, India, Dec. 2023.

T.-l. Chasupa and W. Paireekreng, “The framework of extracting unstructured usage for big data platform,” in Proc. 2nd Int. Conf. Big Data Analytics and Pract. (IBDAP), Bangkok, Thailand, Aug. 2021, pp. 90–94.

Talance. “Let's take a look! What are the responsibilities of IT positions?.” (in Thai), TALANCE.tech https://www.talance.tech/blog/it-job-responsibility/ (accessed Aug. 1, 2024).

G. Jaimovitch-López, C. Ferri, J. Hernández Orallo, F. Martínez-Plumed, and M. J. Ramírez-Quintana, “Can language models automate data wrangling?,” Mach. Learn., vol. 112, pp. 2053–2082, 2023.

H. Ahmad and H. Halim, “Determining sample size for research activities: The case of organizational research," Selangor Bus. Rev., vol. 2, no. 1, pp. 20–34, 2017.

R. Anantha, T. Bethi, D. Vodianik, and S. Chappidi, “Context tuning for retrieval augmented generation,” in Proc. 1st Workshop on Uncertainty-Aware NLP (UncertaiNLP), St. Julian’s, Malta, Mar. 2024, pp. 15–22.