Application of Generative Artificial Intelligence in Data Cleaning and Preparation: A Case Study of Recycled Polypropylene Composite Mixed with Tea Residue

Authors

  • Anuchit Khongrit Faculty of Engineering, Vongchavalitkul University
  • Cheevin Limsiri Faculty of Engineering, Vongchavalitkul University
  • Sureeporn Meehom Faculty of Engineering, Vongchavalitkul University

Keywords:

Generative Artificial Intelligence, Data Cleaning, Data Preparation

Abstract

     Objectives: This study aims to explore the capabilities of three Generative Artificial Intelligence (AI) systems—Claude.ai, ChatGPT 3.5, and Gemini—on online platforms for data cleaning and data preparation. The dataset used for this study consists of the mechanical property testing data of recycled polypropylene composite mixed with tea waste.

     Method: The study begins by creating conversational prompts to interact with the generative AI systems, asking them for principles and methods for data cleaning and data preparation. Following these steps, the cleaned data is prepared, and the results from each AI platform are compared and discussed.

     Results: The results indicate that all three Generative AI systems provided useful information for the data cleaning and data preparation processes. The information obtained was accurate and consistent with data science literature. However, the AI platforms could not perform the actual data cleaning due to their system limitations. Consequently, Python libraries were used to carry out the data cleaning, with the generative AI systems serving a persona-based role in generating Python scripts for data cleaning and preparation. The cleaned and prepared dataset was found to be usable in data analysis programs, statistical analysis, and online platforms without any issues or errors stemming from incomplete or unclean data.

References

อนุชิต คงฤทธิ์, สงวน วงษ์ชวลิตกุล และมารุต โคตรพันธ์. (2565). การศึกษาสมบัติเบื้องต้นของวัสดุเชิงประกอบจากพอลิโพรพิลีนเกรดรีไซเคิลผสมผงกากชา.

วารสารวิศวกรรมและเทคโนโลยี มหาวิทยาลัยรังสิต, 25(2), หน้า 43-56.

Chaki, R., Chaki, N., Cortesi, A., & Saeed, K. (2024). Applied computing for software and smart systems. Proceedings of ACSS 2023 (p. 71).

Springer Nature.

Dasari, D., & Varma, P. (2022). Employing various data cleaning techniques to achieve better data quality using Python. 2022 6th International Conference on Electronics, Communication and Aerospace Technology. IEEE. htpst://doi.org/10.1109/iceca55336.2022.10009079

Du, S., Shi, W., Li, S., & Zhao, G. (2022). Research on data cleaning technology based on RD-CFD method. 7th International Symposium on Advances in Electrical, Electronics, and Computer Engineering. IEEE. https://doi.org/10.1117/12.2639864Finding outliers in a scatter plot /

pandas dataframe? (n.d.). Stack Overflow. https://stackoverflow.com/questions/75993462/finding-outliers-in-a-scatter-plot-pandas-dataframe

Frazer, G. (2024). Data Cleaning with Power BI: The definitive guide to transforming dirty data into actionable insights.

Packt Publishing Ltd.

Gill, J. K. (2023, November 24). Generative AI for data analytics And management. XENONSTACK. https://www.xenonstack.com/blog/ generative-ai-data-analytics

Goyle, K., Xie, Q., & Goyle, V. (2023). DataAssist: A machine learning approach to data cleaning and preparation. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2307.07119

Jafarov, E. J. E. (2022). Technologies of data processing and cleaning, noise identification and removal at time series. Proceedings of

Azerbaijan High Technical Educational Institution, 23(12), 73–82. https://doi.org/10.36962/pahtei23122022-73

Jin, Z. (2022). Principle, methodology and application for data cleaning techniques, BCP Business & Management, 26, 724–732. https://doi.org/10.54691/bcpbm.v26i.2032

Krishnan, S., & Wu, E. (2019). AlphaClean: Automatic generation of data cleaning pipelines. arXiv Preprint arXiv:1904.11827.

https://arxiv.org/abs/1904.11827

Lee, J., Kim, D., Hong, H., & Kuk, J. (2020). Apparatus and method of generating map data of cleaning space. (Patent).

https://patents.google.com/patent/ US20200218274A1/en

Loureiro, A., Torgo, L., & Soares, C. (2004). Outlier detection using clustering methods: A data cleaning application. In Proceedings

of KDNet Symposium on Knowledge-based Systems for the Public Sector. Bonn, Germany.

Martinez-Luengo, M., Shafiee, M., & Kolios, A. (2019). Data management for structural integrity assessment of offshore wind turbine support structures: Data Cleaning and missing data imputation, Ocean Engineering, 173, 867–883. https://doi.org/10.1016/j.oceaneng.

01.003

Mertz, D. (2021). Cleaning data for effective data science: Doing the other 80% of the work with Python. R, and command-line tools. Packt Publishing Ltd.

Nandi, G., & Sharma, R. K. (2020). Data Science Fundamentals and Practical Approaches: Understand Why Data Science Is the Next.

BPB Publications. pandas DataFrame: replace nan values with average of columns. (n.d.). Stack Overflow. https://stackoverflow.com/questions/18689823/pandas-dataframe-replace-nan-values-with-average-of-columns

Parulian, N. N., & Ludäscher, B. (2022). DCM explorer: A tool to support transparent data cleaning through provenance exploration.

In Proceedings of the 14th International Workshop on the Theory and Practice of Provenance (TaPP '22) (Article 10, pp. 1–6). Association for Computing Machinery. https://doi.org/10.1145/ 3530800.3534539

Rangapur, A., & Rangapur, A. (2024). The Battle of LLMs: A Comparative Study in Conversational QA Tasks. arXiv (Cornell University).

https://doi.org/10.48550/arxiv. 2405.18344

Sergiienko, B. (2024, April 16). How to Use Generative AI in Data Analytics for Enhanced Decision-Making and Strategic Growth. Master of Code Global. https://masterofcode.com/ blog/generative-ai-for-data-analytics

Setiyanto, S., & Setiawan, I. (2022). Data science with Excel. International Journal of Computer and Information System, 3(3), 104–110. https://doi.org/10.29040/ijcis.v3i3.79

Shigarov, A. O., & Mikhailov, A. A. (2017). Rule-based spreadsheet data transformation from arbitrary to relational tables.

Information Systems, 71, 123–136. https://doi.org/10.1016/j.is.2017. 08.004

Sunne, S. (2022). Cleaning data. In Routledge eBooks (pp. 71–85). https://doi.org/10.4324/9781003273301-5

Torkey, H., Ibrahim, E., Hemdan, E. E., El-Sayed, A., & Shouman, M. A. (2021). Diabetes classification application with efficient missing

and outliers data handling algorithms. Complex & Intelligent Systems, 8(1), 237–253. https://doi.org/10.1007/s40747- 021-00349-2

Von Zernichow, B. M., & Roman, D. (2017). Usability of visual data profiling in data cleaning and transformation. In Lecture Notes in Computer Science (pp. 480–496). https://doi.org/10.1007/978-3-319-69459-7_32

Walker, M. (2020). Python Data Cleaning Cookbook: Modern Techniques and Python Tools to Detect and Remove Dirty Data and

Extract Key Insights. Packt Publishing Ltd.

Ye, A., & Wang, Z. (2022). Data preparation and engineering. In Apress eBooks (pp. 95–179). https://doi.org/10.1007/978-1-4842-8692-0_2

Zaamout, K. (2024, April 30). AI-Driven Operational Efficiency and Data Management. Caylent. https://caylent.com/blog/ai-driven-

operational-efficiency-and-data- management

Zou, F. (2022). Research on data cleaning in big data environment. 2022 International Conference on Cloud Computing, Big Data and

Internet of Things (3CBIT), 145–148. https://doi.org/10.1109/3CBIT57391.2022.00037

Downloads

Published

2024-06-28

How to Cite

Khongrit, A., Limsiri, C., & Meehom, S. (2024). Application of Generative Artificial Intelligence in Data Cleaning and Preparation: A Case Study of Recycled Polypropylene Composite Mixed with Tea Residue. Journal of Vongchavalitkul University, 37(1), 112–140. Retrieved from https://ph01.tci-thaijo.org/index.php/vujournal/article/view/257269