BERTopic Analysis of Indonesian Biodiversity Policy on Social Media
Main Article Content
Abstract
Indonesia, known for its diverse biodiversity, faces critical challenges such as habitat degradation and species loss. This study delves into public opinion regarding Indonesian government biodiversity policies by analyzing text data from X social media platforms. Leveraging BERTopic, an advanced topic modeling technique, we uncover nuanced topics related to biodiversity within tweets. Our research uniquely contributes by exploring diverse combinations of BERTopic parameters on Indonesian text, assessing their efficacy through coherence values and manual content evaluation. Notably, our findings highlight the optimal combination of sentence embedding, cluster model, and dimension reduction parameters, with Model 5 demonstrating the highest coherence score of 0.7733. Moreover, we elucidate the impact of outlier reduction techniques when applying BERTopic in an Indonesian context. Our study serves as a foundational model for categorizing Indonesian-language topics using BERTopic, showcasing the significance of tailored text processing techniques. We also reveal that while standard preprocessing methods enhance clustering outcomes, certain dataset characteristics, such as the inclusion of hashtags and mentions, can inuence coherence differently across models. This work not only provides insights into public perceptions of biodiversity policies but also offers methodological guidance for text analysis in similar contexts.
Article Details
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
References
K. von Rintelen, E. Arida, and C. Ha ̈user, “A review of biodiversity-related issues and challenges in megadiverse Indonesia and other Southeast Asian countries,” Res. Ideas Outcomes, vol. 3, Sep. 2017.
S. Vasudeva Raju, B. Kumar Bolla, D. K. Nayak and J. Kh, “Topic Modelling on Consumer Financial Protection Bureau Data: An Approach Using BERT Based Embeddings,” 2022 IEEE 7th International conference for Convergence in Technology (I2CT), Mumbai, India, pp. 1-6, 2022.
R. Egger and J. Yu, “A Topic Modeling Comparison Between LDA, NMF, Top2Vec, and BERTopic to Demystify Twitter Posts,” Front. Sociol., vol. 7, no. May, pp. 1–16, 2022.
T. Ramamoorthy, V. Kulothungan, and B. Mappillairaju, “Topic modeling and social network analysis approach to explore diabetes discourse on Twitter in India.,” Front. Artif. Intell., vol. 7, p. 1329185, 2024.
A. Abuzayed and H. Al-Khalifa, “BERT for Arabic Topic Modeling: An Experimental Study on BERTopic Technique,” Procedia CIRP, vol. 189, pp. 191–194, 2021.
Z. A. Gu ̈ven, B. Diri and T. C ̧akalo ̆glu, “Comparison Method for Emotion Detection of Twitter Users,” 2019 Innovations in Intelligent Systems and Applications Conference (ASYU), Izmir, Turkey, pp. 1-5, 2019.
N. N. Hidayati and S. Rochimah, “Requirements traceability for detecting defects in agile software development,” EECCIS 2020 - 2020 10th Electr. Power, Electron. Commun. Control. Informatics Semin., pp. 248–253, 2020.
N. N. Hidayati and A. Parlina, “Performance Comparison of Topic Modeling Algorithms on Indonesian Short Texts,” in Proceedings of the 2022 International Conference on Computer, Control, Informatics and Its Applications, pp. 117–120, 2023.
L. B. Hutama and D. Suhartono, “Indonesian Hoax News Classification with Multilingual Transformer Model and BERTopic,” Inform., vol. 46, no. 8, pp. 81–90, 2022.
I. Scarpino, C. Zucco, R. Vallelunga, F. Luzza, and M. Cannataro, “Investigating Topic Modeling Techniques to Extract Meaningful Insights in Italian Long COVID Narration,” BioTech, vol. 11, no. 3, Sep. 2022.
G. Hristova and N. Netov, “Media Coverage and Public Perception of Distance Learning During the COVID-19 Pandemic: A Topic Modeling Approach Based on BERTopic,” 2022 IEEE International Conference on Big Data (Big Data), Osaka, Japan, pp. 2259-2264, 2022.
M. Asgari-Chenaghlu, M. R. Feizi-Derakhshi, L. farzinvash, M. A. Balafar, and C. Motamed, “TopicBERT: A cognitive approach for topic detection from multimodal post stream using BERT and memory–graph,” Chaos, Solitons and Fractals, vol. 151, p. 111274, 2021.
M. de Groot, M. Aliannejadi, and M. R. Haas, “Experiments on Generalizability of BERTopic on Multi-Domain Short Text,” 2022. [Online]. Available: http://qwone.com/~jason/20Newsgroups/.
B. Ogunleye, T. Maswera, L. Hirsch, J. Gaudoin, and T. Brunsdon, “Comparison of Topic Modelling Approaches in the Banking Context,” Appl. Sci., vol. 13, no. 2, Jan. 2023.
M. T. Uliniansyah et al., “Twitter Dataset on Public Sentiments Towards Biodiversity Policy in Indonesia,” Data Br., vol. 52, p. 109890, 2024.
S. Pebiana et al., “Experimentation of Various Preprocessing Pipelines for Sentiment Analysis on Twitter Data about New Indonesia’s Capital City Using SVM and CNN,” 2022 25th Conf. Orient. COCOSDA Int. Comm. Co-ord. Stand. Speech Databases Assess. Tech. O-COCOSDA 2022 Proc., pp. 1–6, 2022.
H. Axelborn and J. Berggren, “Topic Modeling for Customer Insights: A Comparative Analysis of LDA and BERTopic in Categorizing Customer Calls,” M.S. Thesis, UME ̊A University, Sweden, 2023.
M. Grootendorst, “BERTopic: Neural topic modeling with a class-based TF-IDF procedure,” 2022. [Online]. Available: http://arxiv.org/abs/2203.05794.
O. Bulut, A. MacIntoshand and C. Walsh, “Using Lbl2Vec and BERTopic for SemiSupervised Detec-tion of Professionalism Aspects in a Constructed-Response Situational Judgment Test,” 2022. [Online]. Available: osf.io/preprints/psyarxiv/n5fqe.
H. Lee, S. H. Lee, K. R. Lee, and J. H. Kim, “ESG Discourse Analysis Through BERTopic: Comparing News Articles and Academic Papers,” Comput. Mater. Contin., vol. 75, no. 3, pp. 6023–6037, 2023.
M. Gu ̈nther, L. Milliken, J. Geuter, G. Mastrapas, B. Wang, and H. Xiao, “Jina Embeddings: A Novel Set of High-Performance Sentence Embedding Models,” 2023. [Online]. Available: http://arxiv.org/abs/2307.11224.
Y. Yang et al., “Multilingual universal sentence encoder for semantic retrieval,” in Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 87–94, 2020.
A. Perwira Joan Dwitama, D. Hatta Fudholi, and S. Hidayat, “Indonesian Hate Speech Detection Using Bidirectional Long Short-Term Memory (Bi-LSTM),” J. RESTI (Rekayasa Sist. dan Teknol. Informasi), vol. 7, no. 2, pp. 302–309, Mar. 2023.
L. McInnes, J. Healy, and J. Melville, “UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction,” 2018, [Online]. Available: http://arxiv.org/abs/1802.03426.
M. R ̈oder, A. Both, and A. Hinneburg, “Exploring the space of topic coherence measures,” WSDM 2015 Proc. 8th ACM Int. Conf. Web Search Data Min., pp. 399–408, 2015.
M. S. Asyaky and R. Mandala, “Improving the Performance of HDBSCAN on Short Text Clustering by Using Word Embedding and UMAP,” Proc. 2021 8th Int. Conf. Adv. Informatics Concepts, Theory, Appl. ICAICTA 2021, pp. 1–6, 2021.