Hierarchical text classification using Relative Inverse Document Frequency

Main Article Content

Boonthida Chiraratanasopha
Thanaruk Theeramunkong
Salin Boonbrahm

Abstract

Automatic hierarchical text classification has been a challenging and in-needed task with an increasing of hierarchical taxonomy from the booming of knowledge organization. The hierarchical structure identifies the relationships of dependence between different categories in which can be overlapped of generalized and specific concepts within the tree. This paper presents the use of frequency of the occurring term in related categories among the hierarchical tree to help in document classification. The four extended term weighting of Relative Inverse Document Frequency (IDFr) including its located category, its parent category, its sibling categories and its child categories are exploited to generate a classifier model using centroid-based technique. From the experiment on hierarchical text classification of Thai documents, the IDFr achieved the best accuracy and F-measure as 53.65% and 50.80% in Top-n features set from family-based evaluation in which are higher than TF-IDF for 2.35% and 1.15% in the same settings, respectively.

Article Details

How to Cite
[1]
B. Chiraratanasopha, T. Theeramunkong, and S. Boonbrahm, “Hierarchical text classification using Relative Inverse Document Frequency”, ECTI-CIT, vol. 15, no. 2, pp. 166 - 176, Apr. 2021.
Section
Research Article

References

[1] J. Graovac, J. Kovačević and G. Pavlović-Lažetić, “Hierarchical vs. flat n-gram-based text categorization : can we do better?,” Computer Science and Information Systems, Vol. 14, No. 1, pp. 103–121, 2016.

[2] J. Li, S. Fong, Y. Zhuang and R. Khoury, “Hierarchical Classification in Text Mining for Sentiment Analysis,” Proceedings of the 2014 International Conference on Soft Computing and Machine Intelligence (ISCMI 2014). IEEE, pp.46-51, 2014.

[3] M. Ferrandin, F. Enembreck, J. C. Nievola, E. E. Scalabrin and B. C. Ávila, “A Centroid-based Approach for Hierarchical Classification,” Proceedings of 7th International Conference on Enterprise Information Systems (ICEIS), pp.25-33, 2015.

[4] X. Qiu, X. Huang, Z. Liu and J. Zhou, “Hierarchical text classification with latent concepts,” Proceedings of 49th Annual Meeting of the Association for Computational Linguistics Human Language Technologies (ACL-HLT 2011), pp.598-602, 2011.

[5] T. Li, S. Zhu and M. Ogihara, “Hierarchical document classification using automatically generated hierarchy,” Journal of Intelligent Information Systems, Vol. 29, No. 2, pp.211-230, 2007.

[6] B. Chiraratanasopha, T. Theeramunkong and S. Boonbrahm, “Improved Term Weighting Factors for Keyword Extraction in Hierarchical Category Structure and Thai Text Classification,” Proceedings of the Joint International Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP 2017), pp.191-198, 2017.

[7] A. Freitas and A. Carvalho, “A tutorial on hierarchical classification with applications in bioinformatics,” In Research and trends in data mining technologies and applications, IGI Global, pp.175-208, 2007.

[8] F. Javed, Q. Luo, M. McNair, F. Jacob, M. Zhao and T.S. Kang, “Carotene: A job title classification system for the online recruitment domain,” Proceedings of the 2015 IEEE 1st International Conference on Big Data Computing Service and Applications (BIGDATASERVICE '15), pp.286-293, 2015.

[9] D. Zhou, L. Xiao and M. Wu, “Hierarchical classification via orthogonal transfer,” Proceedings of the 28th International Conference on Machine Learning (ICML 2011), 2011.

[10] V. Gupta, H. Karnick, A. Bansal and P. Jhala, “Product classification in e-commerce using distributional semantics,” Proceedings of 26th International Conference on Computational Linguistics (COLING 2016), pp.536-546, 2016.

[11] H. S. Oh and S. H. Myaeng, “Utilizing global and path information with language modelling for hierarchical text classification,” Journal of Information Science, Vol. 40, No. 2, pp.127-145, 2014.

[12] C. N. Silla Jr and A. A. Freitas, “A global-model naive bayes approach to the hierarchical prediction of protein functions,” Proceedings of 2009 9th IEEE International Conference on Data Mining (ICDM 2009), pp. 992-997, 2009.

[13] X. Qiu, W. Gao and X. Huang, “Hierarchical multi-label text categorization with global margin maximization,” Proceedings of 47th Annual Meeting of the Association for Computational Linguistics and 4th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2009) of the Asian Federation of Natural Language Processing (AFNLP) short papers, pp.165-168, 2009.

[14] C. N. Silla-Jr. and A. A. Freitas, “A survey of hierarchical classification across different application domains,” Data Min. Knowl. Discov., Vol. 22, No. 1-2, pp.31–72, 2011.

[15] G. R. Xue, D. Xing, Q. Yang and Y. Yu, “Deep classification in large-scale text hierarchies,” Proceedings of 31st Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR), pp.619–626, 2008.

[16] G. Valentini, “True path rule hierarchical ensembles,” In International Workshop on Multiple Classifier Systems, in: Lecture Notes in Computer Science, Springer, Vol. 5519, pp.232–241, 2009.

[17] A. Secker, M. N. Davies, A. A. Freitas, E. B. Clark, J. Timmis and D. R. Flower, “Hierarchical classification of G-Protein-Coupled Receptors with data-driven selection of attributes and classifiers,”International Journal of Data Mining and Bioinformatics, Vol. 4, No. 2, pp.191–210, 2010.

[18] J. Wang, X. Shen and W. Pan, “Large margin hierarchical classification with multiple paths,” J Am Stat Assoc., Vol. 104, No. 487, pp.1213–1223, 2009.

[19] U. Pappuswamy, D. Bhembe, P. W. Jordan and K. VanLehn, “A supervised clustering method for text classification,” Proceedings of International Conference on Intelligent Text Processing and Computational Linguistics (CICLing 2005), Lecture Notes in Computer Science, Vol. 3406, pp.704-714, 2005.

[20] L. M. Abualigah, A. T. Khader, M. A. A l-Betar and O. A. Alomari, “Text feature selection with a robust weight scheme and dynamic dimension reduction to text document clustering,” Expert Systems with Applications, Vol. 84, pp.24-36, 2017.

[21] K. Chatcharaporn, N. Kittidachanupap, K. Kerdprasop and N. Kerdprasop, “Comparison of feature selection and classification algorithms for restaurant dataset classification,” Proceedings of 11th Conference on Latest Advances in Systems Science & Computational Intelligence, pp.129-134, 2012.

[22] N. Chirawichitchai, “Emotion classification of Thai text based using term weighting and machine learning techniques,” Proceedings of 11th International Joint Conference on Computer Science and Software Engineering (JCSSE 2014) IEEE, pp.91-96, 2014.

[23] P. Jotikabukkana, V. Sornlertlamvanich, O. Manabu and C. Haruechaiyasak, “Effectiveness of social media text classification by utilizing the online news category”, Proceedings of 2015 International Conference on Advanced Informatics: Concepts, Theory and Applications (ICAICTA 2015) IEEE, pp.1-5, 2015.

[24] De C. Boom, S. Van Canneyt, T. Demeester and B. Dhoedt, “Representation learning for very short texts using weighted word embedding aggregation,” Pattern Recognition Lett., Vol. 80, pp.150-156, 2016.

[25] G. Paltoglou and M. Thelwall, “A study of information retrieval weighting schemes for sentiment analysis,” Proceedings of 48th Annual Meeting of the Association for Computational Linguistics (ACL ’10), pp.1386-1395, 2010.

[26] A. Awajan, “Keyword extraction from Arabic documents using term equivalence classes,” ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), Vol. 14, No. 2, pp.1-18, 2015.

[27] V. Lertnattee and T. Theeramunkong, “Effect of term distributions on centroid-based text categorization,” Information Sciences, Vol. 158, pp.89-115, 2004.

[28] V. Lertnattee and T. Theeramunkong, “Class normalization in centroid-based text categorization,” Information Sciences, Vol. 176, No. 12, pp.1712-1738, 2006.

[29] V. Lertnattee and T. Theeramunkong, “Effects of term distributions on binary classification,” IEICE TRANSACTIONS on Information and Systems,Vol. 90, No. 10, pp.1592-1600, 2007.

[30] Y. Miao, and X. Qiu, “Hierarchical centroid-based classifier for large scale text classification,” Large Scale Hierarchical Text Classification (LSHTC) Pascal Challenge, Vol. 18, 2009.

[31] S. D. Kashireddy, S. Gauch and S. M. Billah, “Automatic class labeling for CiteSeerX,” Proceedings of the 2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT) IEEE Computer Society, pp. 241-245, 2013.

[32] National Reform Council of Thailand website, 2017, http://static.thaireform.org/ (Accessed: February 2017).
[33] National Electronics and Computer Technology Center, 2016, http://www.sansarn.com/lexto/ (Accessed: January 2016).

[34] C. Jiang, D. Zhu and Q. Jiang. “A Dynamic Centroid Text Classification Approach by Learning from Unlabeled Data,” Proceedings of 3rd International Conference on Multimedia Technology (ICMT-13), pp.1420-1429, 2013.

[35] H. Guan, J. Zhou and M. Guo, “A class-feature-centroid classifier for text categorization,” Proceedings of the 18th international conference on World Wide Web(WWW '09), pp.201-210, 2009.

[36] S. Tan, “Large margin Drag Pushing strategy for centroid text categorization,” Expert Systems with Applications, Vol. 33, No. 1,pp.215-220, 2007.

[37] H. Takçı and T. Güngör, “A high performance centroid-based classification approach for language identification,” Pattern Recognition Lett., Vol. 33, No. 16, pp.2077-2084, 2012.