Exploiting a knowledge base for intelligent decision tree construction to enhance classification power

Main Article Content

Sirichanya Chanmee
Kraisak Kesorn

Abstract

Decision Trees are a common approach used for classifying unseen data into defined classes. The Information Gain is usually applied as splitting criteria in the node selection process for constructing the decision tree. However, bias in selecting the multi-variation attributes is a major limitation of using this splitting condition, leading to unsatisfactory classification performance. To deal with this problem, a new decision tree algorithm called “Knowledge-Based Decision Tree (KDT)” is proposed which exploits the knowledge in an ontology to assist the decision tree construction. The novelty of the study is that an ontology is applied to determine the attribute importance values using the PageRank algorithm. These values are used to modify the Information Gain to obtain appropriate attributes to be nodes in the decision tree. Four different datasets, Soybean, Heart disease, Dengue fever, and COVID-19 dataset, were employed to evaluate the proposed approach. The experimental results show that the proposed method is superior to the other decision tree algorithms, such as the traditional ID3 and the Mutual Information Decision tree (MIDT), and also performs better than a non-decision tree algorithm, e.g., the k-Nearest Neighbors.

Article Details

How to Cite
Chanmee, S., & Kesorn, K. (2022). Exploiting a knowledge base for intelligent decision tree construction to enhance classification power. Engineering and Applied Science Research, 49(4), 545–561. Retrieved from https://ph01.tci-thaijo.org/index.php/easr/article/view/246563
Section
ORIGINAL RESEARCH

References

Hand DJ. Principles of data mining. Drug Saf. 2007;30(7):621-2.

Dou D, Wang H, Liu H. Semantic data mining: a survey of ontology-based approaches. The 9th International Conference on Semantic Computing; 2015 Feb 7-9; Anaheim, USA. New York: IEEE; 2015. p. 244-51.

Sirichanya C, Kraisak K. Semantic data mining in the information age: a systematic review. Int J Intell Syst. 2021;36(8):3880-916.

Anand SS, Bell DA, Hughes JG. The role of domain knowledge in data mining. The 4th International Conference on Information and Knowledge Management; 1995 Nov 29 - Dec 2; Baltimore, USA. New York: Association for Computing Machinery; 1995. p. 37-43.

Staab S, Studer R. Handbook on ontologies. 2nd ed. Heidelberg: Springer; 2009.

Bytyçi E, Ahmedi L, Lisi FA. Enrichment of association rules through exploitation of ontology properties-healthcare case study. Procedia Comput Sci. 2017;113:360-7.

Salguero AG, Espinilla M. Ontology-based feature generation to improve accuracy of activity recognition in smart environments. Comput Electr Eng. 2018;68:1-13.

Chanmee S, Kesorn K. Data quality enhancement for decision tree algorithm using knowledge-based model. Curr Appl Sci Technol. 2020;20(2):259-77.

Paul AK, Shill PC. Incorporating gene ontology into fuzzy relational clustering of microarray gene expression data. Biosystems. 2018;163:1-10.

Alkahtani M, Choudhary A, De A, Harding JA. A decision support system based on ontology and data mining to improve design using warranty data. Comput Ind Eng. 2019;128:1027-39.

Maimon OZ, Rokach L. Data mining with decision trees: theory and applications. 2nd ed. Singapore: World Scientific; 2014.

Lundberg SM, Erion G, Chen H, DeGrave A, Prutkin JM, Nair B, et al. From local explanations to global understanding with explainable AI for trees. Nat Mach Intell. 2020;2:56-67.

White AP, Liu WZ. Technical note: bias in information-based measures in decision tree induction. Mach Learn. 1994;15(3):321-9.

Fang L, Jiang H, Cui S. An improved decision tree algorithm based on mutual information. The 13th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery; 2017 Jul 29-31; Guilin, China. New York: IEEE; 2017. p. 1615-20.

Wang Z, Liu Y, Liu L. A new way to choose splitting attribute in ID3 algorithm. IEEE 2nd Information Technology, Networking, Electronic and Automation Control Conference; 2017 Dec 15-17; Chengdu, China. New York: IEEE; 2017. p. 659-63.

Wang Y, Li Y, Song Y, Rong X, Zhang S. Improvement of ID3 algorithm based on simplified information entropy and coordination degree. Algorithms. 2017;10(4):124.

Zhou H, Zhang J, Zhou Y, Guo X, Ma Y. A feature selection algorithm of decision tree based on feature weight. Expert Syst Appl. 2020;164:113842.

Soni VK, Pawar S. Emotion based social media text classification using optimized improved ID3 classifier. International Conference on Energy, Communication, Data Analytics and Soft Computing; 2017 Aug 1-2; Chennai, India. New York: IEEE; 2017. p. 1500-5.

Es-Sabery F, Hair A. An improved ID3 classification algorithm based on correlation function and weighted attribute. International Conference on Intelligent Systems and Advanced Computing Sciences; 2019 Dec 26-27; Taza, Morocco. New York: IEEE; 2019. p. 1-8.

Dietrich D, Heller B, Yang B. Data science and big data analytics: discovering, analyzing, visualizing and presenting data. Indianapolis: John Wiley & Sons; 2015.

Han J, Pei J, Kamber M. Data mining: concepts and techniques. 3rd ed. Burlington: Elsevier; 2011.

Iqbal MRA, Rahman S, Nabil SI, Chowdhury IUA. Knowledge based decision tree construction with feature importance domain knowledge. The 7th International Conference on Electrical and Computer Engineering; 2012 Dec 20-22; Dhaka, Bangladesh. New York: IEEE; 2012. p. 659-62.

Pouriyeh S, Allahyari M, Liu Q, Cheng G, Arabnia HR, Atzori M, et al. Graph-based methods for ontology summarization: a survey. IEEE First International Conference on Artificial Intelligence and Knowledge Engineering; 2018 Sep 26-28; Laguna Hills, USA. New York: IEEE; 2018. p. 85-92.

Brin S, Page L. Reprint of: the anatomy of a large-scale hypertextual web search engine. Comput Netw. 2012;56(18):3825-33.

Kralj J, Vavpetič A, Dumontier M, Lavrač N. Network ranking assisted semantic data mining. In: Ortuño F, Rojas I, editors. International Conference on Bioinformatics and Biomedical Engineering; 2016 Apr 20-22; Granada, Spain. Cham: Springer; 2016. p. 752-64.

Vavpetič A, Novak PK, Grčar M, Mozetič I, Lavrač N. Semantic data mining of financial news articles. In: Fürnkranz J, Hüllermeier E, Higuchi T, editors. Discovery science; 2013 Oct 6-9; Singapore. Berlin: Springer; 2013. p. 294-307.

Kastrati Z, Imran AS. Performance analysis of machine learning classifiers on improved concept vector space models. Future Gener Comput Syst. 2019;96:552-62.

Dua D, Karra Taniskidou E. UCI Machine learning repository [Internet]. University of California, Irvine, School of Information and Computer Sciences; 2017 [cited 2019 Feb 12]. Available from: https://archive.ics.uci.edu/ml/index.php.

Vianna Cardozo S, Maniero V, Rangel P, Camargo T, Souza M, Forte J, et al. Databases of a clinico-ecological study of a triple epidemic [Internet]. Mendeley Data; 2018 [cited 2021 May 10]. Available from: https://data.mendeley.com/datasets/ 2drcj8mtbc/1.

Viana dos Santos Santana Í, CM da Silveira, Sobrinho A, Chaves e Silva L, Dias da Silva L, Freire de Souza Santos D, et al. A Brazilian dataset of symptomatic patients for screening the risk of COVID-19 [Internet]. Mendeley Data; 2021 [cited 2021 May 28]. Available from: https://data.mendeley.com/datasets/b7zcgmmwx4/5.

Knublauch H, Fergerson RW, Noy NF, Musen MA. The protégé OWL plugin: an open development environment for semantic web applications. In: McIlraith SA, Plexousakis D, van Harmelen F, editors. The Semantic Web-ISWC; 2004 Nov 7-11; Hiroshima, Japan. Berlin: Springer; 2004. p. 229-43.

Crop ontology curation tool. Soybean ontology [Internet]. 2011 [cited 2018 Aug 24]. Available from: http://www.cropontology. org/ontology/CO_336/Soybean.

Markell S, Malvick D. Soybean disease diagnostic series-publications [Internet]. 2018 [cited 2019 Feb 13]. Available from: https://www.ag.ndsu.edu/publications/crops/soybean-disease-diagnostic-series.

Michalski RS. Learning by being told and learning from examples: an experimental comparison of the two methods of knowledge acquisition in the context of development an expert system for soybean disease diagnosis. Int J Policy Anal Inf Syst. 1980;4(2):125-61.

Wang L. Heart failure ontology. BioPortal [Internet]. 2015 [cite 2021 May 11]. Available from: https://bioportal.bioontology.org/ ontologies/HFO.

Mitraka E, Topalis P, Dritsou V, Dialynas E, Louis C. Describing the breakbone fever: IDODEN, an ontology for dengue fever. PLoS Negl Trop Dis. 2015;9(2):e0003479.

Mitraka E. Dengue fever ontology. BioPortal [Internet]. 2014 [cited 2021 Jul 5]. Available from: https://bioportal.bioontology. org/ontologies/IDODEN.

Sargsyan A, Kodamullil AT, Baksi S, Darms J, Madan S, Gebel S, et al. The COVID-19 ontology. Bioinformatics. 2020;36(4):5703-5.

Kodamullil AT. COVID-19 Ontology. BioPortal [Internet]. 2021 [cited 2021 Jul 6]. Available from: https://bioportal. bioontology.org/ontologies/COVID-19.

McCarthy RV, McCarthy MM, Ceccucci W, Halawi L. Know your data-data preparation. In: Applying predictive analytics: finding value in data. Cham: Springer; 2019. p. 27-56.

Debie E, Shafi K. Implications of the curse of dimensionality for supervised learning classifier systems: theoretical and empirical analyses. Pattern Anal Applic. 2019;22(2):519-36.

Shroff KP, Maheta HH. A comparative study of various feature selection techniques in high-dimensional data set to improve classification accuracy. International Conference on Computer Communication and Informatics; 2015 Jan 8-10; Coimbatore, India. New York: IEEE; 2015. p. 1-6.

Verma JP. Non-parametric tests for psychological data. In: Statistics and research methods in psychology with excel. Singapore: Springer; 2019. p. 477-521.

Verma JP. Non-parametric correlations. In: Statistics and research methods in psychology with excel. Singapore: Springer; 2019. p. 523-65.

Kaur H, Pannu HS, Malhi AK. A systematic review on imbalanced data challenges in machine learning: applications and solutions. ACM Comput Surv. 2019;52(4):1-36.

Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res. 2002;16(1):321-57.

Kumar S, Baliyan N. Quality evaluation of ontologies. In: Semantic web-based systems: quality assessment models. Singapore: Springer; 2018. p. 19-50.

Gupta S, Gupta A. Dealing with noise problem in machine learning data-sets: a systematic review. Procedia Comput Sci. 2019;161:466-74.

Chaitra PC, Kumar DRS. A review of multi-class classification algorithms. Int J Pure Appl Math. 2018;118(14):17-26.

Shekar BH, Dagnew G. Grid search-based hyperparameter tuning and classification of microarray cancer data. Second International Conference on Advanced Computational and Communication Paradigms; 2019 Feb 25-28; Gangtok, India. New York: IEEE; 2019. p. 1-8.

Althnian A, AlSaeed D, Al-Baity H, Samha A, Dris AB, Alzakari N, et al. Impact of dataset size on classification performance: an empirical evaluation in the medical domain. Appl Sci. 2021;11(2):796.

Mehta P, Bukov M, Wang CH, Day AGR, Richardson C, Fisher CK, et al. A high-bias, low-variance introduction to machine learning for physicists. Phys Rep. 2019;810:1-124.

Jiang Z, Pan T, Zhang C, Yang J. A new oversampling method based on the classification contribution degree. Symmetry. 2021;13(2):194.

Gaye B, Zhang D, Wulamu A. Improvement of support vector machine algorithm in big data background. Math Probl Eng. 2021;2021:5594899.