Enhancing the performance of association rule models by filtering instances in colorectal cancer patients
Main Article Content
Abstract
Colorectal cancer data available from the SEER program is analyzed with the aim of using filtering techniques to improve the performance of association rule models. In this paper, it is proposed to improve the quality of the dataset by removing its outliers using the Hidden Naïve Bayes (HNB), Naïve Bayes Tree (NBTree) and Reduced Error Pruning Decision Tree (REPTree) algorithms. The Apriori and HotSpot algorithms are applied to mine the association rules between the 13 selected attributes and average survivals. Experimental results show that the HNB algorithm can improve the accuracy of the Apriori algorithm’s performance by up to 100% and support threshold up to 45%. It can also improve the accuracy of the HotSpot algorithm’s performance up to 93.38% and support threshold up to 80%. Therefore, the HotSpot rules with minimum support of 80% are selected for explanation. The HotSpot algorithm shows that colorectal cancer patients, who died from colon cancer and were not receiving radiation therapy, were associated with survival of less than 22 months. Our study shows that filtering techniques in the preprocessing stage are a useful approach in enhancing the quality of the data set. This finding could help researchers build models for better prediction and performance analysis. Although it is heuristic, such analysis can be very useful to identify the factors affecting survival. It can also aid medical practitioners in helping patients to understand risks involved in a particular treatment procedure.
Article Details
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
References
Cancer Network. Colon, rectal, and anal cancers. [Internet]. Connecticut: Cancer Network Home of the Journal Oncology; 2016 [cited 2016 April 22]. Available from: http://www.cancernetwork.com /cancer-management/colon-rectal-and-anal-cancers.
Haggar F, Boushey R. Colorectal cancer epidemiology: incidence, mortality, survival, and risk factors. Clin Colon Rectal Surg. 2009;22(4):191-7.
Amarican cancer Society. Colorectal cancer : facts and figures 2014-2016. Atlanta, Georgia: Amarican cancer Society; 2016.
Leelakusolvong S. Colorectal cancer: Early detection can mean a cure [Internet]. The Nation; [Update 2014 October 28; cited 2016 Feb 28]. Available from: http://www.nationmultimedia.com/news/life/living_health/30246366.
Information and Technology Division National Cancer Institute. Hospital based cancer registry annual report 2013. Bangkok: National Cancer Institute Department of Medicine services, Ministry of Public Health, Thailand; 2015.
Khuhaprema T, Srivatanakul P. Colon and rectum cancer in Thailand: An overview 2008. Jpn J Clin Oncol. 2008;38(4):237-43.
Rebecca S, Carol D, Ahmedin J. Colorectal cancer statistics. CA Cancer J Clin. 2014;64(2):104-17.
American Cancer Society. Key statistics for colorectal cancer [Internet]. Atlanta: American Cancer Society; 2016 [cited 2016 Augus 16 ]. Available from: http://www.cancer.org/cancer/colonandrectumcancer/detailedguide/colorectal-cancer-key-statistics.
Chen J, He H, Jin H, McAullay D, Williams G, Kelman C. Identifying risk groups associated with colorectal cancer. In: Williams GJ, Simoff SJ, editors. Data Mining, LNAI 3755. Heidelberg: Springer-Verlag Berlin Heidelberg; 2006. p. 260-72.
Agrawal R, Srikant R. Fast algorithms for mining association rules. In: Bocca JB, Jarke M, Zaniolo C, editors. Proceedings of the 20th International Conference on Very Large Data Bases (VLDB 94); 1994 Sep 12-15; Santiago, Chile. San Francisco: Morgan Kaufmann Publishers; 1994. p. 487-99.
Agrawal R, Imielinski T, Swami A. Mining association rules between sets of items in large databases. ACM SIGMOD Record. 1993;22(2):207-16.
Leung CK-S. Mining Uncertain Data. WIREs Data Mining and Knowledge Discovery. 2011;1(4):316-29.
Vinnakota S, Lam NS. Socioeconomic inequality of cancer mortality in the united states: A spatial data mining approach. Int J Health Geogr. 2006;5(9):1-12.
Agrawal A, Choudhary A. Identifying hotspots in lung cancer data using association rule mining. In: Spiliopoulou M, Wang H, Cook D, Pei J, Wang W, Zaïane O, et al., editors. Proceeding of the 11th IEEE International Conference on Data Mining Workshops; 2011 Dec 11; Vancouver, Canada. New Jersey: IEEE Computer Society; 2011. p. 995-1002.
Cuet CV, Szeja S, Wertheim BC, Ong ES. Disparities in treatment and survival of white and native american patients with colorectal cancer: A SEER Analysis. J Am Coll Surg. 2011;213(4):469-74.
O’Connor ES, Greenblatt DY, Loconte NK, Gangnon RE, Liou J-I, Heise CP, Smith MA. Adjuvant chemotherapy for stage Ii colon cancer with poor prognostic features. J Clin Oncol. 2011;29(25): 3381-8.
Fathy SK. A predication survival model for colorectal cancer. In: Zemliak A, Mastorakis N, editors. Proceeding of the American conference on applied mathematics and the 5th WSEAS international conference on Computer engineering and applications; 2011 Jan 29-31; Puerto Morelos, Mexico. Stevens Point: World Scientific and Engineering Academy and Society; 2011. p. 36-42.
Fawzy A, Mokhtar HMO, Hegazy O. Outliers detection and classification in wireless sensor networks. Egyptian Informatics Journal. 2013; 14(2):157-64.
Tallon-Ballesteros AJ, Riquelme JC. Deleting or keeping outliers for classifier training?. Proceeding of the 6th World Congress on Nature and Biologically Inspired Computing; 2014 Jul 30 - Aug 1; Porto, Portugal. New Jersey: IEEE; 2014. p. 281-6.
Upadhyaya S, Singh K. Classification based outlier detection techniques. Int J Comput Trends Tech. 2012;3(2):294-8.
Chenaoua K, Kurugollu F, Bouridane A. Data cleaning
and outlier removal: application in human skin detection. Proceeding of 5th European Workshop on Visual Information Processing; 2014 Dec 10-12; Paris, France. New Jersey: IEEE; 2014. p. 23-28.
Thongkam J, Xu G, Zhang Y, Huang F. Support vector machines for outlier detection in cancers survivability prediction. In: Ishikawa Y, He J, Xu G, Shi Y, Huang G, Pang C, et al., editors. Proceeding of International Workshop on Health Data Management: APWeb'08; 2008 April 26-28; Shenyang, China. Berlin: Springer; 2008. p. 99-109.
Barbara D, Couto J, Jajodia S, Wu N. Detecting novel network intrusions using bayes estimators. In: Kumar V, Grossman R, editors. Proceeding of the 1st SIAM International Conference on Data Mining; 2001 April 5-7; Chicago, USA. Philadelphia: Society for Industrial and Applied Mathematics; 2001. p. 1-17.
Farida DM, Zhanga L, Rahmanb CM, Hossaina MA, Strachana R. Hybrid decision tree and Naive Bayes classifiers for multi-class classification tasks. Expert Syst Appl. 2014 ;41(4):1937-46.
Paris IHM, Affendey LS, Mustapha N. Improving academic performance prediction using voting technique in data mining. World Acad Sci Eng Tech. 2010;4(2):306-9.
Kohavi R. Scaling up the accuracy of Naïve Bayes classifiers: A decision tree hybrid. In: Simoudis E, Han J, Fayyad U, editors. Proceeding of 2nd International Conference of Knowledge Discovery and Data mining;
Aug 2-4; Portland, USA. California: AAAI Press. p. 202-7.
Mahmood DY, Hussein MA. Analyzing NB, DT and NBtree intrusion detection algorithms. Journal of Zankoy Sulaimani-Part A. 2014;16(1):87-94.
Quinlan JR. Simplifying decision trees. Int J Man Mach Stud. 1987;27(3):221-34.
Xiong H, Pandey G, Steinbach M, Kumar V. Enhancing data analysis with noise removal. IEEE Trans Knowl Data Eng. 2006;18(3):304-19.
Gamberger D, Lavrac N, Groselj C. Experiments with noise filtering in a medical domain. In: Bratko I, Dzeroski S, editors. Proceeding of the 16th International Conference on Machine Learning. 1999 Jun 27-30; Bled, Slovenia. San Francisco: Morgan Kaufmann Publishers; 1999. p. 143-51.
Sharma N, Om H. Significant patterns for oral cancer detection: Association rule on clinical examination and history data. Netw Model Anal Health Inform Bioinforma. 2014;3(1):1-13.
Ramezankhani A, Pournik O, Shahrabi J, Azizi F, Hadaegh F. An application of association rule mining to extract risk pattern for type 2 diabetes using teharan lipid and glucose study database. Int J Endocrinol Metab. 2015;13(2):1-13.
Arikan U, Gurgen F. Discrimination ability of time-domain features and rules for arrhythmia classification. Math Comput Appl. 2012;17(2):111-20.
The University of Waikato. Weka 3: Data mining software in Java [Internet]. Hamilton, New Zealand: Machine Learning Group at the University of Waikato; 2016 [cited on 2016 August 16]. Available from: http://www.cs.waikato.ac.nz/ml/.