Enabling Efficient Personally Identifiable Information Detection with Automatic Consent Discovery

Main Article Content

Somchart Fugkeaw
Pattavee Sanchol

Abstract

Personal data leakage prevention has now become a critical issue for implementing data management and sharing in many industries. Several data privacy regulations such as General Data Protection Regulation (GDPR), Health Insurance Portability and Accountability Act (HIPPA), California Consumer Privacy Act (CCPA), and Thailand's Personal Data Protection Act (PDPA) have been issued to enforce organizations to collect, process, and transfer personally identifiable information (PII) securely. In this paper, we propose a design and development of PII RapidDiscover, an efficient Thai and English PII discovery system featured with automatic consent discovery. At the core of our proposed system, we introduce the PII scanning algorithm based on the Presidio library and a natural language processing (NLP) technique to improve the scan result of PII written in Thai and English. Finally, we conducted the experiments to demonstrate the efficiency of our proposed system.

Article Details

How to Cite
[1]
S. Fugkeaw and P. Sanchol, “Enabling Efficient Personally Identifiable Information Detection with Automatic Consent Discovery”, ECTI-CIT Transactions, vol. 17, no. 2, pp. 245–254, Jun. 2023.
Section
Research Article

References

https://gdpr-info.eu

https://oag.ca.gov/privacy/ccpa

https://www.pcisecuritystandards.org

https://www.cdc.gov/phlp/publications/topic/hipaa.html

http://www.ratchakitcha.soc.go.th/DATA/PDF/2562/A/069/T_0052.PDF

A. Mrabet, M. Bentousi and P. Darmon, “SecP2I A Secure Multi-party Discovery of Personally Identifiable Information (PII) in Structured and Semi-structured Datasets,” 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA, pp. 5028-5033, 2019.

S. Fugkeaw, A. Chaturasrivilai, P. Tasungnoen and W. Techaudomthaworn, “AP2I: Adaptive PII Scanning and Consent Discovery System,” 2021 13th International Conference on Knowledge and Smart Technology (KST), Bangsaen, Chonburi, Thailand, pp. 231-236, 2021.

I. Arous, L. Dolamic, J. Yang, A. Bhardwaj, G. Cuccu, P. Cudr ́e-Mauroux, “Marta: Leveraging human rationales for explainable text classification,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 7, pp. 5868–5876, 2021.

Z. Liu, Y. Guo and J. Mahmud, “When and why does a model fail? A human-in-the-loop error detection framework for sentiment analysis,” Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers, pp. 170-177, 2021.

X. Wu, L. Xiao, Y. Sun, J. Zhang, T. Ma and L. He, “A survey of human-in-the-loop for machine learning,” Future Generation Computer Systems, vol. 135, pp. 364-381, 2022.

A. Shuba, A. Le, E. Alimpertis, M. Gjoka, and A. Markopoulou, “Antmonitor: System and applications,” arXiv:1611.04268, 2016.

A. Razaghpanah, N. Vallina-Rodriguez, S. Sundaresan, C. Kreibich, P. Gill, M. Allman and V. Paxson, “Haystack: A multipurpose mobile vantage point in user space,” arXiv:1510.01419v3, Oct. 2016.

J. Ren, A. Rao, M. Lindorfer, A. Legout, and D. Choffnes. Recon, “Revealing and controlling pii leaks in mobile network traffic,” in Proceeding of the 13th Annual Int. Conf. on Mobile Systems, Applications, and Services (MobiSys), vol. 16, New York, NY, USA, 2016.

S. J. Y. Go, R. Guinto, C. A. M. Festin, I. Austria, R. Ocampo and W. M. Tan, “An SDN/NFV-Enabled Architecture for Detecting Personally Identifiable Information Leaks on Network Traffic,” 2019 Eleventh International Conference on Ubiquitous and Future Networks (ICUFN), Zagreb, Croatia, pp. 306-311, 2019.

Y. Liu, H. H. Song, I. Bermudez, A. Mislove, M. Baldi and A. Tongaonkar, “Identifying personal information in internet traffic,” in Proceeding of the 2015 ACM on Conference on Online Social Networks, COSN ’15, New York, USA, pp. 59–70, ACM, 2015.

J. Huang, B. Klee, D. Schuckers, D. Hou and S. Schuckers, “Removing Personally Identifiable Information from Shared Dataset for Keystroke Authentication Research,” 2019 IEEE 5th International Conference on Identity, Security, and Behavior Analysis (ISBA), Hyderabad, India, pp. 1-7, 2019.

F. Alizadeh, T. Jakobi, A. Boden, G. Stevens and J. Boldt, “GDPR Reality Check Claiming and Investigating Personally Identifiable Data from Companies,” 2020 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW), Genoa, Italy, pp. 120-129, 2020.

“CUSpider,” Accessed on: Sep. 23, 2020. [Online]. Available: https: //cuit.columbia.edu/content/ cuspider-pii-scanning-application

P. Silva, C. Gonc ̧alves, C. Godinho, N. Antunes and M. Curado, “Using NLP and Machine Learning to Detect Data Privacy Violations,” IEEE INFOCOM 2020 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), Toronto, ON, Canada, pp. 972-977, 2020.

S. Bird, E. Klein and E. Loper, Natural Language Processing with Python, Boston, USA: O’Reilly Media, 2009.

C. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. Bethard and D. McClosky, “The Stanford CoreNLP natural language processing toolkit,” in Proc. of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55–60, Jun. 2014.

ExplosionAI. (2019) spacy industrial-strength natural language processing. [Online]. Available: https://spacy.io

Y. Liu et al., “Identifying, Collecting, and Monitoring Personally Identifiable Information: From the Dark Web to the Surface Web,” 2020 IEEE International Conference on Intelligence and Security Informatics (ISI), Arlington, VA, USA, pp. 1-6, 2020.

E. Costante, D. Fauri, S. Etalle, J. den Hartog and N. Zannone, “A Hybrid Framework for Data Loss Prevention and Detection,” 2016 IEEE Security and Privacy Workshops (SPW), San Jose, CA, USA, pp. 324-333, 2016.

A. Guha, D. Samanta, A. Banerjee and D. Agarwal, “A Deep Learning Model for Information Loss Prevention From Multi-Page Digital Documents,” in IEEE Access, vol. 9, pp. 80451-80465, 2021.

S. Fugkeaw, K. Worapaluk, A. Tuekla and S. Namkeatsakul, “Design and Development of A Dynamic and Efficient PII Data Loss Prevention System,” in Proc. of the 17th International Conference on Computing and Information Technology, Springer, vol. 251, pp. 23-33, Bangkok, Thailand, 2021.

https://github.com/microsoft/presidio

G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami and C. Dyer, “Neural Architectures for Named Entity Recognition,” Association for Computational Linguistics: Human Language Technologies, 2016.

S. Thattinaphanich and S. Prom-on, “Thai Named Entity Recognition Using Bi-LSTM-CRF with Word and Character Representation,” 2019 4th International Conference on Information Technology (InCIT), Bangkok, Thailand, pp. 149-154, 2019.

S. Hochreiter and J. Schmidhuber, “Long ShortTerm Memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.

J. D. Lafferty, A. McCallum and F. C. N. Pereira, “Conditional random fields:Probabilistic models for segmenting and labeling sequence data,” ICML ’01: Proceedings of the Eighteenth International Conference on Machine Learning, pp. 282-289, 2001.

N. Tirasaroj and W. Aroonmanakun, “Thai named entity recognition based on conditional random fields,” 2009 Eighth International Symposium on Natural Language Processing, Bangkok, Thailand, pp. 216-220, 2009.

W. Phatthiyaphaibun, “Thai Named Entity Recognitions for PyThaiNLP,” [Online]. Available: https://github.com/wannaphongcom/thai-ner.