Information Extraction Tasks based on BERT and SpaCy on Tourism Domain
Main Article Content
Abstract
In this paper, we present two methodologies to extract particular information based on the full text returned from the search engine to facilitate the users. The approaches are based three tasks: name entity recognition (NER), text classification and text summarization. The first step is the building training data and data cleansing. We consider tourism domain such as restaurant, hotels, shopping and tourism data set crawling from the websites. First, the tourism data are gathered and the vocabularies are built. Several minor steps include sentence extraction, relation and name entity extraction for tagging purpose. These steps are needed for creating proper training data. Then, the recognition model of a given entity type can be built. From the experiments, given review texts, we demonstrate to build the model to extract the desired entity,i.e, name, location, facility as well as relation type, classify the reviews or summarize the reviews. Two tools, SpaCy and BERT, are used to compare the performance of these tasks.
Article Details
References
J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” CoRR, vol. abs/1810.04805, 2018. [Online]. Available: http://arxiv.org/abs/1810.04805
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018. [Online]. Available: https://github.com/google-research/bert
C. Chantrapornchai and C. Choksuchat, “Ontology construction and application in practice case study of health tourism in thailand,” SpringerPlus, vol. 5, no. 1, p. 2106, Dec 2016. [Online]. Available: https://doi.org/10.1186/s40064-016-3747-3
C. Feilmayr, S. Parzer, and B. Pr ̈oll, “Ontology- based information extraction from tourism websites,” J. of IT & Tourism, vol. 11, pp. 183– 196, 08 2009.
H. Alani, , D. E. Millard, M. J. Weal, W. Hall, P. H. Lewis, and N. R. Shadbolt, “Automatic ontology-based knowledge extraction from web documents,” IEEE Intelligent Systems, vol. 18, no. 1, pp. 14–21, Jan 2003.
R. Jakkilinki, M. Georgievski, and N. Sharda, “Connecting destinations with an ontology- based e-tourism planner,” in Information and Communication Technologies in Tourism 2007, M. Sigala, L. Mich, and J. Murphy, Eds. Vienna: Springer Vienna, 2007, pp. 21–32.
S. Mouhim, A. Aoufi et al., “A knowledge management approach based on ontologies: the case of tourism,” SpringerPlus, vol. 4, no. 3, p. 362–369, 2011.
D. Bachlechner. (2004) OnTour The Semantic Web and its Benefits to the Tourism Industry. Retrieved 13 June 2019. [Online]. Available: https://pdfs.semanticscholar.org/eabc/ d4368f5b00e248477a67d710c058f46cd83e.pdf
M. Sigala, L. Mich et al., “Connecting destinations with an ontology-based e-tourism planner,” in Information and communication technologies in tourism, M. Sigala and M. L. Mich, Eds. Springer, Vienna, 2007, p. 21–32.
STI INNBUCK. (2013) Accommodation Ontology Language Reference. Retrieved 8 March 2019. [Online]. Available: http: //ontologies.sti-innsbruck.at/acco/ns.html
Web vocabulary for e-commerce. (2008) A paradigm shift for e-commerce. Since 2008. Retrieved 8 March 2019. [Online]. Available: http://www.heppnetz.de/projects/ goodrelations/
M. Chaves, L. Freitas, and R. Vieira, “Hontology: A multilingual ontology for the accommodation sector in the tourism industry,” 01 2012.
K. Vila and A. Ferr ́andez, “Developing an ontology for improving question answering in the agricultural domain,” in Metadata and Semantic Research, F. Sartori, M. A ́. Sicilia, and N. Manouselis, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2009, pp. 245–256.
W. Khan, A. Daud, J. Nasir, and T. Amjad, “A survey on the state-of-the-art machine learning models in the context of nlp,” vol. 43, pp. 95– 113, 10 2016.
SpaCy. (2020) Facts & Figures. Retrieved 26 March 2020. [Online]. Available: https: //spacy.io/usage/facts-figures
——, “Library architecture,” 2019. [Online]. Available: https://spacy.io/api# section-nn-model
——. (2020) Facts & Figures. Retrieved 26 March 2020. [Online]. Available: https://spacy. io/usage/linguistic-features
D. W. Otter, J. R. Medina, and J. K. Kalita, “A survey of the usages of deep learning in natural language processing,” CoRR, vol. abs/1807.10854, 2018. [Online]. Available: http://arxiv.org/abs/1807.10854
T. Shi, Y. Keneshloo, N. Ramakrishnan, and C. K. Reddy, “Neural abstractive text summarization with sequence-to-sequence models,” CoRR, vol. abs/1812.02303, 2018. [Online]. Available: http://arxiv.org/abs/1812. 02303
L. Yang, “Abstractive summarization for amazon reviews,” Stanford University, CA, Tech. Rep. [Online]. Available: https: //cs224d.stanford.edu/reports/lucilley.pdf
J. C. Cheung, “Comparing abstractive and extractive summarization of evaluative text: Controversy and Content selection”, Canada, 2008. Available: http://www.cs.toronto.edu/∼jcheung/papers/honours-thesis.pdf [Online].
Y. Liu and M. Lapata, “Text summarization with pretrained encoders,” in The Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 2019, p. 3730–3740.
D. Miller, “Leveraging BERT for extractive text summarization on lectures,” CoRR, vol. abs/1906.04165, 2019. [Online]. Available: http: //arxiv.org/abs/1906.04165
M. Munikar, S. Shakya, and A. Shrestha, “Fine- grained sentiment classification using BERT,” in 2019 Artificial Intelligence for Transforming Business and Society (AITB), vol. 1, 2019, pp. 1–5.
A. Adhikari, A. Ram, R. Tang, and J. Lin, “DocBERT: BERT for document classification,” ArXiv, vol. abs/1904.08398, 2019.
X. Dai, “Recognizing complex entity mentions: A review and future directions,” in Proceedings of ACL 2018, Student Research Workshop, Melbourne, Australia, 2018, p. 37–44.
V. Yadav and S. Bethard, “A survey on recent advances in named entity recognition from deep learning models,” in Proceedings of 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA, 2018, p. 2145–2158.
K. Xue, Y. Zhou, Z. Ma, T. Ruan, H. Zhang, and P. He, “Fine-tuning bert for joint entity and relation extraction in chinese medical text,” in 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2019, pp. 892–897.
“NLP-progress.” [Online]. Available: http://nlpprogress.com/english/named entity recognition.html
Y. Wu et al., “Google’s neural machine translation system: Bridging the gap between human and machine translation,” CoRR, vol. abs/1609.08144, 2016. [Online]. Available: http: //arxiv.org/abs/1609.08144
J. Vijay and R. Sridhar, “A machine learning approach to named entity recognition for the travel and tourism domain,” Asian Journal of Information Technology, vol. 15, pp. 4309–4317, 01 2016.
K. E. Saputro, S. S. Kusumawardani, and S. Fauziati, “Development of semi-supervised named entity recognition to discover new tourism places,” 2016 2nd International Conference on Science and Technology- Computer (ICST), pp. 124–128, 2016.
Y. Liu, “Fine-tune BERT for extractive summarization,” CoRR, vol. abs/1903.10318, 2019. [Online]. Available: http://arxiv.org/abs/ 1903.10318
A. Vashisht, “BERT for text summarization,” 2019. [Online]. Available: https://iq.opengenus. org/bert-for-text-summarization/