An Integration Model on Data Lake for Solving Structured Data Silo Problems
Main Article Content
Abstract
Data siloes are a major data management challenge in both public and business organizations. As a result of the organizational structure of work functions into numerous departments, each different department has distinct responsibilities and tends to create dependent software, applications and data systems to locally support its individual needs. Similar data are always stored in multiple silos or databases, and normally their data schema likes names and meanings are different from each other. As a result, users are perplexed on how to use those data coming from different silos of software applications. This research applies the data lake concept to solve the data silo problem. The scope of the research focuses on structured data silos. The objective of this research is to design a data lake architecture and its internal working framework by using Hive and Spark technologies to integrate data within a data lake and write functional testing programs in Java Spark. According to the result of testing based on a detailed developed framework, integrating data silos on data lakes can reduce data heterogeneity and data inconsistencies by 100%, and it was able to reduce the redundancy of the test data by 78.6% from the total of 13 separate data cases.
Article Details
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
References
Dixon, J. (2019). Retrieved from https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/
Fang, H. (2015). Managing Data Lakes in Big Data Era: What’s a data lake and why has it became popular in data management ecosystem. In: The 5th Annual IEEE International Conference on Cyber Technology in Automation, Control and Intelligent Systems, Shenyang, China. 820-824.
Giebler, C., Gröger, C., Hoos, E., Eichler, R., Schwarz, H. and Mitschang, B. (2021). The Data Lake Architecture Framework: A Foundation for Building a Comprehensive Data Lake Architecture. In: Conference for Database Systems for Business, Technology and Web (BTW). 351-370.
Inmon, B. (2016). Designing the Data Lake and Avoiding the Garbage Dump. USA: Technics Publications.
Khine, P.P. and Wang, Z.S. (2018). Data lake: a new ideology in big data era. ITM Web of Conferences. 1-11.
LaPlante, A. and Sharma, B. (2016). Architecting Data Lakes Data Management Architectures for Advanced Business Use Cases. USA: O’Reilly.
Miloslavskaya, N. and Tolstoy, A. (2016). Application of Big Data, Fast Data and Data Lake: Concepts to Information Security Issues. In: 4th International Conference on Future Internet of Things and Cloud Workshops, Vienna, Austria. 148-153.
Patel, J. (2019). Overcoming data Silos through big data integration. International Journal of Computer Science and Technology 3(1): 1-6.
Stein, B. and Morrison, A. (2014). The enterprise data lake: Better integration and deeper analytics. Technology Forecast: Rethinking integration Retrieved. 1: 1-9.
Walker, H.A. (2015). Personal Data Lake with Data Gravity Pull. In: Big Data and Cloud Computing (BDCloud), 2015 IEEE Fifth International Conference. 160-167.