Comparison of Storages Performance in Big Data Platform
Main Article Content
Abstract
From the continuous application of information technology in various fields today. This causes the amount of data in information systems to increase rapidly. The data that comes from the data source is stored sparsely in external systems. Before being imported and linked to store in the Big Data Platform, which is designed to support the storage and processing of large data. Get all types efficiently before being put to use in various dimensions such as data analysis, data service and sharing, and reporting. So that executives can use these data and reports to truly analyze and plan to drive the organization with data.
However, the imported data is stored in the big data platform coming from different external systems resulting in a variety of data storage formats. Each format has a different structure on the storage, such as a row-based or column-based data structure, storing as a binary or text file, and supporting data compression. Because each format has both advantages and disadvantages, we therefore studied and compared the efficiency of data file formats stored on the big data platform in order to find the most suitable data file format for working in various cases as efficiently as possible. The results of the experiment indicated that column-based data structures are better suited for importing and storing data taken from external sources, while row-based data structures are suitable for querying or analyzing the data with more complex commands.
Article Details

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Copyright of all articles published is owned by CRMA Journal.
References
Cutting, D., & Cafarella, M. (2007). Apache hadoop., 203-214.
Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1), 107-113.
Shafranovich, Y. (2005). Rfc 4180: Common format and mime type for comma-separated values (csv) files., 562-604.
Pezoa, F., Reutter, J. L., Suarez, F., Ugarte, M., & Vrgoč, D. (2016, April). Foundations of JSON schema. In Proceedings of the 25th international conference on World Wide Web (pp. 263-273).
Vohra, D. (2016). Apache avro. Practical Hadoop Ecosystem: A Definitive Guide to Hadoop-Related Frameworks and Tools, 303-323.
Apache, O. R. C. (2018). Apache ORC: High-Performance Columnar Storage for Hadoop., 67-108.
He, Y., Lee, R., Huai, Y., Shao, Z., Jain, N., Zhang, X., & Xu, Z. (2011). RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems. In 2011 IEEE 27th International Conference on Data Engineering (pp. 1199-1208). IEEE.
Vohra, D., & Vohra, D. (2016). Apache parquet. Practical Hadoop Ecosystem: A Definitive Guide to Hadoop-Related Frameworks and Tools, 325-335.
Ivanov, T., & Pergolesi, M. (2020). The impact of columnar file formats on SQL‐on‐hadoop engine performance: A study on ORC and Parquet. Concurrency and Computation: Practice and Experience, 32(5), e5523.
Bansal, H., Chauhan, S., & Mehrotra, S. (2016). Apache Hive Cookbook. Packt Publishing Ltd.
Salloum, S., Dautov, R., Chen, X., Peng, P. X., & Huang, J. Z. (2016). Big data analytics on Apache Spark. International Journal of Data Science and Analytics, 1, 145-164.
Anonymous.: Dataset web services. Available: https://www.kaggle.com/datasets/shivamb/netflix-shows. February 5, 2024