Codepickle: Portable Python Grid Computing
Main Article Content
Abstract
We present Codepickle, a groundbreaking solution designed to overcome Python version constraints and enhance portability, especially within distributed Python grids and volunteer computing environments. Unlike existing serialization libraries such as Cloudpickle and Dill, which rely on Python bytecode and are prone to version conflicts, Codepickle offers a robust alternative. Our methodology includes innovative adjustments for function serialization and shared variable management. Experimental results reveal challenges like code sourcing and nonlocal variable handling. Performance benchmarks highlight Codepickle's significant advantages over Cloudpickle, including better portability and reduced message sizes. Notably, Codepickle achieves message sizes that are 84% of those produced by Cloudpickle especially for small code segments, with comparable execution performance. Proposed enhancements target critical issues such as lambda functions and cross-version compatibility. This comprehensive study not only demonstrates Codepickle's transformative potential but also underscores the ongoing quest for advanced serialization techniques in Python's distributed computing landscape.
Article Details
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
References
A G´eron, Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, O’Reilly Media, 2019.
F. Nelli, Python Data Analytics, Apress, 2015.
F. Chollet, Deep Learning with Python, Manning Publications, 2018.
S. Raschka and V. Mirjalili, Python Machine Learning, Packt Publishing Ltd, 2019.
J. VanderPlas, Python for Data Science Handbook, O’Reilly Media, Inc., 2016.
P. Moritz et al., “Ray: A Distributed Framework for Emerging AI Applications,” 13th USENIX
Symposium on Operating Systems Design and Implementation, USENIX Association, pp. 561-577, Sep. 2018.
M. M. McKerns, L. Strand, T. Sullivan, A. Fang, and M. A. G. Aivazis, “Building a Framework for Predictive Science,” in Proceedings of the 10th Python in Science Conference, pp. 1-12, Feb. 2012.
“PySpark Documentation — PySpark 3.5.1 documentation,” Accessed: Apr. 25, 2024. [Online]. Available: https://spark.apache.org/docs/latest/api/python/
“Using IPython for parallel computing — ipy-parallel 8.9.0.dev documentation,” Accessed: Apr. 25, 2024. [Online]. Available: https://ipyparallel.readthedocs.io/en/latest/
“Joblib: running Python functions as pipeline jobs — joblib 1.4.0 documentation.,” Accessed:Apr. 25, 2024. [Online]. Available: https://joblib.readthedocs.io/en/latest/
W. Stallings, Data and Computer Communications, Pearson Education, 2013.
Y. Shafranovich, “Common Format and MIME Type for CommaSeparated Values (CSV) Files,” Request for Comments 4180, RFC Editor, Oct. 2005.
D. Crockford, JSON: The Fat-Free Alternative to XML, JSON.org, 2006.
Extensible Markup Language (XML) 1.0 (Fifth Edition), W3C Recommendation, 26 November 2008. [Online]. Available: https://www.w3.org/TR/2008/REC-xml-20081126/
cloudpickle. cloudpipe, 2023. Accessed: Apr. 25, 2024. [Online]. Available: https://github.com/cloudpipe/cloudpickle
dill. The UQ Foundation, 2023. Accessed: Apr. 25, 2024. [Online]. Available: https://github.com/uqfoundation/dill
R. Buyya and S. Venugopal, “A Gentle Introduction to Grid Computing and Technologies,” CSI Communications, pp. 9-19, Jul. 2005.
L. F. G. Sarmenta, “Volunteer Computing,” Ph.D. dissertation, Massachusetts Institute of Technology, 2001.
R. Schreiber, “Embarrassingly parallel processing: A new paradigm for massively parallel computers,” Citeseer, 2003.
A. Karlsson, “Embarrassingly parallel data analysis for transportation big data,” Transportation Research Part C: Emerging Technologies, vol. 46, pp. 181–189, 2014.
“Security - Ray 2.11.0,” [Online]. Available: https://docs.ray.io/en/latest/raysecurity/index.html
Anyscale team, “Update on Ray CVEs CVE-2023-6019, CVE2023-6020, CVE-2023-6021, CVE-2023-48022, CVE-2023-48023, Accessed: Apr. 25, 2024. [Online]. Available: https://www.anyscale.com/blog/update-on-ray-cves-cve-2023-6019-cve2023-6020-cve-2023-6021-cve-2023-48022-cve-2023-48023
R. Abraham, “Running multiple projects with different python versions/ray versions and docker images,” Accessed: Apr. 25, 2024. [Online]. Available: https://discuss.ray.io/t/running-multipleprojects-with-different-python-versions-ray-versions-and-docker-images/6253/3
K. Asanovi´c et al., “The Landscape of Parallel Computing Research: A View from Berkeley,” techreport UCB/EECS2006-183, Dec. 2006. [Online].Available:http://www2.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
“Pyodide is a Python distribution for the browser and Node.js based on WebAssembly,” Accessed: Apr. 25, 2024. [Online]. Available: https://github.com/pyodide/pyodide
E. Tangmunchittham and K. Piromsopa, “An Analysis of Python Serialization towards Distributed Systems,” 2022 19th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), Prachuap KhiriKhan, Thailand, pp. 1-4, 2022.
“The Python programming language,” Accessed: Apr. 25, 2024. [Online]. Available:https://github.com/python/cpython
Oracle. (n.d.), Java Object Serialization, Accessed: Apr. 25, 2024. [Online]. Available:
https://docs.oracle.com/javase/8/docs/platform/serialization/spec/serialTOC.html
PHP Manual, (n.d.), Serialization, Accessed: Apr. 25, 2024. [Online]. Available:
https://www.php.net/manual/en/language.oop5.serialization.php
G. V. Rossum and F. L. Drake, The Python Reference Manual, Centrum voor Wiskunde en Informatica Amsterdam, The Netherlands, 1995.
“Python Functions,” Accessed: Aug. 28, 2023. [Online]. Available: https://www.w3schools.com/python/python_functions.asp
“Python Functions - GeeksforGeeks,” Accessed: Aug. Aug. 28, 2023. [Online]. Available: https://www.geeksforgeeks.org/python-functions/
“Python version numbers and their encoding (“magic number”),” Accessed: Apr. 25, 2024. [Online]. Available:https://github.com/google/pycnite/blob/main/pycnite/magic.py
“Pattern: Using nested tasks to achieve nested parallelism – Ray 2.34.0,” Accessed: Aug. 11, 2024. [Online]. Available:https://docs.ray.io/en/latest/ray-core/patterns/nested-tasks.html