Container Use In Software Preservation

Container Use in Software Preservation: Approaches, Issues, and Examples

Annotations by Mark Suhovecky

Definition of Project

A container is a lightweight virtual machine that bundles one or more software executables with the dependencies & data they require to run, all combined into one runtime package. The software can be a single application, a framework, or even an operating system or emulator to be used for running other software. Smaller in size and consuming fewer computing resources than the traditional virtual machine, containers are widely used in software reproducibility and preservation efforts. The articles in this bibliography introduce the technology behind containers, discuss many of the issues surrounding the preservation of software using containers, outline some of the high-level approaches that container-based preservation may take, and provide several in-depth descriptions of specific container-based implementations. The articles were selected from the bibliographies of a number of iPRES and IEEE Computing conferences, a Container Workshop held at the University of Notre Dame, and from a bibliography maintained by the Software Preservation Network. Some articles are secondary references from papers in those collections.

Annotations

Boettiger, C. (2015). An introduction to Docker for reproducible research. ACM SIGOPS Operating Systems Review, 49(1), 71–79. https://doi.org/10.1145/2723872.2723882

For those unfamiliar with containers, Boettiger provides a gentle introduction, focused on software reproducibility and geared towards researchers, using Docker, the reigning king of container technologies. Along the way, he also manages to introduce the reader to “dependency hell” and “code rot” two problems any software preservationist will need to confront. Enough technical detail is given for the reader to understand what a Dockerfile is, the metadata it contains, and how it is used to build a container. There is also a valuable discussion on how a researcher might incorporate Docker into their existing workflow to preserve the software as it is being developed. Code examples use the R statistical language.

Brown, G., & Hsu. A. (2011). Dependency analysis of legacy digital materials to support emulation based preservation. International Journal of Digital Curation, 6(1), 99-110. https://doi.org/10.2218/ijdc.v6i1.175

A container at its simplest is an application executable bundled with the software dependencies and data it needs to run. Brown and Hsu take a deep look at the difficulties in determining these software dependencies, using as examples early Windows executables and libraries (DLLs). They then do this systematically for 2700 Software CD-ROM images in the Federal Depository Library Program (FDLP) using the LibArchive Tool set. This article is primarily of value for its deep, nuts and bolts view of software dependency analysis, which may be new territory to some readers.

Clyburne-Sherin, A., Fei, X., & Green, S. A. (2019). Computational reproducibility via containers in psychology. Meta-Psychology, 3, 1-9. https://doi.org/10.15626/MP.2018.892

One popular container strategy, encapsulation, bundles code, data, results, metadata, and a computational environment together for later reproduction. Clyburn-Sherin and Fei introduce Code Ocean, a popular cloud-based container implementation that makes use of such ‘compute capsules’. Based on Docker, Code Ocean provides a number of customizable, predefined base container computation environments for the depositor to use in their preservation efforts. The depositor uploads their code and data, and creates a specification detailing how the software should be executed . The depositor gets back a DOI link to a cloud-based sandbox where the code can be rerun at the click of a button. This paper is of value for its in-depth exposure to a cloud-based software preservation service, and for introducing the how software encapsulation is used in preservation.

Cochrane, E. (2019). Towards a Universal Virtual Interactor (UVI). In Proceedings of the 16th International Conference on Digital Preservation iPRES 2019, September 16-20, 2019, Amsterdam, The Netherlands, (pp. 1-10). https://doi.org/10.17605/OSF.IO/AZEWJ

The preservation of legacy computer software has its own special set of challenges. Cochrane walks us through the arduous process one must usually follow when starting with an orphaned legacy digital object: figuring what file format the object is in, what software was likely used to render it originally, what operating system and version was used to run the software, and what hardware (and hardware configuration) the OS ran on. The Universal Virtual Interactor, given the legacy file via a browser upload, attempts to open it in a container running its “original” software environment. It gives its best guess in determining what this environment is based on the object’s characteristics, file classification, and a lengthy taxonomy algorithm. This is the Holy Grail of software preservation: rendering a digital object in its original environment with little or no upfront work by the depositor or curator. UVI is a work in progress (as of 2020) on the container-based EaaSI platform. Section II is a useful template for anyone creating environments meant to run legacy software. Section III, on UVI, is a glimpse into the state of the art in software preservation: the ability to execute preserved software objects with no work on the part of the depositor or curator.

Liebetraut, T., Rechert, K., Valizada, I., Meier, K., & Von Suchodoletz, D. (2014). Emulation-as-a-Service - The past in the cloud. In 2014 IEEE 7th International Conference on Cloud Computing, Anchorage, AK, (pp. 906–913). https://doi.org/10.1109/CLOUD.2014.124

There are many advantages to using a cloud-based container system for software preservation, and Liebetraut covers them in this article. A cloud solution such as Emulation as a Service allows an organization without extensive financial or technical resources a relatively inexpensive infrastructure for software preservation. As a cloud service, there is minimal duplication of effort, and efficient use is made of the large storage and computing resources needed to run and maintain a container farm. It also allows for collaboration with other organizations: setting up legacy computing environments is time consuming, and cloud solutions easily allow such setups to be shared. Useful to any organization starting a software preservation initiative, and interested in a collaborative, cloud-based solution.

Piccolo, S. R., & Frampton, M. B. (2016). Tools and techniques for computational reproducibility. GigaScience, 5(1). https://doi.org/10.1186/s13742-016-0135-4

Piccalo & Frampton discuss the pros and cons of seven different levels of software preservation, with code documentation at the simple end, and containers at the other. The discussion of how containers are alike and how they differ from virtual machines is terrific, and the diagrams illustrating this really help. There is a good explanation of how a container takes up less space than a virtual machine due to the layering of individual dependencies over each other. The discussion on Literate Programming (Jupyter Notebook, Knitr) given, while not specific to containers, is good introduction to some of other container implementations in this bibliography that build “clean” code specifications pre-preservation for reproducibility later.

Rechert, K. (2017). Preserving containers. In Kratzke, J., & Heuveline, V. (Eds.) E-Science-Tage 2017: Forschungsdaten managen, 140–151. https://books.ub.uni-heidelberg.de/heibooks/reader/download/285/285-4-79972-1-10-20171219.pdf.

Containers are widely used in software preservation, but, as Klaus Rechert points out, a container itself can be viewed as a complex digital object with some built-in preservation risks. Popular container implementations (Docker, Singularity, Shifter) use different runtime elements, and require different metadata specifications. Rechert introduces the Open Contain Initiative (OCI), which seeks to provide a common archival representation of a container. He suggests a strategy of ingest tools to convert container flavors into OCI specifications coupled with an OCI emulator. A good read for understanding the container as a complex digital object with many parts, and learning what those parts do. Also of value as a look into cutting-edge digital preservation research.

Steeves, V., Rampin, R., & Chirigati, F. (2018). Using ReproZip for reproducibility and library services. IASSIST Quarterly, 42(1), 14. https://doi.org/10.29173/iq18

ReproZip is a container-based software preservation tool that determines software dependencies using system call analysis, and packs all of the dependencies into a ReproZipped (.rpz) file. An unpacking tool, ReproUnzip, uses a saved rpz file to rerun the program in a Docker container environment. Steeves, a data management librarian, provides an entire section detailing how ReproZip can be integrated into various library services, from digital libraries and repositories to subject librarians. Useful to any librarian interested in software preservation, or anyone who would like good examples of container-based preservation with dependency analysis using system calls.

Thain, D., Ivie, P., & Meng, H. (2015). Techniques for preserving scientific software executions: Preserve the mess or encourage cleanliness? In The 12th International Conference on Digital Preservation iPRES 2015, November 2-6, 2015, Chapel Hill, North Carolina, USA. https://phaidra.univie.ac.at/o:429560

Douglas Thain provides a high-level analysis of the basic issues in software preservation that will serve the preservationist well when evaluating different container implementations. “The essence of the software preservation problem”, as he sees it, “is that it extremely difficult for the end user to understand the set of objects upon which an execution depends”. He splits preservation frameworks into two groups. “Preserve the mess” strategies can be done at ingest time (or even later), do automatic dependency checking, and require less depositor effort, but produce artifacts good for little more than verification (container examples are Umbrella and ReproZip) . Preservation strategies that “encourage cleanliness”, put much of the preservation work on the depositor upfront, sometimes before the software has even been written, but result in objects that are more flexible, and extendable (Code Ocean, for example). Thains’s own container-based solution, used in the NSF-funded DASPOS project, uses elements from both strategies, as does the EaaS project in this bibliography. It’s all in here. Highly recommended.

Willis, C., Lambert, M., McHenry, K., & Kirkpatrick, C. (2017). Container-based analysis environments for low-barrier access to research data. In Proceedings of the Practice and Experience in Advanced Research Computing 2017 on Sustainability, Success and Impact, 1–4. https://doi.org/10.1145/3093338.3104164

The authors describe how many computationally-produced scientific datasets are far too large to be stored in conventional data preservation systems, and instead use high-performance computing software infrastructure for access, infrastructure that is not available to most organizations. They introduce two platforms, DataDNS and the National Data Service Labs Workbench, that maintain private Docker registries of containers with research computing software stacks designed specifically for reproducing the runs of software requiring these big data frameworks. Worth a read if you are interested in preserving software that needs in-place access to large research datasets, or in software reproducibility in high performance computing environments.

Container Use In Software Preservation