Uniformity in Scientific Research
Uniformity? Scientific Data Curation Best Practices in Research since 2009
Annotations by Kellie Madis
Project Definition
This annotated bibliography examines the literature published about data collection, data storage, and data access within the scientific disciplines since 2009. Within the scope of data collection, what data was collected? Within the scope of data storage, how was the data stored? Within the scope of data access, was data accessible after storage? Through all of the above measures the uniformity of scientific data curation was studied.
Annotations
In the Bennet lab, five doctoral students and their department head, Dr. Bennet, used a variety of data collection tools to store their semiconductor study of stress as placed on different surfaces yet their data remained un-shareable. To solve the problem of what made the data so difficult to share, the research team of Bennet and her colleagues had outsiders - the authors - look at their practices using the interview method. The paper's authors suggested that shared data collection and uniform naming practices might help alleviate the struggles faced by the researchers. Furthermore, the authors recommend that one uniform collection method would improve the reuse and sharing of the research data collected within the semiconductor research lab.
Austin et. al examine scientific data publishing methods in their paper - community conversation, workflows, and quality assurance practices as used by data journals. The authors used shared Google documents and shared flow-chart workflows though researchers usually tend not to share their work but hoard it until publication or acceptance to a repository. When scientific data is hoarded this way, it remains unusable to others in the field and no one is able to build on it. The authors recommend a uniform system that everyone can use - open-access. This way, data can later be built upon by anyone and still remain uniform.
Dearborn's authorial team writes of HUBzero's project, ISO 16363, which follows the Center for Research Libraries Trusted Repository Audit Checklist (TRAC)'s protocol and in fact is a direct descendant, in order to create a model for uniform storage. The ISO 16363 project team first developed PURR documents - Preservation Strategic Plan and the Preservation Strategies. They then followed PURR's workflow attached to the OAIS Reference Model. For metadata they followed METS, DCMI, MODS, and PREMIS reference models. Data was later accessible as it comes to the attention of the HUBzero team in queue every ten years for review. Data was uniform to systems used but not to one all-encompassing curation system because all data is unique and will have different requirements, authors say. The model used here could indeed be a model for uniform scientific research data.
With the goal of open-access scientific data, the authors used a personal code of best practices, saying: "the aim is to make data accessible and usable to anyone, anytime, anywhere, and for any purpose," (p. 62). Data was collected from different cultures, non-scientists, remote locations, and included oceanographic research and ecological reuse patterns. Data was stored in a myriad of ways like on servers, in notebooks, not shared for long time periods, communication between scientists was not available for reuse; hence the problems the authors set out to remedy. When stored properly, data was accessible, but this was not always the case. Scientific data curation studied was not uniform to any curation system throughout the study thus its problematic storage. The authors recommend a common system be borne out of this problem while recognizing that one is not available yet. Further, they recommend the use of trained intermediaries such as archivists to collaborate on data storage from its creation point.
When earth science data was collected related to source, location, and distance to repository of data collection, the authors found that scientists were often long distances away from storage repositories for long periods of time. This led to the problem of data collection - both often and timely. Authors hope to make an example best practice for data uniformity. Following OAIS (Open Archival Information System)'s Reference Model, the authors apply the SOA (Service Oriented Architecture) Model to the OAIS repository to describe how repositories can be implemented and extended through use of internal or external services. When data was collected using the OAIS Reference Model with added SOA Model stipulations, data was later accessible using the combination OAIS, SOA, and THREDDS (Thematic Real-time Environmental Distributed Data Service) Models in order to assure uniformity and later reuse.
Inspired by both the Dublin Core and the Semantic Web, the authors have developed Dryad repository metadata practice, a best practice they hope to make uniform in the scientific research community. Regarding data collected, because their system was designed for the "preservation, access, and reuse of scientific data objects underlying published research in the field of evolutionary biology, ecology, and related disciplines" (p. 199), Dryad aims to be simple, interoperable, and sematic-Web compatible. Data is stored using Dublin Core Application Profile (DCAP) compiled with the Singapore Framework as well as Dryad itself. The authors' 2009 data was later accessible through the new Dryad system; authors aim to make Dryad accessible for all scientists as well as uniform.
Here, McLure's authorial team investigates the way in which scientists at Colorado State University used to record their research and what their resulting needs were; they then made recommendations for data curation best practices. Small to large data was previously stored in a variety of file formats (CSV, PDF, SQL, JPEG, TIFF, MP3) as well as science-specific Statistical Package for the Social Sciences (SPSS) and geospatial data system files (GIS). Scientific data creation was rarely uniform to other curation systems - access became prblematic when researchers searched for their and others' data. Based on the differing information styles, McLure's authorial team developed a Data Management Plan (DMP) to be followed by all researchers in the school - it consisted of six stems: plan, crate, keep, produce, transfer, and share data. Using the DMP data was later accessible; however, ACNS (Application and Content Networking System) storage like Cisco is instead recommended for both flexibility and securing in data curation for the long term.
In her article, Pinnick writes of two best practices - a past one, the British Geological Survey (BGS) and a future one, the European Union (EU)'s INSPIRE (Infrastructure for Spatial Information in the European Community) - the past BGS was flawed and EU's INSPIRE is yet to be proven. Data was previously stored for 10 years or more in physical form - this data is not uniformly store digitally at the time but the EU's INSPIRE will be in place by 2019 to store it in the future. Previously digital BGS information was lost. The author therefore recommends a combination - training and awareness for scientists and a technology watch list / priority list. Data curation here was not uniform to any given system except for the future EU INSPIRE, based on surveyed needs of scientists.
This study was an exercise in human error versus standardized recording when two scientists discovered the same gene but named it very differently. The data collected in this lab were variants of a cancer gene, Hi-C samples T47D_rep2 - named by the scientist - and b1913e6c1_51720e9cf - a computer-generated name. Mainly data was stored in a notebook for the former sample and a shared computer with shared back-up Google document for the later. Personal data needed to be recreated when a scientist left the lab with workup of the first sample in his personal notebook - the shared data for the second Hi-C sample was accessible and replicable. Data curation was only uniform to the FAIR Principles; however, the author group recognizes the need for a system that better contains metadata. That is why they have developed the FAIR DATA (Documentation, Automation, Traceability, and Autonomy) System for future use.
Witt's article provides guidelines of how data can be shared between institutions while remaining accessible to others and while maintaining standards of the profession. Sharing of data between science users and authors with the Purdue library's data curation policies for archiving were herein studies. Data storage included Joomla content management system, Rapture toolkit, Web Browser, TerGrid, Open Science Grid, and the Handle System in conjunction with HUB early on for PURR; later, OAIS and DCC Data Curation for HUBzero, both of which are open-access. Data now intends to be a model for other colleges and universities. Data curation was uniform to DCC, OAIS, and Dublin Core standards.