A key goal of the ROBIN Initiative is the collection and sharing of uniquely rich, fully annotated, and computable
longitudinal digital data from an unprecedented range of biological assays, imaging studies, treatment planning dosimetry, and clinical variables. This goal will be accomplished through a well-designed data collection and curation process as well as continuous curation and collation of the complete un-identifiable dataset into a flexible, cloud-based back-end data store (the NCI-sponsored Terra/FireCloud). The resulting linked and computable dataset will be easily accessible for collaborators within FireCloud who will have access to our complete computing workflows and analyses using standard open-source informatics tools such as python, R, and Jupyter notebooks. All published analyses will have corresponding re-runnable FireCloud Jupyter notebooks. Following the end of the project, applicants will receive full access to the linked dataset for any legitimate use, including downloads, analyses, and re-analysis using our stored computing workflows. We will provide a full range of “cheat sheet” examples and tutorial guides (videos, Wiki’s, and example notebooks) to make data access and analysis re-use as painless as possible.
Data flow schema: Figure 1 shows the overall data flow. The MCT will collect a broad range of biological, imaging, and clinical data acquired longitudinally. A comprehensive data management approach will be taken, ensuring consistent data sample labeling, supported by specialized and open-source informatics resources previously developed and deployed at Cornell (Tracker, RedCAP) and MSK (XNAT, CERR). Data accrual, completeness, and consistency will be monitored by dedicated personnel (Tang, LoCastro), who will work closely
with investigators and the DSIA Core leadership (Deasy and Sadanandam), via recurrent zoom data reviews of
imaging and biological data characteristics. Data will be uploaded to FireCloud, annotated using standard terms, where possible, and linked together into computable machine-readable/AI-ready longitudinal subject profiles.
Figure 1. Overall data flowschema of this ROBIN Center. Patient registration and clinical trial number assignment will be performed by the Cornell group; existing Cornell Profiler/Tracker tools will be used to track/monitor tissue/data acquisition. Imaging and treatment planning data will be tracked and uploaded to the MSK group. Next, all data will be uploaded to Terra/FireCloud, where it will be further curated, including and needed harmonizations such as adjustments for batch effects and image harmonizations. Data will be annotated using standard terms and linked together into computable machine-readable/AI-ready longitudinal subject profiles.
Dataset descriptor publications: Detailed descriptions of the disseminated datasets methods for computable
cloud access will be the focus of one or more peer-reviewed publications, either to the journal Scientific Data or
to the journal Medical Physics, using the new Medical Physics Dataset Article.
Licensing: Upon publication, we will use a fully open license without restrictions of any kind, such as the Creative Commons license used by the journal Scientific Data. Under that license, users are free to share, copy, distribute, adapt, transform, and build upon the data or software.
Rapid Access to Published Material: To accelerate access to results, data, and tools, our standard publication approach will be to submit preprints to either the medRxiv or bioRxiv servers, coincident with submission for peer-reviewed publication. In accordance with the NIH Public Data Sharing we agree to share all research resources and relevant data generated by this project with the scientific community via publications, presentations, and abstracts in a timely manner. Publications and presentations using data generated under the proposed project will acknowledge the NIH funding source and will include information on where the data are. As specified by the NIH Public Access Policy, we will ensure submission of final peer- reviewed journal manuscripts that arise from NIH funds to PubMed Central. Should the NIH release new guidelines for data sharing, we will update this plan to comply with those recommendations.
Support for collaborations: Our proposed data processing workflow together with the proposed cloud access
architecture will not only facilitate within-Center trans-continental collaborations but will also enable inter-ROBIN Network projects (conducted in Years 2-5) that leverage either our computational tools or our unique datasets. More broadly, this data flow infrastructure, and the full integration of our tools with Terra/FireCloud, provides a powerful basis for trans-NCI projects as envisioned in the ROBIN RFA. Furthermore, our approach ensures that data collected under this ROBIN Center will be FAIR (Findable, Accessible, Interpretable, and Reusable.) A key point is that Terra/FireCloud enables the workflows and corresponding data to coexist, making data truly reusable.