Managing Big Data in a High-Throughput Genomics Pipeline
Confirmed Presenter: Grace Pigeau, Ontario Institute for Cancer Research, Canada
Room: 525
Format: In Person
Moderator(s): Alberto Riva
Authors List: Show
- Grace Pigeau, Ontario Institute for Cancer Research, Canada
- Heather Armstrong, Ontario Institute for Cancer Research, Canada
- Michael Laszloffy, Ontario Institute for Cancer Research, Canada
- Dillan Cooke, Ontario Institute for Cancer Research, Canada
- Alexander Fortuna, Ontario Institute for Cancer Research, Canada
- Alexis Varsava, Ontario Institute for Cancer Research, Canada
- Ally Wu, Ontario Institute for Cancer Research, Canada
- Beatriz Lujan Toro, Ontario Institute for Cancer Research, Canada
- Bernard Lam, Ontario Institute for Cancer Research, Canada
- Jessica Miller, Ontario Institute for Cancer Research, Canada
- Xuemei Luo, Ontario Institute for Cancer Research, Canada
- Ryan Falkenberg, Ontario Institute for Cancer Research, Canada
- Morgan Taschuk, Ontario Institute for Cancer Research, Canada
- Lawrence Heisler, Ontario Institute for Cancer Research, Canada
Presentation Overview: Show
The Genome Sequence Informatics (GSI) team at OICR handled the analysis and processing of over 1.2 petabytes of data in 2023. The resources required to store such large amounts of data are expensive and difficult to manage. Additionally, data processing demands will increase significantly in 2024, with the on-boarding of two new sequencers and migration to larger capacity flow cells. To manage this increased data output we are implementing changes to data tracking and more aggressive data removal.
Assays available through the genomics core consist of a distinct set of samples - defined as a case - which are analyzed together, producing a variety of deliverable files and reports. Once all work on a case is complete, the associated data can be scheduled for deletion. However, data from cases that use our clinical reporting assays, must be retained for two years. To accommodate this, data is automatically backed up over multiple stages to cloud storage before being deleted. First, the cases which are complete and ready for archiving are identified by an automated pipeline operations system. The raw sequence data and any files that are directly used by the clinical report are encrypted and automatically backed-up to a file storage web service. Archive status and metadata are tracked in a local database. If needed, the archive retrieval is straightforward to initiate and the files are recalled for reload into the production pipeline. This allows the team to meet accreditation requirements and ensure data integrity without requiring continually increasing storage capacity.
Novel Linux-style code helps us all down the road
Confirmed Presenter: George Bell, Whitehead Institute, United States
Room: 525
Format: In Person
Moderator(s): Alberto Riva
Authors List: Show
- George Bell, Whitehead Institute, United States
- Bingbing Yuan, Whitehead Institute, United States
- Troy Whitfield, Whitehead Institute, United States
- M Inmaculada Barrasa, Whitehead Institute, United States
- Xinlei Gao, Whitehead Institute, United States
Presentation Overview: Show
Python, Matlab, and especially R --
all have code bases that can help you go far.
But for biologists who can't program,
asking them to try can lead to, ""No way, ma'am!""
In contrast, when sending a biologist to the command line,
they typically respond, ""Sure -- that'd be fine!""
As a result, getting R and python packages into scripts
doesn't cause any coding conflicts.
Typing the command provides the syntax,
so then you'll know all the practical facts.
We can recommend libraries like edgeR and DESeq2,
and give everyone great analytic methods to pursue.
And specialized figures like UpSet, Sankey, and waterfall,
can be easily created, even in places like Montreal.
We provide input, output and sample commands
which are accessible by all -- no one misunderstands.
So try out the scripts on our web site,
It can increase your efficiency to a new height.
https://github.com/whitehead/barc"