The Data Lake – we’re going to need a bigger boat (and better tools)

The Bioprocessing Summit, Boston, 2018   

A thought-provoking week at The Bioprocessing Summit 2018, Boston was summarised nicely by a comment from one presenter - its all about data, thats where we are now. This data conversation also continued with the discussions at our own stall (the first time we have exhibited at the bioprocessing summit), with delegates surprised by how easy our software makes it for even first-time automation users to design, simulate and run complex data generating experiments such as a design of experiments for media optimisation! Across the wider conference, this data theme was also replicated, fascinating discussions were everywhere with delegates, speakers and exhibitors from all walks of the biotechnology sector – biologics, formulation, cell and gene therapies, all with the same word on their lips: Data.

It has long been recognised that data is a critical part of the bioprocess development journey at all stages. But, as technologies and modalities become more advanced, the number and versatility of products in development grow. High-throughput technologies are increasingly utilised in development programs, and on-/at-line analytical techniques improve and increase the density of data streams – the tools used at present are not fit for purpose.

As Mariano Nicolas Cruz of the Technical University of Berlin points out, the amount of data we’re generating is growing exponentially, while Jerry Murry of Amgen comments that 80% of the world’s data has been generated in the last 2 years. Yet still, we use handwritten batch records and Excel spreadsheets to track and analyse our experiments. Jeremy Spignall of MedImmune noted that they can easily generate thousands of sets of data per day, during a single chromatography method development study.

But it’s not just the volume of data that’s a problem, it’s the type, complexity and disparate formats of data that we’re generating. As I discussed with Florian Dziopa of MeiraGTx, gone are the days of end-point data providing a sufficient level of understanding. We’re now in the age of: advanced process control (APC) as presented by Anne Tolstrup of BPTC; hybrid modelling as discussed by Michael Sokolov of DataHow and machine learning which was addressed by Wei-Chien Hung of Alexion.

Then there is the critical point that experiments are being designed in a fundamentally more complex way than traditionally considered, for example using DoE methodologies with multivariate responses in mind as referenced by Marc-Olivier Baradez of the Cell and Gene Therapy Catapult. Try doing that in an Excel spreadsheet.

Namit Mehta of Strategy& talked through examples from other industries such as chemicals, wind energy and even snack foods, where we could apply strategies such as AI and big data analytics to the biotech industries.

Nevertheless, the overall message was unmistakable, we’re not doing enough with the data that we’re generating, and there are a lot of questions posed by the scale and complexity of data. Questions included: what to do about concerns over data security; how to handle, organise and process volumes of data orders of magnitude greater than we’re used to dealing with; how to ensure quality of data; are techniques like machine learning even applicable or useful in the field of bioprocessing and even what is the impact of something like GDPR on my data?

But the consensus was clear, we have a lot of data and there are far more questions than answers.