Lessons learnt of working on a big data analytics project, the view of the T2D

Authors: Nada Philip (Kingston University London) & John Chang (Croydon Hospital)


First, let us tell you a story of why I got involved with the Aegle project.

I think we have all been there – Some data are there but how do you work out what it actually means? – We collect medical data all the time, but do not get any chance of harnessing it – What is the point of doing all this work of collecting data if it is never used!

Supermarkets have worked it out, and use club cards, with reward points, get Billions of data points: they often are aware of who is having an affair or who is depressed before the shopper knows!

Can we utilise our data (Big data) to do the same?

This though is just the start: medical data, by its very nature, is confidential: you need the patient’s explicit consent, given the current GDPR regulations. This is major obstacle especially where data has been collected over a number of years, predating the regulations – The latest alleged data breach is again going to impact on the reputation of well-meaning clinicians.

Given this background, trying to get centres to open up and share data is now an even bigger problem – They simply do not want to share data where consent or governance issues rears its head – That is the background we are finding within this project for the diabetic data – This is on top of the issues we already have to tackle, namely competition for the first publication or discovery!

Fixing it is not easy: we are working closely with our own data to develop the analytics. Once that is bedded down fully, then we can illustrate to the true stakeholders – The patients. Type 2 diabetes is a growing global problem for all. Yet despite this, developed countries like the UK still do not have a formal diabetic registry. The Scandinavian countries, being younger and more nimble, have both diabetic registry, as well as consented registries that are becoming an envy in the world of diabetes – Yet they hold less data than the UK.

So, as a starter for many challenges for such projects, let me take you down our journey of how we have started on this journey to utilize type 2 big data.

First, we secured agreement from the data custodians that the data was available, and that we could use it.

Naturally, formal consent for retrospective data from patients was not practical given the period that these databases had been build up. Therefore formal ethical approval was sought and was granted, to enable this data to be utilised and a ‘research database’ generated.

One side effect of this is that the research database could be available for other interested parties if they apply for using it.

Having secured the approvals, then we were in a position to look at the data. That is the time when the sink moment arrived. These databases were clinically driven. This means busy clinical teams filed them when they saw the patient. The data fields were not mandatory. Also over the period, the data units changed, as well as other data fields added or removed. This culminated in significant gaps in various data fields.

Sometimes there is a need to deal with different databases that are formatted differently – A nightmare when you are trying to merge databases – Do you merge all or limit to a workable dataset – The last option is the one chosen. The downside is that you may lose the data fields of interest in avoiding gaps. Something to bear in mind if we are to do this again

We also need the clinical teams to work with the technical teams so that the concepts of ‘what do you want to see’ or what do you want to do with the data can be thrashed out: It means the clinician should have a picture in their minds, and the tech teams the paintbrush and colours to try to paint what they think the clinical team wants.

Yes, you have guessed it – is that the way we should have done it?

What about the patient’s perspective? What about the scientist? How does pharm fit into the picture?

As I said before, this is the start of the journey. Each journey needs some safe steps: these are our safe steps. Once we have the system up then we are walking. Once we are walking, then we can set the pace and get the others on board. Until then, the journey is only starting – help take these steps with me down our Aegle journey.

Let me go through our initial analytics journey. Deciding on the clinical scenarios for the T2D case and related analytics and visualization is an important step in such projects. This step depends on the availability of data to perform the analytics, is there a value from the business perspective of such scenario and analytics? Can we classify these analytics as big data analytics? Selecting the scenarios and related analytics works hand in hand with the availability of data and validity of the business model of such a scenario in the health sector.

One of the recommendations and lessons learnt in working with big data analytics projects like AEGLE is to start with a pre-project or projects that involve the initial steps of having the data and the ethics clearing of using the data, the big data scenarios and analytics and the validity of the business model.  Then we can go ahead and spend resources and time to complete the rest of the steps of building the big data platform infrastructure, development of the systems modules that can perform the system functionalities related to a typical big data analytics platform, e.g. the storing the data securely, workflow and analytics creating, running and management, and visualizations.  The development of these requires resources in terms of cost and talent staff and time to develop such a complex system.