On Plc3bo...12 Apr 2013
Plc3bo is a simple webapp for exploring the prescriptions made by the NHS between 2009 and 2012. The idea came to me after trawling through data.gov.uk and finding the dataset Prescription Cost Analysis, England, which looked pretty interesting. Being from a medical family (Doctor Dad, Nurse Mum and Med Student Brother), I’ve been impregnated with a significant interest in medicine (thankfully I lack the will to become a doctor, which is why I’m studying Computer Science and not Medicine).
The first thing I had to do with the dataset was to get it into a usable format. Like most things on data.gov.uk it was a combination of CSVs and Excel spreadsheets. I had to convert the Excel spreadsheets into CSVs (I’m lucky in that I have access to Excel, doubly so because the spreadsheet happily opened in Libreoffice as well), and after a little bit of column rearranging so that all 4 years were in the same format, I was ready to load the data into a database.
I used Python and SQLAlchemy (a python database abstraction package) to load the data into a MariaDB database (a drop-in MySQL replacement), and a Whoosh database for fulltext searching (Whoosh is a pure Python full text indexing/searching package, unfortunately SQLAlchemy lacks the wrappers for fulltext searching on MariaDB). Unfortunately, halfway through importing the first year, I decided I was going to change my table structures, so I had to start all over again. I set out my tables as Category, Active Ingredient, Drug and Statistic, and the resulting requirement for each of the 90,000 CSV rows to check whether the relevant Category, Active Ingredient and Drug existed or not made the import process very slow, taking about 4 hours. I still have 2 more years to import!
Whilst the import was running, I made the webapp for exploring the data. This was built on Flask, a simple Python web framework. It included a front page listing all the categories, a search function, pages for each category, active ingredient and drug. The drugs page showed a table of statistics, which also displayed a graph when a column heading was clicked. Unfortunately it was at this point that I ran out of time, having spent too much time battling with the raw data, and with d3.js, the library used for the graphs.
I would have liked to add graphs and data comparisons in other places on the site, particularly to compare generic versions of drugs with non-generics. I’d also like to neaten things up a bit, unfortunately the dataset had coded names for the majority of drugs, such as Paracet for Paracetamol. To go through and manually fix all of these would be an incredibly time-consuming task, so I would have to find another way of fixing this. There are also a lot of components coded into the name such as dosage and format that I only partially parsed, I think it would be useful to parse these fully so as to make the information more human readable and understandable.