My experience of working with data that isn’t ‘mine’

This blog post was originally written for and published by the Health Services Research Association of Australia and New Zealand (HSRAANZ) Emerging Researcher Group (ERGO) section of the August 2012 Newsletter.

 

During my PhD I was lucky enough to be offered access to a large dataset for analysis.  This was a fantastic opportunity, which has strengthened my PhD and my data management and analysis skills.  Logistically however, it was not always easy.  I learnt a number of lessons that I thought other early career researchers may find useful.

There were four main issues I encountered:

Physically accessing the data

The data was held at another institution, who had received ethics approval and access to the data on the basis that it was kept confidential and did not leave their secure building.  I therefore needed to go to their site to conduct the analysis.  Whilst this was not foreseen to be an issue, there was a huge amount of red tape to get access to the university building, a desk, a computer, a log-in etc, because I was neither a staff member nor a student at that university.

To avoid these issues, start planning logistics early, and be realistic about what you need.  ‘Hot desking’ is not as easy as it sounds, so if you need your own computer, or a larger than average hard drive, be specific.  Make sure you ask very specific questions very early on about how you will access buildings and resources such as stationary and software etc.

Working with data that wasn’t ‘mine’

It takes extra time to get to know your data when you haven’t been involved in collecting it.  A data dictionary can be extremely helpful in these situations, and it is worth continuing to ask for one, if it is not provided with your data.  The other issue with working with data that isn’t yours, is that you may end up waiting for other people to prepare or clean datasets before you can use them.  Obviously this impacts on timelines, so build in a generous buffer into your project plans.

Working with a very large dataset

My working data file was over 60Gb, and analysis code often took days to run.  The computer system at the University was not really configured to cope with work being done overnight, and so often my programs would get interrupted by virus scans and automated backups.  I ended up using a local drive and doing my own backups, to avoid the issue, but in future I would try to sort this out before starting.

Working off site

Finally, I have already covered some of the access issues of working off site, but the other issues this raised was that it was quite isolating.  There was no one there who was really responsible for my work or who understood my project and methods, and I couldn’t sit down and show my data to anyone at my PhD office.  It was also difficult to integrate time to be in two offices into my daily schedule.  Meetings and events often prevented me going for days at a time, by which time I had forgotten what I was working on!  The solution to this was to get as organized as possible, to keep detailed notes of what I did each day, and to use tools such as dropbox (where allowed) to keep track of things.