Data Governance for Human-Related Dynamic Digital Data: A General Guide for Researchers

As human lives are becoming more digitised now than ever, we see new forms of data, especially social media data and streaming data, increasingly incorporated in research endeavours. While there are standard practices regarding dealing with traditional research data, no such standards exist for these human-related dynamic digital data. This calls for a new data governance framework to guide researchers or anyone working with digital data.

What is data governance and why is it important?

According to DAMA International’s Body of Knowledge, data governance is defined as "the exercise of authority and control (planning, monitoring, and enforcement) over the management of data assets". The purpose of data governance is to ensure that data is managed properly and according to policies and best practices.

Because of their human-related nature, using dynamic digital data for research purposes inevitably entails many technical, ethical, and legal challenges. Therefore, this data governance framework aims to provide researchers with some guiding principles to navigate these challenges when working with dynamic digital data.

The framework outlines the general data considerations relevant to each stage of the research data lifecycle (i.e., collection, storage, pre-processing, analysis, sharing, publishing, and archiving). In doing so, we attempt to describe the current best practices, underpinned by the FAIR and CARE principles, that are aligned with the research community’s goals and emerging directions.

FAIR and CARE principles

The FAIR principles: data resources, tools, vocabularies, and infrastructures should be Findable, Accessible, Interoperable, and Reusable. The FAIR principles may be adhered to in any combination and incrementally - it is not an 'all or nothing' framework.

The CARE principles propose how research data are used to foster the wellbeing of people. The concerns related to the purpose of data are Collective benefit, Authority to control, Responsibility, and Ethics.

A challenge for researchers working with human-related dynamic digital data is navigating the tensions inherent in this type of data:

Legal and ethical landscape

In addition to data governance considerations, in Australia there are also legal and ethical implications around the use of human data. For example, mishandling of personal data might lead to violation of the Privacy Act 1988. Also, depending on the geographical background(s) of the subject(s) of your research, you might have to comply with data regulations in different regions such as the General Data Protection Regulation (GDPR). As a researcher, it is advisable to be aware of this ethical and legal landscape. Read more about the different legislation and ethical guidelines that might be relevant to Australian researchers:

2. Data Considerations Throughout the Research Data Lifecycle

Good data governance requires considering data implications throughout the entire research data lifecycle: collecting, storing, analysing, sharing, and archiving data. Below are some concerns which researchers should take into account at each phase of the data lifecycle.

  • Is the data collection documented with methodology and metadata that enables its future reusability and reproducibility, in accordance with the FAIR principles?
  • Is data collection justified in terms of its research merits?
  • Have considerations been made regarding privacy, consent, the data platform(s) terms of service(s), copyright, and the ephemerality (short life-span) of data?
  • What are the procedures in place to deal with the rights of data subjects to withdraw and erasure?

  • Who has access to the data?
  • Who is the data owner?
  • Who is the data steward?
  • How is access secured?
  • What and where is the data management plan?
  • Where are the data physically stored?
  • What are the legal implications of the geography of storage?
  • What is the risk of compromise and data breach?
  • How long is it stored? In what format?
  • How is anonymisation of data enabled? How is the risk of re-identification being mitigated?
  • Will the data be encrypted? What encryption method will be used?
  • What are the other data attributes to be considered (richness, complexity, type [e.g., text, image, link, video], size)?
  • What is the process for deletion of data?
  • What about uncharacterisable data?
  • What are the costs of storing data?
  • How to ensure sustainability?
  • What to do with sensitive and objectionable content?
  • How to deal with data for marginalised and sensitive populations?

The research study design and planning should specify the type of analyses of the data. These analyses should consider the following:

  • What is the purpose of processing/analysis?
  • What are the privacy implications of processing personal and relational data?
  • How will the results of the analyses be reported? Will they be publicly or privately disseminated?
  • If public (e.g., academic publication), how will consent from the data subject be obtained? If a waiver is acquired, on what basis?
  • How will the data subjects’ privacy be respected and their contributions honoured?
  • How is the risk of harm to data subjects being mitigated?
  • How is the risk of re-identification being mitigated?
  • How to manifest duty of care to the data subjects?
  • How to protect the researchers, especially those working with potentially emotionally taxing topics such as hate speech, suicide, etc.?

Sharing research data is in line with the FAIR (Findable, Accessible, Interoperable, and Reusable) principles and is often required to comply with the requirements of many research funding agencies and publishers. However, sharing of dynamic digital data such as social media data is fraught with challenges. Sharing data in ways that are legal and ethical is accepted best practice. Some of the key legal and ethical challenges are:

  • Privacy constraints (consent and anonymity)
  • Copyright (platform users own the copyright to their posts)
  • Terms of service restrictions
  • Research integrity (i.e., research reproducibility, transparency, provenance)

Other considerations are:

  • What data is being shared (e.g., data as collected, data as transformed, or data as analysed) as well as the potential consequences?
  • Data provenance (documentation of data origin and methodology)
  • How data is being shared to ensure methodological transparency?
  • What is the cost of not sharing? What is the total cost of duplication of effort across multiple groups? How much will it cost to maintain?
  • Who to share with? Concerns of accessibility or equity?

Archiving of research data for future use and preservation is the domain of data archives. However, for the duration of the project, the following questions should be considered:

  • Archive duration
  • Potential transfer to designated national data archives
  • Integrity of data
  • Context
  • Provenance
  • Preservation
  • Dynamism
  • Transparency
  • Limited data

Resources: Platform Factsheets

These fact sheets contain high-level information about a number of platforms, including their APIs*, usage policy and terms of services, as well as some useful open-source tools that can be used. The fact sheets aim to be a starting point for researchers interested in working with these platforms.

*API stands for Application Programming Interface. You can learn more about APIs by watching this video or reading this article.