The ADO Ecosystem

Introduction
What is it for?
Example workflows
- Twitter conversation analysis
- Multi-platform content analysis
What other resources are there for researchers?
The future of the ADO ecosystem

Introduction

The Australian Digital Observatory (ADO) is curating an ecosystem of resources for researchers working with digital human data from the internet. This ecosystem is intended to be useful to a wide range of disciplines, particularly humanities and social sciences, data science, public health, business, law, and many others.

Research projects are, by definition, all unique: the very purpose of research is to make new contributions. Therefore, there is no one-size-fits-all system for obtaining and pre-processing research data, especially in the humanities and social sciences. However, there are many overlapping methods and skills that are part of the research data lifecycle including (but not limited to) collecting, tidying, analysing, and publishing data.

The ADO ecosystem aims to equip researchers with a set of modular, open source, interoperable methods and processes to assist with these tasks. Researchers can engage with and benefit from the ADO ecosystem by accessing tools, training, and project support services.

This modular ecosystem approach gives flexibility, allowing researchers to pick and choose resources that are relevant and suitable for them, without being locked in to a specific stack (e.g., proprietary formats/tools, specific cloud infrastructure). The modular approach also recognises different entry levels in terms of researchers’ skills and needs. Furthermore, smaller tools are easier to maintain, and focusing on modularity and interoperability allows making use of and supporting existing tools and resources already in the community rather than reinventing the wheel.

In designing and curating the ADO ecosystem, we take inspiration from the open source software community and the Unix philosophy of designing interoperable tools that each do one thing well, and from foundational tooling ecosystems such as the R Tidyverse, which has well demonstrated the suitability of such a structure to the academic research environment. We hope to learn from these communities, and also to contribute back to them.

What is it for?

The ADO ecosystem aims to support researchers to make use of dynamic digital human data on the internet. This includes social media data, review websites, blogs, shared knowledge bases such as Wikipedia, and forums such as Reddit. Data collected from these sources can include the written words, metadata, platform structures and affordances, user contributions/changes - basically any and all of the facets of data that make up our digital internet activities.

Working with dynamic digital human data on the internet poses a number of challenges, especially for researchers from disciplines that traditionally do not require computational methods. These data-centric challenges can be framed in terms of the research lifecycle activities: explore; collect; tidy and model; store and organise; analyse; and publish. The ADO ecosystem provides resources to help researchers address these challenges in the form of:

Datasets and data sources to assist with finding and collecting data;
Software and technical utilities/systems to assist with processing and analysing data;
Guides to guide researchers on best practice, workflows, and technical methods;
Services for more specialised support (e.g., data-related consulting, bespoke software development, training).

Example workflows

Here are some examples of research workflows that benefit from our ecosystem approach.

Twitter conversation analysis

A researcher is interested in the Twitter conversation around Australia’s federal election. The workflow for answering this problem would include a preliminary feasibility check (to ensure there are sufficient data for analysis), followed by data collection, processing, and analysis. To implement this workflow, we use the following tools:

ADOReD: high-level social media analysis via an interactive dashboard. It can be used to derive a list of tweet IDs relevant to the research question.
twarc hydrate: open-source command-line tool for extracting data from Twitter API. It can be used to hydrate full tweet content from tweet IDs.
tidy_tweet: in-development open-source Python library for processing raw responses from Twitter API into an SQLite database.
tweet_exploR: in-development R package providing descriptive statistics and visualisation of Twitter data within an SQLite database produced by tidy_tweet.

The chart below illustrates the research workflow, as well as the specific tools used in each phase.

DO example workflow for Twitter analysis

Multi-platform content analysis

Many research projects require data from multiple sources. For example, a researcher is interested in the dynamics and development of content and networks related to the Critical Race Theory in light of the Black Lives Matter movement. Analyses would require data from Twitter, Youtube, Reddit, Wikipedia, as well as scholarly citation networks. A set of different bespoke tools are developed and used to support this workflow:

twarc hydrate: open-source command-line tool for extracting data from Twitter API. It can be used to hydrate full tweet content from tweet IDs.
tidy_tweet: in-development open-source Python library for processing raw responses from Twitter API into an SQLite database.
youtube_collector: custom script developed by QUT Digital Observatory to assist with collecting and processing YouTube data (which is also currently being developed into an open-source library).
reddit_collector: custom script developed by QUT Digital Observatory to assist with collecting and processing Reddit data via the open Pushshift data platform.
wikipedia_collector: custom script developed by QUT Digital Observatory to assist with collecting and processing Wikipedia data.
citation_collector: custom script developed by QUT Digital Observatory to assist with collecting and processing scholarly citation data from OpenAlex database.

What other resources are there for researchers?

The ADO Ecosystem is a part of a broader community and ecosystem of open source or openly licensed resources and methods. We’d like to share a list of projects whose work we make use of in our own resources, and which we also frequently recommend to researchers to use either with our workflows or in their own right.

Data collection resources:

twarc: Command-line utility and Python library for collecting Twitter data
Wayback Machine: The Internet Archive and its historical collection of webpages
Webrecorder: Utility for creating and curating targeted datasets of webpage archives
Pushshift API: Reddit data archive
GLAM Workbench: A broad group of utilities and instructional examples for collecting and utilising data from galleries, libraries, archives, and museums

Data analysis resources:

Language Technology and Data Analysis Laboratory (LADAL): Collection of tutorials and examples for computational linguistics and text analytics, primarily written in R
Australian Text Analytics Platform (ATAP): Tools and training for analysing, processing, and exploring text

Data storage and organisation tools:

DBeaver: Multiplatform database client
SQLite: File-based database system
Datasette: Web-based interface for SQLite databases
ClickHouse: Fast analytical column-store database

Computational skills for research resources:

The Carpentries: Community training organisation for teaching researchers computational and data science skills
The Programming Historian: A broad collection of computational skills tutorials specifically targeted at humanities and social science researchers
ResBaz: A worldwide collection of gatherings centred around digital skills for research
Hacky Hours (Queensland): Peer support sessions for technical skills for research

Foundational open source projects and communities:

The R language, and the R Tidyverse (the ADO 💙s RLadies)
The Python language and community (the ADO 💙s PyConAU and PyLadies)
The Unix, Linux, and Ubuntu operating systems and communities
The broader open source and open data communities (the ADO 💙s linux.conf.au)

The future of the ADO ecosystem

The current state of the ADO ecosystem is just the start of our journey! We plan to continue creating more tutorials, tools, documented workflows, and more, as well as refining and developing the resources already published so that they stay relevant and useful to researchers’ needs. We always love to hear from researchers and the rest of the community so that we can share, discuss, and work together to solve problems and enable research. Reach out to us anytime, and watch this space!