Dr. Neal Reeves - King’s College London
Professor Elena Simperl - King’s College London
Dr. Laura Koesten - University of Vienna
Aneta Swianiewicz - King’s College London
Andrei Balcau - King’s College London
The past year and a half has seen an unprecedented increase in exposure of the public to charts and data visualisations, with governmental, research and media organisations providing frequent updates on the progress, spread and effects of the Covid-19 pandemic. Terminology and phrases such as ‘flatten the curve’ and discussions of projections and the possibility for a peak in cases have become frequent topics of political and news campaigns. These visualisations have played a crucial role in guiding public behaviour and stressing the importance of preventative measures (UK Government, 2020). At the same time, we have seen a growing prevalence of anti-science (e.g., anti-vaccination and conspiracy theories) views. Moreover, there are significant divergences in opinions and understanding of what data such as case numbers are showing us – this can be currently seen in the UK with responses from diverse groups to the proposals to end restrictions in light of the current case numbers. These issues pose substantial risks, both during the Covid-19 pandemic and moving forward to new crises and challenges (Full Fact, 2021).
We wish to understand the role that artificial intelligence can play in supporting participant understanding and critical evaluation of visualisations. We have gathered thousands of judgements of the readability and trustworthiness of 7 types of data visualisation, with types and parameters derived from existing state-of-the-art datasets (Vougiouklis et al., 2020) and informed by best practise in chart-design (Cairo, 2019). We have then trained a series of machine learning models to predict how members of the public might evaluate different charts. We aim to show these machine judgements to crowdworkers and assess how these impact their judgements, in comparison with human exposing judgements and a prior baseline of no intervention.
We would compare results between conditions and with our existing baseline to analyse how human and machine judgements influence opinions and when they are best introduced to support trust and readability. We aim to use these findings to generate recommendations.
We will generate scores from our models for 224 charts (32 for each of the 7 varieties), varying these charts according to the trends in the data (positive or negative), the strength of these trends (strong or weak) and visual features including the chart title, the presence of a source and whether the chart is colour or black and white. These features are existing features of the charts used to generate our baseline and build our model.
We will then build a task that will show a worker a particular image and ask the worker to provide a score out of 4 for how easy they find the chart to read and how much they trust the chart to be an accurate visualisation. Workers will then be shown a new screen, detailing the judgement of the algorithm or crowd, depending on experimental condition. Workers will then be asked if they wish to change their mind and prompted once again to give a score out of 4 for readability and trust.
We would run the experiment four times, with four experimental conditions. One pair of conditions would use human judgements from our existing baseline and the other machine judgements from our machine learning model. In condition 1, the human judgement will be shown up-front and workers will then be asked to give their scores. In condition 2, the human judgement will be shown only after the worker has given their score and the worker will be prompted to give their score again. In conditions 3 and 4, we will repeat conditions 1 and 2, but using machine judgements. We will compare scores between conditions and the baseline to understand how the source of the provided score (human or machine) and the timing (immediate or delayed) might influence a worker’s trust of and ability to understand a chart.
Our sample size is largely driven by two concerns: firstly, we have built on our model using seven distinct type of chart and six binary parameters. For this reason, each type of chart at a minimum can appear in 32 configurations. We would need to use each chart 4 times (for the 4 conditions). This gives us a total of 896 charts. Secondly and because our aim is to understand how the public view and interact with charts, we aim for a representative sample of the UK population in terms of – at a minimum - age, ethnicity and gender. To ensure a sufficient number of responses from diverse groups and to provide redundancy to allow an aggregated response, we would want each chart to be classified 5 times in each experiment. This gives us a total of 4480 workers.
Based on our previous data collection, we estimate an average task completion time of approximately 2 minutes. To ensure workers are paid fairly and in excess of the UK minimum wage, we would aim for a per-submission reward of $0.30 per submission. For 4480 workers, the total cost is £2,016.00.
Ideally we would use the “representative sample” option, leading to a total cost of £6,010.44 (£1,502.61 for each experimental condition). However, we could run the study without this cost, by launching each condition in stages (e.g., recruiting a quarter of respondents at a time), analysing the demographics represented in the responses and specifically targeting under-represented groups with subsequent recruiting.
We aim to publish our findings in a peer-reviewed journal. We would make any paper at minimum green open access, uploading the final version to an institutional repository and prior to this, including a pre-print on ArXiV or a similar service. Our study materials will all be made available in a GitHub repository. We will publish the analysis code and dataset using Zenodo or a similar service, ensuring that a link is included within the paper and pre-print.