[Proposal] Evaluation of bias in commercial facial recognition algorithms


Consumer research is the part of market research which aims at understanding consumers’ needs and preferences towards products and services. Most consumer research is still very manual and done in-person. A big risk of consumer research is poor data caused by human bias and errors. This can lead to wasted marketing spend, inaccurate targeting, and lost customers [Forrester Research 2019].

Our company Physalis is disrupting consumer research through automation with AI algorithms. While automation and digitisation was intended to eliminate human bias, we can see that commercial APIs for facial recognition actually seem to amplify it.

Commercial APIs actually claim to provide high accuracy for their facial recognition algorithms. However, there is a tendency that these algorithms were developed using biased data. Research has shown that the accuracy of some of these algorithms are significantly low for certain demographics (El Khiyari & Wechsler, 2016). Grother, Quinn & Phillips, (2011) reported that accuracy of some recognition algorithms depends on the biographic data of individuals, e.g., younger persons are more difficult to recognize than older persons for certain algorithms.

We want to use the research grant to study the bias in greater detail for several ethnicities, age bands, and genders. We will collect visual impressions of people after being exposed to a stimulus. We let them self-report their sentiments and visually double check. We will use this data collection to benchmark commercial APIs from Google, IBM, and Microsoft. With this research we will understand for each API what bias they have and which ethnicities, age bands, and genders are affected.

Research methodology

Data Collection

The data for this research will be collected in the form of a video and we will be taking an action research design approach whereby participants will be shown a stimulus such as an offensive video or sentence and their actions are recorded as they view the asset. This would allow us to analyse the sentiments and emotions revealed by commercial APIs.

The collected data would enable us to determine the accuracy of these emotions across different groups in particular by gender, race and age group. In order to analyse if the performance of the APIs are significantly different across certain demographics, participants will be selected based on age, gender and ethnicity.

Gender features will be split into 2 categories (male and female). Age features will be split into 3 categories (18-30, 30-50 and 50-70). Ethnicity features will be split into 5 categories (White, Black, Asian, Mixed and Others) according to the recommendation provided by the UK government. We will ensure that the number of participants in each demographic group is balanced and represents a similar proportion with other groups. For example, the 30-50 age group has a similar distribution for ethnicity, as the 50-70 age group. Having a balanced dataset will mitigate bias in our research process.


Once the videos have been captured, we will run them through our speech recognition algorithm to obtain the transcripts needed for sentiment analysis. The videos will also be fed through our data labeling algorithm where they will be labelled as positive, neutral and negative for sentiments and emotions. Finally, the text analytics and facial emotion recognition algorithms of commercial APIs will be applied on the videos in order to obtain the sentiments and emotions that are identified for each participant.

A video will be determined to be predicted correctly by a commercial API if its prediction matches with the data label as described above. Similar to El Khiyari & Wechsler (2016) our analysis will be split into 3 parts, with part I being the analysis of the algorithms on one-class demographic groups such as Whites, 18-30, Males etc. Part II of the analysis will be based on 2-class demographic groups such as 18-30 Males, White Females etc… Part III of the analysis will be based on 3-class demographic groups i.e. 18-30 Black Males, 50-70 Asian Females etc. This analysis would enable us to compare the performance of each algorithm across multiple demographics and provide evidence that commercial facial recognition algorithms are biased and should be used with care.

Sample size

Grother, Quinn & Phillips, (2011) evaluated facial recognition algorithms from seven commercial providers based on a sample size of 4 million, while El Khiyari & Wechsler, (2016) evaluated the accuracy of automatic facial verification based on a sample size of approximately 3,000 research objects. Based on our literature review and cost limitation, we have opted for a sample size of 10,000 research subjects, as this falls in the range of the two research mentioned above. Each research subject will be exposed to 4 stimuli, i.e., we overall collect 40,000 data points.

Study costs

In a survey, we assume that it takes up to 1 minute to deliver data for each data point. This has been confirmed in pre-tests, where the average time for collecting a data point was less than a minute. Note that each prolific respondent will contribute 4 data points.

Given these assumptions, at Prolific we need 40,000 x 1 minutes at an hourly rate of £10.20. Overall, the costs for Prolific will be £9,520 including Prolific’s costs and VAT. Any other costs for commercial APIs, hosting, data collection etc. will be handled separately by Physalis.


Our study is pre-registered on Open Science Framework at OSF | Sign in

Open Science

We intend to submit a paper about this study in an open-access peer-reviewed journal, as well as, reuse the findings on our blog and website. We will make code generated by us for this study openly available. We will make any data used in this research openly available. We apply FAIR principles to make data findable, accessible, interoperable, and reusable. Our data repository will be osf.io.


El Khiyari H., Wechsler H. (2016) Face Verification Subject to Varying (Age, Ethnicity, and Gender) Demographics Using Deep Learning. Journal of Biometrics & Biostatistics. https://www.hilarispublisher.com/open-access/face-verification-subject-to-varying-age-ethnicity-and-genderdemographics-using-deep-learning-2155-6180-1000323.pdf

Grother J., Quinn G. W., Phillips P. J. (2011) Report on the Evaluation of 2D Still-Image Face Recognition Algorithms. NIST Interagency Report 7709. National Institute of Standards and Technology https://nvlpubs.nist.gov/nistpubs/Legacy/IR/nistir7709.pdf

Grother J., Quinn G. W., Phillips P. J. (2011) Report on the Evaluation of 2D Still-Image Face Recognition Algorithms. NIST Interagency Report 7709. National Institute of Standards and Technology https://nvlpubs.nist.gov/nistpubs/Legacy/IR/nistir7709.pdf

List of ethnic groups. https://www.ethnicity-facts-figures.service.gov.uk/style-guide/ethnic-groups

“Why Marketers Can’t Ignore Data Quality,” Forrester Research, Inc., July 2019

Lunter J. (2020) Beating the bias in facial recognition technology. Biometric Technology Today. 2020 Oct; 2020(9): 5–7.