Analysis of response times to likert scale question

Hi all,

We have just run a really small survey (3 pages, takes me 3 minutes to complete, 240 or so participants). Page 1 is consent + prolific ID, page 2 is demographics, page 3 is actual questions of interest. Page 3 is 10 simple likert questions.

On page 3 we are collecting timing data on all mouse clicks. Answer order does not matter, and when participants change previously selected answers the times get added for that question, so clicking quicly and then thinking does not skew these results. Have a look at these statistics:

Time taken after loading the page to respond to the first question:


Mean 8.12 seconds, median 5.89 seconds. Fastest 10%: 3.6 sec

Median time for each participant to respond to subsequent likert style questions


Mean 4.12, median 3.88, fastest 10%: 2.34 seconds

Overall time to respond to items 2-10 (after removing people that were more than 3 std slower):


Mean 62.47, median 44.31, 10%: 27.05 sec

Drilling into the fastest ~10% there:

Lastly, looking into total distribution of inter-item response times (i.e. duration between click 2 and 3, 3 and 4 and so forth), cropping the very long tail at 1 std

mean 6.12, median 3.76, fastest 10%: 1.75 seconds

And if we focus on the fastest 50% of responses

The team of three of us have also tried to answer the questions as fast as possible, while still reading the questions and thinking briefly about the answer, and our fastest attempt (out of 9) is 27 seconds for items 1-9, and a median time between responses of of 2.25 seconds. But that attempt basically didn’t involve much thinking at all.

Clearly a lot of participants are significantly faster than we are. But how fast is too fast? The under-2 second (median) response times do make me worried about the quality of our data.

What do you think?

1 Like

Dear Scientific

I think I may have taken your survey and wrote to the researchers advising the use of attendance checks but was told that the objective of the survey was to analyse data quality. I did not press the point, but I think the use of attendance checks affects the care with which participants approach the study (but also affects data).

Did you include
Attendance checks?
Nonsense attendance checks?
It seems to me that a certain proportion of participants, and not always the same ones, depending on how busy the are and how under-priveledged they are feeling at any given time, have a tendency to answer largely at random, perhaps looking out for attendance checks. If there are checks in the survey at various levels of obviousness even these participants, or some of them, will be encouraged to take the survey more serously.

Did you allow the use of mobile devices (smartphones and tablets)? If so then some respondents will I guess have been answering on buses and the like.

What proportion of repondents averaged less than 2 seconds per question (your suggested cut of line) ? The fastest 10 percent averaged 1.75 seconds so I am guessing that less than 15% averaged less than 2 seconds. This to me is surprisingly good, especially if your survery had no attendance checks :open_mouth:

Looking forward to your response, and more research of this type.

Tim

1 Like

hi @scientific, I’m not sure what to make of this, but I would hesitate to think that actual speed could be a good indicator of “how fast is too fast”.

If you’re thinking through all the possibilities, some other questions - in addition to @timtak 's insights above - might be:

  1. What are the questions that participants respond to?
  2. Are they measuring a single construct or are they of diverse topics?
  3. Do they require much thought in general or can people use their intuition?
  4. Are all or some reversed-scored?
  5. Can they see all the questions from the beginning?
  6. Is it possible that they are similar to questions that participants have seen before?

I might check test-retest reliability on the responses to get a better sense of whether the questions are being responded to in a consistent manner.

Back to the idea of quick responses - the first time that I administered an online survey to undergraduates I thought there might be a problem with data quality because lots of undergraduates were way faster at responding to survey questions than people on the research team.

I think I have a tendency to overestimate the amount of time/thought/effort that people put into Likert-style responses. I also think that there is a big, broad question about how and when it’s valid to use Likert-style responses in general, especially when averaged to create “construct scores”, but it doesn’t seem like that’s the question at hand here. Any chance you have data from a demographically similar sample completing the questions in-person? That would be neat, but it would still open up the possibility that participants on online platforms have more experience answering questions of this style (perhaps mentally regarding question content or heuristics, and literally clicking through them).

1 Like

Hi Tim and P.P

Thank you for your insights and questions. Let me ellaborate: The survey was supposed to be fast, a pre-study to scope out if there are enough individuals with relevant experiences on prolific to warrant a full study. We used a lot of pre-screeners which limited our participant pool to about 3000. We wanted this to be fast, and designed the survey that way too.
To answer your questions:

a) timtak, you did not take our survey, we have had no messages from participants at all actually!

b) We didn’t use attendance checks. Given there were only 10 questions they would stick out like sore thumbs, and from my experience from previous studies no one falls for them, especially not the random clickers or straight liners. It does make sense in longer surveys, but in my opinion not for something that takes <2min.

c) What are nonsense attendance checks?

d) We did allow mobile devices. Qualtrics shows the matrix questions in a different style on mobile devices, which I think will actually slow participants down, but I haven’t done the analysis.

e) Here are my statistics for the fastest bunch.
total speed for 9 items under 25 seconds: 16 (7.1%)
at least two responses in under 1 second each: 10 (4.4%)
average response time under 2.5 seconds: 23 (10.2%)
median response time under 2 seconds: 16 (7%)

1 person is in only one of the above four categories
10 are in 2
8 are in 3
5 are in 4 (all of them).

24 total.

1-3) the survey was on a single topic (experiences around data protection), but with fairly diverse points, asking participants to consider experiences from the last 4 years. From our pre-prolific-tests we found that participants had no problem comprehending the questions, but it took them some time to think of past experiences. Other questions were about benefits and drawbacks. In the follow up study we would ask participants to write those down, but here we just wanted to find out if prolific participants had sufficient experiences with data protection to be able to give us useful answers.

  1. There are some items where we would expect responses to be on the opposite side of the spectrum, but we are not using constructs. Those questions were distributed differently, but I haven’t done an analysis of deviation from mean answer by response time (which would be interesting… perhaps cluster the answers first and then see how response times correlate with distance from the clusters? The assumption here is that participants that answer randomly don’t follow the trend of other participants)

  2. Online yes, on mobile not. We are using qualtrics with a matrix question.

  3. This is extremely unlikely. The questions are topic specific, and that topic has not received any academic attention as far as we know (and we have done a reasonable lit review beforehand).

  4. Test retest is what we are thinking. I don’t think prolific’ is keen on rejecting participants who fail test-retest questions, but our primary worry is data quality. If we have to throw 25% of responses that would still be ok. How do we best do test-retest though? Another survey with specifically those questions? Should we be upfront with participants that this is a retest, and that everyone will be paid (but that the next stage of the study requires consistent answers), or should we be more covert?

  5. Do you mean comparing it to on paper or in a controlled lab, potentially with eye-tracking? That would be an interesting study indeed. But no, I have no such data currently.

In principle I agree that speed is a poor measure of attendance. Prolific’s criteria of 3 std deviation for outliers would require participants to answer in negative time, but clearly someone answering in <1 seconds (not even to mention those that answer some items in < 0.5 seconds) are unlikely to have had enough time to read the question and think about it.

Perhaps a future survey should automatically retest participants later on in the survey if they responded very quickly early on? :thinking:

Based on the numbers in e) above we want to retest those 24 particpants (10.7%). I will report back once I have had a chance to analyse the results, but please keep the feedback coming.

Dear Scientific

Thank you for your detailed response. Thanks to Paul for the interesting workarounds.

I agree that standard Instruction Manipulation Checks (IMCs’) e.g. "click ‘Strongly disagree’ for this question’

Stand out like a sore thumb.

And in days past when we were only allowed to use IMCs I confess I felt a little disgruntled with the level of leniency at that time. However, now that we are also allowed to use “Nonsense Checks,” I think that IMCs, even if they stand out, serve a function – to let the participants know that their attention is being monitored.

The really cool checks are however the nonsense checks.

I feel that I have a number of good qualities
For example in a Rosenberg Self-Esteem one might include

“I feel that I have numerous venomous toenails”
"I am aware of my gender of pumice dependency. "
“I am valuable to the extent of my underhand gorilla”
“I have tendency to past participation goat soup.”

One can vary the level of obviousness, such as to make participants aware that they are being checked and must pay attention. Some respondents will just blast through at random hoping that you don’t reject them, but the majority of respondents will give pause.

I am not sure of the effect of attendance checks upon overall responses patterns. Sometimes it is important to include even the most non-pro-social participants.

Tim

Hi @scientific , this all sounds good. I think I understand the situation better now. I don’t think you’ll have substantive problems especially if you’re okay with excluding some portion of responses from analysis. I don’t think I would exclude based on the super quick response times but whatever you decide, the most important thing is to be transparent and justify decisions.

One pattern that could be a flag to look out for is #4 - when one response leads you to expect an opposite response on a reverse-scored question and instead they are similar. (e.g., “all 1s”)

What’s interesting to me about this now is how much participants (in this case, specifically, Prolific participants) know/remember about data protection. I suspect that people have a sort of schema in memory about what happens with their information when they complete surveys, but I would be surprised if it’s a sort of rich mental representation. I’m very curious as to whether online study participants have very strong opinions about data protection. There’s also some information about data protection that gets provided to participants when they start using Prolific.

Test-retest by creating a second study, with an inclusion list of those Prolific PIDs that completed the first study (you probably won’t get all of them, but some). I’d probably say something direct like “you are being invited to participate in this study because you participated in a very similar survey previously. we are interested in whether survey responses to these questions are consistent over time. it’s okay if your responses are the same as the first time, and it’s also okay if they are different”.

I was just thinking about people completing the exact same survey in the controlled lab setting (same web-based survey questions etc), which might provide some context about speed in an unfamiliar/controlled setting compared to the potentially familiar/uncontrolled setting. It would also get rid of variability between response devices. But, now that you mention it, eye-tracking would be a fascinating way to really dig into how participants complete survey questions.

Thank you Tim for that explanation of nonsense checks, I will add them to future surveys.

@paul - you do raise a good point about studying data protection on prolific. Presumably participants here are a little bit more sensitised to it than the general public, they handle their anonymous IDs all the time and get to read consent forms that explain their rights etc on many many studies (on our study the median participant spends 11 seconds on the consent form, so they certainly are reading parts of it). I guess something for the limitations section.

I have had some time to analyse the responses from the 24 participants who we invited for a second round. We had altered the survey ever so slightly and changed 5 of the 11 likert scale items to be reversely coded. The rest remained identical, including 5 demographics / background questions. Of the 24 participants, 15 responded.

Of those 15,

  • 10 got all 5 demographics question correct (i.e. the same answer as before)
  • 4 got 4 correct
  • 1 got all of the incorrect.

The options on the demographics questions were randomly displayed, and they had between 4 and 8 options.

Turning to the likert questions, I calculated the deviation from the previous answer for each item, after converting the 7-point likert scale into -3, … +3, and inverting the reversely coded items.

The overall distribution has a mean error of 0.90, i.e. on average every participant managed to reproduce the original responses to within 0.9 ‘steps’ on the 7 point likert scale:
image

Surprisingly, the error is only minorly higher for the newly reversely coded items (0.99) compared to the exact copies (0.83).

If we average the errors by participant, the mean is also 0.90, with a range of [0, 2.27].
image

We are actually surprised by how good this is: An average test-retest change in less than one level on the likert scales seems good to us, especially considering that we are analysing participants that responded extremely quickly on the first round here. It seems like that professional prolific participants can indeed respond in a meaningful manner in <2 seconds per item.

2 Likes

Scientific

Wow. That is pretty amazing. There was me thinking that the really fast respondents are clicking at random but in fact, it seems that they are just fast!

Thank you very much.

Tim