Crossref Citations
This article has been cited by the following publications. This list is generated based on data provided by
Crossref.
Jia, Ronnie
Reich, Blaize Horner
and
Jia, Heather H.
2016.
A commentary on: “Creating agile organizations through IT: The influence of IT service climate on IT service quality and IT agility”.
The Journal of Strategic Information Systems,
Vol. 25,
Issue. 3,
p.
227.
Dimitriu, Radu
and
Guesalaga, Rodrigo
2017.
Consumers’ Social Media Brand Behaviors: Uncovering Underlying Motivators and Deriving Meaningful Consumer Segments.
Psychology & Marketing,
Vol. 34,
Issue. 5,
p.
580.
Alarcon, Gene M.
Gamble, Rose
Jessup, Sarah A.
Walter, Charles
Ryan, Tyler J.
Wood, David W.
Calhoun, Chris S.
and
Peter, Walla
2017.
Application of the heuristic-systematic model to computer code trustworthiness: The influence of reputation and transparency.
Cogent Psychology,
Vol. 4,
Issue. 1,
p.
1389640.
Keith, Melissa G.
Tay, Louis
and
Harms, Peter D.
2017.
Systems Perspective of Amazon Mechanical Turk for Organizational Research: Review and Recommendations.
Frontiers in Psychology,
Vol. 8,
Issue. ,
Cheung, Janelle H.
Burns, Deanna K.
Sinclair, Robert R.
and
Sliter, Michael
2017.
Amazon Mechanical Turk in Organizational Psychology: An Evaluation and Practical Recommendations.
Journal of Business and Psychology,
Vol. 32,
Issue. 4,
p.
347.
Ford, John B.
2017.
Amazon's Mechanical Turk: A Comment.
Journal of Advertising,
Vol. 46,
Issue. 1,
p.
156.
McCarthy, Julie M.
Bauer, Talya N.
Truxillo, Donald M.
Campion, Michael C.
Van Iddekinge, Chad H.
and
Campion, Michael A.
2017.
Using pre-test explanations to improve test-taker reactions: Testing a set of “wise” interventions.
Organizational Behavior and Human Decision Processes,
Vol. 141,
Issue. ,
p.
43.
Niles, Andrea N.
and
O’Donovan, Aoife
2018.
Personalizing Affective Stimuli Using a Recommender Algorithm: An Example with Threatening Words for Trauma Exposed Populations.
Cognitive Therapy and Research,
Vol. 42,
Issue. 6,
p.
747.
Chambers, Silvana
and
Nimon, Kim
2018.
Handbook of Research on Innovative Techniques, Trends, and Analysis for Optimized Research Methods.
p.
258.
Gay, Jeremy G.
Vitacco, Michael J.
Hackney, Amy
Beussink, Courtney
and
Lilienfeld, Scott O.
2018.
Relations among psychopathy, moral competence, and moral intuitions in student and community samples.
Legal and Criminological Psychology,
Vol. 23,
Issue. 2,
p.
117.
Hyman, Michael R.
Kostyk, Alena
Zhou, Wenkai
and
Paas, Leo
2019.
Novel Approaches for Improving data Quality from Self-Administered Questionnaires.
International Journal of Market Research,
Vol. 61,
Issue. 5,
p.
552.
Davids, Christopher M.
Watson, Laurel B.
and
Gere, Madeline P.
2019.
Objectification, Masculinity, and Muscularity: A Test of Objectification Theory with Heterosexual Men.
Sex Roles,
Vol. 80,
Issue. 7-8,
p.
443.
Jin, S. Venus
Ryu, Ehri
and
Muqaddam, Aziz
2019.
Romance 2.0 on Instagram! “What Type of Girlfriend Would You Date?”.
Evolutionary Psychology,
Vol. 17,
Issue. 1,
Stockdale, Margaret S.
Gilmer, Declan O.
and
Dinh, Tuyen K.
2019.
Dual effects of self-focused and other-focused power on sexual harassment intentions.
Equality, Diversity and Inclusion: An International Journal,
Vol. 39,
Issue. 1,
p.
17.
Jones, Adam T.
Hadsell, Lester
and
Burrus, Robert T.
2019.
Capitalist Views and Religion.
Eastern Economic Journal,
Vol. 45,
Issue. 3,
p.
384.
Verigin, Brianna L.
Meijer, Ewout H.
Bogaard, Glynis
Vrij, Aldert
and
Sartori, Giuseppe
2019.
Lie prevalence, lie characteristics and strategies of self-reported good liars.
PLOS ONE,
Vol. 14,
Issue. 12,
p.
e0225566.
Kostyk, Alena
Zhou, Wenkai
and
Hyman, Michael R.
2019.
Using surveytainment to counter declining survey data quality.
Journal of Business Research,
Vol. 95,
Issue. ,
p.
211.
Chambers, Silvana
and
Nimon, Kim
2019.
Crowdsourcing.
p.
410.
Winslow, Carolyn J.
Sabat, Isaac E.
Anderson, Amanda J.
Kaplan, Seth A.
and
Miller, Sarah J.
2019.
Development of a Measure of Informal Workplace Social Interactions.
Frontiers in Psychology,
Vol. 10,
Issue. ,
Johnson, Andrea C
Mays, Darren
Villanti, Andrea C
Niaura, Raymond S
Rehberg, Kathryn
Phan, Lilianna
Mercincavage, Melissa
Luta, George
and
Strasser, Andrew A
2019.
Marketing Influences on Perceptions of Reduced Nicotine Content Cigarettes.
Nicotine & Tobacco Research,
Vol. 21,
Issue. Supplement_1,
p.
S117.
The focal article by Landers and Behrend (Reference Landers and Behrend2015) makes the case that samples collected on microtask websites like Amazon's Mechanical Turk (MTurk) are inherently no better or worse than traditional samples of convenience from university students or organizations. We wholeheartedly agree. However, having successfully used MTurk and other online sources for data collection, we feel that the focal article was insufficient regarding the caution required in identifying inattentive respondents and the problems that can arise if such individuals are not removed from the dataset. Although we focus on MTurk, similar issues arise for most “low-stakes” assessments, including student samples, which seem to be increasingly collected online.
Our Experiences Using MTurk and Online Samples
Recently, we collected data on ability tests meant ultimately for selection purposes. Participants (i.e., MTurkers) were warned that data validity checks were embedded in the tests and that only attentive participants would be paid. We embedded validity items analogous to the bogus items recommended in the focal article (and by Meade & Craig, Reference Meade and Craig2012). For two different tests, in order to reach samples of 300, the responses of about 432 individuals were discarded. That is, almost 42% of the respondents were inattentive. Although this may be atypically high, MTurk samples that we have collected or handled usually have significant proportions of the respondents who seem to be inattentive. Clearly, great caution must be taken to remove such inattentive responders, and with proper care, equivalence can be found between MTurk and student samples (Fleischer & Mead, Reference Fleischer and Mead2015; Fleischer, Mead, & Neuhengen, Reference Fleischer, Mead and Neuhengen2014).
Another problem that we have observed with low-stakes, unproctored Internet data collection is low scores on ability tests. This issue may be exacerbated by the scoring algorithm because in proctored and standardized ability testing, it is common to score omitted items (items attempted but not answered) as incorrect because omission usually occurs when examinees are unsure of their answer. However, in unproctored Internet administrations it can be harder to determine which items were actually attempted and whether examinees were attentive in their responses. Kahneman's (Reference Kahneman1973) conceptualization of attention and effort suggests that examinees might earnestly attempt items until boredom and/or item difficulty erode their ability to concentrate. At this point, which is easy to conceptualize but difficult to pinpoint in any empirical data, the responses of the participants are actually more damaging than helpful. Assuming that most examinees can answer a few items attentively before they lose focus, traditional screens for chance responding will not catch these examinees. For example, on a four-option, 30-item exam, if a participant can attentively answer 10 items with a mean difficulty of 0.75, then the participant's expected score is 12.5 (0.75*10+0.25*20), whereas the threshold for chance responding would be seven or eight right. Such an examinee completed two-thirds of the exam randomly but is unlikely to be removed if all participants with eight or fewer correct responses are deleted.
Inattentive respondents are a serious problem for any “low-stakes” assessments (Montgomery, Reference Montgomery2007; Wright, Reference Wright2005). Below, we outline some of the issues and then describe MTurk features that may allow for mitigation of these issues.
Issues With Inattention
Respondents complete self-report surveys with varying degrees of attentiveness depending on the situation. Studies that use self-report surveys and have incomplete data or data collected from inattentive respondents face major problems in their analysis (Schmitt & Stults, Reference Schmitt and Stults1985). For instance, if respondents to an attitudes survey fail to see the importance of the survey, they will not be attentive in their responses and will respond in a careless manner, yielding useless data (Rogelberg, Fisher, Maynard, Hakel, & Hovarth, Reference Rogelberg, Fisher, Maynard, Hakel and Horvath2001; Rogelberg, Luong, Sederburg, & Cristol, Reference Rogelberg, Luong, Sederburg and Cristol2000). In this case the researcher needs to demonstrate the importance of the survey to the respondents. If they do not do this, the respondents will not respond attentively. When respondents answer in an inattentive manner, and decisions are based on the collected data, major issues can arise such as false conclusions based on the data.
Reliability Issues
One such major issue is the reliability of a measure as it is directly impacted by respondents’ scores and attentiveness (Thompson, Reference Thompson1994). Reliability is a measure of how closely correlated the observed scores of participants are to their true scores (Allen & Yen, Reference Allen and Yen1979; Marquis, Marquis, & Polich, Reference Marquis, Marquis and Polich1986). If respondents are inattentive, their answers will not accurately reflect their “attentive true score” and will generally increase the total error (McGrath, Mitchell, Kim, & Hough, Reference McGrath, Mitchell, Kim and Hough2010; Thompson, Reference Thompson1994).
For ability tests, however, samples that contain both attentive and inattentive participants can inflate reliability estimates. We routinely remove examinees responding at chance level or below from short (25- to 35-items) cognitive ability tests, and we generally observe a drop in coefficient alpha reliability in the range of 0.05 to 0.10 after removing these low scores. It is possible that we actually have a disproportionate sample of individuals with profoundly low cognitive ability, but the low scores seem much more likely due to inattention. Unfortunately for our estimates of reliability, the true-score variability of the sample seems less when these individuals are deleted and the reliability shrinks.
Validity Issues
Another issue that arises from inattention is incorrect factor analysis solutions (Meade & Craig, Reference Meade and Craig2012). Research has found that when as few as 10% of respondents demonstrate careless or inattentive responding, a clearly definable, artifactual factor occurs, and it is composed of negatively worded items (Schmitt & Stults, Reference Schmitt and Stults1985). Woods (Reference Woods2006) demonstrated that in samples containing 10% or more careless responses, the model fit of the correct one-factor solution was substantially lower than that of a two-factor solution associated with item wordings. This is concerning, as most measures implement negatively scored items and assume each domain being measured is bipolar (e.g., extraverted respondents score high and introverted respondents score low), but in samples contaminated by inattentive respondents, these measures might incorrectly assume two-factor structures in the test development process (Schmitt & Stults, Reference Schmitt and Stults1985).
Item Quality
Inattentive samples might lead to lower quality items. We conducted a meta-analysis on item characteristics of positively and negatively worded items in personality measures (Huang, Reference Huang2015). One crucial finding was that sample motivation moderated item quality. The quality (IRT item discrimination or CFA factor loading) of both positively and negatively worded items were similarly acceptable in motivated samples, whereas there was a “negatively-worded item effect” (Sliter & Zickar, Reference Sliter and Zickar2014) in low-motivation samples. Our interpretation is that highly motivated samples included less careless responses.
MTurk Addresses Inattention
Using MTurk, researchers can compensate only those participants who seem to answer the survey in an attentive fashion—if attentiveness can be measured. Researchers must request the MTurkers’ unique IDs and then instruct MTurk not to pay those who are identified as inattentive. In MTurk terminology, the worker is not being paid because their work on the human intelligence tasks (HIT) was inferior. This results in a negative mark for the MTurker.
Negative marks are a part of an MTurker's rating, which is important for the MTurker to get work. So an additional way that MTurk can be used to assure quality data is to specify respondents who have achieved a certain percentage success rate on their HITs. This is done when setting up a request for participants. Researchers can stipulate different levels of skill, proven reliability, and previous HIT experience.
It is beyond the scope of this short commentary to fully explore the ethical issues of withholding payment from seemingly inattentive workers. An institutional review board (IRB) might construe this as contrary to the “without penalty” portion of participants’ right to withdraw participation at any time without penalty (Basic HHS Policy for Protection of Human Research Subjects, 2009). However, we also see ethical issues with equally rewarding attentive and inattentive participants, particularly when the time and effort of the inattentive respondents has seemed much less than that of the attentive respondents. Our IRB has been concerned that participants are fully informed of these issues; however, some faculty have felt that this was not sufficiently protective of a vulnerable population.
One argument that it is ethical to withhold pay from inattentive MTurkers is to conceptualize their research participation as a “microjob” that they execute well or poorly. We imagine that few industrial–organizational psychologists would advocate rewarding poor job performance—particularly if these “terms” are clearly communicated to the MTurker before accepting the HIT. Similarly, we see a moral imperative to contribute accurate information into the MTurk ecosystem, even if that means giving a poorly performing worker a negative mark. In his review, the editor noted that this issue represents a clear distinction from traditional data collection where measures might only be discovered to be unusable long after credit for study participation was awarded. We may view this as a positive improvement in our research methodology and simultaneously a shift in power toward experimenters that places a greater burden on researchers to protect the interests of participants.
Protecting the interests of participants would include using accurate detection methods and avoiding cut-scores that shift the burden of errors onto participants (i.e., minimize false-negatives by inflating false-positives). Research on the effectiveness of detection (Huang, Curran, Keeney, Poposki, & DeShon, Reference Huang, Curran, Keeney, Poposki and DeShon2012; Lee & Chen, Reference Lee and Chen2011; Meade & Craig, Reference Meade and Craig2012) has identified bogus items (e.g., “I have never eaten a meal” or “Please select ‘Strongly Agree’”) and response time as the two most effective methods for detection. There is also a sizeable literature on appropriateness measurement (Armstrong, Stoumbos, Kung, & Shi, Reference Armstrong, Stoumbos, Kung and Shi2007). Such methods are helpful but imperfect and may exacerbate IRB concerns about withholding payment from participants. Also, logistically, such measures must be reasonably automated so that payment decisions can be made quickly. Any detection method that depends on a post hoc analysis of the entire dataset cannot be used for this purpose. In our research using directed response bogus items, we typically find that around 15%–20% of the participants are identified as inattentive.
Another way that MTurk can often be used to improve data quality is by collecting smaller amounts of data (i.e., fewer responses per participant), which should reduce the amount of emotional labor of each respondent and increase the chances for attentive responding (Baer, Ballenger, Berry, & Wetter, Reference Baer, Ballenger, Berry and Wetter1997; Berry et al., Reference Berry, Wetter, Baer, Larsen, Clark and Monroe1992). However, this approach is only applicable to research protocols that can be broken into short segments (i.e., research robust to planned, missing-completely-at-random data). For example, normative ability test data can be collected in small blocks if linking items are included in each block.
In conclusion, inattention is a serious concern not fully addressed by the focal article. Features of the MTurk system may allow researchers to mitigate this issue, in which case MTurk may be the preferred online source of participants. However, there are ethical questions that need to be locally addressed.