Improving the reliability of cognitive task measures: A narrative review

Published:February 24, 2023DOI:


      Cognitive tasks are capable of providing researchers with crucial insights into the relationship between cognitive processing and psychiatric phenomena. However, many recent studies have found that task measures exhibit poor reliability, which hampers their usefulness for individual-differences research. Here we provide a narrative review of approaches to improve the reliability of cognitive task measures. Specifically, we introduce a taxonomy of experiment design and analysis strategies for improving task reliability. Where appropriate, we highlight studies that are exemplary for improving the reliability of specific task measures. We hope that this article can serve as a helpful guide for experimenters who wish to design a new task, or improve an existing one, to achieve sufficient reliability for use in individual-differences research.


      To read this article in full you will need to make a payment

      Purchase one-time access:

      Academic & Personal: 24 hour online accessCorporate R&D Professionals: 24 hour online access
      One-time access price info
      • For academic or personal research use, select 'Academic and Personal'
      • For corporate R&D use, select 'Corporate R&D Professionals'

      8 References

        • Spiegel J.A.
        • Goodrich J.M.
        • Morris B.M.
        • Osborne C.M.
        • Lonigan C.J.
        Relations between executive functions and academic outcomes in elementary school children: A meta-analysis.
        Psychological Bulletin. 2021; 147: 329
        • Hartshorne J.K.
        • Germine L.T.
        When does cognitive functioning peak? The asynchronous rise and fall of different cognitive abilities across the life span.
        Psychological science. 2015; 26: 433-443
      1. Kline, P. A handbook of test construction (psychology revivals): introduction to psychometric design (Routledge, 2015).

      2. Allen, M.J. & Yen, W.M. Introduction to measurement theory (Waveland Press,2001).

        • Spearman C.
        General intelligence.
        objectively determined and measured. Am. J. Psychol. 1904; 15: 201-293
        • Parsons S.
        • Kruijt A.-W.
        • Fox E.
        Psychological Science Needs a Standard Practice of Reporting the Reliability of Cognitive-Behavioral Measurements.
        Advances in Methods and Practices in Psychological Science. 2019; 2: 378-395
        • Owens M.M.
        • et al.
        Recalibrating expectations about effect size: A multi-method survey of effect sizes in the ABCD study.
        PloS one. 2021; 16e0257535
        • Paap K.R.
        • Sawi O.
        The role of test-retest reliability in measuring individual and group differences in executive functioning.
        Journal of Neuroscience Methods. 2016; 274: 81-93
        • Cooper S.R.
        • Gonthier C.
        • Barch D.M.
        • Braver T.S.
        The role of psychometrics in individual differences research in cognition: A case study of the AX-CPT.
        Frontiers in psychology. 2017; 8: 1482
        • Arnon I.
        Do current statistical learning tasks capture stable individual differences in children? An investigation of task reliability across modality.
        Behavior research methods. 2020; 52: 68-81
        • Pronk T.
        • Hirst R.J.
        • Wiers R.W.
        • Murre J.M.
        Can we measure individual differences in cognitive measures reliably via smartphones? A comparison of the flanker effect across device types and samples.
        Behavior Research Methods. 2022; 1–12
        • Bruder L.R.
        • Scharer L.
        • Peters J.
        Reliability assessment of temporal discounting measures in virtual reality environments.
        Scientific reports. 2021; 11: 1-16
        • Rouder J.N.
        • Haaf J.M.
        A psychometrics of individual differences in experimental tasks. en.
        Psychon. Bull. Rev. 2019; 26: 452-467
      3. Haines, N. et al. Learning from the reliability paradox: How theoretically informed generative models can advance the social, behavioral, and brain sciences. PsyArXiv (2020).

        • Chen G.
        • et al.
        Trial and error: A hierarchical modeling approach to test-retest reliability.
        NeuroImage. 2021; 245118647
        • Green S.B.
        • et al.
        Use of internal consistency coefficients for estimating reliability of experimental task scores. en.
        Psychon. Bull. Rev. 2016; 23: 750-763
        • Hedge C.
        • Powell G.
        • Sumner P.
        The reliability paradox: Why robust cognitive tasks do not produce reliable individual differences.
        Behavior Research Methods. 2018; 50: 1166-1186
        • Frey R.
        • Pedroni A.
        • Mata R.
        • Rieskamp J.
        • Hertwig R.
        Risk preference shares the psychometric structure of major psychological traits. en.
        Sci Adv. 2017; 3e1701381
        • Enkavi A.Z.
        • et al.
        Large-scale analysis of test–retest reliabilities of self-regulation measures.
        Proceedings of the National Academy of Sciences. 2019; 116: 5472-5477
      4. Von Bastian, C.C. et al. Advancing the understanding of individual differences in attentional control: Theoretical, methodological, and analytical considerations 2020.

        • Nitsch F.J.
        • Lu¨pken L.M.
        • Lu¨schow N.
        • Kalenscher T.
        On the reliability of individual economic rationality measurements. en.
        Proc. Natl. Acad. Sci. U. S. A. 2022; 119e2202070119
        • Verdejo-Garcia A.
        • et al.
        A unified online test battery for cognitive impulsivity reveals relationships with real-world impulsive behaviours.
        Nature Human Behaviour. 2021; 5: 1562-1577
        • Haaf J.M.
        • Rouder J.N.
        Developing constraint in Bayesian mixed models.
        Psychological methods. 2017; 22: 779
      5. Rouder, J., Kumar, A. & Haaf, J.M. Why most studies of individual differences with inhibition tasks are bound to fail (2019).

        • McLean B.F.
        • Mattiske J.K.
        • Balzan R.P.
        Towards a reliable repeated-measures beads task for assessing the jumping to conclusions bias.
        Psychiatry research. 2018; 265: 200-207
        • Kucina T.
        • et al.
        A solution to the reliability paradox for decision-conflict tasks.
        PsyArXiv. 2022;
      6. Snijder, J.-P., Tang, R., Bugg, J., Conway, A.R. & Braver, T. On the Psychometric Evaluation of Cognitive Control Tasks: An Investigation with the Dual Mechanisms of Cognitive Control (DMCC) Battery. PsyArXiv (2022).

        • Rey-Mermet A.
        • Gade M.
        • Souza A.S.
        • Von Bastian C.C.
        • Oberauer K.
        Is executive control related to working memory capacity and fluid intelligence?.
        Journal of Experimental Psychology: General. 2019; 148: 1335
        • Siegelman N.
        • Bogaerts L.
        • Frost R.
        Measuring individual differences in statistical learning: Current pitfalls and possible solutions.
        Behavior research methods. 2017; 49: 418-432
        • Oswald F.L.
        • McAbee S.T.
        • Redick T.S.
        • Hambrick D.Z.
        The development of a short domain-general measure of working memory capacity.
        Behavior research methods. 2015; 47: 1343-1355
        • Kyllonen P.
        • et al.
        General fluid/inductive reasoning battery for a high-ability population.
        Behavior Research Methods. 2019; 51: 507-522
        • Hausknecht J.P.
        • Halpert J.A.
        • Di Paolo N.T.
        • Moriarty Gerrard M.O.
        Retesting in selection: a meta-analysis of coaching and practice effects for tests of cognitive ability.
        Journal of Applied Psychology. 2007; 92: 373
        • Scharfen J.
        • Peters J.M.
        • Holling H.
        Retest effects in cognitive ability tests: A meta-analysis.
        Intelligence. 2018; 67: 44-66
        • Paredes N.
        • Zorowitz S.
        • Niv Y.
        The Psychometric Properties of the Pavlovian Instrumental Transfer Task in an Online Adult Sample.
        Biological Psychiatry. 2021; 89: S132
        • Anokhin A.P.
        • et al.
        Age-related changes and longitudinal stability of individual differences in ABCD Neurocognition measures.
        Developmental Cognitive Neuroscience. 2022; 54101078
        • Salthouse T.A.
        Influence of age on practice effects in longitudinal neurocognitive change.
        Neuropsychology. 2010; 24: 563
        • Schiller D.
        • et al.
        Preventing the return of fear in humans using reconsolidation update mechanisms.
        Nature. 2010; 463: 49-53
        • Gulliksen H.
        The relation of item difficulty and inter-item correlation to test variance and reliability.
        Psychometrika. 1945; 10: 79-91
        • Lord F.M.
        The relation of the reliability of multiple-choice tests to the distribution of item difficulties.
        Psychometrika. 1952; 17: 181-194
        • Feldt L.S
        The relationship between the distribution of item difficulties and test reliability.
        Applied measurement in education. 1993; 6: 37-48
        • Newman A.
        • Bavik Y.L.
        • Mount M.
        • Shao B.
        Data collection via online platforms: Challenges and recommendations for future research.
        Applied Psychology. 2021; 70: 1380-1402
        • Chandler J.
        • Mueller P.
        • Paolacci G.
        Nonnaivete among Amazon Mechanical Turk workers: Consequences and solutions for behavioral researchers.
        Behavior research methods. 2014; 46: 112-130
        • Robinson J.
        • Rosenzweig C.
        • Moss A.J.
        • Litman L.
        Tapped out or barely tapped? Recommendations for how to harness the vast and largely unused potential of the Mechanical Turk participant pool.
        PloS one. 2019; 14e0226394
        • Price R.B.
        • et al.
        Empirical recommendations for improving the stability of the dot-probe task in clinical research.
        Psychological assessment. 2015; 27: 365
      7. Klingelhoefer-Jens, M., Ehlers, M.R., Kuhn, M., Keyaniyan, V. & Lonsdorf, T.B. Robust group-but limited individual-level (longitudinal) reliability and insights into cross-phases response prediction of conditioned fear. bioRxiv (2022).

        • Keutmann M.K.
        • Moore S.L.
        • Savitt A.
        • Gur R.C.
        Generating an item pool for translational social cognition research: methodology and initial validation.
        Behavior research methods. 2015; 47: 228-234
      8. Embretson, S.E. & Reise, S.P. Item response theory (Psychology Press, 2013).

        • Yoo A.H.
        • Keglovits H.
        • Collins A.
        The importance of linguistic information in human reinforcement learning.
        PsyArXiv. 2022;
        • Aday J.S.
        • Carlson J.M.
        Extended testing with the dot-probe task increases test–retest reliability and validity.
        Cognitive processing. 2019; 20: 65-72
        • Barbosa J.
        • et al.
        A practical guide for studying human behavior in the lab.
        Behavior Research Methods. 2022; 1–19
        • Hughes C.
        • Graham A.
        Measuring executive functions in childhood: Problems and solutions?.
        Child and adolescent mental health. 2002; 7: 131-142
        • Alexander C.
        • Paul M.
        • Michael M.
        • et al.
        The effects of practice on the cognitive test performance of neurologically normal individuals assessed at brief test–retest intervals.
        Journal of the International Neuropsychological Society. 2003; 9: 419-428
        • Sailer M.
        • Hense J.U.
        • Mayr S.K.
        • Mandl H.
        How gamification motivates: An experimental study of the effects of specific game design elements on psychological need satisfaction.
        Computers in human behavior. 2017; 69: 371-380
        • Wilson R.C.
        • Collins A.G.
        Ten simple rules for the computational modeling of behavioral data.
        Elife. 2019; 8e49547
        • Palminteri S.
        • Wyart V.
        • Koechlin E.
        The importance of falsification in computational cognitive modeling.
        Trends in cognitive sciences. 2017; 21: 425-433
        • Broomell S.B.
        • Bhatia S.
        Parameter recovery for decision modeling using choice data.
        Decision. 2014; 1: 252
        • Melinscak F.
        • Bach D.R.
        Computational optimization of associative learning experiments.
        PLoS computational biology. 2020; 16e1007593
        • Lerche V.
        • Voss A.
        Retest reliability of the parameters of the Ratcliff diffusion model. en.
        Psychol. Res. 2017; 81: 629-652
        • Waltmann M.
        • Schlagenhauf F.
        • Deserno L.
        Sufficient reliability of the behavioral and computational readouts of a probabilistic reversal learning task.
        Behavior Research Methods. 2022; 1–22
        • Katahira K.
        How hierarchical models improve point estimates of model parameters at the individual level.
        Journal of Mathematical Psychology. 2016; 73: 37-58
        • Myung J.I.
        • Cavagnaro D.R.
        • Pitt M.A.
        A tutorial on adaptive design optimization.
        Journal of mathematical psychology. 2013; 57: 53-67
        • Gonthier C.
        • Aubry A.
        • Bourdin B.
        Measuring working memory capacity in children using adaptive tasks: Example validation of an adaptive complex span.
        Behavior Research Methods. 2018; 50: 910-921
        • Ahn W.-Y.
        • et al.
        Rapid, precise, and reliable measurement of delay discounting using a Bayesian learning algorithm.
        Scientific reports. 2020; 10: 1-10
        • Curran P.J.
        • Cole V.
        • Bauer D.J.
        • Hussong A.M.
        • Gottfredson N.
        Improving Factor Score Estimation Through the Use of Observed Background Characteristics. en.
        Struct. Equ. Modeling. 2016; 23: 827-844
        • Bertling M.
        • Weeks J.P.
        Using response time data to reduce testing time in cognitive tests.
        Psychological Assessment. 2018; 30: 328
        • Ballard I.C.
        • McClure S.M.
        Joint modeling of reaction times and choice improves parameter identifiability in reinforcement learning models.
        Journal of Neuroscience Methods. 2019; 317: 37-44
        • Shahar N.
        • et al.
        Improving the reliability of model-based decision-making estimates in the two-stage decision task with reaction-times and drift-diffusion modeling.
        PLoS computational biology. 2019; 15e1006803
        • Palestro J.J.
        • et al.
        A tutorial on joint models of neural and behavioral measures of cognition.
        Journal of Mathematical Psychology. 2018; 84: 20-48
        • Chiou J.-s.
        • Spreng R.A.
        The reliability of difference scores: A re-examination.
        The Journal of Consumer Satisfaction, Dissatisfaction and Complaining Behavior. 1996; 9: 158-167
        • Draheim C.
        • Mashburn C.A.
        • Martin J.D.
        • Engle R.W.
        Reaction time in differential and developmental research: A review and commentary on the problems and alternatives.
        Psychological Bulletin. 2019; 145: 508
        • Lord F.M.
        The measurement of growth.
        ETS Research Bulletin Series. 1956; i–22 (1956)
        • Cronbach L.J.
        • Furby L.
        How we should measure “change”: Or should we?.
        Psychological bulletin. 1970; 74: 68
        • Edwards J.R.
        Ten difference score myths.
        Organizational research methods. 2001; 4: 265-287
        • Saville C.W.
        • et al.
        On the stability of instability: Optimising the reliability of intrasubject variability of reaction times.
        Personality and individual differences. 2011; 51: 148-153
        • Weigard A.
        • Clark D.A.
        • Sripada C.
        Cognitive efficiency beats top-down control as a reliable individual difference dimension relevant to self-control.
        Cognition. 2021; 215104818
        • Kofler M.J.
        • et al.
        Reaction time variability in ADHD: a meta-analytic review of 319 studies.
        Clinical psychology review. 2013; 33: 795-811
        • Heathcote A.
        • et al.
        Decision processes and the slowing of simple choices in schizophrenia.
        Journal of abnormal psychology. 2015; 124: 961
      9. Eckstein, M.K. et al. The Interpretation of Computational Model Parameters Depends on the Context. BioRxiv, 2021–05 (2022).

        • Steiner M.D.
        • Frey R.
        Representative design in psychological assessment: A case study using the Balloon Analogue Risk Task (BART). en.
        J. Exp. Psychol. Gen. Apr. 2021;
        • Germine L.
        • Strong R.W.
        • Singh S.
        • Sliwinski M.J.
        Toward dynamic phenotypes and the scalable measurement of human behavior.
        Neuropsychopharmacology. 2021; 46: 209-216
        • Dworkin J.D.
        • et al.
        The extent and drivers of gender imbalance in neuroscience reference lists.
        Nature neuroscience. 2020; 23: 918-926
      10. Bertolero, M.A. et al. Racial and ethnic imbalance in neuroscience reference lists and intersections with gender. BioRxiv (2020).

      11. Ambekar, A., Ward, C., Mohammed, J., Male, S. & Skiena, S. Name-ethnicity classification from open sources in Proceedings of the 15th ACM SIGKDD international conference on Knowledge Discovery and Data Mining (2009), 49–58.

        • Sood G.
        • Laohaprapanon S.
        Predicting race and ethnicity from the sequence of characters in a name.
        arXiv preprint arXiv. 2018; 1805: 02109