From Classical Methods to Generative Models: Tackling the Unreliability of Neuroscientific Measures in Mental Health Research

Published:January 11, 2023DOI:


      Advances in computational statistics and corresponding shifts in funding initiatives over the past few decades have led to a proliferation of neuroscientific measures being developed in the context of mental health research. Although such measures have undoubtedly deepened our understanding of neural mechanisms underlying cognitive, affective, and behavioral processes associated with various mental health conditions, the clinical utility of such measures remains underwhelming. Recent commentaries point toward the poor reliability of neuroscientific measures to partially explain this lack of clinical translation. Here, we: (
      • Insel T.R.
      • Quirion R.
      Psychiatry as a clinical neuroscience discipline.
      ) provide a concise theoretical overview of how unreliability impedes clinical translation of neuroscientific measures; (

      U.S. Department of Health and Human Services, National Institutes of Health, National Institute of Mental Health (n.d.): NIMH Strategic Plan for Research. Retrieved from

      ) discuss how various modeling principles—including those from hierarchical and structural equation modeling frameworks—can help to improve reliability; and then (

      U.K. Research and Innovation (n.d.): U.K. Research and Innovation. Retrieved from

      ) demonstrate how to combine principles of hierarchical and structural modeling within the generative modeling framework to achieve more reliable, generalizable measures of brain-behavior relationships for use in mental health research.


      To read this article in full you will need to make a payment

      Purchase one-time access:

      Academic & Personal: 24 hour online accessCorporate R&D Professionals: 24 hour online access
      One-time access price info
      • For academic or personal research use, select 'Academic and Personal'
      • For corporate R&D use, select 'Corporate R&D Professionals'


        • Insel T.R.
        • Quirion R.
        Psychiatry as a clinical neuroscience discipline.
        JAMA. 2005; 294: 2221-2224
      1. U.S. Department of Health and Human Services, National Institutes of Health, National Institute of Mental Health (n.d.): NIMH Strategic Plan for Research. Retrieved from

      2. U.K. Research and Innovation (n.d.): U.K. Research and Innovation. Retrieved from

        • Winter N.R.
        • Leenings R.
        • Ernsting J.
        • Sarink K.
        • Fisch L.
        • Emden D.
        • et al.
        Quantifying Deviations of Brain Structure and Function in Major Depressive Disorder Across Neuroimaging Modalities.
        JAMA Psychiatry. 2022;
        • Infantolino Z.P.
        • Luking K.R.
        • Sauder C.L.
        • Curtin J.J.
        • Hajcak G.
        Robust is not necessarily reliable: From within-subjects fMRI contrasts to between-subjects comparisons.
        NeuroImage. 2018; 173: 146-152
        • Hedge C.
        • Bompas A.
        • Sumner P.
        Task Reliability Considerations in Computational Psychiatry.
        Biol Psychiatry Cogn Neurosci Neuroimaging. 2020; 5: 837-839
        • Blair R.J.R.
        • Mathur A.
        • Haines N.
        • Bajaj S.
        Future directions for cognitive neuroscience in psychiatry: recommendations for biomarker design based on recent test re-test reliability work.
        Curr Opin Behav Sci. 2022; 44101102
        • Hitchcock P.F.
        • Fried E.I.
        • Frank M.J.
        Computational Psychiatry Needs Time and Context.
        Annu Rev Psychol. 2022; 73: 243-270
      3. Haines N, Kvam PD, Irving LH, Smith C, Beauchaine TP, Pitt MA, et al. (2020): Theoretically Informed Generative Models Can Advance the Psychological and Brain Sciences: Lessons from the Reliability Paradox. PsyArXiv.

        • Botvinik-Nezer R.
        • Holzmeister F.
        • Camerer C.F.
        • Dreber A.
        • Huber J.
        • Johannesson M.
        • et al.
        Variability in the analysis of a single neuroimaging dataset by many teams [no. 7810].
        Nature. 2020; 582: 84-88
        • Della-Maggiore V.
        • Chau W.
        • Peres-Neto P.R.
        • McIntosh A.R.
        An empirical comparison of SPM preprocessing parameters to the analysis of fMRI data.
        NeuroImage. 2002; 17: 19-28
      4. Poline J-B, Strother SC, Dehaene-Lambertz G, Egan GF, Lancaster JL (2006): Motivation and synthesis of the FIAC experiment: Reproducibility of fMRI results across expert analyses. 27.

        • Fournier J.C.
        • Chase H.W.
        • Almeida J.
        • Phillips M.L.
        Model Specification and the Reliability of fMRI Results: Implications for Longitudinal Neuroimaging Studies in Psychiatry.
        PLOS ONE. 2014; 9e105169
        • Gorgolewski K.J.
        • Storkey A.J.
        • Bastin M.E.
        • Whittle I.
        • Pernet C.
        Single subject fMRI test–retest reliability metrics and confounding factors.
        NeuroImage. 2013; 69: 231-243
      5. Korucuoglu O, Harms MP, Astafiev SV, Golosheykin S, Kennedy JT, Barch DM, Anokhin AP (2021): Test-Retest Reliability of Neural Correlates of Response Inhibition and Error Monitoring: An fMRI Study of a Stop-Signal Task. Front Neurosci 15. Retrieved December 11, 2022, from

        • Elliott M.L.
        • Knodt A.R.
        • Ireland D.
        • Morris M.L.
        • Poulton R.
        • Ramrakha S.
        • et al.
        What Is the Test-Retest Reliability of Common Task-Functional MRI Measures? New Empirical Evidence and a Meta-Analysis.
        Psychol Sci. 2020; 31: 792-806
        • Noble S.
        • Spann M.N.
        • Tokoglu F.
        • Shen X.
        • Constable R.T.
        • Scheinost D.
        Influences on the Test–Retest Reliability of Functional Connectivity MRI and its Relationship with Behavioral Utility.
        Cereb Cortex. 2017; 27: 5415-5429
        • Tang L.
        • Yu Q.
        • Homayouni R.
        • Canada K.L.
        • Yin Q.
        • Damoiseaux J.S.
        • Ofen N.
        Reliability of subsequent memory effects in children and adults: The good, the bad, and the hopeful.
        Dev Cogn Neurosci. 2021; 52101037
        • Noble S.
        • Scheinost D.
        • Constable R.T.
        A guide to the measurement and interpretation of fMRI test-retest reliability.
        Curr Opin Behav Sci. 2021; 40: 27-32
        • Hedge C.
        • Powell G.
        • Sumner P.
        The reliability paradox: Why robust cognitive tasks do not produce reliable individual differences.
        Behav Res Methods. 2018; 50: 1166-1186
        • Enkavi A.Z.
        • Eisenberg I.W.
        • Bissett P.G.
        • Mazza G.L.
        • MacKinnon D.P.
        • Marsch L.A.
        • Poldrack R.A.
        Large-scale analysis of test–retest reliabilities of self-regulation measures.
        Proc Natl Acad Sci. 2019; 116: 5472-5477
        • Gawronski B.
        • Morrison M.
        • Phills C.E.
        • Galdi S.
        Temporal Stability of Implicit and Explicit Measures: A Longitudinal Analysis.
        Pers Soc Psychol Bull. 2017; 43: 300-312
      6. Klein C (2020, June 3): Confidence Intervals on Implicit Association Test Scores Are Really Rather Large. PsyArXiv.

        • Chen B.
        • Xu T.
        • Zhou C.
        • Wang L.
        • Yang N.
        • Wang Z.
        • et al.
        Individual Variability and Test-Retest Reliability Revealed by Ten Repeated Resting-State Brain Scans over One Month.
        PLOS ONE. 2015; 10e0144963
        • Noble S.
        • Scheinost D.
        • Constable R.T.
        A decade of test-retest reliability of functional connectivity: A systematic review and meta-analysis.
        NeuroImage. 2019; 203116157
        • Baranger D.A.A.
        • Lindenmuth M.
        • Nance M.
        • Guyer A.E.
        • Keenan K.
        • Hipwell A.E.
        • et al.
        The longitudinal stability of fMRI activation during reward processing in adolescents and young adults.
        NeuroImage. 2021; 232117872
        • Dang J.
        • King K.M.
        • Inzlicht M.
        Why Are Self-Report and Behavioral Measures Weakly Correlated?.
        Trends Cogn Sci. 2020; 24: 267-269
        • Schimmack U.
        The Implicit Association Test: A Method in Search of a Construct.
        Perspect Psychol Sci. 2021; 16: 396-414
        • Wennerhold L.
        • Friese M.
        Why self-report measures of self-control and inhibition tasks do not substantially correlate.
        Collabra Psychol. 2020; 6
        • Gelman A.
        • Carlin J.
        Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors.
        Perspect Psychol Sci. 2014; 9: 641-651
        • Marek S.
        • Tervo-Clemmens B.
        • Calabro F.J.
        • Montez D.F.
        • Kay B.P.
        • Hatoum A.S.
        • et al.
        Reproducible brain-wide association studies require thousands of individuals [no. 7902].
        Nature. 2022; 603: 654-660
      7. Kragel PA, Han X, Kraynak TE, Gianaros PJ, Wager TD (2021): Functional MRI Can Be Highly Reliable, but It Depends on What You Measure: A Commentary on Elliott et al. (2020). Psychol Sci 32: 622–626.

        • Brown V.M.
        • Chen J.
        • Gillan C.M.
        • Price R.B.
        Improving the reliability of computational analyses: Model-based planning and its relationship with compulsivity.
        Biol Psychiatry Cogn Neurosci Neuroimaging. 2020; 5: 601-609
        • Chen G.
        • Pine D.S.
        • Brotman M.A.
        • Smith A.R.
        • Cox R.W.
        • Haller S.P.
        Trial and error: A hierarchical modeling approach to test-retest reliability.
        NeuroImage. 2021; 245118647
        • Rouder J.N.
        • Haaf J.M.
        A psychometrics of individual differences in experimental tasks.
        Psychon Bull Rev. 2019; 26: 452-467
      8. Lord FM, Novick MR, Birnbaum A (1968): Statistical Theories of Mental Test Scores. Oxford, England: Addison-Wesley.

      9. Kelley TL (1947): Fundamentals of Statistics. Oxford, England: Harvard U. Press, pp xvi, 755.

      10. Kelley TL (1927): Interpretation of Educational Measurements. Yonkers-on-Hudson, N.Y.: World Book Company.

        • Efron B.
        • Morris C.
        Stein’s Estimation Rule and Its Competitors--An Empirical Bayes Approach.
        J Am Stat Assoc. 1973; 68: 117-130
      11. Stein C (1956): Inadmissibility of the Usual Estimator for the Mean of a Multivariate Normal Distribution. STANFORD UNIVERSITY STANFORD United States. Retrieved August 23, 2022, from

      12. James W, Stein C (1961): Estimation with Quadratic Loss. Proc Fourth Berkeley Symp Math Stat Probab Vol 1 Contrib Theory Stat 4.1: 361–380.

        • Efron B.
        • Morris C.
        Stein’s Paradox in Statistics.
        Sci Am. 1977; 236: 119-127
        • McGraw K.O.
        • Wong S.P.
        Forming inferences about some intraclass correlation coefficients.
        Psychol Methods. 1996; 1: 30-46
        • Shrout P.E.
        • Fleiss J.L.
        Intraclass correlations: Uses in assessing rater reliability.
        Psychol Bull. 1979; 86: 420-428
        • Shieh G.
        Choosing the best index for the average score intraclass correlation coefficient.
        Behav Res Methods. 2016; 48: 994-1003
        • Curran P.J.
        Have Multilevel Models Been Structural Equation Models All Along?.
        Multivar Behav Res. 2003; 38: 529-569
        • Chow S.-M.
        • Ho M.R.
        • Hamaker E.L.
        • Dolan C.V.
        Equivalence and Differences Between Structural Equation Modeling and State-Space Modeling Techniques.
        Struct Equ Model Multidiscip J. 2010; 17: 303-332
        • Olsen J.A.
        • Kenny D.A.
        Structural equation modeling with interchangeable dyads.
        Psychol Methods. 2006; 11: 127-141
        • Gelman A.
        • Pardoe I.
        Bayesian Measures of Explained Variance and Pooling in Multilevel (Hierarchical) Models.
        Technometrics. 2006; 48: 241-251
      13. Williams DR, Martin SR, DeBolt M, Oakes L, Rast P (2020, November 20): A Fine-Tooth Comb for Measurement Reliability: Predicting True Score and Error Variance in Hierarchical Models. PsyArXiv.

        • Turner B.M.
        • Forstmann B.U.
        • Wagenmakers E.-J.
        • Brown S.D.
        • Sederberg P.B.
        • Steyvers M.
        A Bayesian framework for simultaneously modeling neural and behavioral data.
        NeuroImage. 2013; 72: 193-206
        • Ahn W.-Y.
        • Krawitz A.
        • Kim W.
        • Busemeyer J.R.
        • Brown J.W.
        A model-based fMRI analysis with hierarchical Bayesian parameter estimation.
        J Neurosci Psychol Econ. 2011; 4: 95-110
        • Katahira K.
        How hierarchical models improve point estimates of model parameters at the individual level.
        J Math Psychol. 2016; 73: 37-58
      14. Valton V, Wise T, Robinson OJ (2020): Recommendations for Bayesian hierarchical model specifications for case-control studies in mental health.

        • Rouder J.N.
        • Lu J.
        An introduction to Bayesian hierarchical models with an application in the theory of signal detection.
        Psychon Bull Rev. 2005; 12: 573-604
        • Lee M.D.
        • Bock J.R.
        • Cushman I.
        • Shankle W.R.
        An application of multinomial processing tree models and Bayesian methods to understanding memory impairment.
        J Math Psychol. 2020; 95
        • Huys Q.J.M.
        • Browning M.
        • Paulus M.P.
        • Frank M.J.
        Advances in the computational understanding of mental illness.
        Neuropsychopharmacology. 2021; 46: 3-19
        • Pike A.C.
        • Robinson O.J.
        Reinforcement Learning in Patients With Mood and Anxiety Disorders vs Control Individuals: A Systematic Review and Meta-analysis.
        JAMA Psychiatry. 2022; 79: 313-322
      15. Eckstein M, Wilbrecht L, Collins A (2021, May 4): What do Reinforcement Learning Models Measure? Interpreting Model Parameters in Cognition and Neuroscience. PsyArXiv.

      16. Lockwood PL, Klein-Flügge MC (2021): Computational modelling of social cognition and behaviour—a reinforcement learning primer. Soc Cogn Affect Neurosci 16: 761–771.

      17. Zhang L, Lengersdorff L, Mikus N, Gläscher J, Lamm C (2020): Using reinforcement learning models in social neuroscience: frameworks, pitfalls and suggestions of best practices. Soc Cogn Affect Neurosci 15: 695–707.

        • Palestro J.J.
        • Bahg G.
        • Sederberg P.B.
        • Lu Z.-L.
        • Steyvers M.
        • Turner B.M.
        A tutorial on joint models of neural and behavioral measures of cognition.
        J Math Psychol. 2018; 84: 20-48
      18. SPM12 Software - Statistical Parametric Mapping (n.d.): Retrieved August 23, 2022, from

        • Turner B.M.
        • Forstmann B.U.
        • Love B.C.
        • Palmeri T.J.
        • Van Maanen L.
        Approaches to Analysis in Model-based Cognitive Neuroscience.
        J Math Psychol. 2017; 76: 65-79
        • Wilson R.C.
        • Niv Y.
        Is Model Fitting Necessary for Model-Based fMRI?.
        PLOS Comput Biol. 2015; 11e1004237
        • Lebreton M.
        • Bavard S.
        • Daunizeau J.
        • Palminteri S.
        Assessing inter-individual differences with task-related functional neuroimaging [no. 9].
        Nat Hum Behav. 2019; 3: 897-905
        • Haines N.
        • Vassileva J.
        • Ahn W.-Y.
        The Outcome-Representation Learning Model: A Novel Reinforcement Learning Model of the Iowa Gambling Task.
        Cogn Sci. 2018; 42: 2534-2561
      19. Månsson KNT, Waschke L, Manzouri A, Furmark T, Fischer H, Garrett DD (2022): Moment-to-Moment Brain Signal Variability Reliably Predicts Psychiatric Treatment Outcome. Biol Psychiatry 91: 658–666.

      20. Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB (2015): Bayesian Data Analysis, 3rd ed. New York: Chapman and Hall/CRC.

      21. Farrell S, Lewandowsky S (2018): Computational Modeling of Cognition and Behavior. Cambridge University Press.

      22. Bürkner P-C (2017): brms: An R Package for Bayesian Multilevel Models Using Stan. J Stat Softw 80: 1–28.

        • Carpenter B.
        • Gelman A.
        • Hoffman M.D.
        • Lee D.
        • Goodrich B.
        • Betancourt M.
        • et al.
        Stan: A Probabilistic Programming Language.
        J Stat Softw. 2017; 76: 1-32
      23. McElreath R (2022, August 23): Statistical Rethinking (2022 Edition) [R]. Retrieved August 23, 2022, from

      24. Zhang L (2022, August 17): BayesCog [R]. Retrieved August 23, 2022, from