## Abstract

Canonical correlation analysis (CCA) and partial least squares (PLS) are powerful multivariate methods for capturing associations across 2 modalities of data (e.g., brain and behavior). However, when the sample size is similar to or smaller than the number of variables in the data, standard CCA and PLS models may overfit, i.e., find spurious associations that generalize poorly to new data. Dimensionality reduction and regularized extensions of CCA and PLS have been proposed to address this problem, yet most studies using these approaches have some limitations. This work gives a theoretical and practical introduction into the most common CCA/PLS models and their regularized variants. We examine the limitations of standard CCA and PLS when the sample size is similar to or smaller than the number of variables. We discuss how dimensionality reduction and regularization techniques address this problem and explain their main advantages and disadvantages. We highlight crucial aspects of the CCA/PLS analysis framework, including optimizing the hyperparameters of the model and testing the identified associations for statistical significance. We apply the described CCA/PLS models to simulated data and real data from the Human Connectome Project and Alzheimer’s Disease Neuroimaging Initiative (both of

*n*> 500). We use both low- and high-dimensionality versions of these data (i.e., ratios between sample size and variables in the range of ∼1–10 and ∼0.1–0.01, respectively) to demonstrate the impact of data dimensionality on the models. Finally, we summarize the key lessons of the tutorial.## Keywords

Neuroimaging datasets with sample sizes of

*n*> 1000 (e.g., UK Biobank, Human Connectome Project [HCP], Alzheimer’s Disease Neuroimaging Initiative [ADNI]) represent a unique opportunity to advance population neuroscience and mental health (1

, 2

, 3

). These datasets comprise multiple data modalities (e.g., structural magnetic resonance imaging, resting-state functional magnetic resonance imaging, mental health, cognition, environmental factors and genetics), several of which can be high-dimensional, meaning that there are hundreds or thousands of variables per subject. Understanding the links across these different modalities is fundamental for enabling new discoveries; however, analyzing multimodal datasets with more variables than samples poses technical challenges.The most established methods to find associations across multiple modalities of multivariate data are canonical correlation analysis (CCA) (26On stability of canonical correlation analysis and partial least squares with application to brain-behavior associations.); 2) many studies do not test the significance of the associations using hold-out data (e.g., out-of-sample correlation) (26On stability of canonical correlation analysis and partial least squares with application to brain-behavior associations.,

4

) and partial least squares (PLS) (5

). CCA and PLS have recently become very popular, with numerous applications linking brain imaging to behavior or genetics [e.g., (6

, 7

, 8

, 9

, 10

, 11

, 12

, 13

, 14

, 15

, 16

, 17

, 18

, 19

, 20

, 21

, 22

, 23

, 24

, 25

, 26

)]. However, when the variables in at least one modality (e.g., brain) outnumber the sample size, standard CCA and PLS models may overfit, i.e., they are more likely to find spurious associations that generalize poorly to independent samples [e.g., (- Helmer M.
- Warrington S.
- Mohammadi-Nejad A.-R.
- Lisa J.
- Howell A.
- Rosand B.
- et al.

On stability of canonical correlation analysis and partial least squares with application to brain-behavior associations.

*bioRxiv.*2020; https://doi.org/10.1101/2020.08.25.265546

26

, - Helmer M.
- Warrington S.
- Mohammadi-Nejad A.-R.
- Lisa J.
- Howell A.
- Rosand B.
- et al.

On stability of canonical correlation analysis and partial least squares with application to brain-behavior associations.

*bioRxiv.*2020; https://doi.org/10.1101/2020.08.25.265546

27

, 28

)]. Moreover, there is no unique standard CCA solution when the number of variables exceeds the sample size. Two approaches have been proposed to address this problem: 1) reducing the dimensionality of the data with principal component analysis (PCA) (9

,10

,12

,22

,24

,26

) and 2) using regularized extensions of CCA and PLS (- Helmer M.
- Warrington S.
- Mohammadi-Nejad A.-R.
- Lisa J.
- Howell A.
- Rosand B.
- et al.

On stability of canonical correlation analysis and partial least squares with application to brain-behavior associations.

*bioRxiv.*2020; https://doi.org/10.1101/2020.08.25.265546

11

,20

,23

,27

). However, most studies using these approaches have potential limitations. For instance, 1) they usually do not optimize the hyperparameters (e.g., the number of principal components [PCs] or amount of regularization) (9

,10

,12

,15

,22

,24

,- Helmer M.
- Warrington S.
- Mohammadi-Nejad A.-R.
- Lisa J.
- Howell A.
- Rosand B.
- et al.

*bioRxiv.*2020; https://doi.org/10.1101/2020.08.25.265546

7

,9

, 10

, 11

,22

); and 3) they often do not assess the stability of the CCA/PLS model (7

,9

,18

,21

, 22

, 23

, 24

, 25

). Finally, few studies compare different CCA/PLS models and analytic frameworks across different datasets with different dimensionalities [e.g., (25

, - Helmer M.
- Warrington S.
- Mohammadi-Nejad A.-R.
- Lisa J.
- Howell A.
- Rosand B.
- et al.

*bioRxiv.*2020; https://doi.org/10.1101/2020.08.25.265546

27

)].Several tutorial papers were recently published on CCA and PLS (

29

, 30

, 31

, 32

). Here, we complement these tutorials by discussing some important conceptual and practical aspects of these methods. These comprise 1) the advantages and disadvantages of the various CCA/PLS models, 2) the impact of PCA and regularization on these models (e.g., on overfitting and stability), and 3) the importance of the analytic framework in optimizing the models’ hyperparameters and performing statistical inference.In Part 1, we present the theoretical background of these models and discuss the most common strategies to mitigate the problems caused when the ratio between sample size and number of variables is small (e.g., around ∼0.1–0.01). We also examine the most prevalent analytical frameworks used with CCA/PLS models. In Part 2, we apply the models introduced in Part 1 to simulated data and real data from the HCP and ADNI (

*n*> 500 in all). We illustrate how the different CCA/PLS models perform with data dimensionalities often used in practice (i.e., ratios between sample size and number of variables in the ranges of ∼1–10 or ∼0.1–0.01). Moreover, we show that regularization can be helpful even when the number of variables in both data modalities is smaller than the sample size. Mathematical details of the CCA/PLS models and their connections are provided in the Supplement.## Part 1: Technical Background of CCA and PLS

### CCA/PLS Optimization and Nomenclature

CCA (

4

) and PLS (5

) are multivariate latent variable models that capture associations across 2 modalities of data (e.g., brain and behavior). For example (Figure 1), $\mathbf{X}$ contains voxel-level brain variables and $\mathbf{Y}$ contains behavioral variables from item-level self-report questionnaires (and are matrices with rows and columns representing subjects and variables, respectively). Standard CCA/PLS models find pairs of brain and behavioral weights ${\mathbf{w}}_{x}$ and ${\mathbf{w}}_{y}$ (column vectors) such that the linear combination (weighted sum) of the brain and behavioral variables maximizes the correlation (CCA) or covariance (PLS) between the resulting latent variables, i.e., between $\mathbf{\xi}={\mathbf{Xw}}_{x\phantom{\rule{0.25em}{0ex}}}$ and $\mathbf{\omega}={\mathbf{Yw}}_{y}$, respectively.In the PLS literature, the weights are often referred to as saliences and the latent variables as scores. In the CCA literature, the weights are often referred to as canonical vectors, the latent variables as canonical variates, and the correlation between the latent variables as canonical correlations. The brain and behavior weights have the same dimensionality as their respective data modality (e.g., number of brain/behavioral variables) and quantify each brain and behavioral variable’s contribution to the identified association. Sometimes, Pearson's correlations between the brain and behavioral variables and their respective latent variable are presented instead of the model’s weights and are called structure correlations (CCA) (

33

) or loadings (PLS) (34

) (for details, see the Supplement). The latent variables (one latent variable score per data modality and subject) quantify how the associative effect is expressed across the sample. Table 1 summarizes the different nomenclatures used in the CCA and PLS literature.Table 1Different Nomenclatures in CCA and PLS Literature and Summary of the Corresponding Terms

Model | Relationship | Model Weights | Latent Variable | Correlation Between Original Variables and Latent Variable |
---|---|---|---|---|

CCA | Mode/association | Canonical vector/coefficient | Canonical variable/variate | Structure correlation |

PLS | Association | Salience | Score | Loading |

CCA, canonical correlation analysis; PLS, partial least squares.

While standard CCA refers to a single method, standard PLS refers to a family of methods with different modeling aims (e.g., assuming a symmetric or asymmetric relationship between the 2 data modalities; for details, see the Supplement). Both standard CCA and PLS can be solved by iterative [e.g., alternating least squares (

35

), nonlinear iterative PLS (36

)] and noniterative [e.g., eigenvalue problem (- Wegelin J.A.

A survey on partial least squares (PLS) methods, with emphasis on the two-block case. Technical Report No. 371.

https://stat.uw.edu/sites/default/files/files/reports/2000/tr371.pdf

Date accessed: November 23, 2017

29

,34

)] methods. In the case of iterative methods, once a pair of weights is obtained, the corresponding associative effect is removed from the data (by a process called deflation) and new associations are sought.Because standard CCA maximizes the correlation between the latent variables, it is more sensitive to the direction of the relationships across modalities, and it is not driven by within-modality variances. On the other hand, standard PLS—which maximizes covariance—is less sensitive to the direction of the across-modality relationships, as it is also driven by within-modality variances. Formally, we can see this from the optimization of these models. Standard CCA optimizes correlation across modalities:

${\mathrm{max}}_{{\mathbf{w}}_{x},{\mathbf{w}}_{y}}corr\left({\mathbf{Xw}}_{x},{\mathbf{Yw}}_{y}\right)$

(1)

Standard PLS optimizes covariance across modalities—the product of correlation and standard deviations (i.e., square root of variance):

${\mathrm{max}}_{{\mathbf{w}}_{x},{\mathbf{w}}_{y}}\phantom{\rule{0.05em}{0ex}}cov\left({\mathbf{Xw}}_{x},{\mathbf{Yw}}_{y}\right)=corr\left({\mathbf{Xw}}_{x},{\mathbf{Yw}}_{y}\right)\phantom{\rule{-0.15em}{0ex}}\sqrt{var\left({\mathbf{Xw}}_{x}\right)}\sqrt{var\left({\mathbf{Yw}}_{y}\right)}$

(2)

This also means that standard CCA and PLS are equivalent optimization problems when $var\left({\mathbf{Xw}}_{x}\right)=var\left({\mathbf{Yw}}_{y}\right)=1$, which is true when the within-modality variances are identity matrices, i.e., ${\mathbf{X}}^{T}\mathbf{X}={\mathbf{Y}}^{T}\mathbf{Y}=\mathbf{I}$.

### Limitations of Standard CCA/PLS Models

When the ratio between the sample size and the number of variables is similar to or smaller than 1, standard CCA/PLS models present limitations. These limitations can exist irrespective of sample size if the number of variables is large or if the variables are highly correlated. In the case of standard CCA, the key limitations are that 1) the optimization is ill-posed (i.e., there is no unique solution) when the number of variables in at least one of the modalities exceeds the sample size and 2) the CCA weights ${\mathbf{w}}_{x}$ and ${\mathbf{w}}_{y}$ are unstable when the variables within one or both modalities are highly correlated, known as the multicollinearity problem (

37

). Not surprisingly, these limitations might sound familiar, as standard CCA can be viewed as a multivariate extension of the univariate general linear model (38

,39

). The standard PLS optimization is never ill-posed and copes with multicollinearity [i.e., standard PLS weights are stable (36

)]; however, standard PLS and CCA cannot perform feature selection (i.e., setting the weights of some variables to zero) and may therefore have low performance in cases in which the effects are sparse.- Wegelin J.A.

A survey on partial least squares (PLS) methods, with emphasis on the two-block case. Technical Report No. 371.

https://stat.uw.edu/sites/default/files/files/reports/2000/tr371.pdf

Date accessed: November 23, 2017

These limitations can be addressed by dimensionality reduction (i.e., PCA) or regularization. Regularization adds further constraints to the optimization to solve an ill-posed problem or prevent overfitting. For CCA/PLS models, the most common forms of regularization are L1-norm (lasso) (

40

), L2-norm (ridge) (41

), and combinations of L1-norm and L2-norm regularization (elastic-net) (42

).### CCA With PCA Dimensionality Reduction

PCA transforms one modality of multivariate data into uncorrelated PCs (it is also related to whitening, see Effects of Prewhitening on CCA/PLS Models). PCA is often used as a naïve dimensionality reduction technique, as PCs explaining little variance are assumed to be noise and discarded, and the remaining PCs are entered into standard CCA. However, PCA when applied before CCA (PCA-CCA) can be also seen as a technique similar to regularization: It makes the CCA model well posed and addresses the multicollinearity problem.

The number of retained PCs can be selected based on their explained variance, e.g., 99% of total variance. In PCA-CCA applications, often the same number of PCs are chosen for both data modalities, based on the lower-dimensional data, usually behavior [e.g., (26On stability of canonical correlation analysis and partial least squares with application to brain-behavior associations.)]. One problem with discarding PCs with low variance is that there is no guarantee that PCs with high variance in either modality are best to link the different data modalities, while some discarded PCs might contain useful information. To address this problem, we can use a data-driven approach, by selecting the number of PCs that maximize the correlation across modalities (see CCA With PCA Dimensionality Reduction Versus RCCA in High-Dimensional Data in Part 2).

9

,10

,22

,24

)]. Sometimes, the same proportion of explained variance—rather than numbers of PCs—is used for both data modalities [e.g., (12

,- Helmer M.
- Warrington S.
- Mohammadi-Nejad A.-R.
- Lisa J.
- Howell A.
- Rosand B.
- et al.

*bioRxiv.*2020; https://doi.org/10.1101/2020.08.25.265546

### Regularized CCA

L2-norm regularization is a popular form of regularization for ill-posed problems or for mitigating the effects of multicollinearity, originally used in ridge regression ( which forces the weights to be small but does not make them zero. L2-norm regularization has been proposed for CCA (

where the 2 hyperparameters (${c}_{x}$, ${c}_{y}$) control the amount of regularization and provide a smooth transition between standard CCA (${c}_{x}={c}_{y}=0$, not regularized) and standard PLS (${c}_{x}={c}_{y}=1$, most regularized) (

41

). In L2-norm regularization, the added constraint corresponds to the sum of squares of all weight values,^{ 1}

43

), commonly referred to as regularized CCA (RCCA) (34

,44

, 45

, 46

). Interestingly, in RCCA, the regularization terms added to the CCA problem lead to a mixture of standard CCA and standard PLS optimization. We can see this from the RCCA optimization problem:- Tuzhilina E.
- Tozzi L.
- Hastie T.

Canonical correlation analysis in high dimensions with structured regularization.

*arXiv.*2020; https://doi.org/10.48550/arXiv.2011.01650

${\mathrm{max}}_{{\mathbf{w}}_{x},{\mathbf{w}}_{y}}\phantom{\rule{0.25em}{0ex}}\frac{corr\left({\mathbf{Xw}}_{x},{\mathbf{Yw}}_{y}\right)\sqrt{var\left({\mathbf{Xw}}_{x}\right)}\sqrt{var\left({\mathbf{Yw}}_{y}\right)}}{\sqrt{\left(1-{c}_{x}\right)var\left({\mathbf{Xw}}_{x}\right)+{c}_{x}}\sqrt{\left(1-{c}_{y}\right)var\left({\mathbf{Yw}}_{y}\right)+{c}_{y}}}$

(3)

where the 2 hyperparameters (${c}_{x}$, ${c}_{y}$) control the amount of regularization and provide a smooth transition between standard CCA (${c}_{x}={c}_{y}=0$, not regularized) and standard PLS (${c}_{x}={c}_{y}=1$, most regularized) (

34

,44

). Importantly, as L2-norm regularization mitigates multicollinearity, it increases the stability of the RCCA weights. However, it also means that similar to standard PLS, RCCA can be driven by within-modality variances. For additional connections between standard CCA, RCCA, standard PLS, and how they are related to PCA-CCA, see the Supplement.### Sparse PLS

L1-norm regularization was originally proposed in lasso regression ( which sets some of the weight values to zero, resulting in variable selection and promoting sparsity. Sparse solutions facilitate the interpretability of the model and may improve performance when only a subset of variables is relevant (

40

). In L1-norm regularization, the added constraint corresponds to the absolute sum of weight values,^{ 2}

40

). However, sparsity can also introduce instability to the model if different sets of variables provide similar performance. Elastic-net regularization is a mixture of L1-norm and L2-norm regularization that combines the properties of both forms of regularization and can mitigate the instability of L1-norm regularization (42

). In one popular algorithm (17

), which we will refer to as sparse PLS (SPLS), hyperparameters control the amount of L1-norm regularization or sparsity. Because standard PLS can be seen as CCA with maximal L2-norm regularization (see previous section), SPLS can also be viewed as an elastic-net regularized CCA (for details, see the Supplement).### Effects of Prewhitening on CCA/PLS Models

In machine learning, data are often whitened as a preprocessing step. Whitening transforms the original variables into new, uncorrelated features, which are normalized to have unit length (i.e., the L2-norm of each feature equals 1). Whitening is not a unique transformation, and the most commonly used forms are PCA, Mahalanobis whitening, and Cholesky whitening (

47

). The critical difference between PCA and PCA whitening is that PCA retains the variance of the original data, i.e., the PCs are not normalized to have unit length.Whitening as a preprocessing step has a major drawback in CCA/PLS models: The beneficial effects of L1-norm and L2-norm regularization on the original variables cannot be achieved anymore, as the whitened data are the new inputs of the model. In the case of SPLS, L1-norm regularization will result in sparsity on the whitened variables (instead of on the original variables); thus, the interpretability of the results will not be facilitated. In the case of RCCA, L2-norm regularization is not active on whitened data, which means that standard CCA, RCCA, and standard PLS will yield the same results. For additional details on whitening, see the Supplement.

### Analytic Frameworks for CCA/PLS Models

The statistical significance of the CCA/PLS model (i.e., the number of significant associative effects) can be evaluated using either a descriptive or a predictive (also referred to as a machine learning) framework. The 2 frameworks have distinct goals: The aim of the descriptive framework is to detect above-chance associations in the current dataset, whereas the aim of the predictive framework is to test whether such associations generalize to new data (

47

, 48

, 49

, 50

, 51

).In the descriptive framework (Figure 2A), the CCA/PLS model is fitted on the entire sample; thus, the statistical inference is based on in-sample correlation. In this framework, there is usually no hyperparameter optimization (i.e., the number of PCs or regularization parameter is fixed a priori). In the predictive framework (Figure 2B), the CCA/PLS model is fitted on a training/optimization set and evaluated on a test/holdout set; thus, the statistical inference is based on out-of-sample correlation. This procedure assesses the generalizability of the model, i.e., how well the association found in the training set generalizes to an independent test set. In the predictive framework, the hyperparameters are usually optimized; therefore, the training/optimization set is further divided into a training set and a validation set, and the best hyperparameters are selected based on out-of-sample correlation in the validation set. In both descriptive and predictive frameworks, permutation inference (based on in-sample or out-of-sample correlation) is often used to assess the number of significant associative effects (

51

,52

).Last, an important component of any CCA/PLS framework is testing the stability of the model. Usually a bootstrapping procedure is applied to provide confidence intervals on the model’s weights (

51

). Recently, stability selection (19

,20

,53

, 54

, 55

) has been proposed with the aim of selecting the most stable CCA/PLS model in the first place, rather than evaluating the stability of the model post hoc. Alternatively, the stability of the CCA/PLS models can be measured as the average similarity of weights across different splits of training data, which avoids the additional computational costs of the previous 2 approaches (27

). For more details on analytic frameworks, see for example (22

,27

,56

,51

).## Part 2: Demonstrations of CCA and PLS Analyses

### Description of Experiments

In order to demonstrate the properties of different CCA and PLS approaches, we applied the models introduced in Part 1 to real and simulated datasets with different dimensionalities and sample sizes. Table 2 gives an overview of all experiments.

Table 2Summary of CCA/PLS Models on High- and Low-Dimensional Real and Simulated Data

Model | Analytical Framework | Hyperparameter Optimization | Model Hyperparameter |
---|---|---|---|

High-Dimensional Data | |||

PCA-CCA | Descriptive | None (fixed) | Number of PCs |

PCA-CCA | Predictive | None (fixed) | Number of PCs |

PCA-CCA | Predictive | Data-driven | Number of PCs |

RCCA | Predictive | Data-driven | Amount of L2-norm regularization |

Standard PLS | Predictive | None | None |

SPLS | Predictive | Data-driven | Amount of L1-norm regularization |

Low-Dimensional Data | |||

Standard CCA | Predictive | None | None |

RCCA | Predictive | Data-driven | Amount of L2-norm regularization |

Standard PLS | Predictive | None | None |

SPLS | Predictive | Data-driven | Amount of L1-norm regularization |

CCA, canonical correlation analysis; PC, principal component; PCA, principal component analysis; PLS, partial least squares; RCCA, regularized canonical correlation analysis; SPLS, sparse partial least squares.

We chose the HCP and the ADNI datasets based on 2 recent landmark studies (

22

,56

). In the HCP dataset, we used resting-state functional magnetic resonance imaging connectivity data (19,900 and 300 brain variables in the high- and low-dimensional data, respectively) and 145 nonimaging subject measures (e.g., behavioral, demographic, lifestyle measures) of 1003 healthy subjects. In the ADNI dataset, we used whole-brain gray matter volumes (168,130 and 120 brain variables in the high- and low-dimensional data, respectively) and 31 item-level measures of the Mini-Mental State Examination of 592 elderly subjects. We generated the simulated data with a sparse signal (i.e., 10% of the variables in each modality were relevant to capture the association across modalities) and properties similar to the HCP dataset (in terms of sample size, dimensionality, and correlation between latent variables). Table 3 displays the characteristics of the real and simulated datasets. For further details of the datasets and the simulated data generation, see the Supplement.Table 3Characteristics of Real and Simulated Datasets

Data | HCP | ADNI | Simulation | |||
---|---|---|---|---|---|---|

Low Dimensional | High Dimensional | Low Dimensional | High Dimensional | Low Dimensional | High Dimensional | |

Subjects | Healthy (n = 1001) | Healthy (n = 1001) | Healthy + clinical (n = 592) | Healthy + clinical (n = 592) | Not applicable (n = 1000) | Not applicable (n = 1000) |

Brain Variables | Connectivity of 25 ICA components (d = 300) | Connectivity of 200 ICA components (d = 19900) | ROI-wise gray matter volume (d = 120) | Voxelwise gray matter volume (d = 168130) | Not applicable (d = 100) | Not applicable (d = 20000) |

Behavioral Variables | Behavior, psychometrics, demographics (d = 145) | Behavior, psychometrics, demographics (d = 145) | Items of MMSE questionnaire (d = 31) | Items of MMSE questionnaire (d = 31) | Not applicable (d = 100) | Not applicable (d = 100) |

ADNI, Alzheimer’s Disease Neuroimaging Initiative; d, number of variables; HCP, Human Connectome Project; ICA, independent component analysis; ROI, region of interest; MMSE, Mini-Mental State Examination.

a Data-driven brain parcellation.

b Brain parcellation using the Automated Anatomical Labeling 2 atlas (

62

).c Screening questionnaire for dementia (

63

).The PCA-CCA model was used both with fixed numbers of PCs within a descriptive framework and with optimized number of PCs within a predictive framework. All the other CCA/PLS models were used within a predictive framework. The predictive framework was based on Monteiro

*et al.*(48

), who used multiple test/holdout sets to assess the generalizability and robustness of the CCA/PLS models (detailed in the Supplement). In both frameworks, permutation testing was used to assess the number of statistically significant associative effects based on in-sample and out-of-sample correlations between the latent variables, respectively. Importantly, the family structure of the HCP dataset was respected during the different data splits (training, validation, test/holdout sets) and permutations (57

). We used iterative methods to solve the CCA/PLS model and applied mode-A deflation for standard PLS and SPLS and generalized deflation for standard CCA, PCA-CCA, and RCCA (for details, see the Supplement). For simplicity, we present the results for the first associative effect in most CCA/PLS experiments (for a summary of all associative effects, see Table S1). Throughout the article, we present the weights (canonical vector for CCA models, salience for PLS models) and latent variables obtained by the model.We used linear mixed-effects models to compare the different CCA/PLS models on the following measures across the outer training or test sets: 1) in-sample correlation, 2) out-of-sample correlation, 3) similarity of the model weights (measured by Pearson’s correlation), and 4) variance explained by the model. In addition, we compared the number of PCs between PCA-CCA models with fixed versus data-driven numbers of PCs. We report significance at

*p*< .005 in all linear mixed-effects models. For further details of the linear mixed-effects analyses, see the Supplement. We also quantified the rank similarity of the weights (measured by Spearman’s correlation) across the different CCA/PLS models in the real datasets.### In-Sample Versus Out-of-Sample Correlation in High-Dimensional Data

Figure 3 and Table 4 display the in-sample and out-of-sample correlations for all experiments using all 3 high-dimensional datasets. On average the out-of-sample correlations are lower than the in-sample correlations (

*t*_{14}= 4.51,*p*= .0005). In real datasets, CCA/PLS models with dimensionality reduction or regularization provide high out-of-sample correlations in most cases, underlining that these models generalize well to unseen data. The only notable exceptions are standard PLS and SPLS, which presented significantly lower out-of-sample correlations in the HCP dataset (*F*_{2,56}= 289.30,*p*< .0001) (Figure 3B). This can be attributed to the different properties of the HCP dataset (e.g., higher noise level and nonsparse associative effect) and the fact that standard PLS and SPLS are especially dominated by within-modality variance in this dataset (Table 4).Table 4Main Characteristics of the First Associative Effects in the High-Dimensional Datasets Obtained With the Different CCA/PLS Models Using the Predictive Framework

Model | Brain | Behavior | Across-Modality Relationship | |||
---|---|---|---|---|---|---|

Stability of Weights | Explained Variance | Stability of Weights | Explained Variance | In-Sample Correlation | Out-of-Sample Correlation | |

ADNI Dataset | ||||||

PCA-CCA (Fixed PCs) | 0.86 ± 0.00 | 8.47 ± 0.16 | 0.85 ± 0.01 | 14.91 ± 0.23 | 0.70 ± 0.00 | 0.55 ± 0.01 |

PCA-CCA (Data-Driven PCs) | 0.70 ± 0.01 | 5.26 ± 0.25 | 0.93 ± 0.00 | 15.73 ± 0.13 | 0.83 ± 0.01 | 0.65 ± 0.01 |

RCCA (L2-Reg. Opt.) | 0.82 ± 0.00 | 5.47 ± 0.06 | 0.94 ± 0.00 | 16.63 ± 0.26 | 0.98 ± 0.00 | 0.66 ± 0.01 |

Standard PLS | 0.96 ± 0.00 | 21.54 ± 0.16 | 0.94 ± 0.00 | 18.64 ± 0.21 | 0.44 ± 0.00 | 0.43 ± 0.01 |

SPLS (L1-Reg. Opt.) | 0.83 ± 0.02 | 14.05 ± 0.13 | 0.96 ± 0.01 | 15.86 ± 0.42 | 0.60 ± 0.00 | 0.61 ± 0.01 |

HCP Dataset | ||||||

PCA-CCA (Fixed PCs) | 0.72 ± 0.01 | 0.42 ± 0.01 | 0.78 ± 0.01 | 2.67 ± 0.10 | 0.76 ± 0.00 | 0.47 ± 0.02 |

PCA-CCA (Data-Driven PCs) | 0.56 ± 0.02 | 0.35 ± 0.03 | 0.53 ± 0.04 | 3.73 ± 0.39 | 0.76 ± 0.01 | 0.45 ± 0.03 |

RCCA (L2-Reg. Opt.) | 0.78 ± 0.01 | 0.29 ± 0.01 | 0.88 ± 0.01 | 4.39 ± 0.18 | 1.00 ± 0.00 | 0.52 ± 0.02 |

Standard PLS-2 | 0.52 ± 0.04 | 0.50 ± 0.05 | 0.62 ± 0.05 | 8.07 ± 0.30 | 0.79 ± 0.02 | 0.21 ± 0.02 |

SPLS-2 (L1-Reg. Opt.) | 0.25 ± 0.04 | 0.48 ± 0.07 | 0.51 ± 0.05 | 7.23 ± 0.37 | 0.64 ± 0.04 | 0.25 ± 0.03 |

Simulated Dataset | ||||||

PCA-CCA (Fixed PCs) | 0.74 ± 0.01 | 0.76 ± 0.01 | 0.90 ± 0.00 | 1.82 ± 0.01 | 0.80 ± 0.00 | 0.67 ± 0.01 |

PCA-CCA (Data-Driven PCs) | 0.96 ± 0.00 | 0.85 ± 0.00 | 0.91 ± 0.00 | 1.95 ± 0.02 | 0.73 ± 0.01 | 0.70 ± 0.01 |

RCCA (L2-Reg. Opt.) | 0.93 ± 0.00 | 0.77 ± 0.00 | 0.97 ± 0.00 | 1.99 ± 0.01 | 0.83 ± 0.01 | 0.71 ± 0.01 |

Standard PLS | 0.94 ± 0.00 | 0.84 ± 0.00 | 0.97 ± 0.00 | 2.07 ± 0.01 | 0.81 ± 0.00 | 0.71 ± 0.01 |

SPLS (L1-Reg. Opt.) | 0.78 ± 0.03 | 0.84 ± 0.00 | 1.00 ± 0.00 | 1.94 ± 0.01 | 0.79 ± 0.01 | 0.73 ± 0.01 |

Values are mean ± SEM. Note that we display the second associative effect for standard PLS (PLS-2) and SPLS (SPLS-2) in the HCP dataset, as it is the most similar to the first associative effects identified by the other models.

ADNI, Alzheimer’s Disease Neuroimaging Initiative; CCA, canonical correlation analysis; HCP, Human Connectome Project; L1-reg., L1-norm regularization; L2-reg., L2-norm regularization; opt., optimized; PC, principal component; PCA, principal component analysis; PLS, partial least squares; RCCA, regularized canonical correlation analysis; SPLS, sparse partial least squares;

a Similarity of model weights measured by Pearson’s correlation between each pair of training sets of the outer data splits.

b Percent variance explained by the model relative to all within-modality variance in the training sets of the outer data splits.

c Correlation between the latent variables in the training sets of the outer data splits.

d Correlation between the latent variables in the test sets of the outer data splits.

In conclusion, we recommend embedding all models in a predictive framework that splits the data into training and test sets to assess the model’s out-of-sample generalizability.

### CCA With PCA Dimensionality Reduction Versus RCCA in High-Dimensional Data

In this section, we present the results of applying PCA-CCA and RCCA to all 3 high-dimensional datasets. We focus on experiments using the predictive framework, compare PCA-CCA with fixed versus data-driven numbers of PCs, and compare both of these models with RCCA.

Figures 4A–C and 5A–C display the brain and behavioral weights and corresponding latent variables for the 3 models (note that for the HCP dataset, the brain weights were transformed into brain connection strength increases/decreases). Figure 6 compares the brain and behavioral weights using rank similarity across the models, which indicates that although the weights are similar across the 3 models, data-driven PCA-CCA and RCCA are more similar to each other. The model weights and latent variables for the simulated dataset can be found in Figures 7A–C, which suggest that all 3 models recovered sufficiently the true weights of the generative model. Nevertheless, the nonsparse models attributed nonzero weights for many nonrelevant variables (for details, see Table S2).

To further investigate the characteristics of the 3 models, Table 4 shows the stability of weights and the explained variance by the models. The stability of weights varied significantly across brain and behavior modalities (

*F*_{1,804}= 84.51,*p*< .0001) and models (*F*_{2,804}= 91.63,*p*< .0001). Notably, the stability of RCCA weights was consistently high. The explained variance varied significantly only across modalities (*F*_{1,174}= 241.55,*p*< .0001) but not across models (*F*_{2,174}= 0.31,*p*= .7303).Next, we examined the number of PCs in the 2 PCA-CCA models. We found a significant interaction between the effect of data modality and model on the number of PCs (

*F*_{1,114}= 22.63,*p*< .0001). Data-driven PCA-CCA yielded more brain PCs and fewer behavioral PCs than PCA-CCA with fixed number of PCs (Table S3). These results confirm that lower-ranked brain PCs might also carry information that links brain and behavior and should not necessarily be discarded. Moreover, fixing the same number of PCs for both modalities might not be a good choice.Based on these results, and as the optimal numbers of PCs can vary even across different brain–behavior associations in the same dataset, we recommend data-driven PCA-CCA over PCA-CCA with fixed numbers of PCs. Furthermore, we found that data-driven PCA-CCA and RCCA gave similar results, both having a similar regularizing effect on the CCA model.

### Sparse Versus Nonsparse CCA/PLS Models in High-Dimensional Data

In this section, we show how SPLS found associations between subsets of features in all 3 high-dimensional datasets, and we compare the SPLS results with standard PLS and RCCA.

Figures 4C–E and 5C–E display the models’ weights and latent variables (note that for the HCP dataset, the brain weights were transformed into brain connection strength increases/decreases). The first associative effect found by standard PLS and SPLS was similar to the first found by RCCA in both the ADNI and simulated datasets, but in the HCP dataset, the first associative effect identified by RCCA was more similar to the second effect found by standard PLS and SPLS (Figure 6). This is likely because the within-modality variances in the HCP dataset differ substantially from the identity matrix, and therefore the difference between the objectives of CCA and PLS models is more pronounced (see equations 1 and 2). The brain and behavioral weights were similar across the 3 models in both real datasets, especially the top-ranked variables (i.e., the variables with the highest weights). Similar to RCCA, standard PLS and SPLS recovered sufficiently the true weights of the generative model; however, the SPLS model assigned fewer nonzero weights to nonrelevant variables (Figure 7C–E). These results demonstrate that when the signal is sparse, SPLS can lead to high true positive and high true negative rates of weight recovery (Table S2). Table S4 shows the sparsity of the associative effects identified by SPLS.

The stability of the weights differed significantly between the brain and behavioral modalities (

*F*_{1,804}= 75.26,*p*< .0001) and the 3 models (*F*_{2,804}= 61.77,*p*< .0001) (Table 4). The stability of the SPLS weights was lowest in the HCP dataset, which is likely due to the model’s sparsity and that different sets of variables might provide similar performance. The instability of SPLS could be mitigated by stability selection (20

) or a stability criterion during hyperparameter optimization (27

). The explained variance varied significantly across modalities (*F*_{1,174}= 80.00,*p*< .0001) and the 3 models (*F*_{2,174}= 28.60,*p*< .0001).In summary, while RCCA is likely to yield similar or higher out-of-sample correlations than standard PLS and SPLS, SPLS can perform variable selection and may improve the interpretability of the results; however, it can also present instabilities. In practice, the 3 models often provide similar weights for the top-ranked variables.

### Standard Versus Regularized Extension of CCA/PLS Models in Low-Dimensional Data

To investigate the effects of regularization in all 3 low-dimensional datasets, we compared standard CCA, RCCA, standard PLS, and SPLS. The regularized models (RCCA, SPLS) were more stable (

*F*_{3,1075}= 80.54,*p*< .0001) (Table S5) and showed a trend toward higher out-of-sample correlations (*F*_{1,10}= 3.35,*p*= .0972) (Figure S1) than their nonregularized variants (standard CCA and PLS). The stability of standard PLS and RCCA weights was consistently high, the stability of SPLS varied across datasets, and standard CCA was rather unstable (Table S5). SPLS provided sparse results, similar to the high-dimensional datasets (Table S4). As expected, RCCA and standard PLS explained increasingly more within-modality variance than standard CCA. For a detailed description of these results, see the Supplement. Taken together, these results suggest that RCCA/SPLS models should be preferred even for low-dimensional data.### Conclusions

This tutorial compared standard and regularized extensions of CCA and PLS models and highlighted the benefits of regularization. Here, we outline the key lessons.

First, we showed that regularized extensions of CCA/PLS models give similar out-of-sample correlations in large datasets (with the exception of standard PLS and SPLS in the high-dimensional HCP dataset) when the sample size is similar to or much smaller than the number of variables (i.e., when the ratio between examples and variables is ∼1–10 or ∼0.1–0.01). Importantly, RCCA and SPLS outperformed standard CCA and PLS even when the ratio between examples and variables was ∼1 to 10. Second, we emphasized that it is important to use a predictive framework, as high in-sample correlations do not necessarily imply generalizability to unseen data.

Going beyond model performance, we demonstrated both in theory and in practice that standard CCA is prone to instability (Table S3). L2-norm regularization improves stability, which comes at a cost of the models (RCCA, standard PLS, SPLS) being driven by within-modality variances. PCA-CCA with data-driven selection of PCs improves on a priori selection. Data-driven PCA-CCA has a comparable regularizing effect to RCCA. Sparsity (i.e., L1-norm regularization) can facilitate the interpretability and the generalizability of the models, but it can also introduce instability. Sparsity is most useful when the associative effect itself is sparse (e.g., in the ADNI and simulated datasets). Data-driven PCA-CCA, RCCA, and SPLS yielded similar model weights and accounted for similar variances.

We hope that this work, together with recent efforts [e.g., (26On stability of canonical correlation analysis and partial least squares with application to brain-behavior associations.,

- Helmer M.
- Warrington S.
- Mohammadi-Nejad A.-R.
- Lisa J.
- Howell A.
- Rosand B.
- et al.

*bioRxiv.*2020; https://doi.org/10.1101/2020.08.25.265546

27

,30

,31

,52

)] and critical exchanges [e.g., (28

,58

, 59

, 60

, 61

)], illuminates these complex methods and facilitates their application to the brain and its disorders.## Acknowledgments and Disclosures

AM was funded by the Wellcome Trust (Grant No. WT102845/Z/13/Z) and by MQ: Transforming Mental Health (Grant No. MQF17_24). JC was supported by the Engineering and Physical Sciences Research Council–funded University College London Centre for Doctoral Training in Intelligent, Integrated Imaging in Healthcare (Grant No. EP/S021930/1) and the Department of Health’s National Institute for Health and Care Research–funded Biomedical Research Centre at University College London Hospitals. RAA was supported by a Medical Research Council Skills Development Fellowship (Grant No. MR/S007806/1). NRW was supported by grants from the German Research Foundation (Grant Nos. HA7070/2-2, HA7070/3, and HA7070/4). FSF was funded by a PhD scholarship awarded by Fundação para a Ciência e a Tecnologia (Grant No. SFRH/BD/120640/2016). JM-M was funded by the Wellcome Trust (Grant No. WT102845/Z/13/Z).

A complete listing of Alzheimer’s Disease Neuroimaging Initiative (ADNI) investigators can be found at http://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf.

Some data used in preparation of this article were obtained from the ADNI database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. The code used for the different CCA/PLS analyses is implemented in a CCA/PLS toolkit that is available at http://www.mlnl.cs.ucl.ac.uk/resources/cca_pls_toolkit together with a demo demonstrating how to use the toolkit for generating the SPLS results for the low-dimensional simulated dataset. The CCA/PLS toolkit is also available on Zenodo (https://doi.org/10.5281/zenodo.7153571)(

64

).The authors report no biomedical financial interests or potential conflicts of interest.

## Supplementary Material

- Supplement

## References

- Statistical challenges in “big data” human neuroimaging.
*Neuron.*2018; 97: 263-268 - Inference in the age of big data: Future perspectives on neuroscience.
*Neuroimage.*2017; 155: 549-564 - Towards algorithmic analytics for large-scale datasets.
*Nat Mach Intell.*2019; 1: 296-306 - Relations between two sets of variates.
*Biometrika.*1936; 28: 321-377 - Partial least squares.in: Kotz S. Johnson N. Encyclopedia of Statistical Sciences. Wiley, New York1985: 581-591
- Somatosensory-motor dysconnectivity spans multiple transdiagnostic dimensions of psychopathology.
*Biol Psychiatry.*2019; 86: 779-791 - Resting-state connectivity biomarkers define neurophysiological subtypes of depression.
*Nat Med.*2017; 23: 28-38 - Multivariate associations among behavioral, clinical, and multimodal imaging phenotypes in patients with psychosis.
*JAMA Psychiatry.*2018; 75: 386-395 - Topography and behavioral relevance of the global signal in the human brain.
*Sci Rep.*2019; 914286 - The relationship between spatial configuration and functional connectivity of brain regions.
*Elife.*2018; 7e32992 - Linked dimensions of psychopathology and connectivity in functional brain networks.
*Nat Commun.*2018; 9: 3003 - Multivariate patterns of brain-behavior-environment associations in the Adolescent Brain and Cognitive Development Study.
*Biol Psychiatry.*2021; 89: 510-520 - Sparse canonical correlation analysis relates network-level atrophy to multivariate cognitive measures in a neurodegenerative population.
*Neuroimage.*2014; 84: 698-711 - Partial least squares correlation of multivariate cognitive abilities and local brain structure in children and adolescents.
*Neuroimage.*2013; 82: 284-294 - Neurobehavioural characterisation and stratification of reinforcement-related behaviour.
*Nat Hum Behav.*2020; 4: 544-558 - Significant correlation between a set of genetic polymorphisms and a functional brain network revealed by feature selection and sparse partial least squares.
*Neuroimage.*2012; 63: 11-24 - A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis.
*Biostatistics.*2009; 10: 515-534 - Functional corticostriatal connection topographies predict goal-directed behaviour in humans.
*Nat Hum Behav.*2017; 10146 - Correspondence between fMRI and SNP data by group sparse canonical correlation analysis.
*Med Image Anal.*2014; 18: 891-902 - Identification of neurobehavioural symptom groups based on shared brain mechanisms.
*Nat Hum Behav.*2019; 3: 1306-1318 - Patterns of thought: Population variation in the associations between large-scale network organisation and self-reported experiences at rest.
*Neuroimage.*2018; 176: 518-527 - A positive-negative mode of population covariation links brain connectivity, demographics and behavior.
*Nat Neurosci.*2015; 18: 1565-1567 - Traces of trauma: A multivariate pattern analysis of childhood trauma, brain structure, and clinical phenotypes.
*Biol Psychiatry.*2020; 88: 829-842 - Patterns of sociocognitive stratification and perinatal risk in the child brain.
*Proc Natl Acad Sci U S A.*2020; 117: 12419-12427 - Brain-behaviour modes of covariation in healthy and clinically depressed young people.
*Sci Rep.*2019; 911536 - On stability of canonical correlation analysis and partial least squares with application to brain-behavior associations.
*bioRxiv.*2020; https://doi.org/10.1101/2020.08.25.265546 - Multiple holdouts with stability: Improving the generalizability of machine learning analyses of brain–behavior relationships.
*Biol Psychiatry.*2020; 87: 368-376 - Evaluating the evidence for biotypes of depression: Methodological replication and extension of.
*Neuroimage Clin.*2019; 22101796 - A tutorial on canonical correlation methods.
*ACM Comput Surv.*2017; 50: 1-33 - A technical review of canonical correlation analysis for neuroscience applications.
*Hum Brain Mapp.*2020; 41: 3807-3833 - Finding the needle in a high-dimensional haystack: Canonical correlation analysis for neuroscientists.
*Neuroimage.*2020; 216116745 - Partial least squares (PLS) methods for neuroimaging: A tutorial and review.
*Neuroimage.*2011; 56: 455-475 - Canonical correlations with fallible data.
*Psychometrika.*1964; 29: 55-65 - Overview and recent advances in partial least squares.in: Saunders C. Grobelnik M. Gunn S. Shawe-Taylor J. Subspace, Latent Structure and Feature Selection. Springer, Berlin2006: 34-51
- Perturbation analysis of the canonical correlations of matrix pairs.
*Linear Algebra Appl.*1994; 210: 3-28 - A survey on partial least squares (PLS) methods, with emphasis on the two-block case. Technical Report No. 371.https://stat.uw.edu/sites/default/files/files/reports/2000/tr371.pdfDate accessed: November 23, 2017
- Discovering genetic associations with high-dimensional neuroimaging phenotypes: A sparse reduced-rank regression approach.
*Neuroimage.*2010; 53: 1147-1159 - Canonical correlation analysis: A general parametric significance-testing system.
*Psychol Bull.*1978; 85: 410-416 - Reduced-rank regression for the multivariate linear model.
*J Multivar Anal.*1975; 5: 248-264 - Regression shrinkage and selection via the Lasso.
*J R Stat Soc Ser B.*1996; 58: 267-288 - Ridge regression: Applications to nonorthogonal problems.
*Technometrics.*1970; 12: 69-82 - Regularization and variable selection via the elastic net.
*J R Stat Soc Ser B Stat Methodol.*2005; 67: 301-320 - Canonical ridge and econometrics of joint production.
*J Econom.*1976; 4: 147-166 - Canonical correlation analysis: An overview with application to learning methods.
*Neural Comput.*2004; 16: 2639-2664 - Regularized generalized canonical correlation analysis.
*Psychometrika.*2011; 76: 257-284 - Canonical correlation analysis in high dimensions with structured regularization.
*arXiv.*2020; https://doi.org/10.48550/arXiv.2011.01650 - Optimal whitening and decorrelation.
*Am Stat.*2018; 72: 309-314 - To explain or to predict?.
*Stat Sci.*2010; 25: 289-310 - Inference and prediction diverge in biomedicine. 1. Patterns, New York, NY2020100119
- Single subject prediction of brain disorders in neuroimaging: Promises and pitfalls.
*Neuroimage.*2017; 145: 137-165 - Partial least squares regression and projection on latent structure regression (PLS regression). 2. Wiley Interdiscip Rev Comput Stat, 2010: 97-106
- Permutation inference for canonical correlation analysis.
*Neuroimage.*2020; 220117065 - Sparse PLS discriminant analysis: biologically relevant feature selection and graphical displays for multiclass problems.
*BMC Bioinformatics.*2011; 12: 253 - Multivariate morphological brain signatures predict patients with chronic abdominal pain from healthy control subjects.
*Pain.*2015; 156: 1545-1554 - A variant of sparse partial least squares for variable selection and data exploration.
*Front Neuroinform.*2014; 8: 18 - A multiple hold-out framework for Sparse partial least squares.
*J Neurosci Methods.*2016; 271: 182-194 - Multi-level block permutation.
*Neuroimage.*2015; 123: 253-268 - Canonical correlation analysis for identifying biotypes of depression.
*Biol Psychiatry Cogn Neurosci Neuroimaging.*2020; 5: 478-480 - Functional and optogenetic approaches to discovering stable subtype-specific circuit mechanisms in depression.
*Biol Psychiatry Cogn Neurosci Neuroimaging.*2019; 4: 554-566 - Reply to: A closer look at depression biotypes: Correspondence relating to Grosenick et al. (2019).
*Biol Psychiatry Cogn Neurosci Neuroimaging.*2020; 5: 556 - A closer look at depression biotypes: Correspondence relating to Grosenick et al. (2019).
*Biol Psychiatry Cogn Neurosci Neuroimaging.*2020; 5: 554-555 - Implementation of a new parcellation of the orbitofrontal cortex in the automated anatomical labeling atlas.
*Neuroimage.*2015; 122: 1-5 - “Mini-mental state”. A practical method for grading the cognitive state of patients for the clinician.
*J Psychiatr Res.*1975; 12: 189-198 - MLNL/cca_pls_toolkit: CCA/PLS Toolkit (v1.0.0). Zenodo.

## Article info

### Publication history

Published online: August 08, 2022

Accepted:
July 22,
2022

Received in revised form:
June 30,
2022

Received:
June 17,
2021

### Identification

### Copyright

© 2022 Society of Biological Psychiatry. Published by Elsevier Inc.

### User license

Creative Commons Attribution (CC BY 4.0) | How you can reuse

Elsevier's open access license policy

Creative Commons Attribution (CC BY 4.0)

## Permitted

- Read, print & download
- Redistribute or republish the final article
- Text & data mine
- Translate the article
- Reuse portions or extracts from the article in other works
- Sell or re-use for commercial purposes

Elsevier's open access license policy