The mcvis
package provides functions for detecting
multi-collinearity (also known as collinearity) in linear regression. In
simple terms, the mcvis
method investigates variables with
strong influences on collinearity in a graphical manner.
Suppose that we have a simple scenario that one predictor X1 is almost linearly dependent on another two predictors X2 and X3, thus X1 is strongly correlated with these two predictors. The dependence among these three variables is a sufficient cause for collinearity which can be shown through large variances of estimated model parameters in linear regression. We illustrate this with a simple example:
## Simulating some data
set.seed(1)
p = 6
n = 100
X = matrix(rnorm(n*p), ncol = p)
X[,1] = X[,2] + X[,3] + rnorm(n, 0, 0.01)
y = rnorm(n)
summary(lm(y ~ X))
#>
#> Call:
#> lm(formula = y ~ X)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -2.56042 -0.73579 -0.05585 0.86967 2.20334
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.02084 0.11157 0.187 0.852
#> X1 10.14768 10.34285 0.981 0.329
#> X2 -10.08175 10.33068 -0.976 0.332
#> X3 -10.30688 10.34038 -0.997 0.321
#> X4 0.04175 0.11321 0.369 0.713
#> X5 0.07191 0.09563 0.752 0.454
#> X6 -0.16951 0.11482 -1.476 0.143
#>
#> Residual standard error: 1.094 on 93 degrees of freedom
#> Multiple R-squared: 0.06683, Adjusted R-squared: 0.006628
#> F-statistic: 1.11 on 6 and 93 DF, p-value: 0.3625
The mcvis
method highlights the major
collinearity-causing variables on a bipartite graph. There are three
major components of this graph: + the top row renders the “tau”
statistics and by default, only one tau statistic is shown (τp, where p is the number of predictors). This
tau statistic measures the extent of collinearity in the data and
relates to the eigenvalues of the correlation matrix in X. + the bottom row renders all
original predictors. + the two rows are linked through the MC-indices
that we have developed, which are represented as lines of different
shades and thickness. Darker lines implies larger values of the MC-index
indicate what predictor contribute more to causing collinearity.
If you are interested in how the tau statistics and the resampling-based MC-index are calculated, our paper is published as Lin, C., Wang, K. Y. X., & Mueller, S. (2020). mcvis: A new framework for collinearity discovery, diagnostic and visualization. Journal of Computational and Graphical Statistics
library(mcvis)
mcvis_result = mcvis(X = X)
mcvis_result
#> X1 X2 X3 X4 X5 X6
#> tau6 0.51 0.25 0.24 0 0 0
This matrix of MC-indices is the main numeric output of
mcvis
and our visualisation techniques are focused on
visualising this matrix. Below is a bipartite graph visualising the last
row of this matrix, which is the main visualisation plot of
mcvis
.
We also provide a igraph version of the mcvis bipartite graph.
In practice, high correlation between variables is not a necessary
criterion for collinearity. In the mplot
package (Tarr et.
al. 2018), a simulated data was created with many of its columns being a
linear combination of other columns plus noise. In this case, the cause
of the collinearity is not clear from the correlation matrix.
The mcvis
visualisation plot identified that the 8th
variable (x8) is the main cause of collinearity of this data. Upon
consultation with the data generation in this simulation, we see that x8
is a linear combination of all other predictor variables (plus noise).
This knowledge can provide important guidance to statistical
interpretations of model selection results.
## Simulation taken from the `mplot` package.
## Generating a data with multi-collinearity.
n=50
set.seed(8) # a seed of 2 also works
x1 = rnorm(n,0.22,2)
x7 = 0.5*x1 + rnorm(n,0,sd=2)
x6 = -0.75*x1 + rnorm(n,0,3)
x3 = -0.5-0.5*x6 + rnorm(n,0,2)
x9 = rnorm(n,0.6,3.5)
x4 = 0.5*x9 + rnorm(n,0,sd=3)
x2 = -0.5 + 0.5*x9 + rnorm(n,0,sd=2)
x5 = -0.5*x2+0.5*x3+0.5*x6-0.5*x9+rnorm(n,0,1.5)
x8 = x1 + x2 -2*x3 - 0.3*x4 + x5 - 1.6*x6 - 1*x7 + x9 +rnorm(n,0,0.5)
y = 0.6*x8 + rnorm(n,0,2)
artificialeg = round(data.frame(x1,x2,x3,x4,x5,x6,x7,x8,x9,y),1)
X = artificialeg[,1:9]
round(cor(X), 2)
#> x1 x2 x3 x4 x5 x6 x7 x8 x9
#> x1 1.00 0.00 0.14 -0.07 -0.02 -0.37 0.46 0.36 -0.22
#> x2 0.00 1.00 0.31 0.30 -0.60 0.00 -0.29 0.24 0.53
#> x3 0.14 0.31 1.00 0.04 -0.28 -0.66 -0.08 -0.01 0.13
#> x4 -0.07 0.30 0.04 1.00 -0.48 0.01 0.02 -0.07 0.62
#> x5 -0.02 -0.60 -0.28 -0.48 1.00 0.38 0.17 -0.30 -0.75
#> x6 -0.37 0.00 -0.66 0.01 0.38 1.00 0.02 -0.50 -0.08
#> x7 0.46 -0.29 -0.08 0.02 0.17 0.02 1.00 -0.43 -0.29
#> x8 0.36 0.24 -0.01 -0.07 -0.30 -0.50 -0.43 1.00 0.27
#> x9 -0.22 0.53 0.13 0.62 -0.75 -0.08 -0.29 0.27 1.00
mcvis_result = mcvis(X)
mcvis_result
#> x1 x2 x3 x4 x5 x6 x7 x8 x9
#> tau9 0.01 0.01 0.29 0 0.03 0.31 0.02 0.32 0.02
plot(mcvis_result)
We also offer a shiny app implementation of mcvis
in our
package. Suppose that we have a mcvis_result
object stored
in the memory of R
. You can simply call the function
shiny_mcvis
to load up a Shiny app.
Lin, C., Wang, K. Y. X., & Mueller, S. (2020). mcvis: A new framework for collinearity discovery, diagnostic and visualization. Journal of Computational and Graphical Statistics, In Press. URL: https://doi.org/10.1080/10618600.2020.1779729
Tarr G, Mueller S, Welsh AH (2018). mplot: An R Package for Graphical Model Stability and Variable Selection Procedures. Journal of Statistical Software, 83(9), 1-28. URL: https://doi.org/10.18637/jss.v083.i09
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] mcvis_1.0.8 rmarkdown_2.29
#>
#> loaded via a namespace (and not attached):
#> [1] sass_0.4.9 generics_0.1.3 stringi_1.8.4 lattice_0.22-6
#> [5] digest_0.6.37 magrittr_2.0.3 RColorBrewer_1.1-3 evaluate_1.0.3
#> [9] grid_4.4.2 fastmap_1.2.0 plyr_1.8.9 jsonlite_1.8.9
#> [13] promises_1.3.2 purrr_1.0.4 scales_1.3.0 jquerylib_0.1.4
#> [17] mnormt_2.1.1 cli_3.6.3 shiny_1.10.0 rlang_1.1.5
#> [21] munsell_0.5.1 withr_3.0.2 cachem_1.1.0 yaml_2.3.10
#> [25] tools_4.4.2 parallel_4.4.2 reshape2_1.4.4 dplyr_1.1.4
#> [29] colorspace_2.1-1 ggplot2_3.5.1 httpuv_1.6.15 assertthat_0.2.1
#> [33] buildtools_1.0.0 vctrs_0.6.5 R6_2.5.1 mime_0.12
#> [37] lifecycle_1.0.4 stringr_1.5.1 psych_2.4.12 pkgconfig_2.0.3
#> [41] pillar_1.10.1 bslib_0.9.0 later_1.4.1 gtable_0.3.6
#> [45] glue_1.8.0 Rcpp_1.0.14 xfun_0.50 tibble_3.2.1
#> [49] tidyselect_1.2.1 sys_3.4.3 knitr_1.49 farver_2.1.2
#> [53] xtable_1.8-4 htmltools_0.5.8.1 nlme_3.1-167 igraph_2.1.4
#> [57] labeling_0.4.3 maketools_1.3.1 compiler_4.4.2