Title: | Semi-Parametric Gene-Environment Interaction via Bayesian Variable Selection |
---|---|
Description: | Many complex diseases are known to be affected by the interactions between genetic variants and environmental exposures beyond the main genetic and environmental effects. Existing Bayesian methods for gene-environment (G×E) interaction studies are challenged by the high-dimensional nature of the study and the complexity of environmental influences. We have developed a novel and powerful semi-parametric Bayesian variable selection method that can accommodate linear and nonlinear G×E interactions simultaneously (Ren et al. (2020) <doi:10.1002/sim.8434>). Furthermore, the proposed method can conduct structural identification by distinguishing nonlinear interactions from main effects only case within Bayesian framework. Spike-and-slab priors are incorporated on both individual and group level to shrink coefficients corresponding to irrelevant main and interaction effects to zero exactly. The Markov chain Monte Carlo algorithms of the proposed and alternative methods are efficiently implemented in C++. |
Authors: | Jie Ren, Fei Zhou, Xiaoxi Li, Cen Wu, Yu Jiang |
Maintainer: | Jie Ren <[email protected]> |
License: | GPL-2 |
Version: | 0.2.1 |
Built: | 2024-11-07 02:56:29 UTC |
Source: | https://github.com/jrhub/spinbayes |
Many complex diseases are known to be affected by the interactions between genetic variants and environmental exposures beyond the main genetic and environmental effects. Existing Bayesian methods for gene-environment (G×E) interaction studies are challenged by the high-dimensional nature of the study and the complexity of environmental influences. We have developed a novel and powerful semi-parametric Bayesian variable selection method that can accommodate linear and nonlinear G×E interactions simultaneously (Ren et al. (2020) doi:10.1002/sim.8434). Furthermore, the proposed method can conduct structural identification by distinguishing nonlinear interactions from main effects only case within Bayesian framework. Spike-and-slab priors are incorporated on both individual and group level to shrink coefficients corresponding to irrelevant main and interaction effects to zero exactly. The Markov chain Monte Carlo algorithms of the proposed and alternative methods are efficiently implemented in C++.
Within the Bayesian framework, we propose a partially linear varying coefficient model (PLVC) for G×E interactions. The varying coefficient functions capture the possible non-linear G×E interaction, and the linear component models the G×E interactions with linear assumptions. The changing of basis with B splines is adopted to separate the coefficient functions with varying, non-zero constant and zero forms, corresponding to cases of nonlinear interaction, main effect only (no interaction) and no genetic interaction at all.
The user friendly, integrated interface BVCfit() allows users to flexibly choose the fitting methods they prefer. There are three arguments in BVCfit() that control the fitting method
sparse: | whether to use the spike-and-slab priors to achieve sparsity. |
VC: | whether to separate the coefficient functions with varying effects |
and non-zero constant (main) effects. | |
structural: | whether to use varying coefficient functions for modeling |
non-linear GxE interactions. |
BVCfit() returns a BVCfit object that contains the posterior estimates of each coefficients. S3 generic functions BVSelection(), predict(), plot() and print() are implemented for BVCfit objects. BVSelection() takes a BVCfit object and returns the variable selection results. predict() takes a BVCfit object and returns the predicted values for new observations.
Ren, J., Zhou, F., Li, X., Chen, Q., Zhang, H., Ma, S., Jiang, Y., Wu, C. (2020). Semiparametric Bayesian variable selection for gene-environment interactions. Statistics in Medicine, 39(5): 617–638. doi:10.1002/sim.8434.
Zhou, F., Ren, J., Lu, X., Ma, S., and Wu, C. (2021). Gene-Environment Interaction: A Variable Selection Perspective. Methods in Molecular Biology, 2212:191-223. doi:10.1007/978-1-0716-0947-7_13. PMID: 33733358.
Wu, C., Jiang, Y., Ren, J., Cui, Y., Ma, S. (2018). Dissecting gene-environment interactions: A penalized robust approach accounting for hierarchical structures. Statistics in Medicine, 37:437–456. doi:10.1002/sim.7518.
Wu, C., Zhong, P.-S., and Cui, Y. (2018). Additive varying–coefficient model for nonlinear gene–environment interactions. Statistical Applications in Genetics and Molecular Biology, 17(2). doi:10.1515/sagmb-2017-0008.
Jiang, Y., Huang, Y., Du, Y., Zhao, Y., Ren, J., Ma, S., Wu, C. (2017). Identification of prognostic genes and pathways in lung adenocarcinoma using a Bayesian Approach. Cancer Informatics, 1(7).
Wu, C., Shi, X., Cui, Y., and Ma, S. (2015). A penalized robust semiparametric approach for gene-environment interactions. Statistics in Medicine, 34(30): 4016–4030. doi:10.1002/sim.6609.
Wu, C., and Ma, S. (2015). A selective review of robust variable selection with applications in bioinformatics. Briefings in Bioinformatics, 16(5), 873–883. doi:10.1093/bib/bbu046.
Wu, C., Cui, Y., and Ma, S. (2014). Integrative analysis of gene–environment interactions under a multi–response partially linear varying coefficient model. Statistics in Medicine, 33(28), 4988–4998. doi:10.1002/sim.6287.
Wu, C., and Cui, Y. (2013). Boosting signals in gene–based association studies via efficient SNP selection. Briefings in Bioinformatics, 15(2):279–291. doi:10.1093/bib/bbs087.
Wu, C., and Cui, Y. (2013). A novel method for identifying nonlinear gene–environment interactions in case–control association studies. Human Genetics, 132(12):1413–1425. doi:10.1007/s00439-013-1350-z.
Wu, C., Zhong, P.S., and Cui, Y. (2013). High dimensional variable selection for gene-environment interactions. Technical Report. Michigan State University.
Wu, C., Li, S., and Cui, Y. (2012). Genetic Association Studies: An Information Content Perspective. Current Genomics, 13(7), 566–573. doi:10.2174/138920212803251382.
Useful links:
fit a Bayesian semi-parametric model for both linear and non-linear G×E interactions. Users can also specify all the interactions as linear and fit a Bayesian LASSO type of model.
BVCfit( X, Y, Z, E = NULL, clin = NULL, iterations = 10000, burn.in = NULL, sparse = TRUE, structural = TRUE, VC = TRUE, kn = 2, degree = 2, hyper = NULL, debugging = FALSE )
BVCfit( X, Y, Z, E = NULL, clin = NULL, iterations = 10000, burn.in = NULL, sparse = TRUE, structural = TRUE, VC = TRUE, kn = 2, degree = 2, hyper = NULL, debugging = FALSE )
X |
the matrix of predictors (genetic factors) without intercept. Each row should be an observation vector. A column of 1 will be added to the X matrix as the intercept. |
Y |
the response variable. The current version of BVCfit only supports continuous response. |
Z |
a vector of environmental factor for non-linear G×E interactions. |
E |
a vector of environmental factor for linear G×E interactions. |
clin |
a matrix of clinical variables. Clinical variables are not subject to penalty. |
iterations |
the number of MCMC iterations. |
burn.in |
the number of iterations for burn-in. |
sparse |
logical flag. If TRUE, spike-and-slab priors will be used to shrink coefficients of irrelevant covariates to zero exactly. 'sparse' has effect only when VC=TRUE. |
structural |
logical flag. If TRUE, the coefficient functions with varying effects and constant effects will be penalized separately. 'structural' has effect only when VC=TRUE. |
VC |
logical flag. If TRUE, varying coefficient functions will be used for modeling the interactions between Z and X. If FALSE, interactions between Z and X will be modeled as linear interactions. |
kn |
the number of interior knots for B-spline. |
degree |
the degree of B spline basis. |
hyper |
a named list of hyperparameters. |
debugging |
logical flag. If TRUE, progress will be output to the console and extra information will be returned. |
By default, varying coefficient functions are used for modeling the nonlinear interactions between Z and X. Assuming both E and clin are NULL, the model can be expressed as
The basis expansion and changing of basis with B splines will be done automatically:
where represents B spline basis.
and
correspond to the constant and varying parts of the coefficient functional, respectively.
q=kn+degree+1 is the number of basis functions. By default, kn=degree=2. User can change the values of kn and degree to any other positive integers.
If E is provided, the linear interactions between E and X will be added modeled as pairwise-products:
If clin is provided, clinical variables will be added to the model.
If VC=FALSE, all interactions are treated as linear and a Bayesian LASSO model will be used. With non-null values of E and clin, the full linear model is:
Please check the references for more details about the model.
Users can modify the hyper-parameters by providing a named list of hyper-parameters via the argument 'hyper'. The list can have the following named components
a.c, a.v, a.e: shape parameters of the Gamma priors on ,
and
, respectively.
b.c, b.v, b.e: rate parameters of the Gamma priors on ,
and
, respectively.
r.c, r.v, r.e: shape parameters of the Beta priors () on
,
and
, respectively.
w.c, w.v, w.e: shape parameters of the Beta priors on ,
and
, respectively.
s: shape parameters of the Inverse-gamma prior on .
h: scale parameters of the Inverse-gamma prior on .
Please check the references for more details about the prior distributions.
an object of class "BVCfit" is returned, which is a list with components:
posterior: posterior samples from the MCMC
coefficients: a list of posterior estimates of coefficients
burn.in: the number of iterations for burn-in
iterations: the number of MCMC iterations.
Ren, J., Zhou, F., Li, X., Chen, Q., Zhang, H., Ma, S., Jiang, Y., Wu, C. (2020) Semiparametric Bayesian variable selection for gene-environment interactions. Statistics in Medicine, 39(5): 617– 638 doi:10.1002/sim.8434
data(gExp) ## default method spbayes=BVCfit(X, Y, Z, E, clin) spbayes ## non-structural structural=FALSE spbayes=BVCfit(X, Y, Z, E, clin, structural=structural) spbayes ## non-sparse sparse=FALSE spbayes=BVCfit(X, Y, Z, E, clin, sparse=sparse) spbayes
data(gExp) ## default method spbayes=BVCfit(X, Y, Z, E, clin) spbayes ## non-structural structural=FALSE spbayes=BVCfit(X, Y, Z, E, clin, structural=structural) spbayes ## non-sparse sparse=FALSE spbayes=BVCfit(X, Y, Z, E, clin, sparse=sparse) spbayes
Variable selection for a BVCfit object
BVSelection(obj, ...) ## S3 method for class 'BVCNonSparse' BVSelection(obj, burn.in = obj$burn.in, prob = 0.95, ...) ## S3 method for class 'BVCSparse' BVSelection(obj, burn.in = obj$burn.in, ...)
BVSelection(obj, ...) ## S3 method for class 'BVCNonSparse' BVSelection(obj, burn.in = obj$burn.in, prob = 0.95, ...) ## S3 method for class 'BVCSparse' BVSelection(obj, burn.in = obj$burn.in, ...)
obj |
BVCfit object. |
... |
other BVSelection arguments |
burn.in |
MCMC burn-in. |
prob |
probability for credible interval, between 0 and 1. e.g. prob=0.95 leads to 95% credible interval |
For class 'BVCSparse', the median probability model (MPM) (Barbieri and Berger 2004) is used to identify predictors that are significantly associated with the response variable. For class 'BVCNonSparse', variable selection is based on 95% credible interval. Please check the references for more details about the variable selection.
an object of class "BVSelection" is returned, which is a list with components:
method: method used for identifying important effects
indices: a list of indices and names of selected variables
summary: a summary of selected variables
Ren, J., Zhou, F., Li, X., Chen, Q., Zhang, H., Ma, S., Jiang, Y., Wu, C. (2020) Semiparametric Bayesian variable selection for gene-environment interactions. Statistics in Medicine, 39(5): 617– 638 doi:10.1002/sim.8434
Barbieri, M.M. and Berger, J.O. (2004). Optimal predictive model selection Ann. Statist, 32(3):870–897
data(gExp) ## sparse spbayes=BVCfit(X, Y, Z, E, clin) spbayes selected = BVSelection(spbayes) selected$indices ## non-sparse spbayes=BVCfit(X, Y, Z, E, clin, sparse=FALSE) spbayes selected = BVSelection(spbayes) selected
data(gExp) ## sparse spbayes=BVCfit(X, Y, Z, E, clin) spbayes selected = BVSelection(spbayes) selected$indices ## non-sparse spbayes=BVCfit(X, Y, Z, E, clin, sparse=FALSE) spbayes selected = BVSelection(spbayes) selected
Simulated gene expression data for demonstrating the features of BVCfit.
data("gExp") data("gExp.new") data("gExp.L")
data("gExp") data("gExp.new") data("gExp.L")
gExp consists of five components: X, Y, Z, E and clin. gExp.new contains the data of new observations (X.new, Y.new, Z.new, E.new and clin.new) which can be used for evaluating the prediction performance.
gExp.L contains larger datasets: X2, Y2, Z2, E2 and clin2
the same true model is used for generating Y, Y.new and Y2
where ,
,
and
data(gExp) dim(X) data(gExp.L) dim(X)
data(gExp) dim(X) data(gExp.L) dim(X)
plot the identified varying effects
## S3 method for class 'BVCfit' plot(x, prob=0.95, ...)
## S3 method for class 'BVCfit' plot(x, prob=0.95, ...)
x |
BVCfit object. |
prob |
probability for credible interval, between 0 and 1. e.g. prob=0.95 leads to 95% credible interval |
... |
other plot arguments |
data(gExp) spbayes=BVCfit(X, Y, Z, E, clin) plot(spbayes)
data(gExp) spbayes=BVCfit(X, Y, Z, E, clin) plot(spbayes)
make predictions from a BVCfit object
## S3 method for class 'BVCfit' predict(object, X.new, Z.new, E.new = NULL, clin.new = NULL, Y.new = NULL, ...) ## S3 method for class 'VarLin' predict(object, X.new, Z.new, E.new, clin.new = NULL, Y.new = NULL, ...) ## S3 method for class 'VarOnly' predict(object, X.new, Z.new, clin.new = NULL, Y.new = NULL, ...) ## S3 method for class 'LinOnly' predict(object, X.new, Z.new, E.new = NULL, clin.new = NULL, Y.new = NULL, ...)
## S3 method for class 'BVCfit' predict(object, X.new, Z.new, E.new = NULL, clin.new = NULL, Y.new = NULL, ...) ## S3 method for class 'VarLin' predict(object, X.new, Z.new, E.new, clin.new = NULL, Y.new = NULL, ...) ## S3 method for class 'VarOnly' predict(object, X.new, Z.new, clin.new = NULL, Y.new = NULL, ...) ## S3 method for class 'LinOnly' predict(object, X.new, Z.new, E.new = NULL, clin.new = NULL, Y.new = NULL, ...)
object |
BVCfit object. |
X.new |
a matrix of new values for X at which predictions are to be made. |
Z.new |
a vector of new values for Z at which predictions are to be made. |
E.new |
a vector of new values for E at which predictions are to be made. |
clin.new |
a vector or matrix of new values for clin at which predictions are to be made. |
Y.new |
a vector of the response of new observations. If provided, the prediction mean squared error (PMSE) will be computed based on Y.new. |
... |
other predict arguments |
X.new (clin.new) must have the same number of columns as X (clin) used for fitting the model. If E and clin are provided when fit the model, E.new and clin.new must not be NULL, and vice versa. The predictions are made based on the posterior estimates of coefficients in the BVCfit object. Note that the main effects of environmental exposures Z and E are not subject to selection.
an object of class "BVCfit.pred" is returned, which is a list with components:
pmse |
predictions mean squared error. pmse is NULL is Y.new=NULL. |
y.pred |
predicted values of the new observations. |
data(gExp) spbayes=BVCfit(X, Y, Z, E, clin) spbayes data(gExp.new) pred = predict(spbayes, X.new, Z.new, E.new, clin.new, Y.new) pred$pmse # pred$y.pred
data(gExp) spbayes=BVCfit(X, Y, Z, E, clin) spbayes data(gExp.new) pred = predict(spbayes, X.new, Z.new, E.new, clin.new, Y.new) pred$pmse # pred$y.pred
Print a summary of a BVCfit object
## S3 method for class 'BVCfit' print(x, digits = max(3, getOption("digits") - 3), ...)
## S3 method for class 'BVCfit' print(x, digits = max(3, getOption("digits") - 3), ...)
x |
BVCfit object. |
digits |
significant digits in printout. |
... |
other print arguments |
Print a summary of a BVCfit.pred object
## S3 method for class 'BVCfit.pred' print(x, digits = max(3, getOption("digits") - 3), ...)
## S3 method for class 'BVCfit.pred' print(x, digits = max(3, getOption("digits") - 3), ...)
x |
BVCfit object. |
digits |
significant digits in printout. |
... |
other print arguments |
Print a summary of a BVSelection object
## S3 method for class 'BVSelection' print(x, digits = max(3, getOption("digits") - 3), ...)
## S3 method for class 'BVSelection' print(x, digits = max(3, getOption("digits") - 3), ...)
x |
BVSelection object. |
digits |
significant digits in printout. |
... |
other print arguments |