Package 'roben' reference manual

Title:	Robust Bayesian Variable Selection for Gene-Environment Interactions
Description:	Gene-environment (G×E) interactions have important implications to elucidate the etiology of complex diseases beyond the main genetic and environmental effects. Outliers and data contamination in disease phenotypes of G×E studies have been commonly encountered, leading to the development of a broad spectrum of robust penalization methods. Nevertheless, within the Bayesian framework, the issue has not been taken care of in existing studies. We develop a robust Bayesian variable selection method for G×E interaction studies. The proposed Bayesian method can effectively accommodate heavy-tailed errors and outliers in the response variable while conducting variable selection by accounting for structural sparsity. In particular, the spike-and-slab priors have been imposed on both individual and group levels to identify important main and interaction effects. An efficient Gibbs sampler has been developed to facilitate fast computation. The Markov chain Monte Carlo algorithms of the proposed and alternative methods are efficiently implemented in C++.
Authors:	Jie Ren [aut, cre], Fei Zhou [aut], Xiaoxi Li [aut], Cen Wu [aut]
Maintainer:	Jie Ren <[email protected]>
License:	GPL-2
Version:	0.1.2
Built:	2025-03-11 06:17:31 UTC
Source:	https://github.com/jrhub/roben

roben: Robust Bayesian Variable Selection for Gene-Environment Interactions

Description

Gene-environment (G×E) interactions have important implications to elucidate the etiology of complex diseases beyond the main genetic and environmental effects. Outliers and data contamination in disease phenotypes of G×E studies have been commonly encountered, leading to the development of a broad spectrum of robust penalization methods. Nevertheless, within the Bayesian framework, the issue has not been taken care of in existing studies. We develop a robust Bayesian variable selection method for G×E interaction studies. The proposed Bayesian method can effectively accommodate heavy-tailed errors and outliers in the response variable while conducting variable selection by accounting for structural sparsity. In particular, the spike-and-slab priors have been imposed on both individual and group levels to identify important main and interaction effects. An efficient Gibbs sampler has been developed to facilitate fast computation. The Markov chain Monte Carlo algorithms of the proposed and alternative methods are efficiently implemented in C++.

In this package, we provide a set of robust Bayesian variable selection methods tailored for interaction analysis. A Bayesian formulation of the least absolute deviation (LAD) regression has been adopted to accommodate data contamination and long-tailed distributions in the response/ phenotype. The default method (the proposed method) conducts variable selection by accounting for structural sparsity. In particular, the spike–and–slab priors are imposed on both individual and group levels to identify important main and interaction effects (bi-level/ sparse-group selection).

In addition to the default method, users can also choose different selection structures (group-level-only or individual-level-only), methods without spike–and–slab priors and non-robust methods. In total, roben provides 12 different methods (6 robust and 6 non-robust). Among them, robust methods with spike–and–slab priors and the robust method for bi-level selection have been developed for the first time. Please read the Details below for how to configure the method used.

Details

The user friendly, integrated interface roben() allows users to flexibly choose the fitting methods they prefer. There are three arguments in roben() that control the fitting method:

robust:	whether to use robust methods.

sparse:	whether to use the spike-and-slab priors to create sparsity.

structure:	structural identification. Three choices are available:
	"sparsegroup", "group" and “individual”.

The function roben() returns a roben object that contains the posterior estimates of each coefficients. S3 generic functions GxESelection(), predict() and print() are implemented for roben objects. GxESelection() takes a roben object and returns the variable selection results. predict() takes a roben object and returns the predicted values for new observations.

Author(s)

Maintainer: Jie Ren [email protected]

Authors:

Fei Zhou
Xiaoxi Li
Cen Wu

References

Ren, J., Zhou, F., Li, X., Ma, S., Jiang, Y., and Wu, C. (2023). Robust Bayesian variable selection for gene-environment interactions. Biometrics, 79(2):684-694. doi:10.1111/biom.13670. PMID: 35394058.

Wu, C., and Ma, S. (2015). A selective review of robust variable selection with applications in bioinformatics. Briefings in Bioinformatics, 16(5), 873–883. doi:10.1093/bib/bbu046.

Zhou, F., Ren, J., Lu, X., Ma, S., and Wu, C. (2021). Gene-Environment Interaction: A Variable Selection Perspective. Methods in Molecular Biology, 2212:191-223. doi:10.1007/978-1-0716-0947-7_13. PMID: 33733358.

Ren, J., Zhou, F., Li, X., Chen, Q., Zhang, H., Ma, S., Jiang, Y., and Wu, C. (2020). Semi-parametric Bayesian variable selection for gene-environment interactions. Statistics in Medicine, 39: 617–638. doi:10.1002/sim.8434.

Ren, J., Zhou, F., Li, X., Wu, C., and Jiang, Y. (2019). spinBayes: Semi-Parametric Gene-Environment Interaction via Bayesian Variable Selection. R package version 0.1.0. https://CRAN.R-project.org/package=spinBayes.

Wu, C., Jiang, Y., Ren, J., Cui, Y., and Ma, S. (2018). Dissecting gene-environment interactions: A penalized robust approach accounting for hierarchical structures. Statistics in Medicine, 37:437–456. doi:10.1002/sim.7518.

Wu, C., Shi, X., Cui, Y., and Ma, S. (2015). A penalized robust semiparametric approach for gene-environment interactions. Statistics in Medicine, 34(30): 4016–4030. doi:10.1002/sim.6609.

Wu, C., Cui, Y., and Ma, S. (2014). Integrative analysis of gene–environment interactions under a multi–response partially linear varying coefficient model. Statistics in Medicine, 33(28): 4988–4998. doi:10.1002/sim.6287.

Wu, C., Zhong, P.S., and Cui, Y. (2018). Additive varying–coefficient model for nonlinear gene–environment interactions. Statistical Applications in Genetics and Molecular Biology, 17(2). doi:10.1515/sagmb-2017-0008.

Wu, C., Zhong, P.S., and Cui, Y. (2013). High dimensional variable selection for gene-environment interactions. Technical Report. Michigan State University.

simulated data for demonstrating the features of roben

Description

Simulated gene expression data for demonstrating the features of roben.

Usage

data("GxE_small")
data("GxE_large")
data("GxE_small")
data("GxE_large")

Format

GxE_small consists of five components: X, Y, E, clin and coeff. coeff contains the true values of parameters used for generating Y.

GxE_large contains larger datasets: X2, Y2, E2 and clin2

Details

The data model for generating Y

Use subscript $i$ to denote the $i$ th subject. Let $(X_{i}, Y_{i}, E_{i}, Clin_{i})$ , ( $i=1,\ldots,n$ ) be independent and identically distributed random vectors. $Y_{i}$ is a continuous response variable representing the disease phenotype. $X_{i}$ is the $p$ –dimensional vector of G factors. The environmental factors and clinical covariates are denoted as the $k$ -dimensional vector $E_{i}$ and the $q$ -dimensional vector $Clin_{i}$ , respectively. The $\epsilon$ follows some heavy-tailed distribution. Considering the following model:

$Y_{i} = \alpha_{0} + \sum_{t=1}^{q}\alpha_{t}Clin_{it} + \sum_{m=1}^{k}\theta_{m}E_{im} + \sum_{j=1}^{p}\gamma_{j}X_{ij} + \sum_{j=1}^{p}\sum_{m=1}^{k}\zeta_{jm}E_{im}X_{ij} +\epsilon_{i},$

where $\alpha_{0}$ is the intercept; $\alpha_{t}$ 's, $\theta_{m}$ 's, $\gamma_{j}$ 's and $\zeta_{jm}$ 's are the regression coefficients for the clinical covariates, environmental factors, genetic factors and G $\times$ E interactions, respectively.

Define $\beta_{j}=(\gamma_{j}, \zeta_{j1},\ldots,\zeta_{jk})^\top \equiv (\beta_{j1},\ldots,\beta_{jL})^\top$ and $U_{ij}=(X_{ij},X_{ij}E_{i1}\ldots,X_{ij}E_{ik})^\top \equiv (U_{ij1},\dots,U_{ijL})^\top$ , where $L=k+1$ . The model can be written as

$Y_{i} = \alpha_{0} + \sum_{t=1}^{q}\alpha_{t}Clin_{it} + \sum_{m=1}^{k}\theta_{m}E_{im} + \sum_{j=1}^{p} \big(U_{ij}^\top\beta_{j}\big) +\epsilon_{i},$

where the coefficient vector $\beta_{j}$ represents all the main and interaction effects corresponding to the $j$ th genetic measurement.

The object coeff in GxE_small is a list of four components, corresponding to $\alpha_{0}$ , $\alpha_{t}$ 's, $\theta_{m}$ 's and $\beta_{j}$ 's.

Examples

data(GxE_small)
dim(X)
print(coeff)

data(GxE_large)
dim(X)
print(coeff)

data(GxE_small)
dim(X)
print(coeff)

data(GxE_large)
dim(X)
print(coeff)

Variable selection for a roben object

Description

Variable selection for a roben object

Usage

GxESelection(obj, ...)

## S3 method for class 'Sparse'
GxESelection(obj, burn.in = obj$burn.in, ...)

## S3 method for class 'NonSparse'
GxESelection(obj, burn.in = obj$burn.in, prob = 0.95, ...)
GxESelection(obj, ...)

## S3 method for class 'Sparse'
GxESelection(obj, burn.in = obj$burn.in, ...)

## S3 method for class 'NonSparse'
GxESelection(obj, burn.in = obj$burn.in, prob = 0.95, ...)

Arguments

`obj`	roben object.
`...`	other GxESelection arguments.
`burn.in`	MCMC burn-in.
`prob`	probability for credible interval, between 0 and 1. e.g. prob=0.95 leads to 95% credible interval.

Details

For class ‘Sparse’, the median probability model (MPM) (Barbieri and Berger, 2004) is used to identify predictors that are significantly associated with the response variable. For class ‘NonSparse’, variable selection is based on 95% credible interval. Please check the references for more details about the variable selection.

Value

an object of class ‘GxESelection’ is returned, which is a list with components:

method: method used for identifying important effects.
effects: a list of names of selected effects.
summary: a summary of selected effects.
indicator: a matrix of indicators of selected effects.

References

Ren, J., Zhou, F., Li, X., Ma, S., Jiang, Y. and Wu, C. (2023) Robust Bayesian variable selection for gene-environment interactions. Biometrics, 79(2):684-694. doi:10.1111/biom.13670. PMID: 35394058.

Barbieri, M.M. and Berger, J.O. (2004). Optimal predictive model selection. Ann. Statist, 32(3):870–897

Examples

data(GxE_small)
iter=5000
## sparse
fit=roben(X, Y, E, clin, iterations=iter)
selected=GxESelection(fit)
selected


## non-sparse
fit=roben(X, Y, E, clin, iterations=iter, sparse=FALSE)
selected=GxESelection(fit)
selected


data(GxE_small)
iter=5000
## sparse
fit=roben(X, Y, E, clin, iterations=iter)
selected=GxESelection(fit)
selected


## non-sparse
fit=roben(X, Y, E, clin, iterations=iter, sparse=FALSE)
selected=GxESelection(fit)
selected

make predictions from a roben object

Description

make predictions from a roben object

Usage

## S3 method for class 'roben'
predict(object, X.new, E.new, clin.new = NULL, Y.new = NULL, ...)
## S3 method for class 'roben'
predict(object, X.new, E.new, clin.new = NULL, Y.new = NULL, ...)

Arguments

`object`	roben object.
`X.new`	a matrix of new values for X at which predictions are to be made.
`E.new`	a vector of new values for E at which predictions are to be made.
`clin.new`	a vector or matrix of new values for clin at which predictions are to be made.
`Y.new`	a vector of the response of new observations. If provided, the prediction error will be computed based on Y.new.
`...`	other predict arguments

Details

X.new (E.new) must have the same number of columns as X (E) used for fitting the model. If clin was provided when fit the model, clin.new must not be NULL, and vice versa. The predictions are made based on the posterior estimates of coefficients in the roben object. Note that the main effects of environmental exposures E are not subject to selection.

If Y.new is provided, the prediction error will be computed. For robust methods, the prediction mean absolute deviations (PMAD) will be computed. For non-robust methods, the prediction mean squared error (PMSE) will be computed.

Value

an object of class ‘roben.pred’ is returned, which is a list with components:

`error`	prediction error. error is NULL is Y.new=NULL.
`y.pred`	predicted values of the new observations.

Examples

data(GxE_small)
test=sample((1:nrow(X)), floor(nrow(X)/5))
fit=roben(X[-test,], Y[-test,], E[-test,], clin[-test,], iterations=5000)
predict(fit, X[test,], E[test,], clin[test,], Y[test,])

data(GxE_small)
test=sample((1:nrow(X)), floor(nrow(X)/5))
fit=roben(X[-test,], Y[-test,], E[-test,], clin[-test,], iterations=5000)
predict(fit, X[test,], E[test,], clin[test,], Y[test,])

print a GxESelection object

Description

Print a summary of a GxESelection object

Usage

## S3 method for class 'GxESelection'
print(x, digits = max(3, getOption("digits") - 3), ...)
## S3 method for class 'GxESelection'
print(x, digits = max(3, getOption("digits") - 3), ...)

Arguments

`x`	GxESelection object.
`digits`	significant digits in printout.
`...`	other print arguments

print a roben object

Description

Print a summary of a roben object

Usage

## S3 method for class 'roben'
print(x, digits = max(3, getOption("digits") - 3), ...)
## S3 method for class 'roben'
print(x, digits = max(3, getOption("digits") - 3), ...)

Arguments

`x`	roben object.
`digits`	significant digits in printout.
`...`	other print arguments

print a roben.pred object

Description

Print a summary of a roben.pred object

Usage

## S3 method for class 'roben.pred'
print(x, digits = max(3, getOption("digits") - 3), ...)
## S3 method for class 'roben.pred'
print(x, digits = max(3, getOption("digits") - 3), ...)

Arguments

`x`	roben.pred object.
`digits`	significant digits in printout.
`...`	other print arguments

fit a robust Bayesian variable selection

Description

fit a robust Bayesian variable selection model for G×E interactions.

Usage

roben(
  X,
  Y,
  E,
  clin = NULL,
  iterations = 10000,
  burn.in = NULL,
  robust = TRUE,
  sparse = TRUE,
  structure = c("sparsegroup", "group", "individual"),
  hyper = NULL,
  debugging = FALSE
)
roben(
  X,
  Y,
  E,
  clin = NULL,
  iterations = 10000,
  burn.in = NULL,
  robust = TRUE,
  sparse = TRUE,
  structure = c("sparsegroup", "group", "individual"),
  hyper = NULL,
  debugging = FALSE
)

Arguments

`X`	the matrix of predictors (genetic factors) without intercept. Each row should be an observation vector. X will be centered and a column of 1 will be added to the X matrix as the intercept.
`Y`	the response variable. The current version of roben only supports continuous response.
`E`	a matrix of environmental factors. E will be centered. The interaction terms between X (G factors) and E will be automatically created and included in the model.
`clin`	a matrix of clinical variables. Clinical variables are not subject to penalty.
`iterations`	the number of MCMC iterations.
`burn.in`	the number of iterations for burn-in.
`robust`	logical flag. If TRUE, robust methods will be used.
`sparse`	logical flag. If TRUE, spike-and-slab priors will be used to shrink coefficients of irrelevant covariates to zero exactly.
`structure`	three choices are available. "sparsegroup" for sparse-group selection, which is a bi-level selection on both group-level and individual-level. "group" for selection on group-level only. "individual" for selection on individual-level only.
`hyper`	a named list of hyperparameters.
`debugging`	logical flag. If TRUE, progress will be output to the console and extra information will be returned.

Details

Consider the data model described in "data":

$Y_{i} = \alpha_{0} + \sum_{t=1}^{q}\alpha_{t}Clin_{it} + \sum_{m=1}^{k}\theta_{m}E_{im} + \sum_{j=1}^{p} \big(U_{ij}^\top\beta_{j}\big) +\epsilon_{i},$

where the main and interaction effects of the $j$ th genetic variant is corresponding to the coefficient vector $\beta_{j}=(\beta_{j1}, \beta_{j2},\ldots,\beta_{jL})^\top$ .

When structure="sparsegroup" (default setting), selection will be conducted on both individual and group levels (bi-level selection):

Group-level selection: by determining whether $||\beta_{j}||_{2}=0$ , we can know if the $j$ th genetic variant has any effect at all.
Individual-level selection: investigate whether the $j$ th genetic variant has main effect, G $\times$ E interaction or both, by determining which components in $\beta_{j}$ has non-zero values.

If structure="group", only group-level selection will be conducted on $||\beta_{j}||_{2}$ . If structure="individual", only individual-level selection will be conducted on each $\beta_{jl}$ , ( $l=1,\ldots,L$ ).

When sparse=TRUE (default), spike–and–slab priors are imposed on individual and/or group levels to identify important main and interaction effects. Otherwise, Laplacian shrinkage will be used.

When robust=TRUE (default), the distribution of $\epsilon_{i}$ is defined as a Laplace distribution with density $f(\epsilon_{i}|\nu) = \frac{\nu}{2}\exp\left\{-\nu |\epsilon_{i}|\right\}$ , ( $i=1,\dots,n$ ), which leads to a Bayesian formulation of LAD regression. If robust=FALSE, $\epsilon_{i}$ follows a normal distribution.

Both $X$ and $E$ will be centered before the generation of interaction terms, in order to prevent the multicollinearity between main effects and interaction terms.

Users can modify the hyper-parameters by providing a named list of hyper-parameters via the argument ‘hyper’. The list can have the following named components

a0, b0: shape parameters of the Beta priors ( $\pi^{a_{0}-1}(1-\pi)^{b_{0}-1}$ ) on $\pi_{0}$ .
a1, b1: shape parameters of the Beta priors ( $\pi^{a_{1}-1}(1-\pi)^{b_{1}-1}$ ) on $\pi_{1}$ .
c1, c2: the shape parameter and the rate parameter of the Gamma prior on $\nu$ .
d1, d2: the shape parameter and the rate parameter of the Gamma priors on $\eta$ .

Please check the references for more details about the prior distributions.

Examples

data(GxE_small)

## default method
iter=5000
fit=roben(X, Y, E, clin, iterations = iter)
fit$coefficient

## Ture values of parameters of main G effects and interactions
coeff$GE

## Compute TP and FP
sel=GxESelection(fit)
pos=which(sel$indicator != 0)
tp=length(intersect(which(coeff$GE != 0), pos))
fp=length(pos)-tp
list(tp=tp, fp=fp)


## alternative: robust group selection
fit=roben(X, Y, E, clin, iterations=iter, structure="g")
fit$coefficient

## alternative: non-robust sparse group selection
fit=roben(X, Y, E, clin, iterations=iter, robust=FALSE)
fit$coefficient


data(GxE_small)

## default method
iter=5000
fit=roben(X, Y, E, clin, iterations = iter)
fit$coefficient

## Ture values of parameters of main G effects and interactions
coeff$GE

## Compute TP and FP
sel=GxESelection(fit)
pos=which(sel$indicator != 0)
tp=length(intersect(which(coeff$GE != 0), pos))
fp=length(pos)-tp
list(tp=tp, fp=fp)


## alternative: robust group selection
fit=roben(X, Y, E, clin, iterations=iter, structure="g")
fit$coefficient

## alternative: non-robust sparse group selection
fit=roben(X, Y, E, clin, iterations=iter, robust=FALSE)
fit$coefficient

Package 'roben'

Help Index

roben: Robust Bayesian Variable Selection for Gene-Environment Interactions

Description

Details

Author(s)

References

See Also

simulated data for demonstrating the features of roben

Description

Usage

Format

Details

See Also

Examples

Variable selection for a roben object

Description

Usage

Arguments

Details

Value

References

See Also

Examples

make predictions from a roben object

Description

Usage

Arguments

Details

Value

See Also

Examples

print a GxESelection object

Description

Usage

Arguments

See Also

print a roben object

Description

Usage

Arguments

See Also

print a roben.pred object

Description

Usage

Arguments

See Also

fit a robust Bayesian variable selection

Description

Usage

Arguments

Details

See Also

Examples