Installing & Loading Package
We can use the devtools
package to install the
development version of MethylSurroGetR
from GitHub.
# install.packages("devtools")
devtools::install_github("jagoode27/MethylSurroGetR")
Once the package is installed, we can load the package.
Loading Sample Data
At a minimum, MethylSurroGetR
requires two input
objects: a methylation matrix and a named vector, matrix, or data frame
with surrogate weights.
Methylation Matrix
We’ll start by loading our matrix of beta values.
data("beta_matrix_miss")
print(beta_matrix_miss)
#> samp1 samp2 samp3 samp4 samp5
#> cg01 0.1028646 0.32037324 0.48290240 NA 0.36948887
#> cg02 0.2875775 NA NA 0.8830174 NA
#> cg04 0.4348927 0.18769112 0.89035022 0.1422943 0.98421920
#> cg05 0.9849570 0.78229430 0.91443819 0.5492847 0.15420230
#> cg07 0.8998250 0.24608773 NA 0.3279207 0.95450365
#> cg08 0.8895393 0.69280341 0.64050681 0.9942698 0.65570580
#> cg09 NA 0.09359499 0.60873498 0.9540912 NA
#> cg10 0.8864691 NA NA 0.5854834 0.14190691
#> cg12 0.1750527 NA 0.14709469 NA 0.69000710
#> cg13 0.9630242 0.90229905 0.69070528 0.7954674 0.02461368
#> cg14 0.1306957 NA 0.93529980 0.6478935 NA
#> cg16 0.6531019 0.33282354 0.30122890 0.3198206 0.89139412
#> cg17 0.1428000 0.41454634 0.41372433 0.3688455 0.15244475
#> cg19 0.3435165 0.48861303 0.06072057 0.3077200 NA
#> cg20 0.6567581 0.95447383 0.94772694 0.2197676 NA
Notice that probe IDs are in row names and sample IDs are in column
names. This is the required orientation in
MethylSurroGetR
.
Building mehtyl_surro
Data Object
In order to generate predicted values from surrogate weights, we need
to construct a mehtyl_surro
object. To do this, we will use
the surro_set()
function and specify the methylation matrix
with the methyl
and the weights object with the
weights
option. In addition, we need to specify the
character string for the name of the intercept in the weights object (if
applicable) with the intercept
option.
my_surrogate <- surro_set(methyl = beta_matrix_miss,
weights = wts_vec_lin,
intercept = "Intercept")
print(my_surrogate)
#> $methyl
#> samp1 samp2 samp3 samp4 samp5
#> cg02 0.2875775 NA NA 0.8830174 NA
#> cg07 0.8998250 0.2460877 NA 0.3279207 0.95450365
#> cg08 0.8895393 0.6928034 0.6405068 0.9942698 0.65570580
#> cg13 0.9630242 0.9022990 0.6907053 0.7954674 0.02461368
#> cg17 0.1428000 0.4145463 0.4137243 0.3688455 0.15244475
#> cg03 NA NA NA NA NA
#> cg06 NA NA NA NA NA
#> cg11 NA NA NA NA NA
#> cg15 NA NA NA NA NA
#> cg18 NA NA NA NA NA
#>
#> $weights
#> cg02 cg03 cg06 cg07 cg08 cg11
#> -0.009083377 -0.001155999 0.005978497 -0.007562015 0.001218960 -0.005869372
#> cg13 cg15 cg17 cg18
#> -0.007449367 0.005066157 0.007900907 -0.002510744
#>
#> $intercept
#> Intercept
#> 1.211
#>
#> attr(,"class")
#> [1] "methyl_surro"
We can see that the resulting methyl_surro
object has
three elements that correspond to the inputs: methyl
,
weights
, and intercept
. In addition to placing
each of the inputs into their own object element, the function has also
subset the methylation matrix to include all of the probes included in
the weights (and nothing more). Although it makes little difference with
this toy example, methylation matrices in the real world contain hundeds
of thousands of probes, while many surrogates are constructed using tens
of thousands of probes.
Handling Missing Values
Notice that there are a number of missing values in the
methyl
element of my_surrogate
. Our next step
is to address these.
Two Types of Missing Values
MethylSurroGetR
distinguishes between two types of
missing values:
-
Missing Observatons: These are
probes that were measured in the target data, with missing values for
some samples.
- Notice that
cg01
was missing in our methylation matrix forsamp2
,samp3
, andsamp5
. Similarly,cg07
was missing forsamp5
.
- Notice that
-
Missing Probes: These are probes
that are not measured in the target data. Therefore, they are missing
for all samples in the methylation matrix. This might have occurred
because those probes was removed during QC or because the array did not
include the probes.
- Notice that
cg03
,cg06
,cg11
,cg15
, andcg18
were included among the surrogate weights, but were not in our methylation data.
- Notice that
samp1 | samp2 | samp3 | samp4 | samp5 | |
---|---|---|---|---|---|
cg02 | 0.288 | NA | NA | 0.883 | NA |
cg07 | 0.900 | 0.246 | NA | 0.328 | 0.955 |
cg08 | 0.890 | 0.693 | 0.641 | 0.994 | 0.656 |
cg13 | 0.963 | 0.902 | 0.691 | 0.795 | 0.025 |
cg17 | 0.143 | 0.415 | 0.414 | 0.369 | 0.152 |
cg03 | NA | NA | NA | NA | NA |
cg06 | NA | NA | NA | NA | NA |
cg11 | NA | NA | NA | NA | NA |
cg15 | NA | NA | NA | NA | NA |
cg18 | NA | NA | NA | NA | NA |
Checking for Missing Values
Next, we can check our new methylation matrix for missing values. This seems silly with our small example, but can be much more helpful when working with real data.
missing <- methyl_miss(methyl_surro = my_surrogate)
print(missing)
#> $missing_obs
#> cg02 cg07
#> 0.6 0.2
#>
#> $missing_probes
#> [1] "cg03" "cg06" "cg11" "cg15" "cg18"
Notice that the object created by methyl_miss()
contains
an element for each type of missing data:
- For missing observations (
missing_obs
), the function returns a named vector of the proportion of samples that are missing each of the probes. - For missing probes (
missing_probes
), the function returns a character vector of probes that are missing for all samples.
Imputing Missing Observations
Now that we know about our missing values, we’ll begin addressing
them by imputing missing observations with impute_obs()
.
Required inputs to this function are the methyl_surro
object, a character string to denote the method (currently available
options are “mean” and “median”). The optional
min_nonmiss_prop
option can be specified to impute
observations for only those probes that are above the specified
proportion of non-missing values.
my_surrogate <- impute_obs(methyl_surro = my_surrogate,
method = "mean")
print(my_surrogate)
#> $methyl
#> samp1 samp2 samp3 samp4 samp5
#> cg02 0.2875775 0.5852975 0.5852975 0.8830174 0.58529746
#> cg07 0.8998250 0.2460877 0.6070843 0.3279207 0.95450365
#> cg08 0.8895393 0.6928034 0.6405068 0.9942698 0.65570580
#> cg13 0.9630242 0.9022990 0.6907053 0.7954674 0.02461368
#> cg17 0.1428000 0.4145463 0.4137243 0.3688455 0.15244475
#> cg03 NA NA NA NA NA
#> cg06 NA NA NA NA NA
#> cg11 NA NA NA NA NA
#> cg15 NA NA NA NA NA
#> cg18 NA NA NA NA NA
#>
#> $weights
#> cg02 cg03 cg06 cg07 cg08 cg11
#> -0.009083377 -0.001155999 0.005978497 -0.007562015 0.001218960 -0.005869372
#> cg13 cg15 cg17 cg18
#> -0.007449367 0.005066157 0.007900907 -0.002510744
#>
#> $intercept
#> Intercept
#> 1.211
#>
#> attr(,"class")
#> [1] "methyl_surro"
Filling Missing Probes from Reference Data
Now that we’ve imputed missing observations, we need to account for
missing probes (i.e., those that are missing for all samples). For this,
we’ll need to draw on a reference panel (ref_vec_mean
) to
fill in values with the reference_fill()
function.
data("ref_vec_mean")
print(ref_vec_mean)
#> cg01 cg02 cg03 cg04 cg05 cg06 cg07 cg08
#> 0.3992451 0.6616689 0.4948262 0.5278895 0.6770353 0.5526592 0.4940793 0.7745650
#> cg09 cg10 cg11 cg12 cg13 cg14 cg15 cg16
#> 0.5281033 0.4982656 0.4566024 0.3856340 0.6752219 0.5866269 0.4004940 0.4996738
#> cg17 cg18 cg19 cg20
#> 0.2984722 0.3923206 0.3747138 0.7031609
We’ll need to specify our methyl_surro
object, the
reference
object (requires named vector, matrix, or data
frame), and the type
of values to fill. Since we’ve already
accounted for missing observations, we’ll use
type = "probes"
to fill missing probes. Alternatively, the
function can be used to fill missing observations
(type = "obs"
), or missing observations and probes
simultaneously (type = "all"
).
my_surrogate <- reference_fill(methyl_surro = my_surrogate,
reference = ref_vec_mean,
type = "probes")
print(my_surrogate)
#> $methyl
#> samp1 samp2 samp3 samp4 samp5
#> cg02 0.2875775 0.5852975 0.5852975 0.8830174 0.58529746
#> cg07 0.8998250 0.2460877 0.6070843 0.3279207 0.95450365
#> cg08 0.8895393 0.6928034 0.6405068 0.9942698 0.65570580
#> cg13 0.9630242 0.9022990 0.6907053 0.7954674 0.02461368
#> cg17 0.1428000 0.4145463 0.4137243 0.3688455 0.15244475
#> cg03 0.4948262 0.4948262 0.4948262 0.4948262 0.49482616
#> cg06 0.5526592 0.5526592 0.5526592 0.5526592 0.55265924
#> cg11 0.4566024 0.4566024 0.4566024 0.4566024 0.45660238
#> cg15 0.4004940 0.4004940 0.4004940 0.4004940 0.40049405
#> cg18 0.3923206 0.3923206 0.3923206 0.3923206 0.39232059
#>
#> $weights
#> cg02 cg03 cg06 cg07 cg08 cg11
#> -0.009083377 -0.001155999 0.005978497 -0.007562015 0.001218960 -0.005869372
#> cg13 cg15 cg17 cg18
#> -0.007449367 0.005066157 0.007900907 -0.002510744
#>
#> $intercept
#> Intercept
#> 1.211
#>
#> attr(,"class")
#> [1] "methyl_surro"
Estimating Predicted Surrogate Values
We’re finally ready to generate predicted values! To do this, we’ll
use the surro_calc()
function. We once again need to
specify methyl_surro
object, as well as the
transform
method. In this case, our surrogate weights were
created using an elastic net regression model with the Gaussian link
function, so we’ll use transform = "linear"
. For surrogate
weights developed using the logit link function, predicted probabilities
can be estimated with transform = "probability"
. For those
developed using the Poisson link function, predicted counts can be
generated with transform = "count"
.
estimates <- surro_calc(methyl_surro = my_surrogate,
transform = "linear")
print(estimates)
#> samp1 samp2 samp3 samp4 samp5
#> 1.197718 1.202317 1.201093 1.199796 1.201382
Piping Commands Together
To simplify the process of estimating predicted surrogate values, we can pipe the commands in a single string to clean up our code.
estimates <- beta_matrix_miss |>
surro_set(weights = wts_vec_lin, intercept = "Intercept") |>
impute_obs(method = "mean") |>
reference_fill(reference = ref_vec_mean, type = "probes") |>
surro_calc(transform = "linear")
print(estimates)
#> samp1 samp2 samp3 samp4 samp5
#> 1.197718 1.202317 1.201093 1.199796 1.201382