Skip to contents

Installing & Loading Package

We can use the devtools package to install the development version of MethylSurroGetR from GitHub.

# install.packages("devtools")
devtools::install_github("jagoode27/MethylSurroGetR")

Once the package is installed, we can load the package.

Loading Sample Data

At a minimum, MethylSurroGetR requires two input objects: a methylation matrix and a named vector, matrix, or data frame with surrogate weights.

Methylation Matrix

We’ll start by loading our matrix of beta values.

data("beta_matrix_miss")
print(beta_matrix_miss)
#>          samp1      samp2      samp3     samp4      samp5
#> cg01 0.1028646 0.32037324 0.48290240        NA 0.36948887
#> cg02 0.2875775         NA         NA 0.8830174         NA
#> cg04 0.4348927 0.18769112 0.89035022 0.1422943 0.98421920
#> cg05 0.9849570 0.78229430 0.91443819 0.5492847 0.15420230
#> cg07 0.8998250 0.24608773         NA 0.3279207 0.95450365
#> cg08 0.8895393 0.69280341 0.64050681 0.9942698 0.65570580
#> cg09        NA 0.09359499 0.60873498 0.9540912         NA
#> cg10 0.8864691         NA         NA 0.5854834 0.14190691
#> cg12 0.1750527         NA 0.14709469        NA 0.69000710
#> cg13 0.9630242 0.90229905 0.69070528 0.7954674 0.02461368
#> cg14 0.1306957         NA 0.93529980 0.6478935         NA
#> cg16 0.6531019 0.33282354 0.30122890 0.3198206 0.89139412
#> cg17 0.1428000 0.41454634 0.41372433 0.3688455 0.15244475
#> cg19 0.3435165 0.48861303 0.06072057 0.3077200         NA
#> cg20 0.6567581 0.95447383 0.94772694 0.2197676         NA

Notice that probe IDs are in row names and sample IDs are in column names. This is the required orientation in MethylSurroGetR.

Surrogate Weights

Next, we’ll load a named vector of weights to construct the surrogate.

data("wts_vec_lin")
print(wts_vec_lin)
#>         cg02         cg03         cg06         cg07         cg08         cg11 
#> -0.009083377 -0.001155999  0.005978497 -0.007562015  0.001218960 -0.005869372 
#>         cg13         cg15         cg17         cg18    Intercept 
#> -0.007449367  0.005066157  0.007900907 -0.002510744  1.211000000

Building mehtyl_surro Data Object

In order to generate predicted values from surrogate weights, we need to construct a mehtyl_surro object. To do this, we will use the surro_set() function and specify the methylation matrix with the methyl and the weights object with the weights option. In addition, we need to specify the character string for the name of the intercept in the weights object (if applicable) with the intercept option.

my_surrogate <- surro_set(methyl = beta_matrix_miss,
                          weights = wts_vec_lin,
                          intercept = "Intercept")
print(my_surrogate)
#> $methyl
#>          samp1     samp2     samp3     samp4      samp5
#> cg02 0.2875775        NA        NA 0.8830174         NA
#> cg07 0.8998250 0.2460877        NA 0.3279207 0.95450365
#> cg08 0.8895393 0.6928034 0.6405068 0.9942698 0.65570580
#> cg13 0.9630242 0.9022990 0.6907053 0.7954674 0.02461368
#> cg17 0.1428000 0.4145463 0.4137243 0.3688455 0.15244475
#> cg03        NA        NA        NA        NA         NA
#> cg06        NA        NA        NA        NA         NA
#> cg11        NA        NA        NA        NA         NA
#> cg15        NA        NA        NA        NA         NA
#> cg18        NA        NA        NA        NA         NA
#> 
#> $weights
#>         cg02         cg03         cg06         cg07         cg08         cg11 
#> -0.009083377 -0.001155999  0.005978497 -0.007562015  0.001218960 -0.005869372 
#>         cg13         cg15         cg17         cg18 
#> -0.007449367  0.005066157  0.007900907 -0.002510744 
#> 
#> $intercept
#> Intercept 
#>     1.211 
#> 
#> attr(,"class")
#> [1] "methyl_surro"

We can see that the resulting methyl_surro object has three elements that correspond to the inputs: methyl, weights, and intercept. In addition to placing each of the inputs into their own object element, the function has also subset the methylation matrix to include all of the probes included in the weights (and nothing more). Although it makes little difference with this toy example, methylation matrices in the real world contain hundeds of thousands of probes, while many surrogates are constructed using tens of thousands of probes.

Handling Missing Values

Notice that there are a number of missing values in the methyl element of my_surrogate. Our next step is to address these.

Two Types of Missing Values

MethylSurroGetR distinguishes between two types of missing values:

  • Missing Observatons: These are probes that were measured in the target data, with missing values for some samples.
    • Notice that cg01 was missing in our methylation matrix for samp2, samp3, and samp5. Similarly, cg07 was missing for samp5.
  • Missing Probes: These are probes that are not measured in the target data. Therefore, they are missing for all samples in the methylation matrix. This might have occurred because those probes was removed during QC or because the array did not include the probes.
    • Notice that cg03, cg06, cg11, cg15, and cg18 were included among the surrogate weights, but were not in our methylation data.
samp1 samp2 samp3 samp4 samp5
cg02 0.288 NA NA 0.883 NA
cg07 0.900 0.246 NA 0.328 0.955
cg08 0.890 0.693 0.641 0.994 0.656
cg13 0.963 0.902 0.691 0.795 0.025
cg17 0.143 0.415 0.414 0.369 0.152
cg03 NA NA NA NA NA
cg06 NA NA NA NA NA
cg11 NA NA NA NA NA
cg15 NA NA NA NA NA
cg18 NA NA NA NA NA

Checking for Missing Values

Next, we can check our new methylation matrix for missing values. This seems silly with our small example, but can be much more helpful when working with real data.

missing <- methyl_miss(methyl_surro = my_surrogate)
print(missing)
#> $missing_obs
#> cg02 cg07 
#>  0.6  0.2 
#> 
#> $missing_probes
#> [1] "cg03" "cg06" "cg11" "cg15" "cg18"

Notice that the object created by methyl_miss() contains an element for each type of missing data:

  • For missing observations (missing_obs), the function returns a named vector of the proportion of samples that are missing each of the probes.
  • For missing probes (missing_probes), the function returns a character vector of probes that are missing for all samples.

Imputing Missing Observations

Now that we know about our missing values, we’ll begin addressing them by imputing missing observations with impute_obs(). Required inputs to this function are the methyl_surro object, a character string to denote the method (currently available options are “mean” and “median”). The optional min_nonmiss_prop option can be specified to impute observations for only those probes that are above the specified proportion of non-missing values.

my_surrogate <- impute_obs(methyl_surro = my_surrogate,
                           method = "mean")
print(my_surrogate)
#> $methyl
#>          samp1     samp2     samp3     samp4      samp5
#> cg02 0.2875775 0.5852975 0.5852975 0.8830174 0.58529746
#> cg07 0.8998250 0.2460877 0.6070843 0.3279207 0.95450365
#> cg08 0.8895393 0.6928034 0.6405068 0.9942698 0.65570580
#> cg13 0.9630242 0.9022990 0.6907053 0.7954674 0.02461368
#> cg17 0.1428000 0.4145463 0.4137243 0.3688455 0.15244475
#> cg03        NA        NA        NA        NA         NA
#> cg06        NA        NA        NA        NA         NA
#> cg11        NA        NA        NA        NA         NA
#> cg15        NA        NA        NA        NA         NA
#> cg18        NA        NA        NA        NA         NA
#> 
#> $weights
#>         cg02         cg03         cg06         cg07         cg08         cg11 
#> -0.009083377 -0.001155999  0.005978497 -0.007562015  0.001218960 -0.005869372 
#>         cg13         cg15         cg17         cg18 
#> -0.007449367  0.005066157  0.007900907 -0.002510744 
#> 
#> $intercept
#> Intercept 
#>     1.211 
#> 
#> attr(,"class")
#> [1] "methyl_surro"

Filling Missing Probes from Reference Data

Now that we’ve imputed missing observations, we need to account for missing probes (i.e., those that are missing for all samples). For this, we’ll need to draw on a reference panel (ref_vec_mean) to fill in values with the reference_fill() function.

data("ref_vec_mean")
print(ref_vec_mean)
#>      cg01      cg02      cg03      cg04      cg05      cg06      cg07      cg08 
#> 0.3992451 0.6616689 0.4948262 0.5278895 0.6770353 0.5526592 0.4940793 0.7745650 
#>      cg09      cg10      cg11      cg12      cg13      cg14      cg15      cg16 
#> 0.5281033 0.4982656 0.4566024 0.3856340 0.6752219 0.5866269 0.4004940 0.4996738 
#>      cg17      cg18      cg19      cg20 
#> 0.2984722 0.3923206 0.3747138 0.7031609

We’ll need to specify our methyl_surro object, the reference object (requires named vector, matrix, or data frame), and the type of values to fill. Since we’ve already accounted for missing observations, we’ll use type = "probes" to fill missing probes. Alternatively, the function can be used to fill missing observations (type = "obs"), or missing observations and probes simultaneously (type = "all").

my_surrogate <- reference_fill(methyl_surro = my_surrogate,
                               reference = ref_vec_mean,
                               type = "probes")
print(my_surrogate)
#> $methyl
#>          samp1     samp2     samp3     samp4      samp5
#> cg02 0.2875775 0.5852975 0.5852975 0.8830174 0.58529746
#> cg07 0.8998250 0.2460877 0.6070843 0.3279207 0.95450365
#> cg08 0.8895393 0.6928034 0.6405068 0.9942698 0.65570580
#> cg13 0.9630242 0.9022990 0.6907053 0.7954674 0.02461368
#> cg17 0.1428000 0.4145463 0.4137243 0.3688455 0.15244475
#> cg03 0.4948262 0.4948262 0.4948262 0.4948262 0.49482616
#> cg06 0.5526592 0.5526592 0.5526592 0.5526592 0.55265924
#> cg11 0.4566024 0.4566024 0.4566024 0.4566024 0.45660238
#> cg15 0.4004940 0.4004940 0.4004940 0.4004940 0.40049405
#> cg18 0.3923206 0.3923206 0.3923206 0.3923206 0.39232059
#> 
#> $weights
#>         cg02         cg03         cg06         cg07         cg08         cg11 
#> -0.009083377 -0.001155999  0.005978497 -0.007562015  0.001218960 -0.005869372 
#>         cg13         cg15         cg17         cg18 
#> -0.007449367  0.005066157  0.007900907 -0.002510744 
#> 
#> $intercept
#> Intercept 
#>     1.211 
#> 
#> attr(,"class")
#> [1] "methyl_surro"

Estimating Predicted Surrogate Values

We’re finally ready to generate predicted values! To do this, we’ll use the surro_calc() function. We once again need to specify methyl_surro object, as well as the transform method. In this case, our surrogate weights were created using an elastic net regression model with the Gaussian link function, so we’ll use transform = "linear". For surrogate weights developed using the logit link function, predicted probabilities can be estimated with transform = "probability". For those developed using the Poisson link function, predicted counts can be generated with transform = "count".

estimates <- surro_calc(methyl_surro = my_surrogate,
                        transform = "linear")
print(estimates)
#>    samp1    samp2    samp3    samp4    samp5 
#> 1.197718 1.202317 1.201093 1.199796 1.201382

Piping Commands Together

To simplify the process of estimating predicted surrogate values, we can pipe the commands in a single string to clean up our code.

estimates <- beta_matrix_miss |>
  surro_set(weights = wts_vec_lin, intercept = "Intercept") |>
  impute_obs(method = "mean") |>
  reference_fill(reference = ref_vec_mean, type = "probes") |>
  surro_calc(transform = "linear")
print(estimates)
#>    samp1    samp2    samp3    samp4    samp5 
#> 1.197718 1.202317 1.201093 1.199796 1.201382