regcal_example
regcal_example.Rmd
Introduction
The RegCalReliab package provides a unified framework for regression calibration under both external and internal reliability study designs. External reliability studies collect replicate measurements on a separate sample, while internal reliability studies collect replicates within the main study cohort. In both settings, regression calibration replaces the error-prone exposures with their estimated conditional expectations, thereby correcting bias and improving confidence interval coverage.
This document demonstrates the use of RegCalReliab through simulation studies under a logistic regression model. We generate data with two error-prone exposures (z1, z2) and two error-free covariates (W1, W2), with a true odds ratio of 1.5 for each exposure. Results are compared between naïve analyses (ignoring measurement error) and regression calibration, highlighting the bias correction and improved inference provided by the method.
1. Setup
In this section we load packages, fix the seed, and define helper functions. We set the true slope = log(1.5), corresponding to a true odds ratio (OR) of 1.5 for each error-prone exposure.
2. One simulation replicate for External Logistic
A single call to simulate_once() generates one dataset and fits regression calibration.
The main study has 1500 subjects.
The external reliability study has 1500 subjects with 2, 3, or 4 replicates per subject (padded with NA).
Error-free covariates W1 (continuous) and W2 (binary) are included.
The outcome Y is generated under a logistic model with slope log(1.5) for both exposures.
We call RC_ExReliab() with link = “logistic” to obtain both naïve and corrected estimates.
simulate_once = function() {
# ---- True covariates ----
x = mgcv::rmvn(3000, c(0,0), matrix(c(1,0.3,0.3,1), 2))
# Error-free covariates (W1 = continuous, W2 = binary)
W1_main = rnorm(1500)
W2_main = rbinom(1500, 1, 0.5)
W1_rep = rnorm(1500)
W2_rep = rbinom(1500, 1, 0.5)
# ---- Main study error-prone Z ----
z.main = x[1:1500, 1:2] + matrix(rnorm(1500*2, 0, sqrt(0.4)), 1500, 2)
colnames(z.main) = c("z1","z2")
# ---- External replicates for Z ----
z1_rep = rbind(
cbind(add_err(x[1501:2000, 1]), add_err(x[1501:2000, 1]), NA, NA),
cbind(add_err(x[2001:2400, 1]), add_err(x[2001:2400, 1]), add_err(x[2001:2400, 1]), NA),
cbind(add_err(x[2401:3000, 1]), add_err(x[2401:3000, 1]), add_err(x[2401:3000, 1]), add_err(x[2401:3000, 1]))
)
colnames(z1_rep) = paste0("z1_", 1:4)
z2_rep = rbind(
cbind(add_err(x[1501:2000, 2]), add_err(x[1501:2000, 2]), NA, NA),
cbind(add_err(x[2001:2400, 2]), add_err(x[2001:2400, 2]), add_err(x[2001:2400, 2]), NA),
cbind(add_err(x[2401:3000, 2]), add_err(x[2401:3000, 2]), add_err(x[2401:3000, 2]), add_err(x[2401:3000, 2]))
)
colnames(z2_rep) = paste0("z2_", 1:4)
# ---- Outcome ----
p = plogis(-2.3 + beta*rowSums(x[1:1500, ]) + 0.5*W1_main - 0.3*W2_main)
Y = rbinom(1500, 1, p)
main_data = data.frame(
Y = Y,
z1 = z.main[, "z1"],
z2 = z.main[, "z2"],
W1 = W1_main,
W2 = W2_main
)
rep_data = data.frame(z1_rep, z2_rep, W1 = W1_rep, W2 = W2_rep, check.names = FALSE)
# ---- Regression Calibration ----
res = RC_ExReliab(
formula = Y ~ z1(z1_1, z1_2, z1_3, z1_4) + z2(z2_1, z2_2, z2_3, z2_4) + W1 + W2,
main_data = main_data,
rep_data = rep_data,
link = "logistic"
)
return(res)
}
3. 100 simulation replicate
We repeat the entire simulation 100 times. Each run returns a results object containing two tables:
uncorrected = naïve logistic regression ignoring measurement error.
corrected = regression calibration estimates adjusted for error.
4. Average estimates across simulations
We compute the Monte Carlo average of the parameter estimates across the 100 runs.
avg_naive = Reduce("+", naive_tabs) / B
avg_corrected = Reduce("+", corrected_tabs) / B
cat("\nAverage Naive Logistic Estimates (B = ", B, "):\n", sep = "")
#>
#> Average Naive Logistic Estimates (B = 100):
print(round(avg_naive, 5))
#> Estimate Std. Error z value Pr(>|z|) OR CI.low CI.high
#> (Intercept) -2.42815 0.10346 -23.48130 0.00000 0.08867 0.07247 0.10850
#> z1 0.31217 0.07817 3.98303 0.00788 1.37202 1.17689 1.59956
#> z2 0.30522 0.07817 3.90866 0.00644 1.36085 1.16754 1.58622
#> W1 0.48754 0.09118 5.34146 0.00003 1.63443 1.36664 1.95482
#> W2 -0.29440 0.17911 -1.63803 0.20679 0.75665 0.53294 1.07444
cat("\nAverage Corrected Logistic Estimates (B = ", B, "):\n", sep = "")
#>
#> Average Corrected Logistic Estimates (B = 100):
print(round(avg_corrected, 5))
#> Estimate Std. Error z value Pr(>|z|) OR CI.low CI.high
#> (Intercept) -2.42815 0.10358 -23.46051 2.00000 0.08867 0.07245 0.10853
#> z1 0.40940 0.11609 3.52319 0.02041 1.51921 1.20951 1.90864
#> z2 0.39907 0.11638 3.43776 0.01553 1.50019 1.19418 1.88504
#> W1 0.48707 0.09112 5.35650 0.00003 1.63355 1.36619 1.95351
#> W2 -0.29554 0.17953 -1.64100 1.76609 0.75584 0.53191 1.07419
5. Coverage-rate (CR) calculation
Finally, we evaluate whether the true OR = 1.5 lies within the 95% confidence intervals.
Coverage is computed as the percentage of simulation replicates whose CI covers the true OR.
Good coverage (~95%) indicates that the variance estimator is accurate.
Naïve estimates typically show attenuation bias and poor coverage, while corrected estimates should recover the true slope with coverage close to the nominal level.
inside_ci = function(tab, i, truth = OR_true) {
ci = tab[i , c("CI.low", "CI.high")]
ci[1] = truth && truth <= ci[2]
}
row_z1 = which(rownames(avg_naive) == "z1")
row_z2 = which(rownames(avg_naive) == "z2")
coverage = function(tab_list, row) {
mean(sapply(tab_list, function(tab) inside_ci(tab, row))) * 100
}
cov_z1_naive = coverage(naive_tabs, row_z1)
cov_z1_corr = coverage(corrected_tabs, row_z1)
cov_z2_naive = coverage(naive_tabs, row_z2)
cov_z2_corr = coverage(corrected_tabs, row_z2)
cat("\nCoverage of TRUE OR = 1.5 for error-prone exposure z1:\n")
#>
#> Coverage of TRUE OR = 1.5 for error-prone exposure z1:
cat(sprintf(" • Naive : %5.1f %%\n", cov_z1_naive))
#> • Naive : 77.0 %
cat(sprintf(" • Regression Calibration: %5.1f %%\n", cov_z1_corr))
#> • Regression Calibration: 95.0 %
cat("\nCoverage of TRUE OR = 1.5 for error-prone exposure z2:\n")
#>
#> Coverage of TRUE OR = 1.5 for error-prone exposure z2:
cat(sprintf(" • Naive : %5.1f %%\n", cov_z2_naive))
#> • Naive : 76.0 %
cat(sprintf(" • Regression Calibration: %5.1f %%\n", cov_z2_corr))
#> • Regression Calibration: 97.0 %
1. Setup
In this section we load packages, fix the seed, and define helper functions. We set the true slope = log(1.5), corresponding to a true odds ratio (OR) of 1.5 for each error-prone exposure.
2. One simulation replicate for Internal Logistic
Here we generate one dataset with both main study and reliability information. Replicates are included in the same main_data frame.
x1, x2 are error-prone exposures with 1–4 replicates
W1 is a continuous error-free covariate
W2 is a binary covariate, dependent on x1
Y is generated from a logistic model with slope log(1.5)
simulate_once = function() {
# ---- True covariates ----
x = mgcv::rmvn(3000, c(0,0,0),
matrix(c(1,0.3,0.2,
0.3,1,0.5,
0.2,0.5,1), nrow = 3))
# Binary W2 depends on x1
w2 = sapply(x[,1], function(t) {
if (t > median(x[,1])) rbinom(1,1,0.5) else rbinom(1,1,0.3)
})
# Error-free covariates
W = cbind(x[,3], w2)
colnames(W) = c("W1", "W2")
# ---- Replicate design ----
r = c(rep(1,1500), rep(2,500), rep(3,400), rep(4,600))
# Replicates for z1
z1 = rbind(
cbind(add_err(x[1:1500, 1]), NA, NA, NA),
cbind(add_err(x[1501:2000, 1]), add_err(x[1501:2000, 1]), NA, NA),
cbind(add_err(x[2001:2400, 1]), add_err(x[2001:2400, 1]), add_err(x[2001:2400, 1]), NA),
cbind(add_err(x[2401:3000, 1]), add_err(x[2401:3000, 1]), add_err(x[2401:3000, 1]), add_err(x[2401:3000, 1]))
)
colnames(z1) = paste0("z1_",1:4)
# Replicates for z2
z2 = rbind(
cbind(add_err(x[1:1500, 2]), NA, NA, NA),
cbind(add_err(x[1501:2000, 2]), add_err(x[1501:2000, 2]), NA, NA),
cbind(add_err(x[2001:2400, 2]), add_err(x[2001:2400, 2]), add_err(x[2001:2400, 2]), NA),
cbind(add_err(x[2401:3000, 2]), add_err(x[2401:3000, 2]), add_err(x[2401:3000, 2]), add_err(x[2401:3000, 2]))
)
colnames(z2) = paste0("z2_",1:4)
# ---- Outcome ----
p = plogis(-2.65 + beta*rowSums(x[,1:3]) + beta*w2)
Y = rbinom(3000, 1, p)
# ---- Main data with outcome, replicates, covariates ----
main_data = data.frame(Y, z1, z2, W)
# ---- Regression calibration ----
res = RC_InReliab(
formula = Y ~ myz1(z1_1, z1_2, z1_3, z1_4) +
myz2(z2_1, z2_2, z2_3, z2_4) +
W1 + W2,
main_data = main_data,
link = "logistic"
)
return(res)
}
3. 100 simulation replicate
We repeat the entire simulation 100 times. Each run returns a results object containing two tables:
uncorrected = naïve logistic regression ignoring measurement error.
corrected = regression calibration estimates adjusted for error.
4. Average estimates across simulations
We compute the Monte Carlo average of the parameter estimates across the 100 runs.
avg_naive = Reduce("+", naive_tabs) / B
avg_corrected = Reduce("+", corrected_tabs) / B
cat("\nAverage Naive Logistic Estimates (B = ", B, "):\n", sep = "")
#>
#> Average Naive Logistic Estimates (B = 100):
print(round(avg_naive, 5))
#> Estimate Std. Error z value Pr(>|z|) OR CI.low CI.high
#> (Intercept) -2.46347 0.07564 -32.57857 0.00000 0.08536 0.07362 0.09896
#> myz1 0.32896 0.05891 5.58205 0.00001 1.39137 1.23960 1.56175
#> myz2 0.31140 0.06348 4.90541 0.00038 1.36817 1.20808 1.54949
#> W1 0.45731 0.07129 6.41244 0.00000 1.58342 1.37684 1.82102
#> W2 0.44009 0.12598 3.49568 0.01636 1.56768 1.22472 2.00674
cat("\nAverage Corrected Logistic Estimates (B = ", B, "):\n", sep = "")
#>
#> Average Corrected Logistic Estimates (B = 100):
print(round(avg_corrected, 5))
#> Estimate Std. Error z value Pr(>|z|) OR CI.low CI.high
#> (Intercept) -2.46506 0.07559 -32.62875 2.00000 0.08522 0.07352 0.09880
#> myz1 0.40418 0.07632 5.29511 0.00003 1.50144 1.29271 1.74396
#> myz2 0.39760 0.08593 4.62977 0.00093 1.49393 1.26225 1.76824
#> W1 0.40044 0.07559 5.29969 0.00005 1.49629 1.29017 1.73539
#> W2 0.41099 0.12687 3.24107 0.02646 1.52313 1.18781 1.95316
5. Coverage-rate (CR) calculation
Finally, we evaluate whether the true OR = 1.5 lies within the 95% confidence intervals.
Coverage is computed as the percentage of simulation replicates whose CI covers the true OR.
Good coverage (~95%) indicates that the variance estimator is accurate.
Naïve estimates typically show attenuation bias and poor coverage, while corrected estimates should recover the true slope with coverage close to the nominal level.
inside_ci = function(tab, i, truth = OR_true) {
ci = tab[i , c("CI.low", "CI.high")]
ci[1] <= truth && truth <= ci[2]
}
row_z1 = which(rownames(avg_naive) == "myz1")
row_z2 = which(rownames(avg_naive) == "myz2")
coverage = function(tab_list, row) {
mean(sapply(tab_list, function(tab) inside_ci(tab, row))) * 100
}
cov_z1_naive = coverage(naive_tabs, row_z1)
cov_z1_corr = coverage(corrected_tabs, row_z1)
cov_z2_naive = coverage(naive_tabs, row_z2)
cov_z2_corr = coverage(corrected_tabs, row_z2)
cat("\nCoverage of TRUE OR = 1.5 for error-prone exposure z1:\n")
#>
#> Coverage of TRUE OR = 1.5 for error-prone exposure z1:
cat(sprintf(" • Naive : %5.1f %%\n", cov_z1_naive))
#> • Naive : 73.0 %
cat(sprintf(" • Regression Calibration: %5.1f %%\n", cov_z1_corr))
#> • Regression Calibration: 96.0 %
cat("\nCoverage of TRUE OR = 1.5 for error-prone exposure z2:\n")
#>
#> Coverage of TRUE OR = 1.5 for error-prone exposure z2:
cat(sprintf(" • Naive : %5.1f %%\n", cov_z2_naive))
#> • Naive : 70.0 %
cat(sprintf(" • Regression Calibration: %5.1f %%\n", cov_z2_corr))
#> • Regression Calibration: 92.0 %
Summary
This document illustrates the use of the RegCalReliab package for regression calibration under measurement error in logistic regression models. Two study designs are demonstrated:
External reliability studies, where replicate measurements are collected on a separate sample.
Internal reliability studies, where replicates are collected within the main study.
In both cases, we simulate data with two error-prone exposures (z1, z2) and two error-free covariates (W1, W2). The true odds ratio for each exposure is set to 1.5.
Key findings from the simulations:
Naïve logistic regression, which ignores measurement error, shows attenuation bias (estimates biased toward the null) and poor confidence interval coverage.
Regression calibration corrects for the bias, yielding average estimates closer to the true odds ratio.
Coverage probabilities of the 95% confidence intervals for regression calibration are close to the nominal level (~ 95%), indicating that the method recovers valid inference under both external and internal reliability designs.
Overall, the document demonstrates that regression calibration is an effective and practical method for addressing measurement error in regression models, and the RegCalReliab package provides a unified framework for implementation in both external and internal reliability studies.