function to simulate multilevel data with interaction.
My code is available for use and alteration, but please give credit with a citation & link to the page.
Share code for simulating multilevel data, with optional interaction term.
This function allows you to specify sample sizes, two main effects (main_x
& main_z
), an interaction
, and other features of the data. If you specify interaction = 0
, you can simply specify multilevel data with no interaction term. Of course the code could be extended to include as many effects as you need. It also provides a variable (w
) that’s not associated with the outcome (outcome
).
simulate_interaction_w_2_levels <-
function(
n, ## number level 1 units
j, ## number level 2 units
intercept_lv1, ## intercept at level 1
main_x, ## main effect of x
main_z, ## main effect of z
interaction, ## interaction of x and z
residual_var_sd_lv1, ## standard deviation of residuals at level 1
random_int_mean_lv2, ## mean of random intercept at level 2
random_int_sd_lv2, ## standard deviation of random intercept at level 2,
start_seed = 123
){
## placeholder for level 2 outcomes
outcomes_j <- vector("list", j)
## for each variable, make list
## as long as level 2
## fill each element with a list of level 1 outcomes
x <- vector("list", j)
z <- vector("list", j)
w <- vector("list", j)
## School distribution (intercept at level 2)
## Standard deviation of random intercept at level 2
set.seed(start_seed)
a_j <- rnorm(j, random_int_mean_lv2, random_int_sd_lv2)
for(i in 1:j) {
## make a level 1 predictor variable:
## set multiple seeds that change each iteration
## prevents identical cases with seed
set.seed(start_seed+i)
x[[i]] <- rnorm(n)
set.seed(start_seed-i)
z[[i]] <- rnorm(n)
set.seed(-start_seed + i)
outcomes_j[[i]] <-
rnorm(
n,
intercept_lv1 +
interaction*z[[i]]*x[[i]] +
main_x*x[[i]] +
main_z*z[[i]] +
a_j[i],
## standard deviation of residual variance
residual_var_sd_lv1
)
set.seed(start_seed*197+197*i)
w[[i]] <- rnorm(n)
}
outcomes_df <-
data.frame(
id = rep(1:j, each = n),
x = unlist(x),
z = unlist(z),
outcome = unlist(outcomes_j),
w = unlist(w)
)
return(outcomes_df)
}
# baseline theme for plots here
project_theme <-
theme(
axis.line = element_line(size = 3, lineend = 'round'),
legend.position = 'none',
plot.title.position = 'plot',
panel.grid.minor = element_blank(),
axis.title.x = element_text(size = 18),
axis.text = element_text(size = 16),
axis.title.y = element_text(size = 20),
plot.title = element_text(size = 25),
strip.text = element_text(size = 25),
)
We have 15 level 1 units, 71 level 2 units. The level 1 intercept of the outcome is 4.25 with a main effect of 1.25 for x
, 2.00 for z
, and 0.75 for their interaction
. The function also makes a nuisance variable w
, which does not contribute to outcome
at all.
sim_dat <-
simulate_interaction_w_2_levels(
n = 15, ## number level 1 units at each level 2
j = 71, ## number level 2 units
intercept_lv1 = 4.25, ## intercept at level 1
main_x = 1.25, ## main effect of x
main_z = 2.00, ## main effect of z
interaction = 0.75, ## interaction of x and z
residual_var_sd_lv1 = 2.00, ## standard deviation of residuals at level 1
random_int_mean_lv2 = 0, ## mean of random intercept at level 2
random_int_sd_lv2 = 1.00, ## standard deviation of random intercept at level 2,
start_seed = 123 ## ensure you can reproduce
) %>%
tibble()
Put formally, the simulation is approximating:
\[ \begin{aligned} \operatorname{outcome}_{i} &\sim N \left(\mu, \sigma^2 \right) \\ \mu &=\alpha_{j[i]} + \beta_{1}(\operatorname{x}) + \beta_{2}(\operatorname{z}) + \beta_{3}(\operatorname{w}) + \beta_{4}(\operatorname{x} \times \operatorname{z}) \\ \alpha_{j} &\sim N \left(\mu_{\alpha_{j}}, \sigma^2_{\alpha_{j}} \right) \text{, for id j = 1,} \dots \text{,J} \end{aligned} \]
where
\[\mu = 4.25\] \[\sigma = 2.00\] \[\beta_{1} = 1.25\] \[\beta_{2} = 2.00\] \[\beta_{3} = 0\] \[\beta_{4} = 0.75\] \[J = 71\] \[\mu_{\alpha_{j}} = 0\] \[\sigma_{\alpha_{j}}= 1.00\]
Here are the overall averages for the 1065 cases, with 71 level two units.
# A tibble: 1 x 8
x_mean x_sd z_mean z_sd outcome_mean outcome_sd w_mean w_sd
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.0900 0.946 -0.0197 1.00 4.44 3.38 -0.0208 0.993
figure_1 <-
sim_dat %>%
pivot_longer(
cols = -id,
names_to = 'variable',
values_to = 'statistic'
) %>%
group_by(variable) %>%
mutate(average_value = mean(statistic)) %>%
ungroup() %>%
ggplot(
aes(
x = statistic,
fill = variable)
) +
colorblindr::scale_fill_OkabeIto() +
colorblindr::scale_color_OkabeIto() +
geom_vline(aes(xintercept = average_value), size = 1.25, linetype = 2) +
geom_density(alpha = 0.7, color = 'black', size = 2) +
facet_wrap(~variable, scales = 'free') +
labs(
y = element_blank(),
x = 'Distribution of Simulated Values',
title = 'Figure 1. Distribution of Simulated Variables',
caption = 'Note the scales are free on each plot\nDotted Lines = Average Values') +
theme_minimal() +
project_theme +
theme(axis.text.y = element_blank())
figure_1
When you group by the id
, the simulated averages are normally distributed around the values (Figure 1).
figure_2 <-
sim_dat %>%
group_by(id) %>%
summarize(
x = mean(x),
z = mean(z),
outcome = mean(outcome),
w = mean(w)
) %>%
pivot_longer(
cols = -id,
names_to = 'variable',
values_to = 'statistic'
) %>%
group_by(variable) %>%
mutate(average_value = mean(statistic)) %>%
ungroup() %>%
ggplot(
aes(
x = statistic,
fill = variable)
) +
colorblindr::scale_fill_OkabeIto() +
colorblindr::scale_color_OkabeIto() +
geom_vline(aes(xintercept = average_value), size = 1.25, linetype = 2) +
geom_density(alpha = 0.7, color = 'black', size = 2) +
facet_wrap(~variable, scales = 'free') +
labs(
y = element_blank(),
x = 'Distribution of Simulated Values (group averages)',
title = 'Figure 2. Distribution of Group Averages for\nSimulated Variables',
caption = 'Note the scales are free on each plot') +
theme_minimal() +
project_theme +
theme(axis.text.y = element_blank())
figure_2
figure_3a <-
sim_dat %>%
ggplot(aes(x = x, y = outcome)) +
geom_point() +
geom_smooth(method = 'lm') +
theme_minimal() +
project_theme +
labs(
title = 'Figure 3. Association of Outcome & Each Covariate',
subtitle = '3a. Association of Outcome & x'
) +
theme(plot.subtitle = element_text(size = 15))
figure_3b <-
sim_dat %>%
ggplot(aes(x = z, y = outcome)) +
geom_point() +
geom_smooth(method = 'lm') +
theme_minimal() +
project_theme +
labs(subtitle = '3b. Association of Outcome & z') +
theme(plot.subtitle = element_text(size = 15))
figure_3c <-
sim_dat %>%
ggplot(aes(x = x*z, y = outcome)) +
geom_point() +
geom_smooth(method = 'lm') +
theme_minimal() +
project_theme +
labs(subtitle = '3c. Association of Outcome & Interaction') +
theme(plot.subtitle = element_text(size = 15))
ggpubr::ggarrange(figure_3a, figure_3b, figure_3c, ncol = 1)
The simulation looks to have worked as planned. With larger samples, the effects becomes closer to what we specified, obviously, but that was omitted for brevity. Let’s see what this data looks like with an {lme4}
to be sure.
confirmation_lmer <-
lmer(
data = sim_dat,
formula = outcome ~ x * z + w + # equivalent to x + z + x : z
(1 | id)
)
summary(confirmation_lmer)
Linear mixed model fit by REML ['lmerMod']
Formula: outcome ~ x * z + w + (1 | id)
Data: sim_dat
REML criterion at convergence: 4729.8
Scaled residuals:
Min 1Q Median 3Q Max
-3.4943 -0.6663 -0.0260 0.6799 2.9690
Random effects:
Groups Name Variance Std.Dev.
id (Intercept) 0.7374 0.8587
Residual 4.5271 2.1277
Number of obs: 1065, groups: id, 71
Fixed effects:
Estimate Std. Error t value
(Intercept) 4.37559 0.12117 36.111
x 1.09797 0.07135 15.389
z 1.95442 0.06737 29.011
w 0.06962 0.06750 1.031
x:z 0.80677 0.07160 11.268
Correlation of Fixed Effects:
(Intr) x z w
x -0.053
z 0.012 -0.014
w 0.013 -0.024 -0.027
x:z -0.005 -0.049 -0.141 -0.003
We see how closely the MLM estimates identify the simulated effects. We see small standard errors and large t-values for these estimates, too. As we’d like to see, w
does not have significant associations with the outcome.