Simulating 2-level data for APA 2021 Supplemental, so others can follow along.
This post is meant for two audiences:
My code is available for use and alteration, but please give credit via citations. Anyways, let’s get into it! If you’re looking for more detail regarding the simulated data or would like to simulate your own data with a simple linear interaction, see this prior post which looks more deeply at distribtions of simulated data.
Simulate data to use in my APA 2021 Supplemental Material Post.
By modifying the interaction in the previous data simulation function with an if_else()
statement, we can make a more complex interaction term that would not be easily detected by traditional multilevel models (MLMs). I am coding the interaction’s influence as conditional on it’s value, more specifically the interaction’s effect is: if_else(z[[i]]*x[[i]] < 0, interaction1*z[[i]]*x[[i]], interaction2*z[[i]]*x[[i]])
. This is one type of circumstance where MLMs may struggle without substantial theoretical understanding to guide exploration of data.
So that is a simple simulation with a complex interaction term. We have 15 level 1 units for each level 2 unit and 71 level 2 units (labeled with id
). The level 1 intercept of outcome
is 4.25 with a main effect of 1.25 for x
, 2.00 for z
. The function also makes a nuisance variable w
, which does not contribute to outcome
at all.
nonlinear_dat <-
simulate_nonlinear_intx(
n = 15, ## number level 1 units at each level 2
j = 71, ## number level 2 units
intercept_lv1 = 4.25, ## intercept at level 1
main_x = 1.25, ## main effect of x
main_z = 2.00, ## main effect of z
interaction1 = -4, ## interaction of x and z when product < 0
interaction2 = 4, ## interaction of x and z when product >= 0
residual_var_sd_lv1 = 2.00, ## standard deviation of residuals at level 1
random_int_mean_lv2 = 0, ## mean of random intercept at level 2
random_int_sd_lv2 = 1.00, ## standard deviation of random intercept at level 2,
start_seed = 123 ## ensure you can reproduce
)
figure_1a <-
nonlinear_dat %>%
ggplot(aes(x = x, y = outcome)) +
geom_point() +
geom_smooth(method = 'lm') +
theme_minimal() +
project_theme +
labs(
title = 'Figure 1. Association of Outcome & Each Covariate (non-linear intx)',
subtitle = '1a. Association of Outcome & x'
) +
theme(plot.subtitle = element_text(size = 15))
figure_1b <-
nonlinear_dat %>%
ggplot(aes(x = z, y = outcome)) +
geom_point() +
geom_smooth(method = 'lm') +
theme_minimal() +
project_theme +
labs(subtitle = '1b. Association of Outcome & z') +
theme(plot.subtitle = element_text(size = 15))
figure_1c <-
nonlinear_dat %>%
ggplot(aes(x = x*z, y = outcome)) +
geom_point() +
geom_smooth(method = 'lm', aes(group = x*z < 0)) +
theme_minimal() +
project_theme +
labs(subtitle = '1c. Association of Outcome & Interaction') +
theme(plot.subtitle = element_text(size = 15))
ggpubr::ggarrange(figure_1a, figure_1b, figure_1c, ncol = 1)
When we fit an MLM to the data, it is unable to find a significant effect. You see this from the small t-values in the MLM results.
misspecified_lmer <-
lmer(
data = nonlinear_dat,
formula = outcome ~ x * z + w +
(1 | id)
)
summary(misspecified_lmer)
Linear mixed model fit by REML ['lmerMod']
Formula: outcome ~ x * z + w + (1 | id)
Data: nonlinear_dat
REML criterion at convergence: 5730
Scaled residuals:
Min 1Q Median 3Q Max
-2.7929 -0.6616 -0.1518 0.4770 4.5718
Random effects:
Groups Name Variance Std.Dev.
id (Intercept) 0.7794 0.8828
Residual 12.0680 3.4739
Number of obs: 1065, groups: id, 71
Fixed effects:
Estimate Std. Error t value
(Intercept) 6.7859 0.1498 45.312
x 1.3789 0.1153 11.961
z 2.0853 0.1093 19.087
w 0.1155 0.1093 1.057
x:z 0.1555 0.1160 1.341
Correlation of Fixed Effects:
(Intr) x z w
x -0.069
z 0.016 -0.011
w 0.016 -0.023 -0.031
x:z -0.006 -0.049 -0.140 -0.004
If you had a highly experienced theorist or a whiz analyst, they perhaps could figure out how to correctly specify an MLM, but it would not be quick. This is a three-way interaction, and one where any two-way interaction would either eat away statistical power (or even result in a type I error). that would be specified with this formula: outcome ~ x + z + x * z * dummy_code + w + (1 | id)
, which estimates 4 unnecessary effects. In a smaller sample, potential issues with this are amplified
Below is the correctly specified model, which technically provides statistical test that the slope of each dummy coded group, formed with the condition x * z < 0
, is different from 0. This would be a very difficult model to simply happen upon.
Linear mixed model fit by REML ['lmerMod']
Formula: outcome ~ x + z + x:z:dummy_code + w + (1 | id)
Data: nonlinear_dat
REML criterion at convergence: 4727.3
Scaled residuals:
Min 1Q Median 3Q Max
-3.5431 -0.6796 -0.0225 0.6703 2.9410
Random effects:
Groups Name Variance Std.Dev.
id (Intercept) 0.7424 0.8616
Residual 4.5122 2.1242
Number of obs: 1065, groups: id, 71
Fixed effects:
Estimate Std. Error t value
(Intercept) 4.48729 0.13363 33.579
x 1.11054 0.07152 15.528
z 1.96038 0.06733 29.117
w 0.07117 0.06740 1.056
x:z:dummy_code0 -3.75376 0.11871 -31.621
x:z:dummy_code1 3.87560 0.11535 33.598
Correlation of Fixed Effects:
(Intr) x z w x:z:_0
x -0.011
z 0.029 -0.010
w 0.016 -0.023 -0.027
x:z:dmmy_c0 0.331 0.041 -0.049 0.008
x:z:dmmy_c1 -0.331 -0.100 -0.122 -0.011 -0.253
{glmertree}
packageLet’s see how GLMM trees act in my next post