Model Fitting Exercise

This report explores the Mavoglurant pharmacokinetic dataset. The goal is to clean the data, perform exploratory data analysis, and fit predictive models using the tidymodels framework.

library(nlmixr2data)
library(ggplot2)

Warning: package 'ggplot2' was built under R version 4.4.3

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(tidyverse)

Warning: package 'tidyverse' was built under R version 4.4.3

Warning: package 'tibble' was built under R version 4.4.3

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ lubridate 1.9.4     ✔ tibble    3.3.1
✔ purrr     1.0.4     ✔ tidyr     1.3.1
✔ readr     2.1.5

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidymodels)

Warning: package 'tidymodels' was built under R version 4.4.3

── Attaching packages ────────────────────────────────────── tidymodels 1.4.1 ──
✔ broom        1.0.11     ✔ rsample      1.3.1 
✔ dials        1.4.2      ✔ tailor       0.1.0 
✔ infer        1.1.0      ✔ tune         2.0.1 
✔ modeldata    1.5.1      ✔ workflows    1.3.0 
✔ parsnip      1.4.1      ✔ workflowsets 1.1.1 
✔ recipes      1.3.1      ✔ yardstick    1.3.2

Warning: package 'broom' was built under R version 4.4.3

Warning: package 'dials' was built under R version 4.4.3

Warning: package 'scales' was built under R version 4.4.3

Warning: package 'infer' was built under R version 4.4.3

Warning: package 'modeldata' was built under R version 4.4.3

Warning: package 'parsnip' was built under R version 4.4.3

Warning: package 'recipes' was built under R version 4.4.3

Warning: package 'rsample' was built under R version 4.4.3

Warning: package 'tailor' was built under R version 4.4.3

Warning: package 'tune' was built under R version 4.4.3

Warning: package 'workflows' was built under R version 4.4.3

Warning: package 'workflowsets' was built under R version 4.4.3

Warning: package 'yardstick' was built under R version 4.4.3

── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()

mavoglurant <- read.csv("Mavoglurant_A2121_nmpk.csv")

Data Exploration

ggplot(mavoglurant, aes(x = TIME, y = DV, group = ID, color = factor(DOSE))) +
  geom_line(alpha = 0.6) +
  geom_point(size = 1) +
  labs(
    title = "DV Over Time by Dose",
    x = "Time",
    y = "DV",
    color = "Dose"
  ) +
  theme_minimal()

We remove observations where OCC = 2 because individuals received the drug more than once and we restrict the analysis to the first occasion.

mavoglurant_occ1 <- mavoglurant %>%
  filter(OCC == 1)

ggplot(mavoglurant_occ1,
       aes(x = TIME,
           y = DV,
           group = ID,
           color = factor(DOSE))) +
  geom_line(alpha = 0.6) +
  geom_point(size = 1) +
  labs(
    title = "DV Over Time by Dose (OCC = 1)",
    x = "Time",
    y = "DV",
    color = "Dose"
  ) +
  theme_minimal()

Data Cleaning

We compute total drug exposure (Y) by summing DV values for each individual after removing TIME = 0.

# Exclude TIME == 0 and compute sum of DV per individual
dv_sum <- mavoglurant %>%
  filter(OCC == 1, TIME != 0) %>%
  group_by(ID) %>%
  summarize(Y = sum(DV, na.rm = TRUE)) %>%
  ungroup()

# Check dimensions
dim(dv_sum)

[1] 120   2

dose_data <- mavoglurant %>%
  filter(OCC == 1, TIME == 0)

dim(dose_data)

[1] 120  17

final_data <- dose_data %>%
  left_join(dv_sum, by = "ID")

dim(final_data)

[1] 120  18

We convert categorical variables to factors and keep only the relevant predictors and outcome.

clean_data <- final_data %>%
  mutate(
    RACE = factor(RACE),
    SEX  = factor(SEX)
  ) %>%
  select(Y, DOSE, AGE, SEX, RACE, WT, HT)

str(clean_data)

'data.frame':   120 obs. of  7 variables:
 $ Y   : num  2691 2639 2150 1789 3126 ...
 $ DOSE: num  25 25 25 25 25 25 25 25 25 25 ...
 $ AGE : int  42 24 31 46 41 27 23 20 23 28 ...
 $ SEX : Factor w/ 2 levels "1","2": 1 1 1 2 2 1 1 1 1 1 ...
 $ RACE: Factor w/ 4 levels "1","2","7","88": 2 2 1 1 2 2 1 4 2 1 ...
 $ WT  : num  94.3 80.4 71.8 77.4 64.3 ...
 $ HT  : num  1.77 1.76 1.81 1.65 1.56 ...

dim(clean_data)

[1] 120   7

EDA

We begin by examining general summary statistics to understand the scale and variability of the variables.

summary(clean_data)

       Y               DOSE            AGE        SEX     RACE   
 Min.   : 826.4   Min.   :25.00   Min.   :18.00   1:104   1 :74  
 1st Qu.:1700.5   1st Qu.:25.00   1st Qu.:26.00   2: 16   2 :36  
 Median :2349.1   Median :37.50   Median :31.00           7 : 2  
 Mean   :2445.4   Mean   :36.46   Mean   :33.00           88: 8  
 3rd Qu.:3050.2   3rd Qu.:50.00   3rd Qu.:40.25                  
 Max.   :5606.6   Max.   :50.00   Max.   :50.00                  
       WT               HT       
 Min.   : 56.60   Min.   :1.520  
 1st Qu.: 73.17   1st Qu.:1.700  
 Median : 82.10   Median :1.770  
 Mean   : 82.55   Mean   :1.759  
 3rd Qu.: 90.10   3rd Qu.:1.813  
 Max.   :115.30   Max.   :1.930

clean_data %>%
  summarise(
    n = n(),
    mean_Y = mean(Y),
    sd_Y   = sd(Y),
    min_Y  = min(Y),
    max_Y  = max(Y)
  )

    n   mean_Y     sd_Y  min_Y   max_Y
1 120 2445.407 961.6351 826.43 5606.58

By Dose:

Since dose is expected to strongly affect total drug exposure, we compute summary statistics of Y stratified by DOSE.

clean_data %>%
  group_by(DOSE) %>%
  summarise(
    n = n(),
    mean_Y = mean(Y),
    sd_Y = sd(Y),
    median_Y = median(Y)
  )

# A tibble: 3 × 5
   DOSE     n mean_Y  sd_Y median_Y
  <dbl> <int>  <dbl> <dbl>    <dbl>
1  25      59  1783.  601.    1666.
2  37.5    12  2464.  488.    2388.
3  50      49  3239.  787.    3194.

Total drug exposure (Y) increases with dose, indicating a strong dose–exposure relationship.

By Sex:

clean_data %>%
  group_by(SEX) %>%
  summarise(
    n = n(),
    mean_Y = mean(Y),
    sd_Y = sd(Y)
  )

# A tibble: 2 × 4
  SEX       n mean_Y  sd_Y
  <fct> <int>  <dbl> <dbl>
1 1       104  2478.  959.
2 2        16  2236.  983.

Total drug exposure (Y) appears somewhat higher on average in individuals coded as sex = 1 compared to sex = 2

Y vs. Age

ggplot(clean_data, aes(x = AGE, y = Y)) +
  geom_point(alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE) +
  theme_minimal() +
  labs(title = "Total Drug (Y) vs Age")

`geom_smooth()` using formula = 'y ~ x'

The scatterplot of total drug exposure (Y) versus age shows substantial variability across all ages, with no strong visible linear trend.

Y vs. Dose

ggplot(clean_data, aes(x = factor(DOSE), y = Y)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Total Drug by Dose", x = "Dose")

Total drug exposure (Y) increased with Dose, which would be expected.

Y vs. Sex

ggplot(clean_data, aes(x = SEX, y = Y)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Total Drug by Sex")

The boxplot comparing total drug exposure (Y) across sex groups shows that individuals coded as sex = 1 have a slightly higher median exposure than those coded as sex = 2.

Distribution of Y

ggplot(clean_data, aes(x = Y)) +
  geom_histogram(bins = 20) +
  theme_minimal() +
  labs(title = "Distribution of Total Drug (Y)")

Distribution of Y seems to be right skewed.

Correlation Plots

numeric_vars <- clean_data %>%
  select(Y, AGE, WT, HT)

cor(numeric_vars)

              Y         AGE         WT         HT
Y    1.00000000  0.01256372 -0.2128719 -0.1583297
AGE  0.01256372  1.00000000  0.1196740 -0.3518581
WT  -0.21287194  0.11967399  1.0000000  0.5997505
HT  -0.15832972 -0.35185806  0.5997505  1.0000000

pairs(numeric_vars)

The outcome variable Y shows only weak correlations with demographic variables, suggesting that dose may be the primary driver of exposure rather than age, weight, or height.

Weight and height are strongly positively correlated (r = 0.599), which reflects expected physiological relationships. Age is moderately negatively correlated with height, indicating that older individuals tend to be shorter in this dataset. Overall, no strong unexpected relationships were observed.

Modeling

Linear regression models predicting Y using Dose only:

lm_spec <- linear_reg() %>% set_engine("lm")

recipe_dose <- recipe(Y ~ DOSE, data = clean_data)

workflow_dose <- workflow() %>%
  add_recipe(recipe_dose) %>%
  add_model(lm_spec)

fit_dose <- fit(workflow_dose, data = clean_data)

pred_dose <- predict(fit_dose, clean_data) %>%
  bind_cols(clean_data)

metrics(pred_dose, truth = Y, estimate = .pred)

# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard     666.   
2 rsq     standard       0.516
3 mae     standard     517.

Linear regression models predicting Y using all predictors:

recipe_all <- recipe(Y ~ ., data = clean_data)

workflow_all <- workflow() %>%
  add_recipe(recipe_all) %>%
  add_model(lm_spec)

fit_all <- fit(workflow_all, data = clean_data)

pred_all <- predict(fit_all, clean_data) %>%
  bind_cols(clean_data)

metrics(pred_all, truth = Y, estimate = .pred)

# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard     591.   
2 rsq     standard       0.619
3 mae     standard     444.

The regression analysis shows that dose is the primary determinant of total drug exposure (Y). A linear model using dose alone explained approximately 52% of the variability in exposure (R-squared = 0.516), with an RMSE of 666 and MAE of 517, indicating moderate prediction error. When all predictors (dose, age, sex, race, weight, and height) were included, model performance improved modestly (R-squared = 0.619, RMSE = 591, MAE = 444). This suggests that demographic variables contribute some additional explanatory value, but dose remains the dominant driver of exposure.

Relationships between Y and other predictors such as age, sex, weight, and height were comparatively weak and showed substantial overlap across groups.

We now treat SEX as the outcome and fit logistic regression models.

Split the data into a training set and a test set.

set.seed(123)

data_split <- initial_split(clean_data, prop = 0.8, strata = SEX)

train_data <- training(data_split)
test_data  <- testing(data_split)

SEX ~ DOSE

log_spec <- logistic_reg() %>%
  set_engine("glm")

recipe_sex_dose <- recipe(SEX ~ DOSE, data = train_data)

workflow_sex_dose <- workflow() %>%
  add_recipe(recipe_sex_dose) %>%
  add_model(log_spec)

fit_sex_dose <- fit(workflow_sex_dose, data = train_data)

pred_sex_dose <- predict(fit_sex_dose, test_data, type = "prob") %>%
  bind_cols(predict(fit_sex_dose, test_data)) %>%
  bind_cols(test_data)

accuracy(pred_sex_dose, truth = SEX, estimate = .pred_class)

# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary          0.84

roc_auc(pred_sex_dose, truth = SEX, .pred_2)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.506

SEX ~ All Predictors

recipe_sex_all <- recipe(SEX ~ ., data = train_data)

workflow_sex_all <- workflow() %>%
  add_recipe(recipe_sex_all) %>%
  add_model(log_spec)

fit_sex_all <- fit(workflow_sex_all, data = train_data)

pred_sex_all <- predict(fit_sex_all, test_data, type = "prob") %>%
  bind_cols(predict(fit_sex_all, test_data)) %>%
  bind_cols(test_data)

accuracy(pred_sex_all, truth = SEX, estimate = .pred_class)

# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary             1

roc_auc(pred_sex_all, truth = SEX, .pred_2)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary             0

Logistic regression models were then used to predict sex from dose alone and from all available predictors. Although accuracy appeared high in some cases, ROC–AUC values were near 0.5, indicating that the models had little true discriminatory ability and largely reflected class imbalance rather than meaningful prediction.

Overall, the analysis demonstrates that dose strongly determines drug exposure in this dataset, demographic variables play a comparatively minor role, and sex cannot be reliably predicted from this dataset.

Module 10 Exercise

rngseed = 1234
set.seed(rngseed)

clean_data2 <- subset(clean_data, select = -RACE)

data_split <- initial_split(clean_data)
train_data <- training(data_split)
test_data <- testing(data_split)

Model 1 - uses only DOSE as a predictor.

lm_model1 <- linear_reg() %>%
  set_engine("lm")

recipe1 <- recipe(Y ~ DOSE, data = train_data)

workflow1 <- workflow() %>%
  add_model(lm_model1) %>%
  add_recipe(recipe1)

fit_model1 <- workflow1 %>%
  fit(data = train_data)

Model 2 - uses all predictors.

lm_model2 <- linear_reg() %>%
  set_engine("lm")

recipe2 <- recipe(Y ~ DOSE + AGE + SEX + WT + HT, data = train_data)

workflow2 <- workflow() %>%
  add_model(lm_model2) %>%
  add_recipe(recipe2)

fit_model2 <- workflow2 %>%
  fit(data = train_data)

Model performance assessment 1

RMSE for Training Data

Model 1

pred1 <- predict(fit_model1, train_data) %>%
  bind_cols(train_data)

rmse_model1 <- rmse(pred1, truth = Y, estimate = .pred)
rmse_model1

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard        703.

Model 2

pred2 <- predict(fit_model2, train_data) %>%
  bind_cols(train_data)

rmse_model2 <- rmse(pred2, truth = Y, estimate = .pred)
rmse_model2

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard        627.

Null Model RMSE

mean_y <- mean(train_data$Y)

null_pred <- train_data %>%
  mutate(.pred = mean_y)

rmse_null <- rmse(null_pred, truth = Y, estimate = .pred)
rmse_null

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard        948.

Model performance assessment 2

set.seed(1234)

folds <- vfold_cv(train_data, v = 10)

Cross-Validate Model 1

cv_model1 <- fit_resamples(
  workflow1,
  resamples = folds,
  metrics = metric_set(rmse)
)

collect_metrics(cv_model1)

# A tibble: 1 × 6
  .metric .estimator  mean     n std_err .config        
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>          
1 rmse    standard    691.    10    67.5 pre0_mod0_post0

Cross-Validate Model 2

cv_model2 <- fit_resamples(
  workflow2,
  resamples = folds,
  metrics = metric_set(rmse)
)

collect_metrics(cv_model2)

# A tibble: 1 × 6
  .metric .estimator  mean     n std_err .config        
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>          
1 rmse    standard    646.    10    64.8 pre0_mod0_post0

What changed?

We see that the RMSE values increased slightly for both models. Previously, Model 1 had a training RMSE of approximately 702, and Model 2 had about 627; after cross-validation, Model 1’s mean RMSE rose to 691 and Model 2’s to 646. This change is expected because cross-validation evaluates each model on subsets of data that were not used for fitting, giving a more realistic estimate of performance on unseen data. What did not change is the overall pattern of model performance: Model 2 still performs best, followed by Model 1, while both continue to outperform the null model.

This section added by Esther Palmer

plot predicted vs. observed values

p1 <- pred1 %>% ggplot(aes(x=Y, y=.pred)) + geom_point() +ggtitle("Dose as predictor")
p1

p2 <- pred2 %>% ggplot(aes(x=Y, y=.pred)) + geom_point() +ggtitle("All predictors")
p2

df1 <- merge(pred1, pred2, by="Y")
df2<- merge(df1, null_pred, by = "Y")
df3 <- df2 %>% select(Y, .pred.x, .pred.y, .pred)

df3 <- df3 %>% rename(
  pred1 = .pred.x,
  pred2 = .pred.y,
  nullpred = .pred
)

p3 <- df3 %>% ggplot(aes(x=Y)) + geom_point(aes(y=pred1, color="dose only")) + geom_point(aes(y=pred2, color="all predictors")) + geom_point(aes(y=nullpred, color="null"))
p3

bootstraps

set.seed(rngseed)
pred_bs <- matrix(NA, nrow=100, ncol=90)

bs_data <- train_data %>% bootstraps(times = 100)

for (i in 1:100) {
  dat_sample = rsample::analysis(bs_data$splits[[i]])
  fit_bs <- lm(Y ~ DOSE + AGE + SEX + WT + HT, data=dat_sample)
  pred_bs[i, ] <- predict(fit_bs, newdata=train_data)
}

confidence intervals and figure observed values (Y) x axis, predictions on Y (pred2) with confidence intervals

preds <- pred_bs |> apply(2, quantile,  c(0.055, 0.5, 0.945)) |>  t()

df4 <- data.frame(X=df3$Y, Y=df3$pred2, low=preds[,1], med=preds[,2], upp=preds[,3])

final_plot <- df4 %>% ggplot(aes(x=X, y=Y)) + geom_point() + geom_errorbar(aes(ymin=low, ymax=upp)) + geom_abline(slope=1)
final_plot

I’m not entirely sure this went correctly, as the error bars don’t overlap with the dots? however things do seem to kinda congregate around the line.

Part 3

pred_test <- predict(fit_model2, test_data) %>%
  bind_cols(test_data) %>%
  mutate(dataset = "Test")

pred_train <- pred2 %>%
  mutate(dataset = "Train")

# Combine train and test predictions
pred_all <- bind_rows(pred_train, pred_test)

# Plot predicted vs observed for both datasets
ggplot(pred_all, aes(x = Y, y = .pred, color = dataset, shape = dataset)) +
  geom_point(size = 2, alpha = 0.7) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
  labs(
    title = "Predicted vs Observed Total Drug (Y)",
    x = "Observed Y",
    y = "Predicted Y",
    color = "Dataset",
    shape = "Dataset"
  ) +
  theme_minimal() +
  scale_color_manual(values = c("Train" = "blue", "Test" = "red")) +
  scale_shape_manual(values = c("Train" = 16, "Test" = 17))

Does any model perform better than the null model? Yes, both Model 1 (dose only) and Model 2 (all predictors) perform better than the null model. The null model simply predicts the mean total drug exposure (Y) for all individuals, resulting in a high RMSE (945) and no ability to explain variability. By comparison, including dose alone or dose with demographic predictors reduces RMSE significantly and increases R-squared, demonstrating that both models capture variations in the data and outperform a null baseline.
Does Model 1 with only dose improve results over the null model? Model 1 clearly improves performance over the null model. The training RMSE drops to 702, and R-squared reaches 0.52, indicating that dose alone explains about half of the variability in total drug exposure. While the model captures the main signal in the data, it ignores demographic differences and residual variability, so it would provide reasonable but not highly precise predictions for individual patients. It is “usable” for rough estimates or exploratory analyses but limited for detailed dosing decisions.
Does Model 2 with all predictors further improve results? Model 2 incorporates dose, age, sex, weight, and height, resulting in modest further improvement over Model 1. RMSE decreases to 627, and R-squared increases to 0.62, indicating that demographic factors explain some additional variability in Y, though dose remains dominant. The results make sense given that the EDA showed weak correlations between Y and demographics. The model generalizes well to the test data, as predictions for unseen individuals overlap with training data and cluster around the 45 degree line, suggesting minimal overfitting. Model 2 is more precise than Model 1 and can be considered usable for research-level predictions, though residual variability remains, limiting its precision for patient-level dosing.