library(PredTest)
# Cognitive data set from the paclage
data("group_cog_data")
# Fitness data set from the package
data("pre_post_fit")
library(kableExtra)
library(ggplot2)
This vignette explains how to use the ‘pred_results’, ‘pred_weights’ and ‘pred_test’ functions in the PredTest package to carry out the prediction test (Montgomery and Mahnken, 2020).
The prediction test is a recently proposed global hypothesis test. The idea of the test is to assess whether a researcher’s priori beliefs (predictions) about what will happen during a study are borne out by the data. The null hypothesis of the prediction test concerns the parameter ϕ, the “predictive ability” across the set of endpoints, specifically, H0 : ϕ = ϕ0. If ϕ0 = 0.50 then the null hypothesis is that the researcher’s predictions, or theory, is no better at predicting the outcomes than by chance. This type of hypothesis is especially useful in early-stage studies such as pilots when many outcomes are collected and there is a desire to know if the overall hypothesis is reasonable (Montgomery, Ptomey and Mahnken, 2022).
For example, a researcher may believe that a set of twenty endpoints will increase in the treatment group relative to the control group. If at the end of the study all of them have, then this provides some evidence that the researcher’s theory accurately predicts what will occur. In practice when many endpoints are collected there is often a high degree of multicollinearity; the prediction test adjusts for how similar endpoints are through weights, wi defined as the squared inverse row sums of the correlation matrix. This ensures that endpoints that are more unique have greater weight toward the overall test statistic which is defined as $\sum_{=1}^{m} p_{i}w_{i}$ where pi is the dichotomous result for endpoint i and wi is the corresponding weight for that endpoint.
The test has an exact distribution and has an asymptotically normal approximation with good performance for ≥ 20 endpoints.
As an example we use data from a study on blood brain barrier (BBB) permeability in patients with End Stage Kidney Disease (ESKD). In a recent pilot study (Gupta et al., 2023) explored the feasibility of measuring BBB permeability using a novel method (standardized uptake values: SUV) due to contraindications with traditional MRI based methods in patients with ESKD. The research team hypothesized that SUV would be higher in ESKD patients relative to control, and a results on the National Alzheimer’s Coordinating Center Uniform Data Set telephone cognitive battery (T-cog) tests would be worse in ESKD patients relative to control.
The T-cog consists of the following tests: MoCA, Immediate and Delayed recall, Digit span forward and backward, Verbal fluency, Trailmaking A and B, Category fluency animals and vegetables and Verbal naming. All T-cog scores were expected to be worse in ESKD patients than control, with the specific hypothesized effects shown in Table 1. For example, the average Oral trail making score was expected to be higher (increase) in ESKD relative to Control.
Variable | Hypothesized result |
---|---|
mean_suv | increase |
blind_moca_uncorrected | decrease |
craft_verbatim | decrease |
craft_delay_verbatim | decrease |
number_span_forward | decrease |
number_span_backward | decrease |
fluency_f_words_correct | decrease |
oral_trail_part_a | increase |
oral_trail_part_b | increase |
fluency_animals | decrease |
fluency_vegetables | decrease |
verbal_naming_no_cue | decrease |
To calculate the prediction test we need to know the results of the hypothesized effects for each endpoint and the associated weights. The PredTest package has several helper functions to calculate these required statistics.
The pred_results function takes a set of variables from a data set and compares whether the observed results align with the predictions that were made. Here we specify the predictions (hyps), the corresponding variables (endpoints), the dataset, type of study (group), the grouping/time variable (gtvar) and the names of the group categories with (grp_a) being the reference category. We then pass these arguments to the pred_results function and get the required output.
grp_endpoints is a vector of the variable names we have a prediction for and cog_hyps are the associated predictions. We could also specify a prediction of a difference (see papers for more details) or if all endpoints are expected to move in the same direction, we could specify cog_hyps = “increase” or cog_hyps = “decrease” to avoid specifying the same value for each variable name.
# Endpoints for the cognitive example
grp_endpts <- c(
"mean_suv","blind_moca_uncorrected","craft_verbatim","craft_delay_verbatim",
"number_span_forward","number_span_backward","fluency_f_words_correct",
"oral_trail_part_a","oral_trail_part_b","fluency_animals","fluency_vegetables",
"verbal_naming_no_cue"
)
# Specifying predictions for the cognitive data example
cog_hyps <- c("increase", "decrease", "decrease", "decrease",
"decrease", "decrease", "decrease", "increase", "increase",
"decrease", "decrease", "decrease")
# To get the results we pass the appropriate function values
group_results <- pred_results(dataset=group_cog_data, hypothesis=cog_hyps,
vars=grp_endpts, type="group", gtvar="group.factor",
grp_a="Control", grp_b ="ESKD", location="mean")
group_results
#> $results
#> [1] 1 1 1 1 0 1 1 0 1 1 1 1
#>
#> $differences
#> [1] 0.1024164 -0.6000000 -2.6000000 -2.9000000 1.5000000 -1.0000000
#> [7] -3.1000000 0.0000000 8.8000000 -3.5000000 -3.2000000 -0.1000000
#>
#> $variables
#> [1] "mean_suv" "blind_moca_uncorrected"
#> [3] "craft_verbatim" "craft_delay_verbatim"
#> [5] "number_span_forward" "number_span_backward"
#> [7] "fluency_f_words_correct" "oral_trail_part_a"
#> [9] "oral_trail_part_b" "fluency_animals"
#> [11] "fluency_vegetables" "verbal_naming_no_cue"
We can see that 10 of the 12 endpoints were correctly predicted (results$results).
We also need to calculate the associated weights for each endpoint. This adjusts for the fact that many of these endpoints are highly correlated, for example the correlation between craft verbatim and craft verbatim delayed is 0.88. We want to down weight variables that are highly correlated so that correctly predicting them is worth less than correctly predicting independent variables.
group_weights <- pred_weights(dataset=group_cog_data,
vars=grp_endpts,gtvar="group.factor",
type="group",corr_method="pearson")
group_weights
#> mean_suv blind_moca_uncorrected craft_verbatim
#> 0.4802467 0.4635745 0.4508275
#> craft_delay_verbatim number_span_forward number_span_backward
#> 0.4186522 0.4922581 0.6102238
#> fluency_f_words_correct oral_trail_part_a oral_trail_part_b
#> 0.3570943 0.4526987 0.4391324
#> fluency_animals fluency_vegetables verbal_naming_no_cue
#> 0.3192010 0.3948201 0.3343613
Number span backwards has the highest estimated weight (0.61) while fluency (animals) has the lowest (0.32), this is because the fluency variable was highly correlated (> |0.50|) with four other variables in the data set. The fluency variable and the other correlated variables are likely all addressing some underlying latent cognitive trait, therefore, giving full credit for correctly predicting that set wouldn’t be appropriate.
To better understand the weighting, consider the extreme example where all endpoints were perfectly correlated (e.g., weight measured in pounds, kilograms, and ounces). In this case each of the three endpoints would receive a weight of 1/3, so that correctly predicting all three endpoints would be given the weight of correctly predicting one endpoint.
Given the weights and results we can now calculate the prediction test. The pred_test function requires a vector of the weights and results and the specification of how the p-value should be calculated (exact, normal approximation, or parametric bootstrap). We can use the output from the pred_results and pred_weights functions to populate these arguments.
pred_test(weights_vector = group_weights,
results_vector = group_results$results,
test_type = "exact",
phi_0 = 0.5)
#> $num_correctly_predicted
#> [1] 10
#>
#> $p_value
#> [1] 1
#>
#> $test_stat
#> [,1]
#> [1,] 4.268134
#>
#> $p0
#> [,1]
#> [1,] 0.7414407
#>
#> $ci
#> [1] 0.5365908 0.9462905
For this data set we reject the null hypothesis in favor of the alternative. We can conclude that the researcher’s predictions were more in line with the results than would be expected by guessing (our null hypothesis). This provides some justification that the theory driving these predictions is sound enough to follow into a future trial.
We can visualize these results using a barplot
end <- grp_endpts
diff <- group_results$differences
outcome <- group_results$results
forplot <- as.data.frame( cbind(end, diff , outcome ) )
forplot$diff <- as.numeric(as.character(forplot$diff))
forplot$outcome <- as.numeric(as.character(forplot$outcome))
ymax <- max(1.25*max(forplot$diff), 0+.5*sd(forplot$diff))
ymin <- min(1.25*min(forplot$diff), 0-.5*sd(forplot$diff))
ggplot(data = forplot, aes(x=reorder(end, -diff), y=diff, fill = factor(outcome) ) ) +
geom_bar(stat='identity') +
scale_y_continuous(limits=c(ymin,ymax)) +
geom_bar(forplot, mapping = aes(end) ,alpha=0, size=1, color="black", stat='identity')+
theme_classic() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
labs( x = "", y = "Difference", fill = "Prediction\nresults") +
scale_fill_manual(values = c('0' = "White",'1'= "Black"),
labels = c("Incorrect", "Correct"), drop = FALSE) +
theme(plot.title = element_text(hjust = 0.5))
We can also analyze data that comes from pre-post studies, e.g. a single group assessed before and after an intervention (for two groups with pre-post data we could simply calculate the change scores and proceed with the between group test).
As an example, we consider a small data set consisting of response pre- and post-intervention to a set of physical fitness outcomes from 10 adults with disabilities. It was hypothesized that all endpoints would improve after the intervention, which corresponds to an increasing score on all endpoints except A2 work capacity. In this data set we have one subject with missing grip strength data, to calculate the differences and weights, we remove those rows.
As before we first calculate the results. The pred_results function call is slightly changed. We need to specify and ID variable (id), change type to “prepost” and update grp_a and grp_b to the respective values indicating the pre and post scores in the “Time” variable, in this case pre was a 0 and post was a 1.
# Endpoints for the fitness example
prepost_endpts <- c(
"COPM_p", "COPM_s", "A1_work", "A2_work", "Grip_dom",
"Grip_ndom", "Flex_right", "Flex_left"
)
# Specifying predictions for the fitness data set
fit_cog_hyps<- c("increase", "increase", "increase",
"decrease", "increase", "increase", "increase", "increase")
pre_post_fit <- pre_post_fit[complete.cases(pre_post_fit),]
pre_post_results <- pred_results(dataset = pre_post_fit,
id = "ID", hypothesis = fit_cog_hyps,
vars = prepost_endpts, type = "prepost",
gtvar = "Time", grp_a = 0,
grp_b = 1, location = "mean")
pre_post_results
#> $results
#> [1] 1 1 1 1 1 1 1 1
#>
#> $differences
#> [1] 1.4222222 1.9666667 23.3777778 -307.3666667 0.4555556
#> [6] 0.5444444 0.4333333 1.4222222
#>
#> $variables
#> [1] "COPM_p" "COPM_s" "A1_work" "A2_work" "Grip_dom"
#> [6] "Grip_ndom" "Flex_right" "Flex_left"
All endpoints moved in the hypothesized direction. We know we will reject the null at this point (correctly predicting all endpoints will result in the smallest p-value, and with more than 4 endpoints the p-value will be < 0.05); nevertheless, we calculate the weights.
pre_post_weights <- pred_weights(dataset = pre_post_fit,
id = "ID",
vars = prepost_endpts,
gtvar = "Time",
type = "prepost",
pre = 0,
post = 1,
corr_method = "pearson")
pre_post_weights
#> COPM_p COPM_s A1_work A2_work Grip_dom Grip_ndom Flex_right
#> 0.4140419 0.4006737 0.5409661 0.7258783 0.3803730 0.3558321 0.9271993
#> Flex_left
#> 0.7580178
Then we can pass the weights and results to the testing function.
pred_test(weights_vector = pre_post_weights,
results_vector = pre_post_results$results,
test_type = "exact")
#> $num_correctly_predicted
#> [1] 8
#>
#> $p_value
#> [1] 1
#>
#> $test_stat
#> [,1]
#> [1,] 4.502982
#>
#> $p0
#> [,1]
#> [1,] 0.8377922
#>
#> $ci
#> [1] 0.6755844 1.0000000
For this data set all 8 endpoints moved in the hypothesized direction, so the test statistic was the most extreme possible. Using the exact test there are 28 = 256 possible tests statistics, so the p-value is 1/256 = 0.004 and we would reject the null hypothesis as expected.
Adjusting for the effect of covariates can be important, especially when we know demographic factors influence the response. With the JASN data set (grouped data) we have data on the age of participants, and we know that age is associated with results on many cognitive tests. We also seem to have imbalance between the control and ESKD groups on age with ESKD patients younger on average than controls. We might expect the difference between groups to be even stronger when we adjust for the effect of age.
To adjust for covariates, we use the framework of multivariate multiple regression models (MMR). A MMR model is of the form Y = Xβ + ϵ where Y is an n x m matrix of responses with each tow corresponding to a single participant and X is an n x (q + 1) design matrix where q is the number of covariates. The unadjusted prediction test can be specified where X is an n x 2 matrix, with the first column of 1’s and the second column a dummy variable indicating group membership (0 or 1). Then the β̂s are simply the differences in means for Group 1 and Group 0 which we can then dichotomize into the results based on the predictions. It can also be shown that the sample correlation matrix estimated from the unadjusted MMR model is equivalent to the sample correlation matrix of the raw data, thus the unadjusted prediction test can be calculated from the estimated Betas and correlation matrix from the MMR model. To adjust for covariates such as age, we simply include them in the MMR model and use the model estimated Beta’s and correlation matrix to calculate an adjusted test.
The pred_adjusted function calculates the adjusted results and weights which can then be passed to the pred_results function. The function is only available for between group studies and cannot incorporate a prediction of a ‘difference’ on an endpoint.
The function provides both the results and weights which can be passed to the pred_test function. The function call is similar to the pred_results function call with the adition of the covariates argument which is how we specify what variables to adjust for and the ref argument, which is the reference category for the group variable, in our example it’s the control group.
Now, we re-analyze the cognitive data when adjusting for participant’s age.
adjusted <- pred_adjusted(dataset = group_cog_data,
hypothesis = cog_hyps,
vars = grp_endpts,
covariates = c("age"),
group = "group.factor", ref = "Control")
pred_test(results_vector = adjusted$results,
weights_vector = adjusted$weights,
test_type = "exact")
#> $num_correctly_predicted
#> [1] 11
#>
#> $p_value
#> [1] 0.001708984
#>
#> $test_stat
#> [,1]
#> [1,] 4.808919
#>
#> $p0
#> [,1]
#> [1,] 0.816298
#>
#> $ci
#> [1] 0.6471897 0.9854064
When adjusting for age we see stronger results than before. We’ve now correctly predicted 11 of the 12 endpoints which leads to a smaller p-value (roughly 0.002 vs 0.02). oral_trail_part_a was incorrectly predicted when using the raw data; it was hypothesized that the score would be higher in ESKD vs Control, but they had the same average score. However, when adjusting for the effect of age, we get a positive estimated effect for ESKD relative to group for oral_trail_part_a and this is recorded as a correct prediction.
The data sets used in these examples and provided with the PredTest package are not the exact same as that analyzed in Gupta et al. or from the fitness data set. The real data set cannot be released except by request, and thus the supplied data set was simulated to maintain the key properties of the real data.
Gupta A, Bansal A, Young, KY, Gautam A, Donald J, Comfort B, Montgomery RN. Blood–Brain Barrier Permeability in ESKD—A Proof-of-Concept Study. Journal of the American Society of Nephrology 34(9):p 1508-1511, September 2023. | DOI: 10.1681/ASN.0000000000000167
Montgomery RN, Mahnken JD. A prediction-based test for multiple endpoints. Statistics in Medicine. 2020; 39: 4267–4280. https://doi.org/10.1002/sim.8724
Montgomery RN, Ptomey LT, & Mahnken JD. A flexible test for early-stage studies with multiple endpoints. Journal of Applied Statistics. 2022, 15: 3048–3061. https://doi.org/10.1080/02664763.2022.2097204