Title: | Preparing Data For, and Calculating the Prediction Test |
---|---|
Description: | Global hypothesis tests combine information across multiple endpoints to test a single hypothesis. The prediction test is a recently proposed global hypothesis test with good performance for small sample sizes and many endpoints of interest. The test is also flexible in the types and combinations of expected results across the individual endpoints. This package provides functions for data processing and calculation of the prediction test. |
Authors: | Richard Vargas [aut] , Neal Montgomery [aut, cre] |
Maintainer: | Neal Montgomery <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.0 |
Built: | 2024-11-16 05:51:56 UTC |
Source: | https://github.com/cran/PredTest |
A simulated dataset demonstrating group differences in three variables. This dataset includes two groups and one covariate, sex
.
adjusted_example
adjusted_example
A data frame with 20 rows and 5 variables:
A binary factor indicating group membership: 0
for control and 1
for treatment.
A binary factor indicating sex: 0
for male and 1
for female.
A numeric vector representing the first variable.
A numeric vector representing the second variable.
A numeric vector representing the third variable.
data(adjusted_example) head(adjusted_example)
data(adjusted_example) head(adjusted_example)
This function calculates the difference between two groups of data based on a specified location measure (median or mean).
create_difference_vector(grp_1_data, grp_2_data, location = "median")
create_difference_vector(grp_1_data, grp_2_data, location = "median")
grp_1_data |
A data frame where all columns are numeric, representing the first group of data. |
grp_2_data |
A data frame where all columns are numeric, representing the second group of data. |
location |
A string specifying the location measure to use for calculating differences. Must be either 'median' or 'mean'. |
The function checks if the specified location measure is valid ('median' or 'mean'). It also checks if both groups of data are numeric and if they have the same size and column variables. Based on the location measure, it calculates the differences and returns them as a numeric vector.
A numeric vector representing the differences between the second group's location measure and the first group's location measure for each column.
df_1 <- data.frame(v1 = c(1, 2, 100),v2 = c(4, 5, 6)) df_2 <- data.frame(v1 = c(7, 6, 5),v2 = c(4, 3, 2)) create_difference_vector(df_2, df_1, 'median') create_difference_vector(df_2, df_1, 'mean')
df_1 <- data.frame(v1 = c(1, 2, 100),v2 = c(4, 5, 6)) df_2 <- data.frame(v1 = c(7, 6, 5),v2 = c(4, 3, 2)) create_difference_vector(df_2, df_1, 'median') create_difference_vector(df_2, df_1, 'mean')
This function takes a data frame with two groups and splits them by a group identifier and specified column variables.
filter_by_group_var(df, grp_var, grp_1, grp_2, vars)
filter_by_group_var(df, grp_var, grp_1, grp_2, vars)
df |
A data frame which must have a column to identify two different groups. |
grp_var |
The column with the two groups, e.g., 'Treatment'. |
grp_1 |
The first group identifier in the |
grp_2 |
The second group identifier in the |
vars |
The column variables the researcher is interested in. The researcher can subset the columns instead of using all potential column variables. |
This function checks if the input data frame, group variable, and column variables are valid. It ensures that the specified groups exist within the group variable column. The function then filters the data for each group and returns a list containing the filtered data frames.
A list of two data frames that are subsets of the original data frame, separated by their group status.
# Load example data data("group_data_example") # Use the function to filter by group result <- filter_by_group_var(df=group_data_example, grp_var="group", grp_1 ='placebo',grp_2 ='drug',vars=c("v1", "v2")) print(result$group_1) print(result$group_2)
# Load example data data("group_data_example") # Use the function to filter by group result <- filter_by_group_var(df=group_data_example, grp_var="group", grp_1 ='placebo',grp_2 ='drug',vars=c("v1", "v2")) print(result$group_1) print(result$group_2)
This function creates two groups based on their ID and separates them by a time variable. Each ID will be present in each subset exactly once. The subsets will contain the same variables specified by the user.
filter_by_time_var(df, id, time_var, pre, post, vars)
filter_by_time_var(df, id, time_var, pre, post, vars)
df |
A data frame which must have a column to identify two different groups by ID. |
id |
The column variable that will contain all of the IDs that show up twice in the data set. |
time_var |
The column variable that designates the time of the data, e.g., it may be a 0 if data was collected on the first day of treatment and a 6 if data is collected six days later. |
pre |
The value in the |
post |
The value in the |
vars |
The column variables the researcher is interested in. The researcher can subset the columns instead of using all potential column variables. |
This function checks if the input data frame, ID, time variable, and column variablesare valid. It ensures that the specified pre and post values exist within the time variable column. The function then filters the data for each time point, orders the IDs, and returns a list containing the filtered data frames.
A list of two data frames that are subsets of the original data frame, separated
by their time_var
status. The data frames will have the same size in rows.
# Load example data data("pre_post_data_example") # Use the function to filter by time variable result <- filter_by_time_var(pre_post_data_example, id = "ID", time_var = "time", pre = 0, post = 12, vars = c("v1", "v2")) print(result$pre) print(result$post)
# Load example data data("pre_post_data_example") # Use the function to filter by time variable result <- filter_by_time_var(pre_post_data_example, id = "ID", time_var = "time", pre = 0, post = 12, vars = c("v1", "v2")) print(result$pre) print(result$post)
This function receives user input on the hypothesis of the experiment results and informs the user if the hypotheses were correct. It can handle hypotheses of 'increase', 'decrease', or 'different', and performs appropriate statistical tests.
get_results_vector( hypothesis, differences, diff_method = "wilcoxon", grp_a = NULL, grp_b = NULL, phi_0 = 0.5 )
get_results_vector( hypothesis, differences, diff_method = "wilcoxon", grp_a = NULL, grp_b = NULL, phi_0 = 0.5 )
hypothesis |
A string or a vector of strings where the string or all
elements of the vector are in the set of {'decrease', 'increase', 'different'}.
If it’s a string, it will be converted to a vector of the same length as
|
differences |
A numeric vector representing the differences between two groups. |
diff_method |
A string specifying the method to use for testing 'different' hypotheses. Valid options are 'wilcoxon' or 't'. Defaults to 'wilcoxon'. |
grp_a |
A data frame representing the first group for testing 'different' hypotheses. This argument is only needed if 'different' is part of the hypothesis. |
grp_b |
A data frame representing the second group for testing 'different' hypotheses. This argument is only needed if 'different' is part of the hypothesis. |
phi_0 |
A numeric value on the interval (0, 1) representing the decision rule threshold for the p-value. Defaults to 0.5. |
The function checks if the input hypotheses are valid, and performs the necessary statistical tests to determine if the hypotheses were correct. It handles 'increase' and 'decrease' hypotheses by comparing differences, and 'different' hypotheses by performing either a Wilcoxon signed-rank test or a paired t-test.
A numeric vector of 0s and 1s. If the hypothesis was incorrect, a 0 is returned. If the hypothesis was correct, a 1 is returned.
df_1 <- data.frame(v1 = c(1, 2, -100), v2 = c(40, 5, 6)) df_2 <- data.frame(v1 = c(7, 6, 5), v2 = c(4, 3, 2)) differences <- create_difference_vector(df_2, df_1) # using singular increase get_results_vector(hypothesis = 'increase', differences = differences) # using 'different' hypothesis and a pre post scenario get_results_vector(hypothesis = c('increase', 'different'), differences = differences, grp_a = df_1, grp_b = df_2, phi_0 = 0.05)
df_1 <- data.frame(v1 = c(1, 2, -100), v2 = c(40, 5, 6)) df_2 <- data.frame(v1 = c(7, 6, 5), v2 = c(4, 3, 2)) differences <- create_difference_vector(df_2, df_1) # using singular increase get_results_vector(hypothesis = 'increase', differences = differences) # using 'different' hypothesis and a pre post scenario get_results_vector(hypothesis = c('increase', 'different'), differences = differences, grp_a = df_1, grp_b = df_2, phi_0 = 0.05)
A dataset representing cognitive scores for control and treatment groups, with various cognitive and demographic variables.
group_cog_data
group_cog_data
A data frame with 20 rows and 20 variables:
A factor indicating group membership: Control
or ESKD
(End-Stage Kidney Disease).
A numeric vector representing the mean SUV (Standard Uptake Value).
A numeric vector representing uncorrected MOCA (Montreal Cognitive Assessment) scores.
A numeric vector representing scores on the Craft Verbatim memory test.
A numeric vector representing delayed scores on the Craft Verbatim memory test.
A numeric vector representing forward number span scores.
A numeric vector representing backward number span scores.
A numeric vector representing the number of correct F words in a verbal fluency test.
A numeric vector representing scores on the oral trail making test part A.
A numeric vector representing scores on the oral trail making test part B.
A numeric vector representing the number of animal names listed in a verbal fluency test.
A numeric vector representing the number of vegetable names listed in a verbal fluency test.
A numeric vector representing scores on a verbal naming test without cues.
A numeric vector representing the age of each subject.
data(group_cog_data) head(group_cog_data)
data(group_cog_data) head(group_cog_data)
A simulated dataset showing group differences across four variables. The dataset is divided into two groups: placebo
and drug
.
group_data_example
group_data_example
A data frame with 30 rows and 5 variables:
A factor indicating group membership: placebo
or drug
.
A numeric vector representing the first variable.
A numeric vector representing the second variable.
A numeric vector representing the third variable.
A numeric vector representing the fourth variable.
data(group_data_example) head(group_data_example)
data(group_data_example) head(group_data_example)
A simulated dataset showing measurements before and after an intervention. Each subject has multiple measurements over time.
pre_post_data_example
pre_post_data_example
A data frame with 30 rows and 6 variables:
A unique identifier for each subject.
A numeric variable indicating time points, 0
for pre-intervention and 12
for post-intervention.
A numeric vector representing the first variable.
A numeric vector representing the second variable.
A numeric vector representing the third variable.
A numeric vector representing the fourth variable.
data(pre_post_data_example) head(pre_post_data_example)
data(pre_post_data_example) head(pre_post_data_example)
A dataset showing physical and cognitive performance measures before and after a fitness intervention.
pre_post_fit
pre_post_fit
A data frame with 20 rows and 12 variables:
A unique identifier for each subject.
A numeric variable indicating time points, 0
for pre-intervention and 1
for post-intervention.
A numeric variable indicating sex: 0
for male and 1
for female.
A numeric variable representing the age of each subject.
A numeric vector representing performance scores on the Canadian Occupational Performance Measure (COPM).
A numeric vector representing satisfaction scores on the COPM.
A numeric vector representing work capacity scores at time A1.
A numeric vector representing work capacity scores at time A2.
A numeric vector representing grip strength of the dominant hand.
A numeric vector representing grip strength of the non-dominant hand.
A numeric vector representing right arm flexibility.
A numeric vector representing left arm flexibility.
data(pre_post_fit) head(pre_post_fit)
data(pre_post_fit) head(pre_post_fit)
This function calculates adjusted predictions for variables of interest, taking into account covariates and group comparisons. It then returns whether the results align with the hypothesized direction of effects.
pred_adjusted(dataset, hypothesis, vars, covariates, group, ref)
pred_adjusted(dataset, hypothesis, vars, covariates, group, ref)
dataset |
A data frame containing the data to be analyzed. |
hypothesis |
A string or vector of strings containing either 'increase' or 'decrease', indicating the expected direction of the effect. |
vars |
A vector of variable names in the dataset that are the outcomes of interest. These must be numeric columns. |
covariates |
A vector of covariates to include in the model. These must be numeric columns in the dataset. |
group |
The name of the grouping variable in the dataset. This must be
a column in the dataset and should not overlap with |
ref |
The reference category within the group variable. This must be a value present in the group column. |
A list with two elements:
A vector indicating whether each hypothesis was correct (1 for correct, 0 for incorrect).
A vector of weights corresponding to each variable in vars
,
calculated from the correlation matrix.
data("group_cog_data") data("adjusted_example") # simple example pred_adjusted(adjusted_example, c("decrease", "increase"), c('v1', 'v2'), 'sex', "group", 0) # simulated example pred_adjusted(dataset = group_cog_data, hypothesis = "decrease", vars = c('craft_verbatim', 'fluency_f_words_correct'), covariates = c('number_span_forward', 'number_span_backward'), group = "group.factor", ref = "Control")
data("group_cog_data") data("adjusted_example") # simple example pred_adjusted(adjusted_example, c("decrease", "increase"), c('v1', 'v2'), 'sex', "group", 0) # simulated example pred_adjusted(dataset = group_cog_data, hypothesis = "decrease", vars = c('craft_verbatim', 'fluency_f_words_correct'), covariates = c('number_span_forward', 'number_span_backward'), group = "group.factor", ref = "Control")
This function is a wrapper that conditionally handles filtering by group or time, calculates the difference vector, and evaluates hypotheses to return a list of results.
pred_results( dataset, id = NULL, vars, type = "group", hypothesis, gtvar, grp_a, grp_b, location = "median", diff_method = "wilcoxon", phi_0 = 0.5 )
pred_results( dataset, id = NULL, vars, type = "group", hypothesis, gtvar, grp_a, grp_b, location = "median", diff_method = "wilcoxon", phi_0 = 0.5 )
dataset |
A data frame for research. |
id |
The column that identifies unique subjects. This should be |
vars |
The column variables of interest. |
type |
The type of study. Valid values are 'group' for group-based data and 'prepost' for pre-post data. Defaults to 'group'. |
hypothesis |
A vector or string of valid hypotheses: 'increase', 'decrease', or 'different'. |
gtvar |
The column of interest to divide the groups (e.g., time or treatment). |
grp_a |
The first subset of interest within the |
grp_b |
The second subset of interest within the |
location |
The measure of central tendency to use for the difference calculation. Valid options are 'median' or 'mean'. Defaults to 'median'. |
diff_method |
The method to use for testing 'different' hypotheses. Valid options are 'wilcoxon' or 't'. Defaults to 'wilcoxon'. |
phi_0 |
The decision rule threshold for the p-value. If p-value < phi_0, then there's sufficient evidence for a success for a difference. Defaults to 0.50. |
This function performs error handling to ensure appropriate input values and types. It then filters the data based on the study type, calculates the difference vector, and evaluates the hypotheses using the specified method.
A list containing:
A vector of 0s and 1s indicating whether each hypothesis was correct.
A vector of the differences between groups.
The column variables used in the analysis.
data("group_data_example") data("group_cog_data") data("pre_post_data_example") data("pre_post_fit") # simple group analysis pred_results(dataset=group_data_example, vars=c('v1', 'v2'), hypothesis=c("increase", "different"), gtvar="group", grp_a="placebo", grp_b="drug") # simple prepost analysis pred_results(dataset=pre_post_data_example, id="ID", vars=c('v1', 'v2', 'v3'), type="prepost", hypothesis="increase", gtvar="time", grp_a=0, grp_b=12) # simulated group analysis pred_results(dataset=group_cog_data, vars=c('blind_moca_uncorrected', 'craft_verbatim'), type="group", hypothesis="decrease", gtvar="group.factor", grp_a="Control", grp_b="ESKD") # simulated prepost analysis pred_results(dataset=pre_post_fit, id="ID", vars=c('Flex_right', 'Flex_left'), type="prepost", hypothesis="increase", gtvar="Time", grp_a=0, grp_b=1)
data("group_data_example") data("group_cog_data") data("pre_post_data_example") data("pre_post_fit") # simple group analysis pred_results(dataset=group_data_example, vars=c('v1', 'v2'), hypothesis=c("increase", "different"), gtvar="group", grp_a="placebo", grp_b="drug") # simple prepost analysis pred_results(dataset=pre_post_data_example, id="ID", vars=c('v1', 'v2', 'v3'), type="prepost", hypothesis="increase", gtvar="time", grp_a=0, grp_b=12) # simulated group analysis pred_results(dataset=group_cog_data, vars=c('blind_moca_uncorrected', 'craft_verbatim'), type="group", hypothesis="decrease", gtvar="group.factor", grp_a="Control", grp_b="ESKD") # simulated prepost analysis pred_results(dataset=pre_post_fit, id="ID", vars=c('Flex_right', 'Flex_left'), type="prepost", hypothesis="increase", gtvar="Time", grp_a=0, grp_b=1)
This function performs statistical tests to determine the predictive power of a results set weighted by a corresponding vector of weights. It offers various methods to conduct the test, allowing flexibility depending on the data characteristics and analysis requirements.
pred_test( weights_vector, results_vector, test_type = "exact", phi_0 = 0.5, sims = 5000 )
pred_test( weights_vector, results_vector, test_type = "exact", phi_0 = 0.5, sims = 5000 )
weights_vector |
A numeric vector where each element represents the weight for a corresponding result in the results vector. Each value must be on the interval |
results_vector |
A numeric vector of test results where each element is in the set {0, 1}, representing the binary outcome of each prediction. |
test_type |
A character string specifying the type of statistical test to perform. The valid options are 'exact', 'approx', or 'bootstrap'. |
phi_0 |
A numeric value on the interval (0, 1) representing the null hypothesis value against which the test results are compared. |
sims |
A natural number that specifies the number of simulations to perform when the bootstrap method is chosen. This parameter allows control over the robustness of the bootstrap approximation. |
This function performs error handling to ensure appropriate input values and types. It then calculates the test statistic and evaluates the p-value based on the specified test type.
A list containing:
The number of results correctly predicted as per the specified criteria.
The p-value resulting from the test, indicating the probability of observing the test results under the null hypothesis.
The test statistic calculated based on the weights and results.
The estimated proportion derived from the weights and results.
A confidence interval for the estimated proportion derived from the weights and results using the Wilson score method.
# Example weights and results vectors weights_vector <- c(1/3, 0.5, 1) results_vector <- c(0, 1, 1) # Exact test result_exact <- pred_test(weights_vector, results_vector, test_type = 'exact') result_exact # Approximate test result_approx <- pred_test(weights_vector, results_vector, test_type = 'approx') result_approx # Bootstrap test result_bootstrap <- pred_test(weights_vector, results_vector, test_type = 'bootstrap') result_bootstrap
# Example weights and results vectors weights_vector <- c(1/3, 0.5, 1) results_vector <- c(0, 1, 1) # Exact test result_exact <- pred_test(weights_vector, results_vector, test_type = 'exact') result_exact # Approximate test result_approx <- pred_test(weights_vector, results_vector, test_type = 'approx') result_approx # Bootstrap test result_bootstrap <- pred_test(weights_vector, results_vector, test_type = 'bootstrap') result_bootstrap
This function calculates predictive weights by computing the inverse square sum of a correlation matrix derived from the specified variables. In 'group' analysis, it directly uses the variables for correlation. In 'prepost' analysis, it calculates the difference between two time points before correlation.
pred_weights( dataset, vars, gtvar, type = "group", id = NULL, pre = NULL, post = NULL, corr_method = "pearson" )
pred_weights( dataset, vars, gtvar, type = "group", id = NULL, pre = NULL, post = NULL, corr_method = "pearson" )
dataset |
A data frame containing the dataset to be analyzed. |
vars |
A vector of strings specifying the names of the variables to be used in the correlation analysis. |
gtvar |
The name of the categorical variable used to identify groups in 'prepost' type analysis. |
type |
The type of analysis. Valid values are 'group' for group-based correlation analysis or 'prepost' for pre-post analysis. Defaults to 'group'. |
id |
The variable in the dataset that uniquely identifies subjects in a
'prepost' analysis. This should not be |
pre |
Specifies the baseline time point for 'prepost' analysis. |
post |
Specifies the follow-up time point for 'prepost' analysis. |
corr_method |
The method of correlation. Valid options are 'pearson', 'kendall', or 'spearman'. Defaults to 'pearson'. |
This function performs error handling to ensure appropriate input values and types. It calculates the correlation matrix for the specified variables and then computes the predictive weights as the inverse square sum of the correlation matrix.
A numeric vector of predictive weights for each variable analyzed.
data("group_data_example") data("group_cog_data") data("pre_post_data_example") data("pre_post_fit") # end points for variables grp_endpts <- c( "mean_suv","blind_moca_uncorrected","craft_verbatim","craft_delay_verbatim", "number_span_forward","number_span_backward","fluency_f_words_correct", "oral_trail_part_a","oral_trail_part_b","fluency_animals","fluency_vegetables", "verbal_naming_no_cue" ) prepost_endpts <- c( "COPM_p", "COPM_s", "A1_work", "A2_work", "Grip_dom", "Grip_ndom", "Flex_right", "Flex_left" ) # simple group pred_weights(dataset=group_data_example, vars=c('v1', 'v2'), gtvar='group') # simple prepost pred_weights(dataset=pre_post_data_example, vars=c('v1','v2','v3'), gtvar='time', id='ID', pre=0,post=12) # simulated group pred_weights(dataset=group_cog_data, vars=grp_endpts, gtvar="group.factor", type="group",corr_method="pearson") # simulated prepost pred_weights(dataset=pre_post_fit, id="ID", vars=prepost_endpts, gtvar="Time", type="prepost",pre=0, post=1, corr_method="pearson")
data("group_data_example") data("group_cog_data") data("pre_post_data_example") data("pre_post_fit") # end points for variables grp_endpts <- c( "mean_suv","blind_moca_uncorrected","craft_verbatim","craft_delay_verbatim", "number_span_forward","number_span_backward","fluency_f_words_correct", "oral_trail_part_a","oral_trail_part_b","fluency_animals","fluency_vegetables", "verbal_naming_no_cue" ) prepost_endpts <- c( "COPM_p", "COPM_s", "A1_work", "A2_work", "Grip_dom", "Grip_ndom", "Flex_right", "Flex_left" ) # simple group pred_weights(dataset=group_data_example, vars=c('v1', 'v2'), gtvar='group') # simple prepost pred_weights(dataset=pre_post_data_example, vars=c('v1','v2','v3'), gtvar='time', id='ID', pre=0,post=12) # simulated group pred_weights(dataset=group_cog_data, vars=grp_endpts, gtvar="group.factor", type="group",corr_method="pearson") # simulated prepost pred_weights(dataset=pre_post_fit, id="ID", vars=prepost_endpts, gtvar="Time", type="prepost",pre=0, post=1, corr_method="pearson")
This function calculates the adjusted proportion estimate (p0) and the confidence interval for a given proportion estimate (p_hat) and sample size (n) using the score method.
solve_p0_score_ci(p_hat, n, z = 1.96)
solve_p0_score_ci(p_hat, n, z = 1.96)
p_hat |
Numeric value. The proportion estimate. Must be between 0 and 1. |
n |
Numeric value. The sample size. Must be a positive integer. |
z |
Numeric value. The z-score for the desired confidence level. Default is 1.96 (approximately 95% confidence interval). |
A list with two elements:
p0 |
The adjusted proportion estimate. |
confidence_interval |
A numeric vector of length 2 containing the lower and upper bounds of the confidence interval. |
solve_p0_score_ci(p_hat = 9/10, n = 10)
solve_p0_score_ci(p_hat = 9/10, n = 10)