Survey data often encounters the challenge of dropout – instances where participants fail to complete sections of the survey due to interruptions or omissions. Handling dropouts effectively is crucial for accurate data analysis and interpretation. The dropout package provides a solution by offering insights into participant behavior during the survey process.
The dropout package empowers you with the capability to extract valuable insights from your dataset, such as:
By leveraging these insights, you can:
In this vignette, we will provide an in-depth overview of the dropout
package’s features and their practical utilization. We will use a sample
dataset named “flying” to illustrate these concepts. This is a modified
version of the Flying Etiquette Survey data behind the story: 41 percent
of flyers say it’s rude to recline your seat on an airplane. You can
load this preinstalled dataset into your environment using the command
data(flying)
.
While the dropout package can function independently, integrating it with the tidyverse ecosystem (especially using dplyr) can significantly enhance your workflow. However all of the methods used in this Vignette can be transferred to using Base R code exclusively.
drop_summary
Let’s embark on a deeper exploration of the dropout
package by delving into the drop_summary
function. This
function serves as a pivotal tool for gaining in-depth insights into
dropout patterns within your dataset. To effectively utilize the
drop_summary
function, you should specify the last column
in your dataset that corresponds to the survey items. If you encounter a
warning message while using this function, it could be attributed to
either of the following reasons:
For example, in the “flying” dataset, the final survey-related item
is stored in the “location_census_region” column. Following this, the
“survey_type” column contains supplementary survey information. Many
datasets incorporate similar non-survey-related data, and it’s crucial
to consider such cases. If the last column is left unspecified, the
drop_summary
function will assume that only survey-related
items are present.
To gain a comprehensive overview of the dropout patterns within your dataset, consider the following code snippet:
drop_summary(flying, "location_census_region")
#> # A tibble: 27 × 8
#> column_name dropout drop_rate drop_na section_na single_na missing
#> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 respondent_id 0 0 0 0 0 0
#> 2 travel_frequency 0 0 0 0 0 0
#> 3 seat_recline 18 0.02 18 164 0 182
#> 4 height 0 0.02 18 164 12 194
#> 5 children_under_18 1 0.02 19 164 6 189
#> 6 two_armrests 1 0.02 20 164 0 184
#> 7 middle_armrest 0 0.02 20 164 0 184
#> 8 window_shade 0 0.02 20 164 0 184
#> 9 moving_to_unsold_seat 1 0.02 21 164 0 185
#> 10 talking_to_seatmate 0 0.02 21 164 0 185
#> # ℹ 17 more rows
#> # ℹ 1 more variable: completion_rate <dbl>
Now, let’s delve into the intricacies of the
drop_summary
function and the valuable insights it provides
in a structured format.
drop_summary
When you use the drop_summary
function, the output you
receive is a compact yet informative summary, packaged as either a
dataframe or a tibble. This summary consists of multiple columns, each
of which provides insights into different dimensions of dropout analysis
within your dataset.
column_name
: Lists the names of the
columns from your dataset that have been analyzed for dropouts.
dropout
: Contains the frequency of
dropouts within each listed column, allowing you to see where dropout
rates might be the most significant.
drop_rate
: Shows the overall
percentage of dropout incidents in each column. This is useful for
understanding the relative impact of dropouts in various parts of your
dataset.
drop_na
: Provides the percentage of
missing values in each column that can be attributed specifically to
dropouts. This offers insights into the nature of missing data.
section_na
: Indicates occurrences
of missing values that span at least n
consecutive columns
(n
defaults to 3). You can adjust this parameter using
section_min
as shown below:
This column is particularly useful for identifying participants who might skip entire sections of a survey without dropping out completely.
single_na
: Reveals the percentage
of single-instance missing values in each column, which are not
associated with systematic dropouts or section skips.
completion_rate
: Denotes the
overall data completion rate for each analyzed column, enabling you to
gauge the integrity and reliability of your dataset.
drop_detect
to Identify DropoutsOne of the core tools in the dropout
package is the
drop_detect
function. This function serves as a
comprehensive tool for isolating and understanding individual
participant dropout behaviors. Specifically, it reveals whether a
participant has left the survey prematurely and pinpoints the exact
juncture at which the dropout took place.
The structure and usage of drop_detect
are intentionally
made to resonate with the drop_summary
function, ensuring
consistency and ease of adoption.
dropout
: A Boolean column (TRUE or
FALSE). A TRUE
value signifies that the respective
participant exited the survey prematurely.
dropout_column
: For those marked as
TRUE
in the dropout
column, this field
specifies the exact column or question that triggered the
dropout.
dropout_index
: Offers a direct
reference to the row number where the dropout incident occurred,
facilitating easier traceability.
For practical insights, consider applying drop_detect
on
the ‘flying’ dataset. Here’s how you can achieve this:
Moreover, if you wish to append the extracted dropout details back to
the original dataset, you can employ the bind_cols
function
from the dplyr
package:
Such integration of dropout specifics into the primary dataset can act as a preliminary step for more nuanced analyses, like zoning into specific dropout triggers, assessing commonalities among dropouts, or any other relevant exploratory exercises.
Subsequent sections will delve into exemplified applications of this integrated approach.
The drop_detect
function can be useful for identifying
and filtering out early dropouts, i.e., participants who stopped
answering the survey at a specific column. For example, you can filter
for participants who did not drop out early, or had a ‘late’ dropout in
the demographic part of the questions, using the
dropout_index
:
If you’re interested in a specific section of questions and want to
filter for dropouts and section_na
, you have two
approaches:
In this approach, we create a subset of the data containing only the
first 22 columns. We then apply the drop_detect
function to
this subset to identify dropouts and other relevant indicators:
One practical application is to compare the demographics (e.g., age and gender) between those who left out a section and those who did not. The following code generates a bar graph that breaks down dropout rates by age group and gender.
library(ggplot2)
flying %>%
drop_detect("smoking_violation") %>%
bind_cols(flying, .) %>%
filter(!is.na(gender)) %>%
mutate(age = factor(age, levels = c("18-29", "30-44", "45-60", "> 60"))) %>%
ggplot(aes(x=age, fill=dropout)) +
geom_bar(position="dodge") +
facet_grid(gender ~ .) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
By visualizing the data, you can more easily discern patterns and disparities among different demographic groups with respect to dropout rates.
Another interesting avenue is to explore whether there’s a relationship between dropout behavior (or in this case leaving out a section) and demographic variables like age and gender:
test <- flying %>%
drop_detect("smoking_violation") %>%
bind_cols(flying, .) %>%
filter(!is.na(gender)) %>%
select(dropout, age, gender) %>%
mutate(dropout = as.numeric(ifelse(dropout == TRUE, 1, 0)))
glm_model <- glm(dropout ~ gender + age, data = test, family = binomial)
print(summary(glm_model))
This can be particularly useful for hypothesis testing and can aid in uncovering patterns in your data.