Scans variable labels in a survey design object or labelled data frame for
groups of variables sharing a common preface (via separator or longest
common prefix). Detected prefaces are written to question_preface in the
metadata and the shared text is trimmed from each variable label, leaving
only the unique suffix.
Usage
infer_question_prefaces(
x,
sep = c(" - ", "- ", " – ", ": ", " | "),
min_vars = 2L,
lcp_min = 20L,
overwrite = FALSE,
verbose = TRUE
)Arguments
- x
A survey design object (
survey_taylor,survey_replicate, etc.) or a data frame with haven-style"label"attributes.- sep
Character vector of literal separator strings to try, in priority order. Default:
c(" - ", "- ", " \u2013 ", ": ", " | ").- min_vars
Minimum number of variables that must share a candidate preface to trigger extraction. Default
2L.- lcp_min
Minimum character length (after trimming to a word boundary) for an LCP-derived preface to be accepted. Default
20L.- overwrite
If
FALSE(default), variables that already have aquestion_prefaceare skipped and a warning is emitted. SetTRUEto replace existing prefaces without warning.- verbose
If
TRUE(default), emits a cli summary for each detected group.
Details
Detection algorithm (two passes):
Separator pass — for each separator in
sep(tried in order):Variables whose label contains the separator are grouped by their candidate preface (text before the first occurrence of the separator, trimmed).
Any group with \(\geq\)
min_varsmembers is recorded; those variables are excluded from all subsequent passes.
LCP pass — for remaining labelled variables (\(\geq\) 2):
The character-level longest common prefix (LCP) of all remaining labels is computed and trimmed to the last word boundary.
If the trimmed LCP is \(\geq\)
lcp_mincharacters, the group is recorded.
Apply step:
Variables with an existing
question_prefaceare skipped whenoverwrite = FALSE(default); a warning is emitted listing the count of skipped variables.Variables whose unique suffix would be empty after trimming are always skipped with a per-variable warning.
Data frame integration:
When called on a data frame, the detected preface is written to
attr(col, "question_preface"). Passing the result to as_survey()
automatically picks up both the trimmed label and the preface via the
internal haven metadata extraction step.
See also
Other metadata:
extract_question_preface(),
extract_val_labels(),
extract_var_label(),
extract_var_note(),
set_question_preface(),
set_question_prefaces(),
set_val_labels(),
set_value_labels(),
set_var_label(),
set_var_note(),
set_variable_labels(),
set_variable_notes(),
survey_metadata(),
survey_weighting_history()
Examples
# Data frame with haven-style labels (Qualtrics / SPSS export pattern)
df <- data.frame(
discrim_a = 1:5,
discrim_b = 2:6,
discrim_c = 3:7
)
attr(df$discrim_a, "label") <-
"Please rate discrimination - Evangelical Christians"
attr(df$discrim_b, "label") <-
"Please rate discrimination - Muslims"
attr(df$discrim_c, "label") <-
"Please rate discrimination - Jews"
df <- infer_question_prefaces(df, verbose = FALSE)
attr(df$discrim_a, "label") # "Evangelical Christians"
#> [1] "Evangelical Christians"
attr(df$discrim_a, "question_preface") # "Please rate discrimination"
#> [1] "Please rate discrimination"