distinct() physically removes duplicate rows from a survey design
object, always issuing surveycore_warning_physical_subset. Unlike
dplyr::distinct(), all columns in @data are retained regardless of
which columns are specified in ... — design variables must never be
lost from the survey object.
For subpopulation analyses, use filter() instead — it marks rows
out-of-domain without removing them, preserving valid variance estimation.
Usage
# S3 method for class 'survey_base'
distinct(.data, ..., .keep_all = FALSE)
# S3 method for class 'survey_collection'
distinct(.data, ..., .keep_all = FALSE, .if_missing_var = NULL)
distinct(.data, ..., .keep_all = FALSE)Arguments
- .data
A
survey_baseobject.- ...
<
data-masking> Optional columns used to determine uniqueness. If empty, all non-design columns are used. Note:.keep_allis alwaysTRUEregardless of what is specified here.- .keep_all
Accepted for interface compatibility; has no effect. The survey implementation always retains all columns in
@data.- .if_missing_var
Per-call override of
collection@if_missing_var. One of"error"or"skip", orNULL(the default) to inherit the collection's stored value. Seesurveycore::set_collection_if_missing_var().
Value
An object of the same class as .data with the following properties:
Rows physically reduced to distinct subset (fewer rows possible).
All columns in
@dataare retained (.keep_all = TRUEalways).@variables$visible_varsis unchanged — distinct is a pure row operation.@metadatais unchanged.@groupsis unchanged.Always issues
surveycore_warning_physical_subset.
Details
Column retention
distinct() always behaves as if .keep_all = TRUE. Specifying columns
in ... controls which columns determine uniqueness — it does not
control which columns appear in the result. This is a deliberate
divergence from dplyr::distinct(df, x, y) which by default drops all
columns except x and y.
Survey collections
When applied to a survey_collection, distinct() is dispatched to each
member independently — there is no cross-survey deduplication. Two
members that share a literally identical row will both retain that row
in their post-distinct() results. This is the V9 contract from the
survey-collection spec; collections deliberately avoid the
bind_rows() analogy here because cross-survey deduplication has no
coherent variance interpretation across designs.
Each member's distinct.survey_base issues
surveycore_warning_physical_subset independently — N firings on an
N-member collection. Capture with withCallingHandlers().
Examples
library(surveytidy)
library(surveycore)
# create a survey design from the pew_npors_2025 example dataset
d <- as_survey(pew_npors_2025, weights = weight, strata = stratum)
# deduplicate on all non-design columns (issues physical-subset warning)
distinct(d)
#> Warning: ! `distinct()` physically removes rows from the survey data.
#> ℹ This is different from `filter()`, which preserves all rows for correct
#> variance estimation.
#> ✔ Use `filter()` for subpopulation analyses instead.
#>
#> ── Survey Design ───────────────────────────────────────────────────────────────
#> <survey_taylor> (Taylor series linearization)
#> Sample size: 5022
#>
#> # A tibble: 5,022 × 65
#> respid mode language languageinitial stratum interview_start interview_end
#> <dbl> <dbl> <dbl> <dbl> <dbl> <date> <date>
#> 1 1470 2 1 NA 10 2025-05-27 2025-05-27
#> 2 2374 2 1 NA 7 2025-05-01 2025-05-01
#> 3 1177 3 1 10 5 2025-03-04 2025-03-04
#> 4 15459 2 1 NA 10 2025-05-05 2025-05-05
#> 5 9849 1 1 9 9 2025-02-22 2025-02-22
#> 6 8178 3 1 9 10 2025-03-10 2025-03-10
#> 7 3682 1 1 9 4 2025-02-27 2025-02-27
#> 8 6999 2 1 NA 10 2025-05-12 2025-05-12
#> 9 9945 2 1 NA 10 2025-05-09 2025-05-09
#> 10 1901 1 1 9 10 2025-03-01 2025-03-01
#> # ℹ 5,012 more rows
#> # ℹ 58 more variables: econ1mod <dbl>, econ1bmod <dbl>, comtype2 <dbl>,
#> # unity <dbl>, crimesafe <dbl>, govprotct <dbl>, moregunimpact <dbl>,
#> # fin_sit <dbl>, vet1 <dbl>, vol12_cps <dbl>, eminuse <dbl>, intmob <dbl>,
#> # intfreq <dbl>, intfreq_collapsed <dbl>, home4nw2 <dbl>, bbhome <dbl>,
#> # smuse_fb <dbl>, smuse_yt <dbl>, smuse_x <dbl>, smuse_ig <dbl>,
#> # smuse_sc <dbl>, smuse_wa <dbl>, smuse_tt <dbl>, smuse_rd <dbl>, …
# deduplicate by one column (all other columns still retained)
distinct(d, cregion)
#> Warning: ! `distinct()` physically removes rows from the survey data.
#> ℹ This is different from `filter()`, which preserves all rows for correct
#> variance estimation.
#> ✔ Use `filter()` for subpopulation analyses instead.
#>
#> ── Survey Design ───────────────────────────────────────────────────────────────
#> <survey_taylor> (Taylor series linearization)
#> Sample size: 4
#>
#> # A tibble: 4 × 65
#> respid mode language languageinitial stratum interview_start interview_end
#> <dbl> <dbl> <dbl> <dbl> <dbl> <date> <date>
#> 1 1470 2 1 NA 10 2025-05-27 2025-05-27
#> 2 15459 2 1 NA 10 2025-05-05 2025-05-05
#> 3 9849 1 1 9 9 2025-02-22 2025-02-22
#> 4 3682 1 1 9 4 2025-02-27 2025-02-27
#> # ℹ 58 more variables: econ1mod <dbl>, econ1bmod <dbl>, comtype2 <dbl>,
#> # unity <dbl>, crimesafe <dbl>, govprotct <dbl>, moregunimpact <dbl>,
#> # fin_sit <dbl>, vet1 <dbl>, vol12_cps <dbl>, eminuse <dbl>, intmob <dbl>,
#> # intfreq <dbl>, intfreq_collapsed <dbl>, home4nw2 <dbl>, bbhome <dbl>,
#> # smuse_fb <dbl>, smuse_yt <dbl>, smuse_x <dbl>, smuse_ig <dbl>,
#> # smuse_sc <dbl>, smuse_wa <dbl>, smuse_tt <dbl>, smuse_rd <dbl>,
#> # smuse_bsk <dbl>, smuse_th <dbl>, smuse_ts <dbl>, radio <dbl>, …
