Skip to contents

distinct() physically removes duplicate rows from a survey design object, always issuing surveycore_warning_physical_subset. Unlike dplyr::distinct(), all columns in @data are retained regardless of which columns are specified in ... — design variables must never be lost from the survey object.

For subpopulation analyses, use filter() instead — it marks rows out-of-domain without removing them, preserving valid variance estimation.

Usage

# S3 method for class 'survey_base'
distinct(.data, ..., .keep_all = FALSE)

# S3 method for class 'survey_collection'
distinct(.data, ..., .keep_all = FALSE, .if_missing_var = NULL)

distinct(.data, ..., .keep_all = FALSE)

Arguments

.data

A survey_base object.

...

<data-masking> Optional columns used to determine uniqueness. If empty, all non-design columns are used. Note: .keep_all is always TRUE regardless of what is specified here.

.keep_all

Accepted for interface compatibility; has no effect. The survey implementation always retains all columns in @data.

.if_missing_var

Per-call override of collection@if_missing_var. One of "error" or "skip", or NULL (the default) to inherit the collection's stored value. See surveycore::set_collection_if_missing_var().

Value

An object of the same class as .data with the following properties:

  • Rows physically reduced to distinct subset (fewer rows possible).

  • All columns in @data are retained (.keep_all = TRUE always).

  • @variables$visible_vars is unchanged — distinct is a pure row operation.

  • @metadata is unchanged.

  • @groups is unchanged.

  • Always issues surveycore_warning_physical_subset.

Details

Column retention

distinct() always behaves as if .keep_all = TRUE. Specifying columns in ... controls which columns determine uniqueness — it does not control which columns appear in the result. This is a deliberate divergence from dplyr::distinct(df, x, y) which by default drops all columns except x and y.

Default deduplication (empty ...)

When ... is empty, deduplication uses all non-design columns. Design variables (strata, PSU, weights, FPC) are excluded from the uniqueness check — deduplicating on them would produce meaningless or survey-corrupting results.

Design variable warning

If ... includes a design variable, surveytidy_warning_distinct_design_var is issued before the operation. The operation still proceeds after the warning — the user is assumed to know what they are doing.

Survey collections

When applied to a survey_collection, distinct() is dispatched to each member independently — there is no cross-survey deduplication. Two members that share a literally identical row will both retain that row in their post-distinct() results. This is the V9 contract from the survey-collection spec; collections deliberately avoid the bind_rows() analogy here because cross-survey deduplication has no coherent variance interpretation across designs.

Each member's distinct.survey_base issues surveycore_warning_physical_subset independently — N firings on an N-member collection. Capture with withCallingHandlers().

See also

filter() for domain-aware row marking (preferred for subpopulation analyses)

Other row operations: drop_na, slice

Examples

library(surveytidy)
library(surveycore)

# create a survey design from the pew_npors_2025 example dataset
d <- as_survey(pew_npors_2025, weights = weight, strata = stratum)

# deduplicate on all non-design columns (issues physical-subset warning)
distinct(d)
#> Warning: ! `distinct()` physically removes rows from the survey data.
#>  This is different from `filter()`, which preserves all rows for correct
#>   variance estimation.
#>  Use `filter()` for subpopulation analyses instead.
#> 
#> ── Survey Design ───────────────────────────────────────────────────────────────
#> <survey_taylor> (Taylor series linearization)
#> Sample size: 5022
#> 
#> # A tibble: 5,022 × 65
#>    respid  mode language languageinitial stratum interview_start interview_end
#>     <dbl> <dbl>    <dbl>           <dbl>   <dbl> <date>          <date>       
#>  1   1470     2        1              NA      10 2025-05-27      2025-05-27   
#>  2   2374     2        1              NA       7 2025-05-01      2025-05-01   
#>  3   1177     3        1              10       5 2025-03-04      2025-03-04   
#>  4  15459     2        1              NA      10 2025-05-05      2025-05-05   
#>  5   9849     1        1               9       9 2025-02-22      2025-02-22   
#>  6   8178     3        1               9      10 2025-03-10      2025-03-10   
#>  7   3682     1        1               9       4 2025-02-27      2025-02-27   
#>  8   6999     2        1              NA      10 2025-05-12      2025-05-12   
#>  9   9945     2        1              NA      10 2025-05-09      2025-05-09   
#> 10   1901     1        1               9      10 2025-03-01      2025-03-01   
#> # ℹ 5,012 more rows
#> # ℹ 58 more variables: econ1mod <dbl>, econ1bmod <dbl>, comtype2 <dbl>,
#> #   unity <dbl>, crimesafe <dbl>, govprotct <dbl>, moregunimpact <dbl>,
#> #   fin_sit <dbl>, vet1 <dbl>, vol12_cps <dbl>, eminuse <dbl>, intmob <dbl>,
#> #   intfreq <dbl>, intfreq_collapsed <dbl>, home4nw2 <dbl>, bbhome <dbl>,
#> #   smuse_fb <dbl>, smuse_yt <dbl>, smuse_x <dbl>, smuse_ig <dbl>,
#> #   smuse_sc <dbl>, smuse_wa <dbl>, smuse_tt <dbl>, smuse_rd <dbl>, …

# deduplicate by one column (all other columns still retained)
distinct(d, cregion)
#> Warning: ! `distinct()` physically removes rows from the survey data.
#>  This is different from `filter()`, which preserves all rows for correct
#>   variance estimation.
#>  Use `filter()` for subpopulation analyses instead.
#> 
#> ── Survey Design ───────────────────────────────────────────────────────────────
#> <survey_taylor> (Taylor series linearization)
#> Sample size: 4
#> 
#> # A tibble: 4 × 65
#>   respid  mode language languageinitial stratum interview_start interview_end
#>    <dbl> <dbl>    <dbl>           <dbl>   <dbl> <date>          <date>       
#> 1   1470     2        1              NA      10 2025-05-27      2025-05-27   
#> 2  15459     2        1              NA      10 2025-05-05      2025-05-05   
#> 3   9849     1        1               9       9 2025-02-22      2025-02-22   
#> 4   3682     1        1               9       4 2025-02-27      2025-02-27   
#> # ℹ 58 more variables: econ1mod <dbl>, econ1bmod <dbl>, comtype2 <dbl>,
#> #   unity <dbl>, crimesafe <dbl>, govprotct <dbl>, moregunimpact <dbl>,
#> #   fin_sit <dbl>, vet1 <dbl>, vol12_cps <dbl>, eminuse <dbl>, intmob <dbl>,
#> #   intfreq <dbl>, intfreq_collapsed <dbl>, home4nw2 <dbl>, bbhome <dbl>,
#> #   smuse_fb <dbl>, smuse_yt <dbl>, smuse_x <dbl>, smuse_ig <dbl>,
#> #   smuse_sc <dbl>, smuse_wa <dbl>, smuse_tt <dbl>, smuse_rd <dbl>,
#> #   smuse_bsk <dbl>, smuse_th <dbl>, smuse_ts <dbl>, radio <dbl>, …