
Design-Based Population Covariance for a Survey Design
Source:R/analysis-covariance.R
get_covariance.RdCompute the design-based estimate of the finite-population Pearson
covariance for every (unordered, by default) pair of numeric variables
selected from x, with optional grouping, uncertainty quantification,
and metadata-driven labelling. Matches the off-diagonal entries of
survey::svyvar() (Kish n/(n-1) correction) on Taylor, replicate,
twophase, and nonprob designs at numerical parity.
Usage
get_covariance(
design,
x,
group = NULL,
redundant = FALSE,
diagonal = FALSE,
variance = "ci",
conf_level = 0.95,
n_weighted = FALSE,
decimals = NULL,
min_cell_n = 30L,
na.rm = TRUE,
label_values = TRUE,
label_vars = TRUE,
name_style = "surveycore",
...,
.id = NULL,
.if_missing_var = NULL
)Arguments
- design
A survey design object:
survey_taylor,survey_replicate,survey_twophase, orsurvey_nonprob. Also accepts asurvey_collection.- x
<
tidy-select> Two or more unquoted variable names. Must resolve to at least two columns. Non-numeric columns are dropped with a warning; if fewer than 2 numeric variables remain, an error is raised.- group
<
tidy-select> Optional grouping variable(s). Combined with any grouping set bygroup_by(). DefaultNULL. Covariances are estimated separately within each group using that group's own weighted means for centring.- redundant
Logical. If
FALSE(default), each unordered pair appears once in supply order (lower-triangle). IfTRUE, both(A, B)and(B, A)are emitted.- diagonal
Logical. If
FALSE(default), self-pairs(x, x)are excluded. IfTRUE, one self-pair per variable is emitted withcovariance = \eqn{\widehat{\mathrm{Var}}(x)}{Var_hat(x)}(the design-based variance – not1).- variance
NULLor a character vector of one or more of"se","ci","var","cv","moe","deff". Default"ci".- conf_level
Numeric scalar in
(0, 1). Default0.95.- n_weighted
Logical. If
TRUE, append ann_weightedcolumn with the pair's pairwise-complete sum of weights. DefaultFALSE.- decimals
Integer or
NULL. If integer, rounds all numeric output columns to this many places. DefaultNULL(no rounding).- min_cell_n
Integer. Minimum pairwise unweighted count before
surveycore_warning_small_cellfires. Default30L(AAPOR).- na.rm
Logical. If
TRUE(default), pairwise-complete deletion per pair, and rows withNAin any group variable are excluded from the output. IfFALSE, NAs propagate to produceNaNestimates; NA group values are retained as their own group row.- label_values
Logical. If
TRUE(default) and the grouping variable has value labels, the group column is converted to a labelled factor.- label_vars
Logical. If
TRUE(default) and variable labels are set in metadata,var1andvar2show labels instead of raw names.- name_style
"surveycore"(default) or"broom". Under"broom", renamescovariance->estimate,se->std.error,ci_low->conf.low,ci_high->conf.high.- ...
Unused. Reserved so that
.idand.if_missing_varremain named-only when asurvey_collectionis passed asdesign.- .id
Character(1) or
NULL. Column name used to identify each survey whendesignis asurvey_collection. For collection inputs,NULL(the default) resolves to the collection's stored@idproperty. Pass a non-NULLvalue to override. Ignored whendesignis a single survey.- .if_missing_var
"error","skip", orNULL. How to handle surveys in a collection that lack one of the requested NSE variables. For collection inputs,NULL(the default) resolves to the collection's stored@if_missing_varproperty. Pass a non-NULLvalue to override. Ignored whendesignis a single survey.
Value
A survey_covariance tibble (also inheriting survey_result).
Columns, in order:
[group_cols...]— group variable columns (when active), first.var1,var2— factor columns identifying the pair (levels inx-supply order).covariance— design-based Pearson covariance estimate (Kish-corrected).NaNfor degenerate cells;0for pairs where at least one variable is constant on the active domain.Uncertainty columns (
se,var,cv,ci_low,ci_high,moe,deff) — only those requested viavariance.n— pairwise unweighted count.n_weighted— pair's sum of weights (only when requested).
Details
Confidence intervals use the normal-Wald approximation on the SE of the
covariance estimate: ci_low = covariance - z * se,
ci_high = covariance + z * se, where z = qnorm((1 + conf_level) / 2).
The bounds are not clamped. Covariance is unbounded — ci_low and
ci_high may have opposite signs and may cross zero. Users who want
clamped intervals can post-process. This behaviour matches
survey::svyvar().
NA handling is pairwise-complete per pair: each ordered pair drops
rows where either variable is NA. There is no na_handling argument;
pairwise is the only policy. This matches survey::svyvar() off-diagonal
pair-at-a-time semantics, not svyvar()'s default listwise deletion
across a multi-variable formula. Numerical parity therefore only holds
when oracle calls are made pair-at-a-time
(survey::svyvar(~x + y, design) per pair).
Under diagonal = TRUE, the self-pair (x, x) returns the design-based
Kish-corrected variance of x on the active domain — not 1 as in
get_corr(). The covariance matrix diagonal is the variance vector, not
the identity. The diagonal-parity gate guarantees that
get_covariance(d, c(x, x), diagonal = TRUE)$covariance and $se equal
get_variance(d, x)$variance and $se numerically (point at 1e-10,
SE at 1e-8) when the active domains match.
Design effect (deff) uses the Goodnight / Mood-Graybill SRS reference
SE_SRS(cov) = sqrt((Var(x) * Var(y) + cov^2) / (n - 1)). When both
the design SE and SRS SE are zero (constant-variable pairs), deff is
set to exactly 0 (0 / 0 guard).
References
Mood, A. M., Graybill, F. A., & Boes, D. C. (1974). Introduction to the Theory of Statistics (3rd ed.). McGraw-Hill.
Lumley, T. (2010). Complex Surveys: A Guide to Analysis Using R. Wiley.
Cochran, W. G. (1977). Sampling Techniques (3rd ed.). Wiley.
Demnati, A., & Rao, J. N. K. (2004). Linearization variance estimators for survey data. Survey Methodology, 30, 17–26.
See also
Other analysis:
clean(),
get_anova(),
get_corr(),
get_diffs(),
get_freqs(),
get_means(),
get_pairwise(),
get_quantiles(),
get_ratios(),
get_t_test(),
get_totals(),
get_variance(),
meta()
Examples
d <- as_survey(
nhanes_2017,
ids = sdmvpsu,
weights = wtint2yr,
strata = sdmvstra,
nest = TRUE
)
get_covariance(d, x = c(ridageyr, bpxsy1))
#> # A tibble: 1 × 6
#> var1 var2 covariance ci_low ci_high n
#> <fct> <fct> <dbl> <dbl> <dbl> <int>
#> 1 Age in years at screening Systolic: Blood pre… 197. 184. 210. 6302
# Include the diagonal (self-pairs return Var(x), not 1)
get_covariance(d, x = c(ridageyr, bpxsy1), diagonal = TRUE)
#> # A tibble: 3 × 6
#> var1 var2 covariance ci_low ci_high n
#> <fct> <fct> <dbl> <dbl> <dbl> <int>
#> 1 Age in years at screening Systolic… 197. 184. 210. 6302
#> 2 Age in years at screening Age in y… 515. 497. 534. 9254
#> 3 Systolic: Blood pres (1st rdg) mm Hg Systolic… 316. 296. 336. 6302
# With grouping
get_covariance(d, x = c(ridageyr, bpxsy1), group = riagendr)
#> # A tibble: 2 × 7
#> riagendr var1 var2 covariance ci_low ci_high n
#> <dbl> <fct> <fct> <dbl> <dbl> <dbl> <int>
#> 1 1 Age in years at screening Systolic: … 162. 149. 174. 3115
#> 2 2 Age in years at screening Systolic: … 234. 216. 252. 3187