Fits a GLM to survey data, producing design-based coefficient estimates and variance-covariance matrix via the Binder (1983) sandwich estimator. All five surveycore design classes are supported.
Arguments
- design
A survey design object created by
as_survey(),as_survey_replicate(),as_survey_twophase(), oras_survey_nonprob().- formula
A model formula in standard R notation (e.g.
y ~ x1 + x2). Mutually exclusive withresponse/predictors. IfNULLandresponseis alsoNULL, errors withsurveycore_error_formula_missing.- response
Character string naming the outcome variable. Programmatic alternative to
formula. Mutually exclusive withformula. Use withpredictorsto build a model formula viareformulate(predictors, response). Suitable forlapply()iteration.- predictors
Character vector of predictor variable names. Used with
responseto build the model formula. Ifresponseis supplied andpredictorsisNULL, an intercept-only model is fitted.- family
A GLM family object specifying the error distribution and link function. Default
gaussian(). Any family accepted bystats::glm()is supported. Forbinomial()andquasibinomial()families, the "non-integer #successes" warning is suppressed because survey weights are non-integer by design.- na.action
How to handle
NAvalues in the model frame. Defaultna.omit(silently drops rows with anyNAin model variables).na.failerrors withsurveycore_error_na_in_datalisting the offending columns and NA counts. Note:na.actionapplies only to model frame variables; survey weights are validated separately.- start
Starting values for the coefficient vector.
- etastart
Starting values for the linear predictor.
- mustart
Starting values for the mean.
- control
A list of GLM control parameters passed to
stats::glm.control().- quiet
Logical. If
TRUE, suppresses convergence warnings emitted bysurvey_glm()and its internal replicate-weight refitting loop. Convergence status is always stored infit@convergedregardless of this setting, so non-convergence can still be detected programmatically. DefaultFALSE.
Value
A survey_glm_fit S7 object.
Details
Variance estimation: Uses the Binder (1983) sandwich estimator, which
decomposes into per-observation score vectors passed to the Phase 0
variance machinery. The bread (X'WX)^(-1) accounts for IRLS working
weights and is correct for all GLM families including binomial and
Poisson.
binomial() family: Wraps the stats::glm() call in
suppressWarnings() to suppress the "non-integer #successes" warning
that fires for every survey-weighted binomial model.
Domain estimation: Use surveytidy::filter() before calling
survey_glm(). The GLM is fit on in-domain rows only; variance
estimation uses the full design for correct design-based SEs.
Multinomial response: cbind() on the LHS of formula is not
supported. Multinomial logistic regression is deferred to a later phase.
Formula to model matrix: survey_glm() passes the formula to
stats::model.matrix() via stats::glm(). Factor and character predictors
are dummy-coded using model.matrix() default contrasts (treatment coding:
first level as reference). Numeric predictors enter as-is. Interaction
terms (:, *) and inline transformations (log(), I()) are supported
as in any standard R formula. The resulting model matrix is n x p where
p is the number of coefficients including the intercept.
Predictor variable types: Predictors may be numeric, integer, logical,
factor, or character. Character predictors are coerced to factor by
stats::model.matrix(). Ordered factors use polynomial contrasts by
default. All other R types (list columns, complex, raw) will produce an
error from stats::model.matrix().
Input assumptions: surveycore assumes (1) each row of design@data
represents one sampled unit; (2) survey weights are positive and finite
for all rows (validated at construction time); (3) the model formula
variables are columns of design@data; (4) the design is correctly
specified before calling survey_glm(). No centering, scaling, or
other pre-processing is applied to predictor variables beyond what the
formula specifies.
Data transformations: No automatic transformation is applied to
predictor or response variables. Factor encoding is handled by
stats::model.matrix() using the active contrasts. Link function
transformations (e.g. log link in poisson()) are applied by the
family object, not by surveycore. To apply custom transformations, use
I() or log() etc. inside the formula.
Row and column names: The coefficient vector returned in
fit@coefficients carries the names produced by stats::model.matrix()
(e.g. "(Intercept)", "sexFemale", "age"). fit@vcov carries the
same names on rows and columns. model.frame.survey_glm_fit() returns the
model frame with row names matching the rows used in fitting (i.e. the
row names of design@data after applying na.action). Rows excluded by
na.action = na.omit do not appear in the model frame.
Missing values: na.action controls handling of NA in model frame
variables (predictors and response). na.omit (default) silently drops
rows with any NA; the variance estimator uses the full design for
correct sandwich SEs. na.fail stops with an informative error listing
all variables containing NA and the row count for each. Survey weights
are validated separately at construction time and must not contain NA.
Performance: Runtime scales as O(n · p²) for the score matrix
computation and O(p³) for the bread matrix (solve). For Taylor designs,
variance estimation adds O(n · H · p²) where H is the number of
strata. For replicate designs it adds O(R · n · p) where R is the
number of replicates. The dominant cost for large n is typically the
stats::glm() IRLS fit (O(n · p² · I) per IRLS iteration).
References
Binder, D.A. (1983) On the variances of asymptotically normal estimators from complex surveys. International Statistical Review 51(3), 279–292.
Binder, D.A. (1991) Use of estimating functions for interval estimation from complex surveys. Proceedings of the American Statistical Association, Section on Survey Research Methods, 34–42.
Lumley, T. and Scott, A. (2014) Tests in surveys with complex sampling. Journal of the Royal Statistical Society: Series B 76(2), 431–452.
Examples
d <- as_survey(gss_2024, ids = vpsu, weights = wtssps, strata = vstrat,
nest = TRUE)
# Linear model: respondent age predicted by education and sex
fit <- survey_glm(d, age ~ educ + sex)
fit@coefficients
#> (Intercept) educ sex
#> 41.7912598 0.4033772 0.3367356
fit@vcov
#> (Intercept) educ sex
#> (Intercept) 7.5695740 -0.363231367 -1.401719626
#> educ -0.3632314 0.025187314 0.002071918
#> sex -1.4017196 0.002071918 0.895403410
# Programmatic interface — suitable for lapply()
results <- lapply(c("age", "educ"), function(v) {
survey_glm(d, response = v, predictors = "sex")
})
