Skip to content

Survey design & calibration

SurveyDesign dataclass

SurveyDesign(weights: str, strata: str | None = None, cluster: str | tuple[str, ...] | None = None, fpc: str | None = None, replicate_weights: tuple[str, ...] | None = None, replicate_type: str = 'jk1')

Column-name bundle describing a survey-design structure.

cluster accepts either a single column name (single-stage cluster sampling) or a tuple of names (multi-stage). For multi-stage designs PySofra currently uses the outermost PSU for variance estimation and a footnote will name the second-stage column as "nested within" — full multi-stage Taylor linearisation is planned.

replicate_weights and replicate_scale enable the jackknife family of variance estimators: every replicate column carries weights with one PSU dropped, and the variance is computed as replicate_scale * Σ (θ̂_r − θ̂)². The "jk1" default sets replicate_scale to (n − 1)/n automatically.

post_stratify

post_stratify(data: DataFrame, base_weights: Series | str, *, strata_cols: list[str] | tuple[str, ...], targets: Mapping[tuple[object, ...], float] | Series) -> pd.Series

Post-stratification calibration over a complete cross-classification.

Parameters:

Name Type Description Default
data DataFrame

Source dataframe.

required
base_weights Series | str

Either the column name of design weights in data or a Series aligned to data.index.

required
strata_cols list[str] | tuple[str, ...]

One or more columns whose Cartesian product defines the post-strata.

required
targets Mapping[tuple[object, ...], float] | Series

Population totals for each stratum. Accepts either:

  • a dict-like keyed by tuples whose length equals len(strata_cols) (e.g. {('M', '<50'): 1200, ...}), or
  • a pandas.Series indexed by those tuples (a MultiIndex.Series).
required

Returns:

Type Description
Series

Calibrated weights, aligned to data.index.

Raises:

Type Description
KeyError

When a stratum present in the data is missing from targets.

rake

rake(data: DataFrame, base_weights: Series | str, *, margins: Mapping[str, Mapping[object, float]] | None = None, targets: Mapping[str, Mapping[object, float]] | None = None, max_iter: int = 50, tol: float = 1e-06) -> pd.Series

Raking (iterative proportional fitting) over marginal targets.

Parameters:

Name Type Description Default
data DataFrame

Source dataframe.

required
base_weights Series | str

Either the column name in data or an aligned Series.

required
margins Mapping[str, Mapping[object, float]] | None

Mapping of variable → {level: target_total}. Each variable's targets are summed during one iteration; the algorithm cycles through the variables until the weights stabilise.

None
targets Mapping[str, Mapping[object, float]] | None

Alias for margins — accepted to match the naming convention used by R's raking functions. Exactly one of margins or targets must be supplied.

None
max_iter int

Maximum number of full sweeps over margins.

50
tol float

Convergence threshold on the largest relative change in any weight between iterations.

1e-06

Returns:

Type Description
Series

Calibrated weights aligned to data.index.

design_effect

design_effect(weights: Series) -> float

Kish's design-effect estimate: DEFF ≈ n · Σw² / (Σw)².

A quick QC check after calibration — large DEFF (≫ 1) means the weights are highly variable and effective sample size is low.

Negative weights are not meaningful in a design context (they would flip the contribution of a row), so they are excluded from the computation. If any are present, a UserWarning flags how many rows were dropped — matching the same behaviour as tbl_one(..., weights=...). Returns nan when no positive weights remain.