Survey design & calibration¶

SurveyDesign `dataclass` ¶

SurveyDesign(weights: str, strata: str | None = None, cluster: str | tuple[str, ...] | None = None, fpc: str | None = None, replicate_weights: tuple[str, ...] | None = None, replicate_type: str = 'jk1')

Column-name bundle describing a survey-design structure.

cluster accepts either a single column name (single-stage cluster sampling) or a tuple of names (multi-stage). For multi-stage designs PySofra currently uses the outermost PSU for variance estimation and a footnote will name the second-stage column as "nested within" — full multi-stage Taylor linearisation is planned.

replicate_weights and replicate_scale enable the jackknife family of variance estimators: every replicate column carries weights with one PSU dropped, and the variance is computed as replicate_scale * Σ (θ̂_r − θ̂)². The "jk1" default sets replicate_scale to (n − 1)/n automatically.

post_stratify ¶

post_stratify(data: DataFrame, base_weights: Series | str, *, strata_cols: list[str] | tuple[str, ...], targets: Mapping[tuple[object, ...], float] | Series) -> pd.Series

Post-stratification calibration over a complete cross-classification.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	Source dataframe.	required
`base_weights`	`Series \| str`	Either the column name of design weights in `data` or a Series aligned to `data.index`.	required
`strata_cols`	`list[str] \| tuple[str, ...]`	One or more columns whose Cartesian product defines the post-strata.	required
`targets`	`Mapping[tuple[object, ...], float] \| Series`	Population totals for each stratum. Accepts either: a `dict`-like keyed by tuples whose length equals `len(strata_cols)` (e.g. `{('M', '<50'): 1200, ...}`), or a `pandas.Series` indexed by those tuples (a `MultiIndex.Series`).	required

Returns:

Type	Description
`Series`	Calibrated weights, aligned to `data.index`.

Raises:

Type	Description
`KeyError`	When a stratum present in the data is missing from `targets`.

rake ¶

rake(data: DataFrame, base_weights: Series | str, *, margins: Mapping[str, Mapping[object, float]] | None = None, targets: Mapping[str, Mapping[object, float]] | None = None, max_iter: int = 50, tol: float = 1e-06) -> pd.Series

Raking (iterative proportional fitting) over marginal targets.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	Source dataframe.	required
`base_weights`	`Series \| str`	Either the column name in `data` or an aligned Series.	required
`margins`	`Mapping[str, Mapping[object, float]] \| None`	Mapping of variable → {level: target_total}. Each variable's targets are summed during one iteration; the algorithm cycles through the variables until the weights stabilise.	`None`
`targets`	`Mapping[str, Mapping[object, float]] \| None`	Alias for `margins` — accepted to match the naming convention used by R's raking functions. Exactly one of `margins` or `targets` must be supplied.	`None`
`max_iter`	`int`	Maximum number of full sweeps over `margins`.	`50`
`tol`	`float`	Convergence threshold on the largest relative change in any weight between iterations.	`1e-06`

Returns:

Type	Description
`Series`	Calibrated weights aligned to `data.index`.

design_effect ¶

design_effect(weights: Series) -> float

Kish's design-effect estimate: DEFF ≈ n · Σw² / (Σw)².

A quick QC check after calibration — large DEFF (≫ 1) means the weights are highly variable and effective sample size is low.

Negative weights are not meaningful in a design context (they would flip the contribution of a row), so they are excluded from the computation. If any are present, a UserWarning flags how many rows were dropped — matching the same behaviour as tbl_one(..., weights=...). Returns nan when no positive weights remain.

Survey design & calibration¶

SurveyDesign dataclass ¶

post_stratify ¶

rake ¶

design_effect ¶

SurveyDesign `dataclass` ¶