Survey design & calibration¶
SurveyDesign
dataclass
¶
SurveyDesign(weights: str, strata: str | None = None, cluster: str | tuple[str, ...] | None = None, fpc: str | None = None, replicate_weights: tuple[str, ...] | None = None, replicate_type: str = 'jk1')
Column-name bundle describing a survey-design structure.
cluster accepts either a single column name (single-stage
cluster sampling) or a tuple of names (multi-stage). For multi-stage
designs PySofra currently uses the outermost PSU for variance
estimation and a footnote will name the second-stage column as
"nested within" — full multi-stage Taylor linearisation is planned.
replicate_weights and replicate_scale enable the jackknife
family of variance estimators: every replicate column carries
weights with one PSU dropped, and the variance is computed as
replicate_scale * Σ (θ̂_r − θ̂)². The "jk1" default sets
replicate_scale to (n − 1)/n automatically.
post_stratify ¶
post_stratify(data: DataFrame, base_weights: Series | str, *, strata_cols: list[str] | tuple[str, ...], targets: Mapping[tuple[object, ...], float] | Series) -> pd.Series
Post-stratification calibration over a complete cross-classification.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
Source dataframe. |
required |
base_weights
|
Series | str
|
Either the column name of design weights in |
required |
strata_cols
|
list[str] | tuple[str, ...]
|
One or more columns whose Cartesian product defines the post-strata. |
required |
targets
|
Mapping[tuple[object, ...], float] | Series
|
Population totals for each stratum. Accepts either:
|
required |
Returns:
| Type | Description |
|---|---|
Series
|
Calibrated weights, aligned to |
Raises:
| Type | Description |
|---|---|
KeyError
|
When a stratum present in the data is missing from |
rake ¶
rake(data: DataFrame, base_weights: Series | str, *, margins: Mapping[str, Mapping[object, float]] | None = None, targets: Mapping[str, Mapping[object, float]] | None = None, max_iter: int = 50, tol: float = 1e-06) -> pd.Series
Raking (iterative proportional fitting) over marginal targets.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
Source dataframe. |
required |
base_weights
|
Series | str
|
Either the column name in |
required |
margins
|
Mapping[str, Mapping[object, float]] | None
|
Mapping of variable → {level: target_total}. Each variable's targets are summed during one iteration; the algorithm cycles through the variables until the weights stabilise. |
None
|
targets
|
Mapping[str, Mapping[object, float]] | None
|
Alias for |
None
|
max_iter
|
int
|
Maximum number of full sweeps over |
50
|
tol
|
float
|
Convergence threshold on the largest relative change in any weight between iterations. |
1e-06
|
Returns:
| Type | Description |
|---|---|
Series
|
Calibrated weights aligned to |
design_effect ¶
Kish's design-effect estimate: DEFF ≈ n · Σw² / (Σw)².
A quick QC check after calibration — large DEFF (≫ 1) means the weights are highly variable and effective sample size is low.
Negative weights are not meaningful in a design context (they would
flip the contribution of a row), so they are excluded from the
computation. If any are present, a UserWarning flags how many
rows were dropped — matching the same behaviour as tbl_one(...,
weights=...). Returns nan when no positive weights remain.