# Coupling Groups, Copulas and Variable Reordering This guide explains how PAL manages dependencies between stochastic variables. It covers three interconnected concepts: 1. **Coupling Groups** — automatic tracking of related variables 2. **Copulas** — creating dependency structures between variables 3. **Variable Reordering** — how copulas work under the hood A runnable version of all examples below is available at `examples/example_couplings_and_copulas.py`. ## 1. Coupling Groups ### What Is a Coupling Group? Every `StochasticScalar` in PAL belongs to a **coupling group** — a set of variables that are linked by computation. When you create an independent variable, it gets its own coupling group. When you combine variables in arithmetic, their coupling groups merge. ```python from pal import config, distributions, set_random_seed config.n_sims = 10_000 set_random_seed(42) motor = distributions.LogNormal(mu=14, sigma=0.5).generate() prop = distributions.LogNormal(mu=15, sigma=0.8).generate() # Each starts in its own group motor.coupled_variable_group is prop.coupled_variable_group # => False ``` When variables are combined, they become coupled: ```python total = motor + prop # Now all three share the same coupling group motor.coupled_variable_group is prop.coupled_variable_group # => True len(motor.coupled_variable_group) # => 3 ``` Derived variables also join automatically: ```python motor_with_expenses = motor * 1.1 motor_with_expenses.coupled_variable_group is motor.coupled_variable_group # => True len(motor.coupled_variable_group) # => 4 ``` ### Why Coupling Groups Matter Coupling groups solve a critical problem: **when a copula reorders one variable's simulations, all related variables must be reordered in exactly the same way** to preserve their mathematical relationships. Consider this scenario: ```python set_random_seed(42) loss_a = distributions.LogNormal(mu=14, sigma=0.5).generate() loss_b = distributions.LogNormal(mu=15, sigma=0.8).generate() # Derive a variable from loss_a loss_a_expenses = loss_a * 1.15 # 15% expense loading ``` Before any copula, `loss_a` and `loss_a_expenses` have a perfect 1.15× relationship because simulation `i` of both variables corresponds to the same underlying scenario: ``` loss_a first 5: [1751476. 1113583. 2057538. 1557236. 622980.] loss_a_expenses first 5: [2014197. 1280620. 2366169. 1790821. 716427.] Ratio: [1.15 1.15 1.15 1.15 1.15] ``` Now apply a copula to correlate `loss_a` with `loss_b`: ```python from pal import copulas copulas.GaussianCopula( [[1.0, 0.7], [0.7, 1.0]] ).apply([loss_a, loss_b]) ``` After the copula reorders `loss_a`, the 1.15× ratio is **preserved**: ``` Ratio (still 1.15!): [1.15 1.15 1.15 1.15 1.15] ``` This works because PAL's coupling group system automatically reordered `loss_a_expenses` with the same permutation as `loss_a`. **Without coupling groups**, `loss_a_expenses` would retain its original ordering while `loss_a` was shuffled, breaking the 1.15× relationship and producing nonsensical results. ## 2. Copulas A copula defines a dependency structure (correlation pattern) between random variables. PAL provides several copula families, each producing different patterns of dependence, especially in the tails. ### Available Copula Types | Family | Class | Key Parameters | Tail Dependence | |--------|-------|----------------|-----------------| | **Elliptical** | | | | | Gaussian | `GaussianCopula` | Correlation matrix | None | | Student's T | `StudentsTCopula` | Correlation matrix, `dof` | Symmetric | | **Archimedean** | | | | | Gumbel | `GumbelCopula` | `theta`, `n` | Upper | | Clayton | `ClaytonCopula` | `theta`, `n` | Lower | | Frank | `FrankCopula` | `theta`, `n` | None | | Joe | `JoeCopula` | `theta`, `n` | Upper | | **Extreme value** | | | | | MM1 | `MM1Copula` | `delta_matrix`, `theta` | Upper | | Galambos | `GalambosCopula` | `theta`, `d` | Upper | | Hüsler-Reiss | `HuslerReissCopula` | `lambda_matrix` | Upper | | Extremal T | `ExtremalTCopula` | `correlation_matrix`, `nu` | Upper | | **Other** | | | | | Plackett | `PlackettCopula` | `delta` | None | **Elliptical copulas** (Gaussian, Student's T) accept a full correlation matrix and naturally support any number of dimensions. **Archimedean copulas** (Gumbel, Clayton, Frank, Joe) take a single `theta` parameter and the dimension `n`. **Extreme value copulas** (MM1, Galambos, Hüsler-Reiss, Extremal T) model upper tail dependence and are suited for modelling joint extreme events. They have varying parameterisations — see the API docs for details. **Plackett copula** is a bivariate copula controlled by a single parameter `delta` that can model both positive and negative dependence. ### Comparing Copula Types ```python from pal import config, copulas, distributions, set_random_seed import numpy as np config.n_sims = 10_000 ``` #### Independent (no copula) ```python set_random_seed(42) x = distributions.LogNormal(mu=10, sigma=1.0).generate() y = distributions.LogNormal(mu=10, sigma=1.0).generate() np.corrcoef(x.ranks.values, y.ranks.values)[0, 1] # => -0.0034 (approximately zero) ``` #### Gaussian Copula ```python set_random_seed(42) x = distributions.LogNormal(mu=10, sigma=1.0).generate() y = distributions.LogNormal(mu=10, sigma=1.0).generate() copulas.GaussianCopula([[1.0, 0.8], [0.8, 1.0]]).apply([x, y]) np.corrcoef(x.ranks.values, y.ranks.values)[0, 1] # => 0.7853 ``` #### Gumbel Copula (upper tail dependence) ```python set_random_seed(42) x = distributions.LogNormal(mu=10, sigma=1.0).generate() y = distributions.LogNormal(mu=10, sigma=1.0).generate() copulas.GumbelCopula(theta=3.0).apply([x, y]) np.corrcoef(x.ranks.values, y.ranks.values)[0, 1] # => 0.8465 ``` Variables jointly tend to be large together — important for modelling joint catastrophic losses. #### Clayton Copula (lower tail dependence) ```python set_random_seed(42) x = distributions.LogNormal(mu=10, sigma=1.0).generate() y = distributions.LogNormal(mu=10, sigma=1.0).generate() copulas.ClaytonCopula(theta=4.0).apply([x, y]) np.corrcoef(x.ranks.values, y.ranks.values)[0, 1] # => 0.8479 ``` Variables jointly tend to be small together — dependence is concentrated in the lower tail. #### Student's T Copula (symmetric tail dependence) ```python set_random_seed(42) x = distributions.LogNormal(mu=10, sigma=1.0).generate() y = distributions.LogNormal(mu=10, sigma=1.0).generate() copulas.StudentsTCopula( [[1.0, 0.7], [0.7, 1.0]], dof=3 ).apply([x, y]) np.corrcoef(x.ranks.values, y.ranks.values)[0, 1] # => 0.6658 ``` Like the Gaussian copula but with **symmetric tail dependence** — extreme values are more likely to co-occur. The lower the degrees of freedom, the stronger the tail dependence. ### Scatter Plots — Rank Space The best way to understand how copulas differ is to plot the variables in **rank space** (plotting the ranks rather than the raw values). This removes the effect of the marginal distributions and isolates the dependency structure. ![Copula dependency structures in rank space](copula_scatter_plots.png) - **Independent**: uniform scatter with no pattern - **Gaussian**: elliptical concentration along the diagonal - **Gumbel**: strong clustering in the upper-right corner (upper tail) - **Clayton**: strong clustering in the lower-left corner (lower tail) - **Student's T**: similar to Gaussian but with heavier concentration in both corners (both tails) The interactive version of this chart (with plotly) is available in `examples/example_couplings_and_copulas.ipynb`. ## 3. Variable Reordering ### How `apply()` Works When you call `copula.apply([var_x, var_y])`, PAL performs these steps: 1. **Generate copula samples**: The copula creates correlated uniform random variables with the desired dependency structure. 2. **Compute rank indices**: For each copula sample, compute the rank ordering (which simulation should be 1st, 2nd, 3rd, etc.). 3. **Reorder simulations**: Rearrange each variable's simulations so their ranks match the copula's rank structure. 4. **Preserve marginals**: The same set of values is retained — only their ordering changes. 5. **Merge coupling groups**: All input variables (and their coupled variables) are merged into a single coupling group. ### Seeing Reordering in Action To make the reordering concrete, here is a small example with only **8 simulations**. We can see exactly which values move where: ```python config.n_sims = 8 set_random_seed(42) x = distributions.Normal(0, 1).generate() y = distributions.Normal(0, 1).generate() ``` **Before** the copula, X and Y have unrelated orderings: ``` Sim: 0 1 2 3 4 5 6 7 X values: [ 0.752, -0.154, 1.074, 0.517, -1.315, 1.971, 0.710, 0.793] Y values: [-1.135, -0.125, -0.330, 1.452, 0.369, 0.926, -0.142, -0.748] X ranks: [ 4, 1, 6, 2, 0, 7, 3, 5] Y ranks: [ 0, 4, 2, 7, 5, 6, 3, 1] ``` Notice that X and Y ranks are unrelated — when X is high (rank 7 in sim 5), Y has rank 6. When X is low (rank 0 in sim 4), Y has rank 5. ```python copulas.GaussianCopula([[1, 0.9], [0.9, 1]]).apply([x, y]) ``` **After** the copula, the Y values have been **shuffled** so their ranks align with X's ranks: ``` Sim: 0 1 2 3 4 5 6 7 X values: [ 0.752, -0.154, 1.074, 0.517, -1.315, 1.971, 0.710, 0.793] Y values: [-0.125, -0.748, 0.926, -0.142, -1.135, 1.452, -0.330, 0.369] X ranks: [ 4, 1, 6, 2, 0, 7, 3, 5] Y ranks: [ 4, 1, 6, 3, 0, 7, 2, 5] ``` Key observations: - **X values are unchanged** — the same 8 numbers in the same positions - **Y values are the same set of numbers**, but reordered — e.g. the value `1.452` moved from sim 3 to sim 5 - **Ranks now nearly match**: when X has rank 7 (sim 5), Y also has rank 7. When X has rank 0 (sim 4), Y also has rank 0 - **Sorted values are identical** before and after: `np.allclose(sorted(x.values), sorted_x) # => True` - **Rank correlation jumped to 0.9762** (from ~0) This is the fundamental mechanism: copula application is a **permutation**, not a transformation. ### Marginal Preservation A critical property of copula application is that **marginal distributions are exactly preserved**. The same values exist before and after — only their ordering changes: ```python config.n_sims = 10_000 set_random_seed(42) var_x = distributions.Normal(0, 1).generate() var_y = distributions.Normal(0, 1).generate() sorted_x = np.sort(var_x.values) sorted_y = np.sort(var_y.values) copulas.GaussianCopula( [[1.0, 0.9], [0.9, 1.0]] ).apply([var_x, var_y]) # Marginals are unchanged np.allclose(np.sort(var_x.values), sorted_x) # => True np.allclose(np.sort(var_y.values), sorted_y) # => True # But now correlated np.corrcoef(var_x.ranks.values, var_y.ranks.values)[0, 1] # => 0.8903 ``` ### Reordering Across a Chain of Derived Variables Coupling groups enable **transitive reordering** across chains of derived variables: ```python set_random_seed(42) base_loss = distributions.LogNormal(mu=14, sigma=0.5).generate() gross_loss = base_loss * 1.0 expense_loaded = gross_loss * 1.10 tax = expense_loaded * 0.21 net_loss = expense_loaded - tax # All 5 variables share one coupling group len(base_loss.coupled_variable_group) # => 5 ``` When a copula is applied to `base_loss` and an independent variable, **every variable in the chain** is reordered together: ```python cat_loss = distributions.LogNormal(mu=16, sigma=1.2).generate() copulas.GumbelCopula(theta=1.5).apply( [base_loss, cat_loss] ) # All ratios are preserved gross_loss.values[:3] / base_loss.values[:3] # => [1. 1. 1.] expense_loaded.values[:3] / gross_loss.values[:3] # => [1.1 1.1 1.1] tax.values[:3] / expense_loaded.values[:3] # => [0.21 0.21 0.21] ``` ## 4. Multivariate Copulas Elliptical copulas naturally extend to any number of dimensions. Here is an example correlating 4 lines of business: ```python set_random_seed(42) lobs = { "Motor": distributions.LogNormal(mu=14, sigma=0.4).generate(), "Property": distributions.LogNormal(mu=15, sigma=0.6).generate(), "Liability": distributions.LogNormal(mu=13, sigma=0.5).generate(), "Marine": distributions.LogNormal(mu=12, sigma=0.7).generate(), } corr_matrix = [ [1.0, 0.6, 0.3, 0.2], [0.6, 1.0, 0.4, 0.3], [0.3, 0.4, 1.0, 0.5], [0.2, 0.3, 0.5, 1.0], ] copulas.GaussianCopula(corr_matrix).apply(list(lobs.values())) ``` The resulting rank correlations closely match the input: ``` Motor Property Liability Marine Motor 1.000 0.578 0.264 0.177 Property 0.578 1.000 0.361 0.261 Liability 0.264 0.361 1.000 0.479 Marine 0.177 0.261 0.479 1.000 ``` Compare with input: `[0.6, 0.3, 0.2], [0.4, 0.3], [0.5]` — the rank correlations are close but not exact due to finite sample size (10,000 simulations). To be completely precise, we should compare the rank correlation to the Spearman's :math:`\rho_S` of the Gaussian copula, which is related to the underlying correlation parameter :math:`r` by: .. math:: \rho_S = \frac{6}{\pi}\asin\left(\frac{r}{2}\right) ## 5. `generate()` vs `apply()` Copulas have two main methods: ### `generate()` — Create New Samples Returns correlated **uniform** `[0, 1]` samples. Use this when you want raw copula samples to feed into inverse CDF transforms manually: ```python samples = copulas.GaussianCopula( [[1.0, 0.8], [0.8, 1.0]] ).generate() samples[0].mean() # => ~0.5 (uniform) samples[1].mean() # => ~0.5 (uniform) ``` ### `apply()` — Reorder Existing Variables Reorders existing `StochasticScalar` variables to match the copula's dependency structure. This is the most common workflow: ```python v1 = distributions.Gamma(alpha=5, theta=1000).generate() v2 = distributions.Pareto(shape=2, scale=10000).generate() copulas.GaussianCopula( [[1.0, 0.8], [0.8, 1.0]] ).apply([v1, v2]) # Means unchanged - only the ordering changed # Rank correlation: 0.7867 ``` **Key difference**: `generate()` creates new data; `apply()` restructures existing data. In most actuarial models, you'll use `apply()` because you've already generated variables from their marginal distributions and want to impose a dependency structure. ## 6. Frequency-Severity Models with Copulas In actuarial modelling, losses are often represented as compound distributions using `FrequencySeverityModel`. After generating event-level simulations, you aggregate them to get simulation-level totals, then use copulas to impose dependencies between lines of business. ### Two Lines of Business ```python from pal.frequency_severity import FrequencySeverityModel config.n_sims = 10_000 set_random_seed(42) # Motor: 50 claims/year, LogNormal severity motor_model = FrequencySeverityModel( freq_dist=distributions.Poisson(mean=50), sev_dist=distributions.LogNormal(mu=10, sigma=1.5), ) motor_events = motor_model.generate() # Property: 20 claims/year, Pareto severity property_model = FrequencySeverityModel( freq_dist=distributions.Poisson(mean=20), sev_dist=distributions.Pareto(shape=2.5, scale=50_000), ) property_events = property_model.generate() # Aggregate to simulation-level totals motor_agg = motor_events.aggregate() property_agg = property_events.aggregate() ``` Before the copula, the two lines are independent: ``` Before copula: Motor aggregate mean: 3,373,050 Property aggregate mean: 1,664,453 Rank correlation: -0.0069 ``` Now apply a Gaussian copula with ρ=0.6 (both lines worsen in the same economic environment): ```python copulas.GaussianCopula( [[1, 0.6], [0.6, 1]] ).apply([motor_agg, property_agg]) ``` After the copula, losses are correlated but marginals are preserved: ``` After Gaussian copula (rho=0.6): Motor aggregate mean: 3,373,050 Property aggregate mean: 1,664,453 Rank correlation: 0.5780 Combined portfolio: Mean: 5,037,503 Std: 1,744,284 99.5th percentile: 12,192,519 ``` The means are identical before and after — only the joint behaviour changed. The 99.5th percentile of the combined portfolio reflects the positive dependence: bad years for motor tend to also be bad years for property. ### Coupling with Catastrophe Losses You can also use copulas to link frequency-severity aggregate losses with standalone catastrophe variables. Because `aggregate()` and `occurrence()` return `StochasticScalar` objects that share a coupling group with the underlying events, reordering the aggregate automatically reorders the max-event loss too: ```python set_random_seed(42) model = FrequencySeverityModel( freq_dist=distributions.Poisson(mean=30), sev_dist=distributions.LogNormal(mu=10, sigma=1.5), ) events = model.generate() agg_loss = events.aggregate() # total loss per sim max_loss = events.occurrence() # largest single event per sim cat_loss = distributions.LogNormal(mu=16, sigma=1.2).generate() copulas.GumbelCopula(theta=2.0).apply( [agg_loss, cat_loss] ) ``` The Gumbel copula introduces upper tail dependence — years with high attritional losses also tend to have high catastrophe losses: ``` After Gumbel copula (theta=2.0): Agg loss mean: 2,022,548 Max event mean: 637,503 Cat loss mean: 18,398,184 Rank corr (agg-cat): 0.6874 Rank corr (max-cat): 0.5818 ``` Note that the max-event loss is also correlated with the catastrophe loss (rank corr 0.58) even though the copula was only applied to the aggregate. This is because `aggregate()` and `occurrence()` share a coupling group with the underlying `FreqSevSims` events. ## Important Constraints 1. **Variables must be independent before `apply()`**: You cannot apply a copula to variables that are already coupled (i.e., in the same coupling group). PAL raises a `ValueError` if you try. 2. **Coupling is permanent**: Once variables are coupled (either through arithmetic or copula application), they remain in the same coupling group for the rest of the simulation. 3. **Order matters**: The order of variables in the list passed to `apply()` must match the order of dimensions in the copula's correlation matrix. ## References - Nelsen, R. B. (2006). *An Introduction to Copulas*, 2nd ed. Springer. — The standard textbook on copula theory; covers all major families, properties, and constructions. - McNeil, A. J., Frey, R. & Embrechts, P. (2015). *Quantitative Risk Management: Concepts, Techniques and Tools*, 2nd ed. Princeton University Press. — Chapters 7–8 give an accessible treatment of copulas in a risk-management context. - Iman, R. L. & Conover, W. J. (1982). "A distribution-free approach to inducing rank correlation among input variables." *Communications in Statistics — Simulation and Computation*, 11(3), 311–334. — The original rank-reordering algorithm used by PAL's `apply()` method. - Joe, H. (2014). *Dependence Modeling with Copulas*. CRC Press. — Comprehensive reference covering Archimedean, elliptical, and extreme-value copula families. - Embrechts, P., Lindskog, F. & McNeil, A. (2003). "Modelling dependence with copulas and applications to risk management." In *Handbook of Heavy Tailed Distributions in Finance*, Elsevier. — Practical guide to choosing and calibrating copulas for insurance and finance applications.