# Distributions Guide PAL provides a comprehensive set of statistical distributions for actuarial modelling. This tutorial covers how to choose, parameterise and use them. ## Setup ```python import numpy as np from pal import config, distributions, set_random_seed config.n_sims = 10_000 set_random_seed(42) ``` ## Generating Samples Every distribution has a `generate()` method that returns a `StochasticScalar` — a vector of simulated values: ```python loss = distributions.LogNormal(mu=10, sigma=1.5).generate() loss.mean() # => 68,673 loss.std() # => 205,459 np.percentile(loss.values, 99.5) # => 1,111,353 ``` The number of samples is controlled by `config.n_sims` (default 100,000). ## Analytical Functions Distributions also provide `cdf()` and `invcdf()` without needing to generate samples: ```python ln = distributions.LogNormal(mu=10, sigma=1.5) ln.cdf(50_000) # => 0.7076 (P(X ≤ 50,000)) ln.invcdf(0.5) # => 22,026 (median) ln.invcdf(0.995) # => 1,049,416 (99.5th percentile) ``` These are useful for quick calculations, curve-fitting checks and validating simulation results. ## Available Distributions ### Severity (Continuous) Distributions | Distribution | Parameters | Typical Use | |-------------|------------|-------------| | `LogNormal` | `mu`, `sigma` | Attritional losses, claim sizes | | `Gamma` | `alpha`, `theta`, `loc=0` | Aggregate losses, waiting times | | `Pareto` | `shape`, `scale` | Large/catastrophe losses | | `GPD` | `shape`, `scale`, `loc` | Excess losses above a threshold | | `Burr` | `power`, `shape`, `scale`, `loc` | Heavy-tailed loss distributions | | `Weibull` | `shape`, `scale`, `loc=0` | Time-to-failure, survival analysis | | `Normal` | `mu`, `sigma` | Symmetric risks, economic variables | | `Beta` | `alpha`, `beta`, `scale=1`, `loc=0` | Loss ratios, probabilities | | `Exponential` | `scale`, `loc=0` | Inter-arrival times, simple decay | | `LogLogistic` | `shape`, `scale`, `loc=0` | Income distributions, survival | | `Logistic` | `mu`, `sigma` | Growth models | | `Uniform` | `a`, `b` | Equal-likelihood scenarios | | `InverseGamma` | `alpha`, `theta`, `loc=0` | Bayesian priors | | `Paralogistic` | `shape`, `scale`, `loc=0` | Heavy-tailed alternatives | | `InverseBurr` | `power`, `shape`, `scale`, `loc` | Flexible heavy tails | | `InverseParalogistic` | `shape`, `scale`, `loc=0` | Heavy-tailed alternatives | | `InverseWeibull` | `shape`, `scale`, `loc=0` | Extreme value modelling | | `InverseExponential` | `scale`, `loc=0` | Extreme value modelling | ### Frequency (Discrete) Distributions | Distribution | Parameters | Typical Use | |-------------|------------|-------------| | `Poisson` | `mean` | Claim counts (fixed exposure) | | `NegBinomial` | `n`, `p` | Over-dispersed claim counts | | `Binomial` | `n`, `p` | Events out of fixed trials | | `HyperGeometric` | `ngood`, `nbad`, `population_size` | Sampling without replacement | ## Comparing Severity Distributions The choice of severity distribution significantly affects tail behaviour. Here are several distributions simulated with similar central tendency but very different tails: ``` Distribution Mean Std 99.5th ------------------------------------------------------------------------ LogNormal(mu=10, sigma=1.5) 68,673 205,459 1,111,353 Gamma(alpha=5, theta=1000) 5,020 2,247 12,551 Pareto(shape=2, scale=10000) 20,107 32,032 149,509 GPD(shape=0.5, scale=1000, loc=0) 2,021 6,406 27,902 Weibull(shape=1.5, scale=1000) 898 615 3,082 ``` **Key observations:** - **LogNormal** has the heaviest tail — the 99.5th percentile is 16× the mean. Suitable for large-loss classes where extreme events dominate. - **Pareto** also has a heavy tail (99.5th is 7.4× the mean) but its minimum value is bounded by the scale parameter. - **GPD** is the natural choice for modelling excesses above a threshold (peaks-over-threshold approach). - **Gamma** is lighter-tailed (99.5th is only 2.5× the mean) and suited for aggregate losses or attritional classes. - **Weibull** is even lighter — useful for modelling time-to-failure or operational risks. ## Comparing Frequency Distributions ``` Distribution Mean Std Max -------------------------------------------------------------- Poisson(mean=5) 5.0 2.3 15 Poisson(mean=50) 50.0 7.0 75 NegBinomial(n=5, p=0.5) 5.0 3.2 22 Binomial(n=100, p=0.1) 10.0 3.0 24 ``` - **Poisson** — variance equals the mean. Standard choice when claims arrive independently at a constant rate. - **Negative Binomial** — variance exceeds the mean (over-dispersed). Use when there is parameter uncertainty or heterogeneity in the claim arrival rate. - **Binomial** — bounded count (0 to n). Use when there is a fixed number of exposures and each can generate at most one claim. ## Choosing a Severity Distribution A practical decision tree: 1. **Do you have data above a threshold?** → `GPD` (peaks over threshold) 2. **Is the tail very heavy (power-law)?** → `Pareto` or `Burr` 3. **Is the distribution right-skewed with moderate tail?** → `LogNormal` or `Gamma` 4. **Is it symmetric?** → `Normal` or `Logistic` 5. **Is it bounded between 0 and 1?** → `Beta` 6. **Modelling time or duration?** → `Weibull` or `Exponential` ## Stochastic Parameters Distribution parameters can themselves be stochastic. Pass a `StochasticScalar` as a parameter to create a **mixed distribution**: ```python set_random_seed(42) # Uncertain claim rate: mean is itself random uncertain_rate = distributions.Gamma(alpha=25, theta=2).generate() claims = distributions.Poisson(mean=uncertain_rate).generate() ``` This produces over-dispersed counts because the Poisson mean varies across simulations, adding an extra layer of variability (this is equivalent to a Negative Binomial in the Poisson-Gamma case). ## Working with Generated Variables `StochasticScalar` objects support standard arithmetic and numpy operations: ```python loss = distributions.LogNormal(mu=14, sigma=0.5).generate() # Arithmetic with_expenses = loss * 1.10 capped = np.minimum(loss, 5_000_000) # Statistics loss.mean() loss.std() np.percentile(loss.values, [25, 50, 75, 95, 99, 99.5]) # Visualisation loss.show_cdf("Loss Distribution") ``` ## See Also - [Getting Started](getting_started.md) — first steps with PAL - [Frequency-Severity Modelling](frequency_severity_modelling.md) — combining frequency and severity distributions - [Coupling Groups, Copulas and Variable Reordering](coupling_groups_and_copulas.md) — adding dependencies between variables