A Practical Introduction to Local Stochastic Volatility — for maths mortals like me
2026-06-04
Dedicated to Peter Austing and Fabrice Rouah,
who teach with intuition and clarity.
The trade-off that motivates everything
Two parent models sit at opposite corners. Local volatility (Dupire ) makes instantaneous volatility a deterministic function \sigma_{\mathrm{loc}}(t,S) of time and spot. By construction it reprices every vanilla on the surface; its defect is dynamic. Stochastic volatility (Heston, SABR ) gives volatility a life of its own—a mean-reverting random process correlated with spot—which produces realistic dynamics, but a handful of parameters cannot bend themselves into the shape of an entire market surface.
So the choice is between perfect fit with broken dynamics and good dynamics with imperfect fit. Neither is acceptable on an exotics desk, which must be arbitrage-free against hedgeable vanillas and price the future-smile-sensitive product in hand. The cleanest symptom of local vol’s broken dynamics is the forward smile, and that is what the rest of this note dissects.
Fixing the term “skew.”
Everything below is a statement about one number, so it is worth defining precisely. The implied volatility \sigma_{\mathrm{BS}}(K,T) is the single Black–Scholes volatility that reprices a European option of strike K and maturity T at its market value; at a fixed maturity the map K\mapsto\sigma_{\mathrm{BS}}(K,T) is the smile, and were Black–Scholes literally true it would be a horizontal line. The skew is the slope of that line. Writing log-moneyness against the forward F as k=\ln(K/F), the formal definition is the at-the-money slope
Step 1 — the skew lives in the slope of one curve
Local vol has exactly one mechanism for producing a skew: the dependence of \sigma_{\mathrm{loc}}(t,S) on the spot level. For an equity skew the calibrated surface is downward sloping—high instantaneous vol at low spot, low vol at high spot (Figure 2). The skew is not a feature bolted on top of the model; it is the shape of this curve. Whatever skew the model can ever generate, at any date, must come from spot sitting somewhere on it.
Step 2 — diffusion averages the level-dependence away
The forward smile is the smile seen from a future date T_1, looking forward to T_2. To form it one must condition on where spot is at T_1—and spot at T_1 is not a number but a distribution. Over [0,T_1] the paths diffuse out from S_0, fanning into a spread of arrival levels (Figure 3).
A skew is fundamentally about the correlation between spot moves and vol moves. In local vol that correlation is manufactured purely by spot sliding along the fixed \sigma_{\mathrm{loc}} curve. But by T_1 the paths are scattered across many levels, each sitting on a different part of the curve, so the sharp, deterministic level-dependence that produced a steep skew today is smeared into an average over the diffused distribution. The longer the horizon to T_1, the more diffusion, the more averaging, the weaker the residual skew.
Step 3 — the forward smile comes out flat
Today’s smile is read directly off the sloped curve with no diffusion in between, so it is steep. The forward smile is read off the same mechanism but only after spot has diffused out to T_1, which dilutes the level-dependence. The model therefore predicts a forward smile systematically flatter than the smile observed today (Figure 4). In the Monte-Carlo experiment here the 90–110 skew falls from 7.2 to 3.7 volatility points—roughly a halving—over a one-year forward start.
Why stochastic volatility does not flatten, and how LSV uses this
In a stochastic-volatility model the skew is generated by the volatility process itself—a fixed spot/vol correlation \rho and a vol-of-vol \xi that do not decay merely because time has passed. An option starting at T_1 still sees full-strength \rho and \xi, so it still sees a steep skew: nothing is averaged away because the skew-generating mechanism is intrinsic to the vol process rather than parasitic on spot’s position along a curve. With parameters tuned to match the local-vol model’s today skew, the stochastic-vol forward skew stays essentially put while the local-vol one collapses (Figure 5, Table 1).
| ATM vol | 90–110 skew (vol pts) | |||
|---|---|---|---|---|
| 2-3(lr)4-5 | today | forward | today | forward |
| Local volatility | 17.1 | 16.0 | 7.2 | 3.7 |
| Stochastic vol | 15.9 | 16.2 | 6.9 | 6.4 |
This is precisely the gap LSV closes. Keep a stochastic variance v_t for the dynamics and attach a deterministic leverage function L(t,S) to the spot diffusion:
Gyöngy’s mimicking theorem.
The answer is a corollary of a general result of Gyöngy relating a complicated Itô process to a simple Markovian one.1 In a sentence: any Itô process can be matched, marginal by marginal, by a local-volatility diffusion whose squared local volatility \sigma_{\mathrm{eff}}^2(t,S) equals the conditional expectation of the original instantaneous variance. Here \sigma_{\mathrm{eff}}(t,S) is the deterministic “effective” volatility of the mimicking local-vol model—the single number that, placed at (t,S), makes spot spread at the same average rate as the original process. The two models then share every marginal distribution of S_t, that is, the one-dimensional distribution of spot at each individual date t taken on its own (the histogram of S_t across all paths), as opposed to the joint distribution of the entire trajectory.2
The intuition is that a European vanilla of maturity T depends only on the marginal density of S_T—not on the path taken, and not on whether volatility was random along the way. And that marginal feels only the average squared diffusion coefficient at each level of spot: when a path sits at S_t=S, what spreads the density there is not the particular random value of v_t but its average across all the paths that are at S at time t. That average is the conditional expectation \mathbb{E}[v_t\mid S_t=S]: the mean of the instantaneous variance v_t over the ensemble of paths, conditioned on their sharing the same current spot S_t=S—i.e. taken over just the vertical column of paths above S in Figure 6, not over the whole population. Replace the random v_t at each (t,S) by this conditional mean and the marginal distribution of S—hence every vanilla price—is unchanged (Figure 6). The projection discards the randomness of volatility but keeps precisely the information vanillas can see.
Apply the projection to the LSV diffusion (2). Its instantaneous variance is L^2(t,S)\,v_t, so the effective squared local volatility Gyöngy produces is L^2(t,S)\,\mathbb{E}[v_t\mid S_t=S]. For the model to reprice the entire vanilla surface this must equal Dupire’s local volatility \sigma_{\mathrm{loc}}(t,S)—the unique local-vol function already calibrated to that surface. Equating the two gives the leverage identity, the central equation of LSV:
What the leverage function is.
Read the identity as a ratio (Figure 7, left). The numerator \sigma_{\mathrm{loc}}^2(t,S) is the target: the squared local vol the market demands at (t,S). The denominator \mathbb{E}[v_t\mid S_t=S] is what the bare stochastic-vol process delivers on average when spot is at S. Their ratio L^2(t,S) is the point-by-point correction that bends one onto the other. So the leverage function is a deterministic surface over (t,S) (Figure 7, right) that scales the stochastic volatility up where it falls short of the surface (L>1) and down where it overshoots (L<1). Two limiting cases pin down its meaning:
Switch the randomness off (v_t\equiv1). Then \mathbb{E}[v_t\mid S]=1 and L(t,S)=\sigma_{\mathrm{loc}}(t,S): the leverage function is the local-vol function, and LSV degenerates to pure Dupire.
Suppose the stochastic-vol process already reproduced the surface on its own. Then \mathbb{E}[v_t\mid S]=\sigma_{\mathrm{loc}}^2(t,S) and L\equiv1: no leverage is needed, and the model is pure stochastic vol.
The useful regime sits in between: L is a mild, order-one rescaling that contributes exactly the part of the skew the chosen \rho,\xi leave on the table—the bare conditional average \mathbb{E}[v\mid S] is too flat to bend into the target skew by itself, and the leverage makes up the difference (Figure 7, left).
Because L always restores the fit, the stochastic-vol parameters are not pinned down by vanillas: essentially any reasonable (\rho,\xi,\dots) can be made to reprice the surface once L is solved for. That residual freedom—a mixing weight running from pure local vol (L=\sigma_{\mathrm{loc}}, no stochastic vol) to as much stochastic vol as one cares to add—is what one calibrates to the forward smile and other exotics, and it is where the model risk lives.
Solving the fixed point
At first glance the leverage identity (3) looks like something you can simply evaluate: take the known Dupire surface \sigma_{\mathrm{loc}}^2(t,S), divide by the denominator \mathbb{E}[v_t\mid S_t=S], and read off L. It is not that simple, because the denominator is not a quantity we already have. The conditional average \mathbb{E}[v_t\mid S_t=S] is a feature of the joint distribution of spot and variance (S_t,v_t), and that distribution is whatever comes out of running the model—evolving the SDE (2) forward. But L sits inside that SDE, multiplying the spot diffusion, so the distribution the model produces already depends on L. We are going in a circle: to get L we need \mathbb{E}[v_t\mid S_t=S]; to get that we need the joint distribution; and to get the distribution we already need L (Figure 8).
This is why (3) is a fixed-point equation rather than an explicit formula: we are after a leverage function that reproduces itself when sent once around the loop. The technical name for an SDE of this kind—one whose coefficients depend on the distribution of its own solution, not merely on the current value of the state—is a McKean–Vlasov SDE. Two solvers dominate in practice, and neither cracks it in closed form: the forward-PDE (Fokker–Planck) method of Section 7, which carries the distribution as a deterministic density on a grid and evolves it with a PDE, and the particle (Monte-Carlo) method of Section 8, which carries it as a cloud of simulated paths and evolves them by simulation. Both escape the circularity the same way: L(t,\cdot) is only ever needed to push the model from t to t+\mathrm{d}t, so the loop can be broken locally in time.
Concretely, chop time into steps of length \Delta t and carry the joint distribution of (S_t,v_t) forward one step at a time. At the start of a step we already hold the time-t distribution—which is all the identity needs. Read the conditional mean \mathbb{E}[v_t\mid S_t=S] off it, form the leverage on that slice through the identity (3), L(t,S)=\sigma_{\mathrm{loc}}(t,S)\big/\sqrt{\mathbb{E}[v_t\mid S_t=S]}, then freeze this L(t,\cdot) and advance every path one Euler–Maruyama step (Appendix B),
Everything else—the bootstrap, the identity (3), the conditional expectation—is common to both solvers; they part company only in how they represent the advancing distribution (Figure 10), and the next two sections take each in turn.
The forward PDE (Fokker–Planck) method
Think of the joint density p(t,S,v) as a sheet of probability mass laid over the (S,v) plane, and push it forward in time. It obeys the Fokker–Planck (Kolmogorov forward) equation—a conservation law stating that probability is neither created nor destroyed, only transported by the drift and spread by the diffusion:
Discretise the plane into a grid and (5) becomes a large, sparse linear update that marches p one step. The pay-off is that once a slice is solved you hold the entire density, so the conditional expectation is a direct quadrature: for each spot column, average the variance against the density down that column (Figure 12),
The method is deterministic and noise-free: the density is known everywhere, so the conditional expectation and any greeks come out smooth, and the vanilla fit can be made very accurate. The cost is that this is a genuinely two-dimensional PDE. On a finite-difference grid the spatial operator couples each node to its neighbours in both S and v, and an explicit time march is throttled by stability limits, so one steps implicitly. Rather than invert the full two-dimensional system at every step, the standard choice is an alternating-direction implicit (ADI) scheme—Craig–Sneyd or Hundsdorfer–Verwer—which splits each step into a tridiagonal sweep along S and another along v, carrying the mixed \partial_{Sv} term (the \rho coupling) explicitly; each step is then a handful of cheap one-dimensional solves rather than one large two-dimensional one (Appendix A sets the Craig–Sneyd scheme out in full). Even so, work scales as (spot grid) \times (variance grid) \times (time steps), and every extra factor (stochastic rates, a second asset) adds a whole grid dimension, so the curse of dimensionality bites quickly. The wings are the delicate part: where the density is thin the conditional expectation is a ratio of two tiny numbers, and the boundary in v near zero needs care, especially when the Feller condition3 is violated and mass piles up at v=0. (Lipton’s work is the standard reference.)
Algorithm 1 collects the forward-PDE calibration as a single forward sweep: at each time level read the conditional mean off the current density, set the leverage from the identity, freeze it, and march the density one ADI step.
The particle method
Instead of a grid, represent the joint distribution by a large cloud of simulated paths, each particle carrying its own (S,v). The cloud is the density—dense where probability is high, sparse where it is low. To estimate \mathbb{E}[v\mid S=S^\ast] one does not integrate; one looks at the particles whose spot sits near S^\ast and averages their variances, weighting each by its closeness through a kernel K_\delta (a Nadaraya–Watson regression; Appendix C explains how a kernel works),
This sidesteps the 2D PDE entirely, scales gracefully to extra factors (just give each particle more state), and is fast—which is why it is the desk favourite (Guyon & Henry-Labordère ). The price is Monte-Carlo noise, and the bandwidth \delta is a bias–variance dial (Figure 14): too small and the conditional expectation is built from a handful of neighbours and is pure noise; too large and it over-smooths, biasing the estimate toward the global mean and washing out the very spot-dependence one is trying to capture. Standard practice is a maturity- and density-dependent bandwidth scaling like \delta_t \propto \sigma\sqrt{t}\,N^{-1/5} (the kernel-regression rate), sometimes in a k-nearest-neighbour form so the window adapts to local density. The wings hurt here too: sparse particles make the denominator both noisy and liable to be small, which blows the leverage up, so one floors the denominator and/or caps L. And because the estimate is biased at finite N, the calibration carries a systematic error—calibrate, then reprice the vanillas through the Monte Carlo and check the residual.
Algorithm 2 is the same bootstrap carried by the cloud: estimate each particle’s conditional mean by kernel regression, set its local leverage, then push every particle one Euler–Maruyama step. Calibration and simulation are literally the same loop.
| Forward PDE | Particle method | |
|---|---|---|
| Joint distribution as | density on a grid | cloud of paths |
| \mathbb{E}[v\mid S] via | quadrature down columns | kernel regression |
| Character | deterministic, noise-free | Monte-Carlo, noisy |
| Calibration | separate, reusable grid | fused with the pricing pass |
| Extra factors | adds a grid dimension | adds state per particle |
| Main weakness | 2D cost, v\!=\!0 boundary | bandwidth, MC bias |
Calibration versus pricing.
A final practical contrast: how each solver relates calibration to pricing. The forward-PDE method is a self-contained, deterministic calibration—one forward sweep produces the whole leverage surface L(t,S) tabulated on the grid, a reusable object that one then hands to a separate pricer (typically a Monte Carlo for the exotic in hand). The particle method instead fuses the two: calibration and path propagation are the same forward pass, with L built on the fly wherever the particles sit, so the product can be priced on the very paths that calibrated it. To reuse a particle-calibrated surface for other products one simply records L(t,\cdot) slice by slice as the pass proceeds, recovering a gridded surface much like the PDE’s.
Appendices
Appendix A. The Craig–Sneyd ADI scheme
Section 7 marched the Fokker–Planck equation (5) one step with an alternating-direction implicit (ADI) scheme. This appendix sets out one concrete member of that family—the Craig–Sneyd scheme —in enough detail to implement. The obstacle it is built to dodge is the mixed \partial_{Sv} term: a fully implicit step would couple every grid node to every other through both directions at once, a large two-dimensional linear solve. ADI keeps only one direction implicit at a time—a cheap tridiagonal solve—and carries the mixed term explicitly.
Finite differences, explicit and implicit.
Finite differencing replaces derivatives on a grid (spacing h) by differences of neighbouring values—for instance \partial_{xx}u\approx(u_{i+1}-2u_i+u_{i-1})/h^2. Doing this to the spatial derivatives of a diffusion equation \partial_t u=\partial_{xx}u turns it into a system of ODEs in time, \mathrm{d}U/\mathrm{d}t = A\,U, with A tridiagonal (each node coupled only to its two neighbours). What remains is how to step time. An explicit step (forward Euler) reads the new values straight off the old,
| explicit (forward Euler) | implicit (backward Euler) | |
|---|---|---|
| Work per step | one matrix–vector product | one linear solve |
| Stability | conditional, \Delta t\lesssim h^2/2 | unconditional |
| Step set by | stability (tiny steps) | accuracy |
What breaks in two dimensions is the implicit solve: the mixed \partial_{Sv} term couples the two grid directions, so I-\Delta t\,A is no longer tridiagonal but a wide-bandwidth two-dimensional system, expensive to factor. ADI is the device that keeps implicit stability while restoring the cheap tridiagonal solves.
Method of lines and operator splitting.
Apply the same recipe to the Fokker–Planck equation (5): discretise the (S,v) plane (central differences in space, time left continuous) so the grid densities form a vector U(t) obeying \mathrm{d}U/\mathrm{d}t = A\,U. In two dimensions the spatial operator splits by direction,
A_1 collects the S-derivatives (\partial_S,\partial_{SS}: the drift and the \tfrac12 L^2 v S^2 diffusion);
A_2 collects the v-derivatives (\partial_v,\partial_{vv}: the mean-reversion drift and the \tfrac12\xi^2 v diffusion);
A_0 is the mixed term \partial_{Sv} (the \rho\xi coupling).
The coefficients of A are frozen at the values implied by the leverage L(t,\cdot) computed at the start of the step (Section 7), so A is constant across the step. The split is engineered around one fact: with the nodes ordered along grid lines, A_1 couples a node only to its neighbours in S, so A_1 is block-diagonal—one tridiagonal block per line of constant v—and likewise A_2 is tridiagonal along v. Only the mixed operator A_0 couples the two directions, and it is never inverted.
The scheme.
Fix \theta=\tfrac12. One step from U^{n} (time t_n) to U^{n+1} (t_n+\Delta t) runs in two passes. A predictor—one explicit full step, then one implicit correction per direction:
Cost, accuracy, stability.
A step costs a fixed number of tridiagonal sweeps—two in each direction—plus a few sparse matrix–vector products: O(N_S N_v) work for an N_S\times N_v grid, linear in the number of nodes. Inverting the full two-dimensional operator directly costs far more (a dense factorisation scales as the cube of the node count, and even sparse 2D factorisations suffer heavy fill-in). The explicit predictor on its own (the Douglas scheme) is only first-order accurate once a mixed derivative is present; the Craig–Sneyd correction restores second-order accuracy in \Delta t at \theta=\tfrac12. And unlike a fully explicit march—whose stable step shrinks like the square of the grid spacing—the scheme stays stable at the large steps one actually wants, which is the whole point of going implicit.
Extra factors (stochastic rates, a second asset) split the same way: one more direction A_3,\dots and one more family of tridiagonal sweeps per step. The mixed terms then multiply, and variants such as the Modified Craig–Sneyd or Hundsdorfer–Verwer schemes handle several of them more robustly —but the structure is the one above.
Appendix B. Euler–Maruyama time stepping
The particle method (Section 8) and the discretised step of Section 7 advance the SDEs by Euler–Maruyama, the stochastic cousin of the forward-Euler step (8). For a generic Itô process \mathrm{d}X_t = a(X_t,t)\,\mathrm{d}t + b(X_t,t)\,\mathrm{d}W_t it reads, over a step \Delta t,
Accuracy.
Euler–Maruyama has strong order \tfrac12 (the pathwise error, matching individual trajectories, shrinks like \sqrt{\Delta t}) but weak order 1 (errors in expectations—hence option prices—shrink like \Delta t). For pricing it is the weak order that matters, so the scheme is first-order in practice; the Milstein scheme adds a \tfrac12\,b\,b'\,((\Delta W)^2-\Delta t) correction to reach strong order 1 .
The variance pitfall.
Applied naively to the CIR variance \mathrm{d}v_t=\kappa(\theta-v_t)\,\mathrm{d}t+\xi\sqrt{v_t}\,\mathrm{d}W_t^{v}, a Gaussian kick can push v_{t+\Delta t} below zero, where \sqrt{v} is undefined—the more so when the Feller condition is violated. Two standard repairs keep the scheme alive: full truncation, which uses \sqrt{v^{+}} (i.e. \max(v,0)) in the coefficients , and Andersen’s moment-matched quadratic-exponential (QE) scheme, more accurate near the origin . This is what Section 8 means by “stepped by the QE or full-truncation scheme, never naive Euler”. Austing discusses the practical choices.
Appendix C. How the kernel works
Section 8 estimates the conditional mean \mathbb{E}[v_t\mid S_t=S^\ast] from the particle cloud with a kernel-weighted average (7). This appendix unpacks what the kernel is doing.
A kernel K_\delta is just a weighting function—a smooth bump centred at zero that is largest there and decays as its argument moves away. The canonical choice is the Gaussian,
The conditional mean we want is the average variance among the paths that sit at S^\ast. The cloud almost never has particles exactly at S^\ast, so instead of an impossible exact match we take a soft neighbourhood: weight every particle by its closeness K_\delta(S^\ast-S_t^i) and form the weighted average of the variances. That is precisely (7); the denominator \sum_i K_\delta(S^\ast-S_t^i) simply normalises the weights so they sum to one. Equivalently, it fits a constant to the variances of the nearby particles, with nearness measured by the kernel—the simplest form of a local regression.
The bandwidth \delta is the one real choice, and it is the bias–variance dial of Figure 14. A small \delta keeps only the few closest particles: the estimate follows the true \mathbb{E}[v\mid S] faithfully (low bias) but, built from a handful of points, is noisy (high variance). A large \delta averages over a wide window: smooth and stable (low variance) but blurred toward the global mean, flattening out the very spot-dependence we are trying to measure (high bias). A hard window—count every particle within \pm\delta equally and ignore the rest—is the special case of a rectangular kernel; the smooth bump of (14) simply avoids the jitter of particles popping in and out of a sharp cutoff as S^\ast moves.
99
P. Austing. Smile Pricing Explained. Palgrave Macmillan, 2014.
B. Dupire. Pricing with a smile. Risk, 7(1):18–20, 1994.
S. L. Heston. A closed-form solution for options with stochastic volatility with applications to bond and currency options. Review of Financial Studies, 6(2):327–343, 1993.
P. S. Hagan, D. Kumar, A. S. Lesniewski, D. E. Woodward. Managing smile risk. Wilmott Magazine, pages 84–108, September 2002.
I. Gyöngy. Mimicking the one-dimensional marginal distributions of processes having an Itô differential. Probability Theory and Related Fields, 71(4):501–516, 1986.
A. Lipton. The vol smile problem. Risk, 15(2):61–65, 2002.
J. Guyon, P. Henry-Labordère. Being particular about calibration. Risk, 25(1):88–93, 2012.
J. Guyon, P. Henry-Labordère. Nonlinear Option Pricing. Chapman & Hall/CRC Financial Mathematics Series, 2014.
I. J. D. Craig, A. D. Sneyd. An alternating-direction implicit scheme for parabolic equations with mixed derivatives. Computers & Mathematics with Applications, 16(4):341–350, 1988.
K. J. in ’t Hout, S. Foulon. ADI finite difference schemes for option pricing in the Heston model with correlation. International Journal of Numerical Analysis and Modeling, 7(2):303–320, 2010.
W. Hundsdorfer, J. G. Verwer. Numerical Solution of Time-Dependent Advection–Diffusion–Reaction Equations. Springer, 2003.
P. E. Kloeden, E. Platen. Numerical Solution of Stochastic Differential Equations. Springer, 1992.
L. B. G. Andersen. Simple and efficient simulation of the Heston stochastic volatility model. Journal of Computational Finance, 11(3):1–42, 2008.
R. Lord, R. Koekkoek, D. van Dijk. A comparison of biased simulation schemes for stochastic volatility models. Quantitative Finance, 10(2):177–194, 2010.
A model is Markovian in spot if its future evolution depends on the present only through the current spot S_t: there is no extra hidden state carrying memory. A local-volatility model, in which spot is driven by a deterministic function of time and the current spot alone, is Markovian in this sense; a stochastic-vol model is not, because the variance v_t is a second state variable the future also depends on. The Markovian projection is the operation of replacing the non-Markovian model by the Markovian one that keeps the same one-dimensional distributions— collapsing the extra state v_t by averaging over it at each spot.↩︎
Matching the marginal of each S_t is far weaker than matching the full path distribution. The mimicking model gets every vanilla right—those depend only on a single marginal—but generally misprices path-dependent and forward-starting payoffs, which see the joint distribution across dates. That gap is exactly why the forward smile can differ between two models that share today’s vanillas.↩︎
For the CIR variance process \mathrm{d}v_t=\kappa(\theta-v_t)\,\mathrm{d}t+\xi\sqrt{v_t}\,\mathrm{d}W_t^{v} the Feller condition is 2\kappa\theta\ge\xi^2. When it holds the mean-reversion is strong enough relative to the vol-of-vol to repel the process from the origin, and v_t stays strictly positive; when it is violated (2\kappa\theta<\xi^2, common in equity calibrations with a large \xi) the origin becomes attainable and probability mass accumulates at v=0. There the variance diffusion \xi^2 v vanishes, so the v=0 boundary must be discretised carefully to keep the scheme stable and mass-conserving.↩︎