Bayesian Lexicostatistics

How old is Vedic Sanskrit?

Vedic manuscripts are all less than 1000 years old, but how old is the language?

Let’s compare it to Hindi, a direct descendant.

Just one trait

Datum: Vedic śíras ‘head’ are Hindi sir ‘head’ are cognate.

Suppose the word for ‘head’ gets replaced by an etymologically unrelated word on average once every 5000 years.

Every year, the probability of replacement is 1 in 5000.
The replacement rate is $\rho = 0.0002\,\textrm{yr}^{-1}$.

The probability that ‘head’ survives $t$ years is $$P(\textrm{survival} \given t) = \exp(-\rho t) = \exp(-t/5000).$$

There are two outcomes:

The survival outcome has a median at 3466 years ago, pretty close to what scholars think! But there’s a lot of uncertainty. If we chop off 5% of the mass on either end, we’re left with:

There is a 90% probability that Sanskrit is between 260 and 15000 years.

A second trait

Datum: Vedic tíṣṭati ‘stand’ > Hindi khaḍā hona ‘stand’.

What if we built the model with this trait instead? Let’s say the replacement rate for ‘stand’ is also once every 5000 years. So the model stays the same, but the outcome of interest is the other one.

You’ve probably noticed that this outcome has unbounded area. Infinitely larger than the other outcome, so infinitely more probable! The model has no a priori notion of how old languages can be. For explaning this single datum, the older Sanskrit is, the more probable the observed outcome.

Bayesian prior

Let’s be Bayesian and add in prior knowledge.

Lower bound: the oldest Vedic manuscript is around 1000 years old.
Upper bound: Hard to say. Humans dispersed from Africa 130000 years ago. Last glacial maximum was 24000 years ago.

Not helpful. Let’s try another tack. Vedas were transmitted orally. We don’t know of many oral traditions that surived more than 1000 years, so let’s say there is a 50% chance of an oral tradition perishing every 1000 years, which works out to a continuous death rate of 0.0007 per year. Run the clock backwards from when Vedic was finally attested textually to get this probability distribution for the age of Vedic:

The median is at 2000 years ago, by design. The 90%ile interval is $[1074, 5320]$. Yes, I pulled the death rate number out of a hat, but this prior is not overly dismissive of the possibility of Vedic being very old.

Synthesizing prior knowledge and data

Let’s combine prior knowledge with evidence from ‘stand’. For convenience, let’s write the outcome of ‘stand’ as $x_\textrm{stand}$.

$x_\textrm{stand}=1$ if ‘stand’ survives in Hindi.
$x_\textrm{stand}=0$ if ‘stand’ was replaced in Hindi.

Then, $P(x_\textrm{stand}, t) = P(t) \cdot P(x_\textrm{stand}\given t)$. Graphically:

$P(t)$

$P(x_\textrm{stand}\given t)$

$P(x_\textrm{stand}, t)$

$P(t)$ is a prior probability: the a priori probability that Vedic has age $t$.
$P(x_\textrm{stand}\given t)$ is a conditional probability: the probability of outcome $x_\textrm{stand}$, given that Vedic has age $t$.
$P(x_\textrm{stand}, t)$ is a joint probability: the probability of outcome $x_\textrm{stand}$ and of Vedic having age $t$.

$P(t)$ is a continuous probability distribution; the vertical axis is unlabeled because it’s understood that the area under the curve integrates to 1. $P(x_\textrm{stand} \given t)$ is a discrete distribution; its value is some proportion of 1, so the vertical axis is labeled. $P(x_\textrm{stand}, t)$ is, again, a continuous distribution; it’s understood that the area under the curve integrates to 1.

Thanks to the prior, the outcome where ‘stand’ is replaced has finite area; thanks to the data, it’s shifted right with respect to the prior. The median age is 2550, and the 90%ile interval is $[1140, 6430]$.

We can derive posterior distributions for $t$ by normalizing each outcome in the joint probability. We scale each curve so that it has area 1 under it.

$P(t \given x_\textrm{stand} = 1)$

$P(t \given x_\textrm{stand} = 0)$

$P(t\given x_\textrm{stand})$ is a posterior probability. It is a probability distribution for the quantity of interest $t$, after seeing some data, viz. the outcome $x_\textrm{stand}$.

What’s likelihood?

$P(x_\textrm{stand}\given t)$ can be called a model likelihood, or just likelihood. Why call it that if you can just call it a conditional probability? It’s a matter of perspective.

Mathematically speaking, $P(x_\textrm{stand}\given t)$ is a function with two inputs.
If you fix $t$ and vary $x_\textrm{stand}$, it tells you how probable the outcome $x_\textrm{stand}$ is, so it’s a conditional probability.
If you fix $x_\textrm{stand}$ and vary $t$, it tells you how good a particular setting of $t$ is for explaining the data (which is fixed after all) so it’s an indicator of model likelihood.
Depending on which you consider the “business end” of $P(x_\textrm{stand}\given t)$, one or the other term is more appropriate.

Likelihood invokes a Fillmorean frame.

The likelihood is often written $P(X \given \Theta)$.
$X$ is the data. (In this case, there is only one data point, $x_\textrm{stand}$.)
$\Theta$ is a collection of parameters. (In this case, the only parameter is $t$.)
The likelihood is often used by itself to estimate $\Theta$. The analyst looks for the setting of $\Theta$ that maximizes the likelihood. This procedure is called maximum likelihood. As we saw, this led to an absurd result in the case of ‘stand’.
A Bayesianist will use the likelihood in conjunction with a prior on $\Theta$ to obtain a posterior distribution for $\Theta$.

Combine evidence

Can we combine evidence from ‘head’ and ‘stand’? If we posit that replacement in ‘head’ and ‘stand’ are independent events, then it’s just multiplication: $$P(x_\textrm{head}, x_\textrm{stand} \given t) = P(x_\textrm{head} \given t) \cdot P(x_\textrm{stand} \given t).$$

The new likelihood is two likelihood terms multiplied together. Plotting the probability of the four outcomes, as a function of time, gives:

Multiplying by the prior, we get:

Normalize to get posterior distributions.

Conditioned on one in two traits surviving, the posterior median for $t$ is 2170, and the 90%ile interval is $[1100, 5210]$.

Ten traits

The posterior distribution from ten meaning classes is more informative. Here are posterior distributions for $t$, conditioned on $n$ traits suriving. (They are not all to scale with one another.)

For ten words with replacement rates of around 1 in 5000 years, five survived (‘head’, ‘laugh’, ‘root’, ‘wind’, ‘sea’) and five did not (‘dog’, ‘red’, ‘day’, ‘stand’, ‘ice’). The curve for this outcome (bright red) has median 2680. Its 90%ile interval is $[1330, 4970]$.

With more traits as data, the model is less sensitive to the prior. If we assume that oral traditions have a 1 in 4 chance of perishing every 1000 years (as opposed to 1 in 2) the posterior 90%ile interval shifts to $[1560, 6180]$. However, if there is “no prior”, the 90%ile interval shifts but a lot, to $[1810, 7520]$. This is a situation where making “no assumptions” is actually quite tendentious because it implicitly assumes that every year stretching back to the Big Bang (and beyond!) is equally likely a date for Vedic Sanskrit.

Onward to phylogenetics

There are many things to account for.

Replacement rates are not givens. They should be estimated by looking at language pairs like Latin/French, where the time interval is basically known.
Languages don’t all evolve at the same rate. Rates estimated for Latin/French cannot automatically be applied to Vedic/Hindi.
Language pairs overlap. Latin/French overlaps with Latin/Spanish. Both speak to rates of change, but not independently. We can model this with a tree, but now we have an extra node (Proto Western Romance) whose time we do not know, and whose state is not observed.
If we introduce a third Romance language like Italian, we have another problem. We don’t necessary know which of the three (Italian, French, Spanish) branched off first. The topology is unknown.

The solution to all of these problems is the same. Introduce random variables for everything that we don’t absolutely know, and construct plausible probability distributions for them.

Notation

There are $L$ attested languages (numbered $1, \ldots, L$) and $N$ traits.
The data $\bfx$ is an $L \times N$ matrix of zeros and ones; $x_{ln}$ indicates whether language $l$ has trait $n$.
There are protolanguages (splits) that are not attested. These are numbered $L+1, \ldots, L'$, for a total of $L'$ languages, attested or not.
To accomodate unattested languages, let’s define an augmented version of $\bfx$ called $\bfy$ that has extra rows for unattested languages. It is an $L' \times N$ matrix whose first $L$ rows are identical to $\bfx$.
Branches are numbered too. Branch $l$ is the branch immediately rootward of language $l$.

For the sake of concreteness, let’s say a trait is whether a language uses a particular etymological root in the most basic word for a particular sense.

Tree prior

Construct a tree prior, which is a distribution over topologies and chronologies. Posit:

A topology $T$ with distribution $P(T)$.
A time-depth $t_l$ for every language $l$. Let $\bft = (t_1, \ldots, t_{L'})$ and posit a distribution $P(\bft)$.

The tree prior is where we put prior notions of how old various languages and proto-languages are.

In the general case, the distribution over split times depends on $T$, so it should be $P(\bft \given T)$. However, it’s extremely common for $P(T, \bft)$ to factor neatly into $P(T) \cdot P(\bft)$, so that’s what we’re doing here.

Trait model

The previous model accounted only for traits that died on the way from Vedic to Hindi; it did not also model traits coming into existence. Now we model both death and birth.

We posit:

A basic rate of trait death $\delta$ with distribution $P(\delta)$.
A basic rate of trait birth $\beta$ with distribution $P(\beta)$.

In the long run, a trait will be:

Alive with probability $\pi_1 = \frac{\beta}{\beta+\delta},$
Dead with probability $\pi_0 = \frac{\delta}{\beta+\delta}.$

This is its stationary distribution. (Stationary just means “in the long run”.) Over time, a trait will approach stationarity. For language $l$ and its parent $m$, this plot shows the probability of $l$ having trait $n$, if $m$ has it (top curve) or lacks it (bottom curve). The horizontal axis is the time spanned by the two languages.

$P(y_{ln} = 1 \given y_{mn} = 1, \Theta)$

$P(y_{ln} = 1 \given y_{mn} = 0, \Theta)$

$\beta / (\beta + \delta)$

If a trait has birth and death rates of $\beta$ and $\delta$, its state approaches stationarity at the rate of $\beta + \delta.$

Rate heterogeneity priors

To model rate heterogeneity, posit:

A rate modifier $b_l$ that pertains to branch $l$, with distribution $P(b_l)$.
A rate modifier $s_n$ that pertains to trait $n$, with distribution $P(s_n)$.

Now that we’re done introducing model parameters, let’s define $\Theta = (\beta, \delta, \bfb, \bfs, T, \bft)$ as a convenient way to refer to all of them at once.

Trait $n$ on branch $l$ approaches stationarity at rate $\rho_{ln} = (\beta + \delta) b_ls_n$. \begin{align} P(y_{ln} = 1 \given y_{mn} = 1, \Theta) &= \pi_1 + \pi_0\exp(-\rho_{ln}(t_m - t_l)),\\ P(y_{ln} = 0 \given y_{mn} = 1, \Theta) &= \pi_0 - \pi_0\exp(-\rho_{ln}(t_m - t_l)).\\ P(y_{ln} = 1 \given y_{mn} = 0, \Theta) &= \pi_1 - \pi_1\exp(-\rho_{ln}(t_m - t_l)),\\ P(y_{ln} = 0 \given y_{mn} = 0, \Theta) &= \pi_0 + \pi_1\exp(-\rho_{ln}(t_m - t_l)). \end{align}

We construct a probability distribution for the state of the root language $r$ by drawing from the stationary distribution: \begin{align*} P(y_{rn} = 1 \given \beta, \delta) &= \pi_1,\\ P(y_{rn} = 0 \given \beta, \delta) &= \pi_0. \end{align*}

Likelihood terms

Complete the model by defining $$P(\bfx_n \given \beta, \delta, \bfb, \bfs, T, \bft),$$ where

$\bfx_n = (x_{1n}, \ldots, x_{Ln})^T$ is the $n$th column of $\bfx$. It is the distribution of trait $n$ in the $L$ attested languages,
$\beta, \delta$ are the basic birth and death rates,
$\bfb = (b_1, \ldots, b_{L'})$ are the branch-specific rate modifiers,
$\bfs = (s_1, \ldots, s_N)$ are the trait-specific rate modifiers,
$T$ is the topology,
$\bft = (t_1, \ldots, t_{L'})$ are language time-depths.

Now we can express the likelihood term as $P(\bfx_n \given \Theta)$.

Likelihood details

As described above, each branch $l$ in the tree induces a conditional probability $P(x_{ln} \given x_{mn}, \Theta).$

We also specified a probability distribution for the state of the root, $P(x_{rn} \given \Theta).$

Intuitively one wants to multiply these probabilities together, but this gets us $P(\bfy_n \given \Theta),$ which is the probability for the distribution of trait $n$ in all languages, including unattested ones. To get from $P(\bfy_n \given \Theta)$ to $P(\bfx_n \given \Theta)$, we have to sum over all possible settings for $\bfy_n$ that are compatible with $\bfx_n$. The standard procedure for this is described in Joseph Felsenstein’s Inferring Phylogenies, pages 251-255, in sections titled “Computing the likelihood of a tree” and “Economizing on the computation”.

Construct the posterior probability distribution

In the Vedic/Hindi example, we had:

Prior probability $P(t)$.
Conditional probabilities $P(x_n \given t)$ for traits $n=1, \ldots, 10$.
Joint probability $P(\bfx, t) = P(t) \cdot \prod_{n=1}^{10} P(x_n \given t).$
Posterior probability $P(t \given \bfx) = P(\bfx, t) / Z$, where $Z$ is a normalization constant.

Now we have:

Prior probability $P(\Theta) = P(\beta) \cdot P(\delta) \cdot P(\bfb) \cdot P(\bfs) \cdot P(T) \cdot P(\bft)$.
Conditional probability $P(\bfx_n \given \Theta)$ for traits $n=1, \ldots, N$.
Joint probability $P(\bfx, \Theta) = P(\Theta) \cdot \prod_{n=1}^N P(\bfx_n \given \Theta).$
Posterior probability $P(\Theta \given \bfx) = P(\bfx, \Theta) / Z$, where $Z$ is a normalization constant.

MrBayes and BEAST will generate samples from the posterior probability. To get the posterior probability for something we care about like $t_\textrm{Vedic}$, just take the samples as they come, note $t_\textrm{Vedic}$, and throw everything else away! With enough samples, you can easily calculate things like mean, median, and belief intervals.

Model versus inference

A model should not be conflated with inference, which is the method used to compute results. A Bayesian model is:

A prior over parameters $P(\Theta)$, and
A likelihood $P(x \given \Theta)$.

The model is supposed to contain all of the scientific insight. You shouldn’t have to understand anything about inference to interpret the model or the result.

The model determines the result, whereas inference is merely a way to find the result.

Often the same model is amenable to many different inference strategies, all of which are supposed to yield the same result. MCMC is just one of many such strategies.

In an ideal world, none of us would have to worry about inference; we could just write down the model, and a computer program will do inference for us.