Bayesian Lexicostatistics

How old is Vedic Sanskrit?

Vedic manuscripts are all less than 1000 years old, but how old is the language?

Let’s compare it to Hindi, a direct descendant.

Just one trait

Datum: Vedic śíras ‘head’ are Hindi sir ‘head’ are cognate.

Suppose the word for ‘head’ gets replaced by an etymologically unrelated word on average once every 5000 years.

The probability that ‘head’ survives $t$ years is $$P(\textrm{survival} \given t) = \exp(-\rho t) = \exp(-t/5000).$$

There are two outcomes:

The survival outcome has a median at 3466 years ago, pretty close to what scholars think! But there’s a lot of uncertainty. If we chop off 5% of the mass on either end, we’re left with:

There is a 90% probability that Sanskrit is between 260 and 15000 years.

A second trait

Datum: Vedic tíṣṭati ‘stand’ > Hindi khaḍā hona ‘stand’.

What if we built the model with this trait instead? Let’s say the replacement rate for ‘stand’ is also once every 5000 years. So the model stays the same, but the outcome of interest is the other one.

You’ve probably noticed that this outcome has unbounded area. Infinitely larger than the other outcome, so infinitely more probable! The model has no a priori notion of how old languages can be. For explaning this single datum, the older Sanskrit is, the more probable the observed outcome.

Bayesian prior

Let’s be Bayesian and add in prior knowledge.

Not helpful. Let’s try another tack. Vedas were transmitted orally. We don’t know of many oral traditions that surived more than 1000 years, so let’s say there is a 50% chance of an oral tradition perishing every 1000 years, which works out to a continuous death rate of 0.0007 per year. Run the clock backwards from when Vedic was finally attested textually to get this probability distribution for the age of Vedic:

The median is at 2000 years ago, by design. The 90%ile interval is $[1074, 5320]$. Yes, I pulled the death rate number out of a hat, but this prior is not overly dismissive of the possibility of Vedic being very old.

Synthesizing prior knowledge and data

Let’s combine prior knowledge with evidence from ‘stand’. For convenience, let’s write the outcome of ‘stand’ as $x_\textrm{stand}$.

Then, $P(x_\textrm{stand}, t) = P(t) \cdot P(x_\textrm{stand}\given t)$. Graphically:

$P(t)$
×
$P(x_\textrm{stand}\given t)$
=
$P(x_\textrm{stand}, t)$

$P(t)$ is a continuous probability distribution; the vertical axis is unlabeled because it’s understood that the area under the curve integrates to 1. $P(x_\textrm{stand} \given t)$ is a discrete distribution; its value is some proportion of 1, so the vertical axis is labeled. $P(x_\textrm{stand}, t)$ is, again, a continuous distribution; it’s understood that the area under the curve integrates to 1.

Thanks to the prior, the outcome where ‘stand’ is replaced has finite area; thanks to the data, it’s shifted right with respect to the prior. The median age is 2550, and the 90%ile interval is $[1140, 6430]$.

We can derive posterior distributions for $t$ by normalizing each outcome in the joint probability. We scale each curve so that it has area 1 under it.

$P(t \given x_\textrm{stand} = 1)$
$P(t \given x_\textrm{stand} = 0)$

What’s likelihood?

$P(x_\textrm{stand}\given t)$ can be called a model likelihood, or just likelihood. Why call it that if you can just call it a conditional probability? It’s a matter of perspective.

Likelihood invokes a Fillmorean frame.

Combine evidence

Can we combine evidence from ‘head’ and ‘stand’? If we posit that replacement in ‘head’ and ‘stand’ are independent events, then it’s just multiplication: $$P(x_\textrm{head}, x_\textrm{stand} \given t) = P(x_\textrm{head} \given t) \cdot P(x_\textrm{stand} \given t).$$

The new likelihood is two likelihood terms multiplied together. Plotting the probability of the four outcomes, as a function of time, gives:

Multiplying by the prior, we get:

Normalize to get posterior distributions.

Conditioned on one in two traits surviving, the posterior median for $t$ is 2170, and the 90%ile interval is $[1100, 5210]$.

Ten traits

The posterior distribution from ten meaning classes is more informative. Here are posterior distributions for $t$, conditioned on $n$ traits suriving. (They are not all to scale with one another.)

For ten words with replacement rates of around 1 in 5000 years, five survived (‘head’, ‘laugh’, ‘root’, ‘wind’, ‘sea’) and five did not (‘dog’, ‘red’, ‘day’, ‘stand’, ‘ice’). The curve for this outcome (bright red) has median 2680. Its 90%ile interval is $[1330, 4970]$.

With more traits as data, the model is less sensitive to the prior. If we assume that oral traditions have a 1 in 4 chance of perishing every 1000 years (as opposed to 1 in 2) the posterior 90%ile interval shifts to $[1560, 6180]$. However, if there is “no prior”, the 90%ile interval shifts but a lot, to $[1810, 7520]$. This is a situation where making “no assumptions” is actually quite tendentious because it implicitly assumes that every year stretching back to the Big Bang (and beyond!) is equally likely a date for Vedic Sanskrit.

Onward to phylogenetics

There are many things to account for.

The solution to all of these problems is the same. Introduce random variables for everything that we don’t absolutely know, and construct plausible probability distributions for them.

Notation

For the sake of concreteness, let’s say a trait is whether a language uses a particular etymological root in the most basic word for a particular sense.

Tree prior

Construct a tree prior, which is a distribution over topologies and chronologies. Posit:

The tree prior is where we put prior notions of how old various languages and proto-languages are.

In the general case, the distribution over split times depends on $T$, so it should be $P(\bft \given T)$. However, it’s extremely common for $P(T, \bft)$ to factor neatly into $P(T) \cdot P(\bft)$, so that’s what we’re doing here.

Trait model

The previous model accounted only for traits that died on the way from Vedic to Hindi; it did not also model traits coming into existence. Now we model both death and birth.

We posit:

In the long run, a trait will be:

This is its stationary distribution. (Stationary just means “in the long run”.) Over time, a trait will approach stationarity. For language $l$ and its parent $m$, this plot shows the probability of $l$ having trait $n$, if $m$ has it (top curve) or lacks it (bottom curve). The horizontal axis is the time spanned by the two languages.

$P(y_{ln} = 1 \given y_{mn} = 1, \Theta)$
$P(y_{ln} = 1 \given y_{mn} = 0, \Theta)$
$\beta / (\beta + \delta)$

If a trait has birth and death rates of $\beta$ and $\delta$, its state approaches stationarity at the rate of $\beta + \delta.$

Rate heterogeneity priors

To model rate heterogeneity, posit:

Now that we’re done introducing model parameters, let’s define $\Theta = (\beta, \delta, \bfb, \bfs, T, \bft)$ as a convenient way to refer to all of them at once.

Trait $n$ on branch $l$ approaches stationarity at rate $\rho_{ln} = (\beta + \delta) b_ls_n$. \begin{align} P(y_{ln} = 1 \given y_{mn} = 1, \Theta) &= \pi_1 + \pi_0\exp(-\rho_{ln}(t_m - t_l)),\\ P(y_{ln} = 0 \given y_{mn} = 1, \Theta) &= \pi_0 - \pi_0\exp(-\rho_{ln}(t_m - t_l)).\\ P(y_{ln} = 1 \given y_{mn} = 0, \Theta) &= \pi_1 - \pi_1\exp(-\rho_{ln}(t_m - t_l)),\\ P(y_{ln} = 0 \given y_{mn} = 0, \Theta) &= \pi_0 + \pi_1\exp(-\rho_{ln}(t_m - t_l)). \end{align}

We construct a probability distribution for the state of the root language $r$ by drawing from the stationary distribution: \begin{align*} P(y_{rn} = 1 \given \beta, \delta) &= \pi_1,\\ P(y_{rn} = 0 \given \beta, \delta) &= \pi_0. \end{align*}

Likelihood terms

Complete the model by defining $$P(\bfx_n \given \beta, \delta, \bfb, \bfs, T, \bft),$$ where

Now we can express the likelihood term as $P(\bfx_n \given \Theta)$.

Likelihood details

As described above, each branch $l$ in the tree induces a conditional probability $P(x_{ln} \given x_{mn}, \Theta).$

We also specified a probability distribution for the state of the root, $P(x_{rn} \given \Theta).$

Intuitively one wants to multiply these probabilities together, but this gets us $P(\bfy_n \given \Theta),$ which is the probability for the distribution of trait $n$ in all languages, including unattested ones. To get from $P(\bfy_n \given \Theta)$ to $P(\bfx_n \given \Theta)$, we have to sum over all possible settings for $\bfy_n$ that are compatible with $\bfx_n$. The standard procedure for this is described in Joseph Felsenstein’s Inferring Phylogenies, pages 251-255, in sections titled “Computing the likelihood of a tree” and “Economizing on the computation”.

Construct the posterior probability distribution

In the Vedic/Hindi example, we had: Now we have:

MrBayes and BEAST will generate samples from the posterior probability. To get the posterior probability for something we care about like $t_\textrm{Vedic}$, just take the samples as they come, note $t_\textrm{Vedic}$, and throw everything else away! With enough samples, you can easily calculate things like mean, median, and belief intervals.

Model versus inference

A model should not be conflated with inference, which is the method used to compute results. A Bayesian model is:

The model is supposed to contain all of the scientific insight. You shouldn’t have to understand anything about inference to interpret the model or the result.

The model determines the result, whereas inference is merely a way to find the result.

Often the same model is amenable to many different inference strategies, all of which are supposed to yield the same result. MCMC is just one of many such strategies.

In an ideal world, none of us would have to worry about inference; we could just write down the model, and a computer program will do inference for us.