This is a sketch of a solution to Task: text to Bayes rationality.
The paradigm is Bayesian epistemology. The broader task is to infer a rational worldview from empirical observation. Here, we use a collection of documents as our link to the real world: we observe that somebody created a document like this.
Roughly speaking, we infer our rational worldview by forcing Bayes’ rule in all possible combinations of model weights and observations. The engine of this arrangement is a language model conditioned on propositional knowledge paired with a knowledge model conditioned on language.
Preliminaries
In reality, there are at least two Bayes’ rules: the discrete and the continuous. We we use the continuous form:
$$f_{X|Y=y}(x) = f_{Y|X=x}(y) f_{X}(x) / f_{Y}(y)$$
where each function is a probability density function / conditional density.
To make a continuous distribution over something discrete like words, we use a traditional word embedding summed with positional encoding, then passed through the PDF of a multivariate normal distribution with inferred mean and covariance matrix. (How this interacts with the positional encoding I’m not clear on….)
The multivariate normal is particularly useful because it can be arbitrarily marginalized to any subset of components of the random vector; this results from the fact that a multivariate normal parameterized by a linear transform of another’s parameters is also multivariate normal.
Distributions of interest
There are five:
- $P(\vec{w})$—a general language model. This decomposes by the chain rule as $P(\vec{w}) = \Pi_{i} P(w_i | \vec{w}_{j < i})$.
Implementation: unclear; we need a probabilistic language model; can we get a probabilistic interpretation of a transformer? - $P(K)$—a general knowledge model. How likely, a priori, is a belief or statement to be true?
Implementation: a multivariate normal would be a starting point - $P(\vec{w} | K)$—the knowledge-conditional language model. This is the probability of a document $\vec{w}$ given some assertion of the state of the world, the nature of reality, or whatever, $K$. $K$ may make claims about a subset of reality; the world is a complex place, so it’s helpful to be able to discuss parts of it rather than always the whole. This is enabled by the marginalizability of the multivariate normal as discussed above. Of course by the chain rule this decomposes to $\Pi_{i} P(w_i | \vec{w}_{j < i})$.
Implementation: uncertain; a multivariate normal parameterized by a transformer with $K$ as input? - $P(K | \vec{w})$—the language-conditional knowledge model. Given a word and its context, how likely is an assertion about our model to be true?
Implementation: uncertain; another probabilistic transformer? A multivariate normal whose parameters are a function of $\vec{w}$, perhaps the output of a transformer? - $P(K|J)$ where $K$ and $J$ are disjoint propositions—a hypotheticals model. What does assuming part of our model say about the rest of our model?
Implementation: multivariate normal parameterized by output of a transformer
Training Procedure
Randomly sample word-with-context $\vec{w}$ and knowledge vector $\vec{k}$. Randomly partition $\vec{k}$ into disjoint vectors $\vec{q}$ and $\vec{r}$. Compute the gradient of the loss:
$$\mathfrak{L}_{Int} = [P(\vec{q} | \vec{r}) – P(\vec{r} | \vec{q}) P(\vec{q}) / P(\vec{r})]^2$$
$$\mathfrak{L}_{Obs} = [P(\vec{w} | \vec{k}) – P(\vec{k} | \vec{w}) P(\vec{w}) / P(\vec{k})]^2$$
$$\mathfrak{L} = \mathfrak{L}_{Int} + \mathfrak{L}_{Obs}$$
and feed it to your favorite optimizer.
The first part critically evaluates the interrelationship of model components. The second part critically evaluates the explanatory power of the model relative to empirical observation.
Leave a Reply