Language models, at the core, are very stupid, blindly predicting the next word given the preceding words. This leaves them profoundly vulnerable to the biases and inaccuracies of the training data.
Human annotations are applied late in the game to reduce spurious, hallucinatory, extremist, discriminatory, and other undesired outputs, but this after-the-model reshaping is symptomatic of the fact that there is no critical thinking in the language model proper. It can’t assess the reasonableness of the text it’s generating; only the likelihood that the words would be spoken. Of course, some things are so widely accepted that they go without saying; and others, like propaganda, are repeated precisely because of their untruth.
This task is to distill from biased textual inputs a rational world-model, so far as it is implicit in the training data.
What makes a model rational?
A “rational” model does its best to be both internally consistent and to comport with empirical observations. If this consistency respects Bayes’ rule, it embodies Bayesian epistemology; we here term such a model Bayes rational.
The first requirement of Bayes rationality, internal consistency, can be seen as adherence to Bayes’ rule between all pairs of model parameters:
$$P(\theta_i | \theta_j) = P(\theta_j | \theta_i) P(\theta_i) / P(\theta_j)$$
This can be enforced by minimizing the loss function:
$$\mathfrak{L}_{Int} \equiv \sum_{i} \sum_{j \neq i} [P(\theta_i | \theta_j) – P(\theta_j | \theta_i) P(\theta_i) / P(\theta_j)]^2$$
But having internal consistency is not enough to render a worldview rational. Theory must also relate to observation in a manner consistent with Bayes’ rule. Observation consistency applies Bayes’ rule to every model parameter-observation pair:
$$P(x_i | \theta_j) = P(\theta_j | x_i) P(x_i) / P(\theta_j)$$
This relationship is captured by the loss:
$$\mathfrak{L}_{Obs} \equiv \sum_i \sum_{j \neq i} [P(x_i | \theta_j) – P(\theta_j | x_i) P(x_i) / P(\theta_j)]^2$$
The combination of internal rational consistency and rational consistency with observation is embodied by a Bayes rationality loss:
$$\mathfrak{L} = \mathfrak{L}_{Int} + \mathfrak{L}_{Obs}$$
The task of extracting a Bayes rational model of the world is complete to the extent that this loss is minimized.
Art by emma marie andersson under a Creative Commons license.
Leave a Reply