• A privacy-preserving protocol for age-verified web applications

    A privacy-preserving protocol for age-verified web applications

    Poorly-designed bills have recently passed in California, Colorado, New York, and Illinois, requiring intrusive, privacy-compromising age verification procedures for sites online. The motivation is to prevent children and adolescents accessing pornography and social media. However, the ill-considered means of achieving this may be worse than doing nothing at all: by forcing users to disclose their age online, and forcing sites to process and potentially store this information, a user’s age is exposed to malign actors, especially given the inevitability of data breaches.

    Others have covered the deficiencies of those bills in detail. What I want to do here is sketch an alternative to the recently legislated approaches, which would maximally preserve privacy while providing parents the tools they need to prevent their children’s access to age-inappropriate materials.

    Principles

    • No information about the user disclosed to websites. This is for the obvious reason that any information disclosed becomes a privacy risk, giving malicious actors online power over the user through knowledge of their identity. In the case of age-verification, this would include knowledge of whether the user is a child, proclaiming the vulnerable status of children to the entire Internet so to speak. Even well-intended sites collecting such data can be hacked, and inevitably will be.
    • Parents in the driver’s seat. Parents know their children’s needs better than the state legislature. Operating systems, browsers, and application software should empower parents to choose the appropriate access level for their children.
    • Voluntary before mandatory. Web services and operating system and application developers should be given space to voluntarily implement an age-verification system, rather than legislatures with little technical knowledge attempting to design the system for them. The ESRB video game rating system was developed in this way, and was a great success.
    • Non-users pay no penalty. Users not interested in age verification should not be required to do anything. The Internet and the Web should continue functioning as previously.

    Design

    • Site age requirement declarations in DNS. The Domain Name System is a distributed, decentralized database and as such is a perfect place for site age requirements to be stored. Until such a time as a DNS AGE record (or similar) might be standardized, a TXT record will be used. Details of the record value are given below. The record indicates which ages the domain’s materials are appropriate for, applying to all sub-domains as well.
    • Client-side parental controls and enforcement of age requirements. Client software including the operating system, web browser, and other internet-connected applications should allow parents to indicate the age rating their children can access, and whether to enforce websites’ ratings or something stricter. In practice, all of these scenarios would fetch the AGE record from DNS (or equivalent TXT record), which would not reveal the user’s interest in age requirements to the domain in question. The operating system could prohibit connections to domains for which the age requirement was not met. Alternately, browsers and applications could query the operating system to determine if a given age requirement is met, and refuse to fetch corresponding content. Either way, and crucially, no information is disclosed to the site whose age requirements are being enforced. Only a DNS query is leaked, and this can be avoided if necessary by running one’s own DNS server.

    Age requirements record design

    Age requirements are better expressed relative to an age rather than relative to a date of birth. This prevents needless updates, allowing long TTLs on records.

    We want the age requirements record to be able to specify any subset of ages, where age can be any non-negative quantity of seconds, potentially fractional.

    Consider one possible format:

    < 12 years; >= 70 years

    This would mean “all ages less than 12 years old, and all ages at least 70 years old”.

    Of course this could be minified thus:

    <12y;>=70y

    The more typical requirement of at least 18 years of age would be:

    >= 18 years

    TXT record alternate

    Until an AGE record could be formalized, a TXT record with prefix of AGEVERIFY: followed by the requirements could be used instead. For example a TXT record with a value of AGEVERIFY:>= 18 years would be equivalent to an AGE record with a value of >= 18 years.

    AGE record recursiveness

    This proposal includes a requirement that AGE records (and their TXT record equivalents) be interpreted recursively. This means that subdomains’ age requirements will be the intersection of parent domain requirements and own requirements. This means subdomain requirements will only become more restrictive, never more permissive. It would be appropriate, for example, for top-level domains such as .xxx primarily (or exclusively) hosting adult materials to declare a minimum age which would apply to all registered subdomains.

    Analysis

    Unlike the approaches thus far legislated, this protocol would allow parents to restrict their kids’ access to adult materials without revealing anything about their children’s age to websites and applications. It is maximally privacy-preserving and maximally-parent-empowering.

    The burden of implementation would rest on operating systems and/or client applications. However, as it would provide a useful feature to users, such implementations are likely to happen without legislative mandate.

    Site operators would be empowered to declare age requirements. Given the threat of more onerous regulations, sites are incentivized to set their requirements somewhat strictly. If this is seen as insufficient, it could be legislated that certain categories of site must declare certain minimums.

    Conclusion

    It’s my view that legislatures acting on age verification now have jumped the gun and will come to regret the rash and intrusive approaches mandated.

    Mandatory age-verification for online content by means of forced disclosure of age status is perhaps the most dangerous and destructive approach that could be imagined.

    Requiring sites to instead estimate age and elicit proof of age is nearly as dangerous, as this private information will inevitably be stolen in a data breach.

    It would be wiser to give technologists time to develop privacy-respecting means of ensuring children don’t access age-inappropriate materials, rather than mandating draconian and technologically-naive systems which could in fact harm children more than help them.

    As this post demonstrates, it is technically feasible to do this in a way that reveals no information about children online, and which empowers parents to restrict children’s access to adult materials as they deem appropriate. Privacy, parent-empowerment, a preference for voluntary implementation, and requiring no change for non-users make this approach preferable to the draconian mandates currently causing justifiable alarm across the Internet.

    Request for Comments

    Of course this is only an outline of a fully-fleshed specification, and details would need to be filled in. But does this approach hold up? Is there something I’m missing? Please comment on this post or email me at ageverify@joshhansen.tech

  • On the products of Bernoulli random variables

    On the products of Bernoulli random variables

    I took another detour into the world of binary neural networks. In the interests of evaluating dot-product-like functions of two binary vectors, I wanted to assess how the distributions of multiplied bits (equivalent to ANDed bits) relate to the distributions of the individual bits. This post captures the results of this side-trip.

    Products of independent Bernoullis

    The simplest case is in the light of independence, when $$X \sim Bern(\pi_{X})$$

    $$Y \sim Bern(\pi_{Y})$$ and $$P(X,Y) = P(X)P(Y)$$

    Let $Z = XY$ be the product of the two bits, equivalent to $Z=X \land Y$. As $Z \in \{0, 1\}$ we can model $Z \sim Bern(\pi_Z)$ for some $\pi_Z$.

    The truth table for the four possible outcomes guides us here:

    $X$$Y$$Z$$P(X)$$P(Y)$
    000$1-\pi_X$$1-\pi_Y$
    010$1-\pi_X$$\pi_Y$
    100$\pi_X$$1-\pi_Y$
    111$\pi_X$$\pi_Y$

    $P(Z=1) = P(X=1)P(Y=1) = \pi_X \pi_Y$; a bit of arithmetic also shows that $P(Z=0) = 1 – \pi_X \pi_Y$. With $\pi_X \in [0,1] \land \pi_Y \in [0,1] \implies \pi_X \pi_Y \in [0, 1]$, we see that $$Z \sim Bern(\pi_X \pi_Y)$$

    Products of Non-Independent Bernoullis

    Suppose that $X \sim Bern(\pi_X)$. A second Bernoulli random variable will be dependent on $X$ only if its single parameter is a function of $X$, so let $Y|X \sim Bern(g(X))$ where $g : \{0, 1\} \to [0,1]$ generates the parameter of $Y$’s distribution.

    As there are only two possible outcomes from such a function $g$ we denote them individually as: $$\pi_{Y|X=0} = g(0)$$ and $$\pi_{Y|X=1} = g(1)$$

    We again are interested in the distribution of a given $Z = XY = X \land Y$, and again the truth table leads us to a straightforward solution:

    $X$$Y$$Z$$P(X)$$P(Y|X)$
    000$1-\pi_X$$1-\pi_{Y|X=0}$
    010$1-\pi_X$$\pi_{Y|X=0}$
    100$\pi_X$$1-\pi_{Y|X=1}$
    111$\pi_X$$\pi_{Y|X=1}$

    $$P(Z=1) = P(X=1) P(Y=1 | X =1) = \pi_{X} \pi_{Y | X = 1} = \pi_{X} g(1)$$

    A bit of arithmetic shows that $$P(Z=0) = 1 – \pi_{X} \pi_{Y|X=1} = 1 – \pi_{X} g(1)$$

    With $\pi_{X} \in [0,1] \land \pi_{Y|X=1} \in [0,1] \implies \pi_{X} \pi_{Y | X = 1} \in [0,1]$ we conclude that $$Z \sim Bern(\pi_{X} \pi_{Y | X = 1})$$

    The parameter $\pi_{Y|X=0} = g(0)$ for the case where $X=0$ drops out; it is redundant as $Y=1 | X = 0$ is simply another way for $Z$ to equal zero.

    In other words, the distribution of $Z$ depends only on the distributions of $X$ and of $Y | X = 1$, not on $Y | X = 0$. The constraint that all events must result in either one or the other of two outcomes allows the behavior of the combined system to be fully characterized with a constant number of parameters, and the distribution of the product / Boolean AND even of dependent Bernoullis is simply yet another Bernoulli.

  • A normally-distributed position encoding

    As part of my effort to develop a probabilistic interpretation of transformer language models, I became interested in alternative position encodings to that used in Attention Is All You Need.

    A position encoding can be characterized as some function from a non-negative integer to a vector of reals:

    $$e : \mathbb{N} \to \mathbb{R}^{K}$$

    or as a matrix carrying out the same function for a finite number of sequence positions, where the encoded vector can be used to reconstruct the sequence position, and where deltas between sequence positions can be captured by a linear map.

    The point of the sequence encoding is to inform a language model of where in a sequence a word was located, rather than merely what the word was, and to allow the model to refer to token positions in both absolute and relative terms. Typically the encoding is summed with the token embedding. This can be viewed as a generalization of, rather than alternative to, concatenation.

    Requirements

    My requirements for a position encoding:

    1. That the encoding be similarly invertible by an artificial neural network as is the sine/cosine encoding in common use.
    2. That relative positions can be captured easily by a neural network mode, as described in subsection 3.5 in Attention Is All You Need. This means that we can train a linear map which can transform position encodings from one position to the correct position encoding of a relative position. This property seems to hold in general; we do not evaluate it empirically.
    3. That the encoded vector play nicely with probabilistic token embeddings, i.e. have a well-understood statistical distribution. Even though position encodings will be deterministic, it would be helpful to be able to interpret them as just another random variable—one which happens to have all its mass on a single outcome.

    A Normally Distributed Position Encoding

    We might as well start with the best-understood distribution and try to generate normally distributed position encodings.

    More specifically, we want to construct encoding vectors of dimension $K$ such that each element at index $k$ is distributed according to the univariate normal distribution $\mathcal{N}(\mu_k, \sigma_k)$. This is equivalent to a multivariate normal with no covariance between components (a special property of the multivariate normal; lack of covariance does not typically imply independence.)

    To get encodings that are normally distributed according to these distributions (if non-random) we reformulate the problem in terms of the sequence position divided by a maximum length $N$, giving us encodings as functions of real-valued positions distributed evenly within $[0, 1]$:

    $$e(i) = e'(i / N)$$

    (This assumes a known maximum length, which is a disadvantage relative to the sine/cosine encoding.)

    With inputs in $[0, 1]$ we now find a corresponding sample in each of the $K$ normals such that the same percentage of the distribution lies below it: in other words, we use the inverse CDF of the normal distribution, which is commonly available. Let

    $$F^{-1}_k : [0, 1] \to \mathbb{R}$$

    be the inverse CDF of $\mathcal{N}(\mu_k, \sigma_k)$. Then:

    $$e'(x) = [F^{-1}_{1}(x), F^{-1}_{2}(x), …, F^{-1}_{K}(x)]$$ and

    $$e(i) = [F^{-1}_{1}(i/N), F^{-1}_{2}(i/N), …, F^{-1}_{K}(i/N)]$$

    Comparison Encodings

    For evaluation purposes, we investigate the invertibility of the following encodings:

    1. sincos: Attention Is All You Need-style sine/cosine encoding. Even components $2k$ are $sin(pos / 10000^{2k / K})$ and odd components $2k+1$ are $cos(pos / 10000^{2k / K})$.
    2. direct1: the first dimension of the encoding equals $i/N$; the rest are zeros.
    3. directN: all dimensions of the encoding equal $i/N$.
    4. linear_normal: the encoding is the position as a decimal ($i/N$) multiplied by a vector of random reals plus a random bias, all initialized as samples from standard normal.
    5. linear_normal_learned: like linear_normal, but the weights and bias are learned during training rather than static.
    6. linear_uniform: like linear_normal, but with weights and bias initialized from a uniform distribution on $[-1, 1]$.
    7. linear_uniform_learned: like linear_uniform, but the weights and bias are learned rather than static.
    8. normal: the normally distributed encoding described above.
    9. normal_learned: like normal, but the parameters of the normals are all learned.

    Inversion

    The invertibility of an encoding is measured by training a function

    $$g : \mathbb{R}^{K} \to [0, 1]$$

    that attempts to recover the original position divided by the sequence length, from the position encoding vector. For simplicity, we let $g$ be an affine transform with sigmoid activation function (in other words, a standard “dense” neural network layer):

    $$g(\vec{x}) = \sigma(\vec{x} \cdot \vec{w}^{T} + b)$$

    where $\sigma$ is the standard logistic function. The loss is the mean squared error; the optimizer is Adam with a learning rate of $1^{-3}$.

    Empirical Evaluation

    Each of the nine encodings was evaluated in terms of the speed with which the optimizer could recover the position from the position encoding. The mean across 100 random initializations is shown, trained for 20000 iterations. A few views are shown to highlight different trends; note the axes vary between images.

    Overview. This view of the entire data series shows clearly that the direct encodings (red and blue) are curiously the worst of all. While directly encoding into all $K$ dimensions instead of just 1 makes inversion of the encoding easier, it still lags far behind other methods. This view also shows that at a certain resolution, over a long enough training time, there is no real distinction between methods.
    Overview 1k. The first 1000 iterations. In the shorter run, however, the different encodings display markedly different behavior in invertibility. normal and normal_learned are the clear standouts, unexpectedly as they were formulated for statistical intelligibility, not for invertibility.
    The “junction” where all encodings perform similarly, prior to diverging around 20-25 iterations.
    sincos advancement. The early performance of normal, normal_learned, linear_normal_learned, and linear_uniform_learned are seen, with sincos overtaking all but normal and normal_learned by about 1200 iterations.
    The money shot. Here we see all essential relationships between the encodings. All but normal, normal_learned, and sincos eventually converge on the same performance as the direct encodings. linear_normal_learned and linear_uniform_learned are interesting for achieving peak inversion performance sooner than sincos; but in the long run they also converge on the performance of direct1 and directN. Meanwhile the normally distributed encodings normal and normal_learned by far perform best in inversion performance until finally being overtaken by sincos late in the game around 12,000 iterations.

    Implementation

    The following should get the position encoding analysis running:

    git clone https://github.com/joshhansen/MLPortfolio.git
    
    # Optional: `git switch faf064b`
    # to get the code version at time of publication.
    
    cd MLPortfolio/pos-enc-invert
    
    python -m venv ./venv
    
    source ./venv/bin/activate
    
    pip install -r packages.txt
    
    python pos-enc-invert.py

    Data

    The invertibility loss data discussed in this post can be accessed in the GitHub repository.

    Discussion

    It is surprising that the most direct procedures for position encoding (multiplying by a random vector; representing positions as values in $[0,1]$) are the worst performers, all converging upon essentially the same performance in the long run.

    In all cases, allowing the encoding parameters to be learned leads to quicker convergence of the inverter model, but seemingly converges to the same inversion loss.

    In all cases, normally distributed is inverted faster than uniformly distributed.

    sincos as expected performs well in both inversion convergence speed and in long-run performance. It was originally selected with care.

    Unexpectedly as they were chosen for reasons of well-characterized statistical distribution rather than ease of inversion, the normally distributed encodings normal and normal_learned converge far faster, to a far lower loss, than all other encodings considered, until being overtaken late by sincos. normal reaches a near-horizontal-asymptote by about 400 iterations; and normal_learned by about 300.

    Conclusion

    It remains unclear why a position encoding that yields normally distributed (if non-random) vectors is so easy for a neural network to invert—even more so than for the sine/cosine encoding in common use and formulated largely for its invertibility.

    What’s more clear is that the method proposed here should have utility as a standalone position encoding, and may also serve as a useful part of a broader effort to develop probabilistic interpretations of transformer language models.

  • Concept: Rational Reader

    This is a sketch of a solution to Task: text to Bayes rationality.

    The paradigm is Bayesian epistemology. The broader task is to infer a rational worldview from empirical observation. Here, we use a collection of documents as our link to the real world: we observe that somebody created a document like this.

    Roughly speaking, we infer our rational worldview by forcing Bayes’ rule in all possible combinations of model weights and observations. The engine of this arrangement is a language model conditioned on propositional knowledge paired with a knowledge model conditioned on language.

    Preliminaries

    In reality, there are at least two Bayes’ rules: the discrete and the continuous. We we use the continuous form:

    $$f_{X|Y=y}(x) = f_{Y|X=x}(y) f_{X}(x) / f_{Y}(y)$$

    where each function is a probability density function / conditional density.

    To make a continuous distribution over something discrete like words, we use a traditional word embedding summed with positional encoding, then passed through the PDF of a multivariate normal distribution with inferred mean and covariance matrix. (How this interacts with the positional encoding I’m not clear on….)

    The multivariate normal is particularly useful because it can be arbitrarily marginalized to any subset of components of the random vector; this results from the fact that a multivariate normal parameterized by a linear transform of another’s parameters is also multivariate normal.

    Distributions of interest

    There are five:

    • $P(\vec{w})$—a general language model. This decomposes by the chain rule as $P(\vec{w}) = \Pi_{i} P(w_i | \vec{w}_{j < i})$.

      Implementation: unclear; we need a probabilistic language model; can we get a probabilistic interpretation of a transformer?
    • $P(K)$—a general knowledge model. How likely, a priori, is a belief or statement to be true?

      Implementation: a multivariate normal would be a starting point
    • $P(\vec{w} | K)$—the knowledge-conditional language model. This is the probability of a document $\vec{w}$ given some assertion of the state of the world, the nature of reality, or whatever, $K$. $K$ may make claims about a subset of reality; the world is a complex place, so it’s helpful to be able to discuss parts of it rather than always the whole. This is enabled by the marginalizability of the multivariate normal as discussed above. Of course by the chain rule this decomposes to $\Pi_{i} P(w_i | \vec{w}_{j < i})$.

      Implementation: uncertain; a multivariate normal parameterized by a transformer with $K$ as input?
    • $P(K | \vec{w})$—the language-conditional knowledge model. Given a word and its context, how likely is an assertion about our model to be true?

      Implementation: uncertain; another probabilistic transformer? A multivariate normal whose parameters are a function of $\vec{w}$, perhaps the output of a transformer?
    • $P(K|J)$ where $K$ and $J$ are disjoint propositions—a hypotheticals model. What does assuming part of our model say about the rest of our model?

      Implementation: multivariate normal parameterized by output of a transformer

    Training Procedure

    Randomly sample word-with-context $\vec{w}$ and knowledge vector $\vec{k}$. Randomly partition $\vec{k}$ into disjoint vectors $\vec{q}$ and $\vec{r}$. Compute the gradient of the loss:

    $$\mathfrak{L}_{Int} = [P(\vec{q} | \vec{r}) – P(\vec{r} | \vec{q}) P(\vec{q}) / P(\vec{r})]^2$$

    $$\mathfrak{L}_{Obs} = [P(\vec{w} | \vec{k}) – P(\vec{k} | \vec{w}) P(\vec{w}) / P(\vec{k})]^2$$

    $$\mathfrak{L} = \mathfrak{L}_{Int} + \mathfrak{L}_{Obs}$$

    and feed it to your favorite optimizer.

    The first part critically evaluates the interrelationship of model components. The second part critically evaluates the explanatory power of the model relative to empirical observation.

  • Task: text to Bayes rationality

    Task: text to Bayes rationality

    Language models, at the core, are very stupid, blindly predicting the next word given the preceding words. This leaves them profoundly vulnerable to the biases and inaccuracies of the training data.

    Human annotations are applied late in the game to reduce spurious, hallucinatory, extremist, discriminatory, and other undesired outputs, but this after-the-model reshaping is symptomatic of the fact that there is no critical thinking in the language model proper. It can’t assess the reasonableness of the text it’s generating; only the likelihood that the words would be spoken. Of course, some things are so widely accepted that they go without saying; and others, like propaganda, are repeated precisely because of their untruth.

    This task is to distill from biased textual inputs a rational world-model, so far as it is implicit in the training data.

    What makes a model rational?

    A “rational” model does its best to be both internally consistent and to comport with empirical observations. If this consistency respects Bayes’ rule, it embodies Bayesian epistemology; we here term such a model Bayes rational.

    The first requirement of Bayes rationality, internal consistency, can be seen as adherence to Bayes’ rule between all pairs of model parameters:

    $$P(\theta_i | \theta_j) = P(\theta_j | \theta_i) P(\theta_i) / P(\theta_j)$$

    This can be enforced by minimizing the loss function:

    $$\mathfrak{L}_{Int} \equiv \sum_{i} \sum_{j \neq i} [P(\theta_i | \theta_j) – P(\theta_j | \theta_i) P(\theta_i) / P(\theta_j)]^2$$

    But having internal consistency is not enough to render a worldview rational. Theory must also relate to observation in a manner consistent with Bayes’ rule. Observation consistency applies Bayes’ rule to every model parameter-observation pair:

    $$P(x_i | \theta_j) = P(\theta_j | x_i) P(x_i) / P(\theta_j)$$

    This relationship is captured by the loss:

    $$\mathfrak{L}_{Obs} \equiv \sum_i \sum_{j \neq i} [P(x_i | \theta_j) – P(\theta_j | x_i) P(x_i) / P(\theta_j)]^2$$

    The combination of internal rational consistency and rational consistency with observation is embodied by a Bayes rationality loss:

    $$\mathfrak{L} = \mathfrak{L}_{Int} + \mathfrak{L}_{Obs}$$

    The task of extracting a Bayes rational model of the world is complete to the extent that this loss is minimized.

    Art by emma marie andersson under a Creative Commons license.

  • Dataset: Wikimedia Commons Image Super-Resolution

    TODO full description

    TODO make available on torrent?

    A large dataset of lossless images from Wikimedia Commons, with 25%, 50%, and 128×128 downscales, plus train and validation splits for an image super-resolution task.

  • Task: image to video

    Given a frame, produce video as would occur in a movie

    https://paperswithcode.com/task/image-to-video