Category: Projects

Projects I’m working on

  • A normally-distributed position encoding

    As part of my effort to develop a probabilistic interpretation of transformer language models, I became interested in alternative position encodings to that used in Attention Is All You Need.

    A position encoding can be characterized as some function from a non-negative integer to a vector of reals:

    $$e : \mathbb{N} \to \mathbb{R}^{K}$$

    or as a matrix carrying out the same function for a finite number of sequence positions, where the encoded vector can be used to reconstruct the sequence position, and where deltas between sequence positions can be captured by a linear map.

    The point of the sequence encoding is to inform a language model of where in a sequence a word was located, rather than merely what the word was, and to allow the model to refer to token positions in both absolute and relative terms. Typically the encoding is summed with the token embedding. This can be viewed as a generalization of, rather than alternative to, concatenation.

    Requirements

    My requirements for a position encoding:

    1. That the encoding be similarly invertible by an artificial neural network as is the sine/cosine encoding in common use.
    2. That relative positions can be captured easily by a neural network mode, as described in subsection 3.5 in Attention Is All You Need. This means that we can train a linear map which can transform position encodings from one position to the correct position encoding of a relative position. This property seems to hold in general; we do not evaluate it empirically.
    3. That the encoded vector play nicely with probabilistic token embeddings, i.e. have a well-understood statistical distribution. Even though position encodings will be deterministic, it would be helpful to be able to interpret them as just another random variable—one which happens to have all its mass on a single outcome.

    A Normally Distributed Position Encoding

    We might as well start with the best-understood distribution and try to generate normally distributed position encodings.

    More specifically, we want to construct encoding vectors of dimension $K$ such that each element at index $k$ is distributed according to the univariate normal distribution $\mathcal{N}(\mu_k, \sigma_k)$. This is equivalent to a multivariate normal with no covariance between components (a special property of the multivariate normal; lack of covariance does not typically imply independence.)

    To get encodings that are normally distributed according to these distributions (if non-random) we reformulate the problem in terms of the sequence position divided by a maximum length $N$, giving us encodings as functions of real-valued positions distributed evenly within $[0, 1]$:

    $$e(i) = e'(i / N)$$

    (This assumes a known maximum length, which is a disadvantage relative to the sine/cosine encoding.)

    With inputs in $[0, 1]$ we now find a corresponding sample in each of the $K$ normals such that the same percentage of the distribution lies below it: in other words, we use the inverse CDF of the normal distribution, which is commonly available. Let

    $$F^{-1}_k : [0, 1] \to \mathbb{R}$$

    be the inverse CDF of $\mathcal{N}(\mu_k, \sigma_k)$. Then:

    $$e'(x) = [F^{-1}_{1}(x), F^{-1}_{2}(x), …, F^{-1}_{K}(x)]$$ and

    $$e(i) = [F^{-1}_{1}(i/N), F^{-1}_{2}(i/N), …, F^{-1}_{K}(i/N)]$$

    Comparison Encodings

    For evaluation purposes, we investigate the invertibility of the following encodings:

    1. sincos: Attention Is All You Need-style sine/cosine encoding. Even components $2k$ are $sin(pos / 10000^{2k / K})$ and odd components $2k+1$ are $cos(pos / 10000^{2k / K})$.
    2. direct1: the first dimension of the encoding equals $i/N$; the rest are zeros.
    3. directN: all dimensions of the encoding equal $i/N$.
    4. linear_normal: the encoding is the position as a decimal ($i/N$) multiplied by a vector of random reals plus a random bias, all initialized as samples from standard normal.
    5. linear_normal_learned: like linear_normal, but the weights and bias are learned during training rather than static.
    6. linear_uniform: like linear_normal, but with weights and bias initialized from a uniform distribution on $[-1, 1]$.
    7. linear_uniform_learned: like linear_uniform, but the weights and bias are learned rather than static.
    8. normal: the normally distributed encoding described above.
    9. normal_learned: like normal, but the parameters of the normals are all learned.

    Inversion

    The invertibility of an encoding is measured by training a function

    $$g : \mathbb{R}^{K} \to [0, 1]$$

    that attempts to recover the original position divided by the sequence length, from the position encoding vector. For simplicity, we let $g$ be an affine transform with sigmoid activation function (in other words, a standard “dense” neural network layer):

    $$g(\vec{x}) = \sigma(\vec{x} \cdot \vec{w}^{T} + b)$$

    where $\sigma$ is the standard logistic function. The loss is the mean squared error; the optimizer is Adam with a learning rate of $1^{-3}$.

    Empirical Evaluation

    Each of the nine encodings was evaluated in terms of the speed with which the optimizer could recover the position from the position encoding. The mean across 100 random initializations is shown, trained for 20000 iterations. A few views are shown to highlight different trends; note the axes vary between images.

    Overview. This view of the entire data series shows clearly that the direct encodings (red and blue) are curiously the worst of all. While directly encoding into all $K$ dimensions instead of just 1 makes inversion of the encoding easier, it still lags far behind other methods. This view also shows that at a certain resolution, over a long enough training time, there is no real distinction between methods.
    Overview 1k. The first 1000 iterations. In the shorter run, however, the different encodings display markedly different behavior in invertibility. normal and normal_learned are the clear standouts, unexpectedly as they were formulated for statistical intelligibility, not for invertibility.
    The “junction” where all encodings perform similarly, prior to diverging around 20-25 iterations.
    sincos advancement. The early performance of normal, normal_learned, linear_normal_learned, and linear_uniform_learned are seen, with sincos overtaking all but normal and normal_learned by about 1200 iterations.
    The money shot. Here we see all essential relationships between the encodings. All but normal, normal_learned, and sincos eventually converge on the same performance as the direct encodings. linear_normal_learned and linear_uniform_learned are interesting for achieving peak inversion performance sooner than sincos; but in the long run they also converge on the performance of direct1 and directN. Meanwhile the normally distributed encodings normal and normal_learned by far perform best in inversion performance until finally being overtaken by sincos late in the game around 12,000 iterations.

    Implementation

    The following should get the position encoding analysis running:

    git clone https://github.com/joshhansen/MLPortfolio.git
    
    # Optional: `git switch faf064b`
    # to get the code version at time of publication.
    
    cd MLPortfolio/pos-enc-invert
    
    python -m venv ./venv
    
    source ./venv/bin/activate
    
    pip install -r packages.txt
    
    python pos-enc-invert.py

    Data

    The invertibility loss data discussed in this post can be accessed in the GitHub repository.

    Discussion

    It is surprising that the most direct procedures for position encoding (multiplying by a random vector; representing positions as values in $[0,1]$) are the worst performers, all converging upon essentially the same performance in the long run.

    In all cases, allowing the encoding parameters to be learned leads to quicker convergence of the inverter model, but seemingly converges to the same inversion loss.

    In all cases, normally distributed is inverted faster than uniformly distributed.

    sincos as expected performs well in both inversion convergence speed and in long-run performance. It was originally selected with care.

    Unexpectedly as they were chosen for reasons of well-characterized statistical distribution rather than ease of inversion, the normally distributed encodings normal and normal_learned converge far faster, to a far lower loss, than all other encodings considered, until being overtaken late by sincos. normal reaches a near-horizontal-asymptote by about 400 iterations; and normal_learned by about 300.

    Conclusion

    It remains unclear why a position encoding that yields normally distributed (if non-random) vectors is so easy for a neural network to invert—even more so than for the sine/cosine encoding in common use and formulated largely for its invertibility.

    What’s more clear is that the method proposed here should have utility as a standalone position encoding, and may also serve as a useful part of a broader effort to develop probabilistic interpretations of transformer language models.

  • Ghee 0.6 – the tastiest way to manage your data

    Introducing Ghee 0.6, the latest version of the tastiest way to manage your data!

    Ghee is an experiment in leveraging modern filesystem features to implement a data management system, providing a key-value database, Git-style commits, and extensive tools for manipulation of extended attributes (xattrs).

    The focus of this release is the introduction of the commit-management subcommands commit, log, restore, and reset. These are modeled after their Git equivalents, but utilize Btrfs copy-on-write filesystem semantics, including read-only snapshots, to efficiently track changes.

    Using Btrfs generalizes change tracking, efficiently handling not only text files, but arbitrary binary blobs as well.

    It is hoped that this could lead to a version control system that handles large files in an integrated manner, whereas large file support in Git is tacked on separately – in the case of Git-LFS, requiring an additional server to implement.

    When using Ghee as a database, the commit, restore, and reset commands provide transaction-like functionality, allowing modifications to be built incrementally but finalized all at once – and rolled back when mistakes are made.

    The main question about what Ghee will be good for is how efficiently it can handle actual database workloads. I suspect the answer will be: not very well, based on Micheal Sproul’s experience with Butter DB, based on a similar architecture.

    But for many workflows, it’s not necessary to serve large numbers of queries: Ghee would be well-suited for data scientists developing datasets and statistical models, for example.

    It will be interesting to see what happens here at the meeting-place of databases, filesystems, and version control systems. Imagine adding WHERE clauses to your merge command, a la ghee merge origin annotations-20231002 -w age>45.

    Now hosted at Codeberg… check it out and, as always, send bug reports my way!

  • Ghee 0.4 – The tastiest way to work with Linux extended attributes (xattrs)

    Introducing Ghee 0.4, the newest release of the premier tool for manipulating Linux extended attributes! (0.3 coverage here reddit .)

    This latest release adds a Rustyline-based REPL and additional tools for using the filesystem as a relational database. The new init subcommand lets you declare the primary key by which a directory (and its subdirectories) are indexed, while ins and del now allow insertion and deletion of records while keeping related indices up to date. ls is helpful in the REPL, showing Ghee’s view of the world.

    In addition to direct management of extended attributes, Ghee is designed to implement a relational data model built around xattrs while offloading as much functionality as feasible to the filesystem.

    As such, Ghee does nothing special to ensure the integrity of stored data. You are encouraged to layer this not on but under Ghee by your choice of filesystem. For example, ZFS, Btrfs, and Bcachefs all provide checksum-based integrity checks.

    Next steps include filling in missing features in existing subcommands and using copy-on-write snapshots to provide a Git-inspired workflow, something like:

    • ghee diff ./people: show how the ./people table has changed since last commit
    • ghee commit ./people -m "message!": commit the ./people table in its current form with message “message!”
    • ghee restore ./people gf037d2c98: restore the ./people table to its state in commit gf037d2c98
    • ghee log ./people: show commit messages for all commits in the ./people table.

    As I am a user of Btrfs, CoW-based features will be implemented with Btrfs in mind first. If this proves successful it could be extended to other filesystems.

    Of course, I hope it goes without saying that version 0.4 of any project should NOT be used in mission-critical contexts where the cost of data loss would be substantial.

    Thanks in advance for any thoughts, questions, or suggestions!

  • Ghee 0.3 – The tastiest way to work with Linux extended attributes (xattrs)

    Introducing Ghee 0.3, the newest release of the premier tool for manipulating Linux extended attributes!

    Originally known as Hatter and then, regrettably, as Mattress, this tastiest of tools has been redubbed Ghee after the clarified butter popular in Indian cuisine, and as a reference to the Btrfs filesystem, which originally convinced me that much database functionality has now been subsumed by advanced filesystem features.

    This new release adds SQL WHERE-style predicates to filter by, e.g. ghee get --where age >= 65 ./people, and makes get recursive by default (the old behavior is still available behind the --flat flag).

    The idea is for Ghee to implement as much of a relational data model as possible using the filesystem itself as a substrate. Design principles:

    1. Folders are tables
    2. Files are records
    3. Relative paths are primary keys
    4. Extended attributes are non-primary-key columns
    5. Enforce schema only when present
    6. The file contents are user-controlled; only directory structure, filenames, and extended attributes are used by Ghee
    7. Use of filesystem features should be preferred over implementing features directly in Ghee, e.g. locking, Btrfs subvolumes, snapshots, incremental backup

    Would love to hear any comments. Apologies for the name changes—third time’s the charm, I think this one’ll stick.

  • Mattress 0.2.1 (formerly Hatter)

    Mattress is a command line tool for working with Linux extended attributes (xattrs)

    Because someone else’s awesome project already occupied the hatter crate, I’ve changed the name of my project from “Hatter” to “Mattress” which, weird as it is, has the advantage of actually including “attr” as a substring.

    The executable name has correspondingly changed from htr to mtr.

    This version begins the introduction of simple database-like features, implemented using the filesystem and extended attributes as a substrate.

    Mattress sees the world in a peculiar way: it interprets a filesystem folder as a database table with one record for each file in the folder, indexed by the “primary key” of the filename.

    A nested hierarchy of directories is seen by Mattress as a database table indexed by the compound key corresponding to the nested subpath, and one “record” per file encompassed under the folder recursively.

    Consider this folder ./people of personnel records:

    n$ mtr get ./people/*n./people/Sandeep        user.id 2n./people/Sandeep        user.name       Sandeepn./people/Sandeep        user.state      CAn./people/Sofia  user.id 1n./people/Sofia  user.name       Sofian./people/Sofia  user.state      WAn./people/Wulfrum        user.id 0n./people/Wulfrum        user.name       Wulfrumn./people/Wulfrum        user.state      CAnn

    Suppose we want to index not by the name as now, but by the id. We can do this using the new idx command.

    n$ mtr idx -v -k id ./people ./people:idn./people/Sandeep -> ./people:id/2n./people/Sofia -> ./people:id/1n./people/Wulfrum -> ./people:id/0nn

    The arrows show the hardlinks mapping the original ./people folder to the indexed view ./people:id.

    We can also index by compound keys, such as here where we index by (state,id):

    n$ mtr idx -v -k state -k id ./people ./people:state:idn./people/Sandeep -> ./people:state:id/CA/2n./people/Sofia -> ./people:state:id/WA/1n./people/Wulfrum -> ./people:state:id/CA/0nn

    I have some “magic” planned to speed up the get command and ease the ergonomics (letting you reference e.g. state, which will be taken from the path rather than from the per-file xattrs.) Eventually I’d like to allow for SQL SELECT-style conditions, but that’s for another day.

    (Note: this project is now known as Ghee.)

  • Hatter: a command line tool for working with Linux extended attributes (xattrs)

    In my current, semi-stealth machine learning project, I’m experimenting with using Linux filesystem extended attributes as a sort of “poor man’s” database to store annotations in.

    I’m not really sure this is turning out better than using, say, a SQLite database, but it’s been interesting to try.

    In the process, I built my own command line tool for manipulating xattrs, which I’m releasing under a GPL3 license, called Hatter. It’s written in Rust since that’s what I’ve mostly been writing the past few months.

    There are probably many bugs, but it’s working alright for me, so I figured I’d unleash it on the world. Just… don’t go crazy

    https://git.disroot.org/joshhansen/Hatter

    (NOTE: This project is now known as Ghee.)