We have liftoff!
We're now beating the random baseline on a 10x10 board, with a greedy algorithm trained only on self-play data:
Evaluating umpire AIs:
r wins: 459
./ai/agz/6.agz wins: 541
Draws: 0
The model is a basic convolutional neural network based on the context surrounding the city or unit taking the next action. The weights serialize to 166KB, so easy to deploy with the game.
The key to training was to up the number of training instances - "The Unreasonable Effectiveness of Data", after all. With purely-random algorithms doing the self-play, this can run extremely fast.
The next obstacle will be the transport mechanic: the need for land units in Umpire to board a transport ship to transfer continents. The inability of air and sea units to capture cities means this mechanic must be understood at some level for an AI to operate on the full 180x90 map, where large stretches of ocean divide multiple land masses.
With purely-random players, the likelihood of this mechanic getting triggered and thus entering the training data seems fairly low. But if we throw enough training episodes at it, we'll see it eventually. This may necessitate multithreading the self-play code to run multiple games simultaneously, and optimizing the game engine for throughput.