So, I fixed the determinism problem. When ranking different networks in the tournament, the best move will always be chosen. When generating training data, moves are randomized, with the best move being the most likely.
I also replaced the two random players with a new, single randomized player. This player will first look for winning moves and, if none are found, then look for opportunities to block a win by their opponent. Only after that will completely random moves be made. This one change made it able to completely dominate the tournament, not losing a single game (which is expected since the networks haven’t really been trained yet).
I’m now running an overnight training session, so I’ll be able to rank them again in the morning.