My laptop and my HTPC are very different machines. My laptop has a fast dual core processor, my HTPC has a slower quad core. My laptop has a mid-range mobile GPU (NVidia 860m), whereas the HTPC has a much better desktop GPU (NVidia 1060).
I bring this up because as I add to the graphical version of my testing suite, I was running some tests and wanted to compare the results between the two machines. What I found was that for small batches the laptop would dominate, and the HTPC wouldn’t really come into its own until the batch size was sufficiently increased.
Everything up to now has been running in a single (background) thread on the CPU. This means that small batches, which devote a much larger percentage of their workload to administrative overhead, are processed more quickly on my laptop. Large batches allow the GTX1060 to dominate, but that advantage isn’t enough to make up for the slow quad core right at first.
So, I began looking at ways to multithread the workload. Ideally this actually wouldn’t matter too much as the main bottleneck would be the GPU and there’s only one of those; however, we don’t live in an ideal world, and the work of running a game, converting the game board to a format the model wants to see, sending that data to the GPU, getting back the results and interpreting them are quite tasking.
The main method I use for playing games is simple:
Result PlayMatch(Player1, Player2)
I modified this in an obvious way:
(Result Structure) PlayMatches(Player1, Player2, NumberOfMatches)
Essentially it was the same thing, just repeating NumberOfMatches times. For small batches (10 games) the difference in speed between the two was negligible, roughly 1%. Chalk that up to random variation. That’s to be expected.
The real fun came when I further modified the function. Calling the batch function now spawns a background thread and returns its result only when finished. The way my test was structured, I was using two of these threads at a time (though I could easily modify it to use more, there isn’t a point on my laptop, and I only publish to the HTPC when I consider myself to have reached a milestone).
Using background threads, I saw something interesting: the threaded version was now 20% slower than the non-threaded version when running games in batches of 10.
I expected there to be a slight hit (every time I start a batch, I spawn a new thread, which then dies when the batch is finished), but that 20% penalty shocked me. That’s some heavy overhead.
So I ran the test again, this time with batches of 100 games, and found a better result: the threaded version was now 10% faster. Increasing the batch size from 100 to 1000 yielded the same 10% advantage for multithreading.
I think what I’m seeing here is the limit of what I can achieve by multithreading on my laptop (where I do development). Even if I made the CPU-bound sections even faster, I likely wouldn’t see any performance increase on my laptop, as that’s not where the bottleneck is. It’s possible I’d see a much greater benefit on my HTPC as the CPU relative to the GPU there is so much weaker, especially if I went up to 4 threads to take full advantage of the quad core. If adding a second working thread saved me 10% of my time on large batches, it’s possible that adding a third and fourth would cut my time further (and I wouldn’t mind getting a 30% reduction in time running tests, though more likely the real-world advantage would be 20-25%… I’ll just have to test and see).
Beyond that, though, I don’t think I’ll spend much time pursuing further asynchronization. The bulk of the work is still in the GPU. I have much better optimizations in mind that could give even better performance boosts.