At least, in a program making extensive use of a compute-heavy library. The effort is better spent on the library itself, rather than the calling program.
I wrote in my previous post that I had better optimizations in mind. Well, up until now, I’ve been running individual games and then collating the results. This week, I’ve spent my time re-writing my methods to run batches of games, and submit entire batches to the GPU for processing.
Compared to running individual games, running batches tops out at about 12x faster (1100% increase). Compared to my previous result of a 10% increase, that’s a serious boost. Keep in mind, that’s still using only a single thread to run the entire batch.
I then looked at adding some more threading to the mix, a la my previous enhancements, and found the results to be on par with the single-threaded batch (they were either +/- 1%). Once you submit a chunk of data to the CNTK library, it appears to make ready use of all available cores and do the heavy lifting for you, making your own multithreading largely redundant.
On the theory that the garbage collector could be a problem, as I’m spawning a huge number of generic collectors (Lists of lists of data), I then looked at some rather complicated restructuring in order to reuse those collectors and hopefully lighten the load for the garbage collector.
Again, the results were +/- 1% of the naive version.
All of which is to say that the team over at Microsoft have done an amazing job optimizing CNTK for use. The single most important thing you can do is get your data into as large a chunk as possible (given constraints of your graphics memory) and send it as a chunk to the library. Once you do that, CNTK takes over and makes amazing use of the hardware at its disposal.