Optimization wonderland extravaganza

Discussion in 'Planetary Annihilation General Discussion' started by varrak, June 27, 2014.

  1. varrak

    varrak Official PA

    Messages:
    169
    Likes Received:
    1,237
    Some of you lovely folks reminded me it's been a really long time since I did an update. Sorry 'bout that. I've been busy.

    "Doing what?" I hear you cry (probably). Well, lot's of things. Some bug fixing (e.g. making the decals on the planet surface render properly), adding features to make the artists happy, stuff like that.

    But the biggest focus has been performance. Below is a screengrab from UberProbe (our internal profiling tool), showing a capture from a late game.

    Now, for this to make sense, I need to describe a couple of things. This view shows time spent on the CPU (where we are hitting the biggest performance issues). Time is the x axis (so from left to right). In this case, a frame is taking around 40ms (so running at 20fps). Each vertical segment (so the gray and the blue background bars) are threads (there's two in this capture). The green blocks show time spent in functions, with the function callstack (i.e. things that call in to other things) arranged vertically.

    single-threaded.jpg

    Now, the observant reader will notice that we're doing all our work on a single thread here. This has made the rapid development schedule a lot easier to manage, but it's also restricted our performance quite a bit. Here, we can see a chunk of time spend in the "update" part of our loop (where we process data sent from the server, update entity positions, and so on), and then even more time spent in the "render" section.

    So my focus has been on moving some of this work onto other cores, so we can make the most of all those nice beefy multi-core CPUs you guys are playing on. This is not trivial (at all). Some reasons this is hard:

    1. Graphics calls can only be made on one thread:

    OpenGL has the concept of the "context thread", which means you can only do openGL calls on the thread (or CPU core) that owns the context. You can move it around, but that's expensive, and also would require some fancy synchronization footwork.

    DirectX 11 and above support something called "deferred contexts", where you can record GPU command streams on any thread, and then just submit them on your main thread. This means you can spread rendering work across multiple cores. I believe Mantle (AMD's API) can do that too. Unfortunately, OpenGL does not support this, even with an extension, so we are SOL.

    2. Threads need to play nicely together.

    We have collections of entities (things in the game - particle systems, units, buildings, etc) and you can't have multiple threads touching these collections at the same time. Bad things can happen (well, the game will crash).

    3. There are temporal dependencies...

    By that I mean, there's stuff in the render part that depends on stuff that's being done in the update. That's why the update is done first. This makes it very hard to do things in parallel.

    4. The operating system is messing with you. And watching you.

    We use things called "synchronization primitives" to manage how we access shared data across threads. These include "mutexes", "critical sections" and something I added called "ticket locks". The first two are operating system concepts, so have some overhead. The last one is my own thingie, and to wait you need to do something called "spinning" where you just loop around a small piece of code waiting for something to change. Ticketlocks spin, which is ok if you do it for very small periods of time, but if you do it too much, the operating system goes "oh-ho! You appear to be misbehaving! I am going to deprioritize you now!" and your framerate goes in the toilet.

    You also get the OS scheduling other stuff, which takes time away from your threads. It's not polite about it either, and will do so right when you're in the middle of something. Fortunately, the way our threading now works, we don't see this make much of an affect.

    Before I waffle more, here's another picture. This one show's the current state of things:


    threaded.jpg

    You can see here, we now have a number of threads running, doing work.

    I'm actually getting this out in two phases, and there's two methods I'm using to scale. What will be shipping out next (soon... soooooon...) is the following:

    First, I added a "parallel for" concept. What this does is allows us to take a chunk of work (say, 1000 units that need updating) and spread the work equally across many threads. The left-most part of the probe here shows that. Basically, unit skinning/animation, updates, etc, are done in a big parallel update loop that just reduces the time it takes to run.

    The implementation I ended up with actually makes each thread take a large chunk of remaining work (so for 4 threads and 1000 units, each thread would take 200 or so the first time); and then a proportionally smaller chunk each successive time. Since different bits of work can take different amounts of time this helps avoid too much overhead in "thread drainout" - where all the other threads are stalled waiting for one that took on too much to finish. It also means we reduce the book-keeping overhead of making each thread pull one item at a time (since they do batches). And, as well, that helps with cache coherency, since you have threads all pounding on data that's adjacent in memory.

    So for animations in a very late game, I was seeing around 10ms to update them all. By doing it in parallel, that cuts it to 2-3ms, which makes a big difference.

    The parallel-for implementation I added also has the concept of an asynchronous version. What that means is you can say "do all this in parallel, but let me keep doing something else". So for particle system depth-sorting and GPU data-writes, I just throw that off over the fence and it gets done in the background. Later in the frame when we want to draw them, they're all just magically ready. In the single-threaded implementation, I saw this work take up to 20ms a frame (which is a lot!). By making it asynchronous, that 20ms just goes away.

    This gets a particular late-game replay on my system from around 8-10fps on my machine to around 20fps. Not there yet, but much, much better.

    Coming slightly less soon (since it's not fully stabilized yet) is moving a large chunk of the update to run in parallel to the main render block. This can cut another large chunk of time out of the frame, and make even more improvements. To do this, I had to break out the render chunk into multiple bits, and schedule stuff in the background. I also had to make a lot of changes we make to collections (meshes, lights, particles) deferred; that means, when the parallel update makes a change to a light, we don't change the light just yet. We take the delta, and add it to a list. When we're done with the rendering, we just batch-add all those changes before looping to the next frame.

    This change gets us even more speed. In the above example, that 8-10fps late game now runs 25-30fps (with some hitches, which I'm working on AFTER that).

    On a quad-core system (like my work machine) this gets CPU usage from around 8% to 25-33%. Bear in mind, on a hyperthreaded system the ideal is not actually 100%, since you may actually run slower in that case (I can talk more about that if anyone is interested). In this case, 50% or so is ideal.

    Even though I'd love to be able to set everyone's CPUs on fire because we're just maximizing their power, I doubt we'll get close to doing that. But we're getting better, and the gains will keep on coming.
  2. nickste88

    nickste88 Member

    Messages:
    49
    Likes Received:
    11
    Nice, so in short this means less laggy games when a couple of people ore AI spams an armie.
    I like it cant wait for the next update i hope the performence is in it :)
  3. nixtempestas

    nixtempestas Post Master General

    Messages:
    1,216
    Likes Received:
    746
    always interested in the nitty gritty details.
  4. plink

    plink Active Member

    Messages:
    176
    Likes Received:
    89
    Really great read. Thanks for taking the time!
  5. jtibble

    jtibble Active Member

    Messages:
    149
    Likes Received:
    89
    Looking at the second chart of the current parallel-ization, it looks like the majority of the animation updates is done 4-5ms before the render begins. You say that all of the rendering depends on all of the update, but is that strictly true? Would it be possible to begin rendering some of the completely-updated entities/collections before the update-cycle is complete?


    Alternatively, what about starting the next update cycle in a thread halfway through the rendering? If the rendering depends on the previous update, would starting the update cycle earlier (perhaps even halfway through the previous frame) help cut huge amounts of time off the entire loop?
  6. RMJ

    RMJ Active Member

    Messages:
    587
    Likes Received:
    234
    This is one of the things that has always made me wonder. If it will ever be possible for either games to be written in a way that they will use what cores is available without all the fuss, or even better that CPU's like say GPU's will just split the work into however many cores or threads as the system has.

    Its probably because i dont have a clue about this. But it just seems odd why there isnt in general i mean, being put more effort into making it easier for game developers to use multi cores. I mean at some point in the future, i would imagine CPU would have 10-20-30-40 cores. and it sounds like its already a damn nightmare making games run on 4 cores.
  7. SXX

    SXX Post Master General

    Messages:
    6,896
    Likes Received:
    1,812
    I suppose this topic is about client-side performance improvements and it's can't simulation faster, but might be I'm wrong? :rolleyes:
    PeggleFrank likes this.
  8. cptconundrum

    cptconundrum Post Master General

    Messages:
    4,186
    Likes Received:
    4,900
    This is really one of the biggest problems in software engineering right now. It is probably not possible to get a compiler to efficiently split a single-threaded program into a multi-threaded one without any bugs. This sort of program really needs to be written by an experienced programmer or it just won't work.

    In addition to that, there is also another problem. Think of each thread as an assembly line in a factory. You can build a car faster by having one assembly line for the frame and another for the engine, but eventually you need to merge the lines together and actually put the engine in your car. In this scenario, you can keep splitting off more tasks into their own assembly line only down to a certain level. Eventually it gets to a point where you're better off just doing more things in one line than trying to do something small and then keep merging it back together. Even worse, there are things that just can't be done on a parallel assembly line at all. It doesn't make a lot of sense to set up the electrical wiring on this car before the frame is finished, for example.

    This all means that there is really going to be a point where adding more cores to a computer stops producing favorable results. You will still benefit if you are trying to run several totally separate programs at once or if you want to do something highly parallelizable like video encoding, but not really for games.
    LavaSnake, lokiCML, Quitch and 3 others like this.
  9. exterminans

    exterminans Post Master General

    Messages:
    1,881
    Likes Received:
    986
    Quite simple to explain actually ;)

    As you may know, modern CPUs are superscalar, that means even a single core has multiple instances of various components so it can execute multiple instructions (from the same thread!) in parallel. This works very well if the compiler managed to keep side effects between "neighbored" instructions low since the CPU core can detect that and start scheduling them in parallel to the different resources.

    Hypertthreading comes into play when the compiler failed, that means if the code happens to be organized in such way that not all resources of a specific type are fully used. (Either that, or because a different thread has a different work load which can use DIFFERENT resources, e.g. one thread using the FPU for floating point arithmetic, the other one using the ALU for integer arithmetic.) So Hyperthreading attempts to fill the idle resources with instructions from the other thread.

    Works quite well if this can increase the usage of a specific resource type (e.g. integer units). Doesn't work at all if the one thread is already hitting the limit on a specific resource type, because in that case Hyperthreading will have to cut back resource allocation by the first thread in favor of the second thread.

    So you now have suddenly 2 threads, each running at half speed. Hence the 50% CPU usage prediction for Hyperthreaded systems.

    Btw.: Works slightly different with AMDs modules. They behave very different to Intels Hyperthreading for SOME instructions (since the corresponding resources are shared module-wide) while for other instructions each core (there are 2 cores per module) has it's very own resources. So it can sometimes outperform Hyperthreading, and sometimes it suffers the very same problems.

    It is not THAT difficult. The Intel compiler for example can do parts of the parallelization for you even without any changes to the code. It can do that, because it has a good understanding of access patterns to the main memory so it can auto-detect code sections suitable for parallelization.

    And then there is also frameworks like OpenMP (compatible with VC, GCC and Intel compiler, but not with Clang yet) which make parallelization a breeze.
    Most important: OpenMP is very efficient, you can already gain a huge performance plus from parallelizing comparably SMALL work shares. Automatically parallelizing a for-loop with only a few hundred iterations and only a dozen floating point operations per iteration already scales almost perfectly.
    That's quite important, since small work shares are far easier to parallelize than big ones. Issue is mostly race conditions which can only be avoided by smart access patterns and defining critical sections (sections where only one thread must ever work). Planing this for small sections is quite easy with some practice. Finding suitable patterns and identifying critical sections for larger chunks of code is a nightmare.
    Last edited: June 27, 2014
    vyolin, PeggleFrank, Remy561 and 3 others like this.
  10. popededi

    popededi Well-Known Member

    Messages:
    784
    Likes Received:
    553
    As a user who falls into the "knows enough to be dangerous" category, I'm proud of myself for understanding some of the words in your post.

    Thanks for the update!
    varrak likes this.
  11. shotforce13

    shotforce13 Well-Known Member

    Messages:
    543
    Likes Received:
    400
    Thanks for the cake, its delicious
    PeggleFrank, Remy561 and varrak like this.
  12. nixtempestas

    nixtempestas Post Master General

    Messages:
    1,216
    Likes Received:
    746
    As I have actually written a (simple, rather silly) multi-threaded game, I'll use it as an example of some of the nightmares that can occur.

    The game was a text based "shoot down the saucers" game, where saucers spawned on the top of the screen and tried moving across, while the player controlled a rocket launcher on the bottom of the screen to shoot it down, all while the saucers would drop bombs randomly on the player.

    Every single saucer, rocket, and bomb was its own thread, along with a few others. Every single one of these threads was asynchronous, meaning there was no global timer keeping things in sync, you fired a rocket, it would have its own timer, as would each saucer etc.

    These means they are all doing their own collision detection and updating to the screen etc.

    Because of this, you never know exactly when something is going to happen relative to anything else, and if something conflicts, horrible things can happen. For example, if you are halfway through updating a saucer position, and a rocket decides it wants to update, it could start printing really weird things all over the place (like turning the rocket into a saucer, often leaving random bits of text everywhere).

    This is where the mutexes varrak mentioned come in. Mutex stands for mutual exclusion (ya, computing scientists have weird ways of making names), and is used to block a thread from interrupting another thread in a critical section (code that must be completed in its entirety before anything else can use those resources).

    So back to the game example, I had a mutex lock for updating the screen, every time a thread wanted to update the screen, it would request a lock, and if it was unlocked before, it is locked and execution proceeds. once done, you unlock it (if you forget to do this, your screen will just freeze up, as nobody else will be able to access the screen update).


    You can probably start guessing why this can't be done totally automatically. The computer has no sure fire way of identifying a program's critical sections. While it may be able to guess that posting stuff to the screen would be one, other things, like collision detection, is not so simple, as there is no way for the compiler to detect that it is a critical section, and not just a standard loop. (along with the other reasons mentioned above by cptconundrum)

    While sorian's AI may some day surpass us puny humans, until then, it requires a human to make these decisions.
    Last edited: June 27, 2014
    PeggleFrank and cptconundrum like this.
  13. exterminans

    exterminans Post Master General

    Messages:
    1,881
    Likes Received:
    986
    You should really try again with OpenMP. (Unless you really shouldn't handle objects as individual threads! Threads should also work on a shared data set, unless you have very good reasons not to, such as deferring a network thread or alike.)
    Stuff like manual management of mutexes is hidden away very well and parallelizing is as simple as adding pragmas, so no risk of forgetting to unlock a mutex. Also makes is very comfortable to swtch between parallel / single sections frequently, which also isn't exactly recommended when using blank Posix/Win-API for threading due to performance overhead when done wrong.

    Plus: You can run the entire code with parallelization switched off so you can get a reference implementation. Very helpful for debugging race conditions by comparing intermediate results.
  14. nixtempestas

    nixtempestas Post Master General

    Messages:
    1,216
    Likes Received:
    746
    It was for a university assignment, written in C with posix threads. We weren't given the choice to use any such fancy libraries.

    Otherwise, I'm all for using libraries like that, no point reinventing the wheel.
  15. japporo

    japporo Active Member

    Messages:
    224
    Likes Received:
    118
    In the pre-hyperthreading era, you could only be doing as many tasks (i.e. threads) at once as you had cores for. If whatever you were doing had to wait on something, the core basically did nothing while it was waiting, which is inefficient.

    With hyperthreading, the processor pretends to have twice as many cores as it actually does. This means that each physical core can be assigned two tasks at once and whenever it gets stuck on one of the tasks, it can switch right over to the other instead of being idle. This gets you a little bit more efficiency when there are a lot of tasks going on at once.

    When you have a bunch of really intensive tasks, which is often encountered in games and enterprise applications, there is very little time in which a task has to wait on things and therefore there's no "slack" to take advantage of to switch to another task. In other words, at 50% processor utilization, all of the underlying physical cores are fully utilized.

    Unlikely. There are small scale ways of recognizing opportunities for parallelization (e.g. auto-vectorization, MapReduce, etc.) but the most effective way to speed up software has always been selecting the right architecture and the right algorithms and making the right trade-offs and that requires a human mind.
    cptconundrum and nixtempestas like this.
  16. wondible

    wondible Post Master General

    Messages:
    3,315
    Likes Received:
    2,089
    If you'll forgive a silly question....

    Given that chronocam gives the game a concept of multiple points in time rather than a single mutable state, why did you choose to focus on breaking up a single frame rather than handing a frame to each processor and cycling through them? The lag time shouldn't be much worse than the current single-core performance, I would think.
  17. varrak

    varrak Official PA

    Messages:
    169
    Likes Received:
    1,237
    It's a different concept. Chronocam controls the data flow from the server. The client is still rendering frames the same way. The client's loop works (at a very high level approximation) like so:

    1. Download a snapshot of state from the server (which could be at any time - aka chronocam)
    2. Create any new objects that we need to display.
    3. Destroy any objects that went away.
    4. Update the state of everything.
    5. Draw everything that's visible.

    We do these 5 steps repeatedly, as fast as possible. To run at 60fps, we need to do all that in 16 milliseconds. That's not a lot of time.
    ArchieBuld likes this.
  18. RMJ

    RMJ Active Member

    Messages:
    587
    Likes Received:
    234
    Yeah i figured as much, but at least i did learn a few things. Thanks for the long responses :)

    Its a real shame it takes so much work to make game use all the cores. but its definitely worth it.

    Its probably also a good thing that CPU then wont go all crazy and have insane amount of cores. that would be programmer hell from what i understand.
  19. wondible

    wondible Post Master General

    Messages:
    3,315
    Likes Received:
    2,089
    So, the data isn't stored on the client? If you go back in chronocam, it has to redownload everything again?
  20. websterx01

    websterx01 Post Master General

    Messages:
    1,682
    Likes Received:
    1,063
    Intel is making the first desktop octacore hyper threaded, have fun game programmers! It's your dream! :p
    PeggleFrank, tollman and RMJ like this.

Share This Page