Reproducible Performance Measurements

/ 30.08.2024

Eugene Yastremsky

Lead programmer

While working on performance optimizations in any software, it's essential to measure the performance in the first place. A common maxima here is never to optimize without first looking at some metric, and even though sometimes it's possible to make simple optimizations without prior investigation – just by looking at the code, it wouldn't hurt to measure them too.

Profilers are powerful tools for finding bottlenecks and tracking optimization progress, but when we talk about games, which can be very complex systems with many moving parts, just using the profiler may not be enough. For example, sampling profilers statistically measure the impacts of functions in a running program and thus give us an estimate of their running times. If we're doing micro-optimizations, we might find that the difference in microseconds we are looking to see is lost in the noise and profiler's overhead, thus prohibiting our efforts to understand whether we should keep the optimization.

Probably the easiest even though time-consuming approach – is to do many measurements and then look at their average, hoping that the noise will be eliminated and overheads will be more steady.

Counterintuitively, the opposite approach of looking at a single capture frame-by-frame (no averaging at all) can be useful too – the effect might be visible on a particular frame, so depending on averaging it can be overpowered by adjacent frames. This, of course, depends on what optimization we're performing – it might be expected to give a constant gain or burst gain on particular frames.

If after trying various approaches, the effect isn't seen – it's better to throw away the optimization – not all ideas work out and that's OK. There is no reason to commit it, even if pessimization also wasn't found – after all, the code is probably more complicated than the original, and there is no need to reduce maintainability for no good gains.

Introduction

Discern micro-optimizations

In this article, we'll look at common challenges that arise during game optimizations, and the ways reproducible performance measurements help overcome them.

Measure many times, and average the results.
Draw graphs to see per-frame changes.
If change is still indiscernible, get rid of it – no need to complicate code for little to no gain.

TL;DR:

Sometimes our intuition and expertise fail us and what we thought would be an improvement leads to a decrease in performance. In the simplest case, we find the pessimization at the place we expected an improvement, so of course we throw away the idea and go to the next one.

That said the situation may be more complicated – we might observe an increase of performance in one place and a decrease in the other: in different frames, between different threads, between different resources. In these cases it isn't as easy as to just throw away the idea – one must make a decision – do we trade performance in one place for another?

The common example is memory-speed trade – one of the most common optimization techniques is caching – and it almost always costs additional memory. If we're good on memory – it's often an easy decision, but if we're on a constrained console and near the out-of-memory limit, this decision gets significantly harder, as keeping the change might necessitate additional memory optimization.

Another possibility might be the timing trade-off – taking the same example – filling in the cache might be expensive, so we lose on that one frame we perform the caching, but win on subsequent frames. Is the initial drop tolerable? To make an informed decision, it's important to look at data for multiple frames, potentially at multiple signals (e.g. we do cache on one thread, but the subsequent improvement can be observed on another).

Another good reason to look at different signals is that pessimization might appear unexpectedly in a different place. For instance: you add a cache, and the studied spot gets better, but by adding an additional field to the structure, you've made it bigger, now data doesn't fit into the CPU cache well, the throughput drops, and some other spot is now bottle-necked. This kind of stuff can happen to anyone – games are often too large, we have to work with a small part of the system, and can easily miss other running parts.

Find pessimizations

Throw away definitive pessimizations.
Draw graphs to see per-frame changes – pessimization may happen in frame(s) that differ from the improved.
Check many signals – frame times, thread times, GPU load, etc. – pessimization may happen in a different system than the one being optimized.

TL;DR:

For this matter, it's important not to tunnel vision on performance numbers only, but also to track the game looks and gameplay. Of course, you should also have QAs, but that's out of the scope of this article, and it's no good to submit a change that clearly breaks something, thus wasting QA's time.

Obvious problems can be caught just by looking at the game screen, so don't forget to do that! But for a finer look – it's immensely useful to record video of the gameplay before and after and then play these recordings side by side.

Performance data itself can also be an indicator of breakages: seeing a large change in load can indicate that some important system no longer works correctly, while the visuals at the first glance seem fine – after all the bug might be slight or even cumulative.

If all seems good, but you know the change is dangerous – notify the QA, and give them some context – what systems were touched and most likely to break – this might help them tremendously.

Look at the game screen.
Record gameplay videos before and after the optimization.
Sanity check the performance numbers.
Notify QA of dangerous changes.

TL;DR:

Optimization-induced bugs

One of the potential side-effects of optimizations could be breakages in gameplay or visuals. When performing optimizations, often it's clear whether the optimization idea is dangerous or harmless. A logical refactor shouldn't break anything (though due to a mistake, it still can – we're only human), but something more drastic, such as cutting out a system that seems to be unused by our particular game, but still is being processed in the runtime, may very well lead to problems.

Play determinism means our ability to play the game the same way with each measurement. One such example is to have the ability to script gameplay from the player's perspective (meaning developer or QA, of course, leaving it for shipping build would essentially constitute a cheat). The basic form of this can be a virtual controller – you do a test run, capturing your inputs, and later when profiling you replay the input capture thus repeating every action in exactly the same fashion. An alternative to input capture is the ability to programmatically hook into every action, which will allow manual construction of sequences that happen exactly the way required, without the need to know controls very well or depend on variable input latency.

Of course, for this to work properly, you need to start from the same state (save, level, stage, etc.) and the game logic must be deterministic (e.g. NPC doing the same actions, their spawns predetermined, if randomness is used, the seed must be the same between captures). Some games are necessarily deterministic (for example, competitive ones), but others might not have such a constraint; if that's the case, it's still advisable to be deterministic for development and testing simplicity. On the other hand, if a game doesn't have game logic determinism, our options become quite limited. One should still strive to replicate the situation on as many parameters as possible, do more measurements, and average them, as well as keep the capture length to a minimum – trying to capture the exact problematic spot.

Even if game logic is fully deterministic, the game itself as software system – almost never is. Consider starting a level 20 seconds from the game boot and 30 seconds: the memory can find itself in a different state, if there is a garbage collection, it may now fire at a different time, not to mention other processes running on the device, which also use system resources, and you have no control of. Generally, you can't do much to alleviate these problems (aside from doing every test from a clean boot, which is a good idea anyway), but if they are severe – a development task of reducing their impact is in order. For example, the garbage collector can be forced in key moments, say before a level starts, so we have a more consistent memory state, if a game is resource-constrained or has poor memory management – switching to better allocators or lifetime management can help.

Reproducibility and determinism

Use input capture or scripting to craft reproducible sequences.
Keep the capture short and to the point.
Average multiple captures.
Reduce indirect hardware/software effects on performance capture.
In case of a severe lack of system determinism – switch to more consistent low-level solutions or manually tweak them near points of interest.

TL;DR:

If we're profiling our game in most non-trivial cases (e.g. when there isn't one large shining bottleneck), we need to capture reproducible game sequences, otherwise comparing the optimized results to the base case could get difficult (both numerically and visually when hunting for bugs). The way to achieve this differs from game to game and may require some creativity if the game is particularly badly behaved. To simplify this a little, we're seeking three co-related behaviors: play determinism, game logic determinism, and system determinism.

Here is an example of some simple-to-assemble tooling for reproducible capture analysis that can be a good addition to conventional profilers. As a prerequisite, we'll need the above-mentioned game-specific capabilities to play out the same situation and capture raw performance data. In this example, we're using Unreal Engine, so for the latter we can use stock FPSChart – issue console commands for StartFPSChart and StopFPSChart on the start and end of your gameplay sequence. This will output the CSV file with raw data: frame time, game thread, render thread, GPU time, and dynamic resolution percentage. From these basic metrics, we can make a very handy visualization that will make it easy to evaluate both fine and averaged performance.

We can use any dataframe software (good candidates are: pandas and polars for Python, DataFrames.jl for Julia, data.frame/data.table for R) and plotting library of choice.

Example analysis tooling

Loading the data set, and naming measurements.
Time-alignment of series.
Printing summary – mean/average FPS, thread times, GPU time.
Rolling average of the data.
Overlaid plotting of time series.
Aggregating multiple time series together (avg, min, max, median).
Overlay plotting of distribution histograms.

Here are the main capabilities to implement:

Capture/script gameplay sequence.
Make N measurements of the same sequence before and after the optimizations.
Run the script to collect and name the measured data (with some manual naming to discern before/after data files).
Load the data and align it by time (retaining only common time). Depending on the way you hook into replaying your sequence, this may be needed to alleviate possible latency. In the worst case when you have no automation, this will be especially important.
Rolling the average of the data to smooth out the noise.
Print the summary to see if there is an apparent average/median change.
Plot overlaid aggregated graphs and analyze them (be careful with this view, while aggregating multiple runs can sometimes be beneficial, other times it can average too much or otherwise cause constructive interference spikes, which may be misleading depending on the degree of determinism).
Check measurements without aggregation, possibly toning down the rolling average to see if there might be specific fine changes that get lost in all the averaging or don't get reproduced in every run.

With this set of functionality, we get the following full workflow:

This particular workflow can be both employed for small and large captures, but it especially excels in the latter case, as the information is already thin (compared to a sampling profiler), so it's easy to capture longer time intervals while averaging makes the data digestible.

Add or use existing performance collection mechanisms.
Visualize data to better see the changes.
Check out both averaged and unaveraged data due to reproducibility artifacts: the less deterministic the game, the more averaging we need, but, to our dismay, the more fine changes get lost to noise and inconsistent behavior.

TL;DR:

9. Plot distribution histograms, if necessary, for additional visualization.

Performance optimization is a multifarious exercise with thinking and implementing the optimization idea often being the smaller part of the work that needs to be done: measuring the effect, making trade-off decisions, spotting pessimizations and bugs, and communicating with the rest of the team. That said, for all these tasks reproducibility of the performance capture is important and very beneficial – it won't only simplify an already difficult endeavor, but also improve the precision of data that will inform better decisions.

Summary

30.05.2024

Lunate Games and Senior Negroni: Meetup in Tbilisi!

The meetup fostered a dynamic environment for sharing knowledge and ideas, strengthening connections within the game development community. We were delighted to share our expertise and play a part in strengthening the networking community in Tbilisi.

05.02.2024

Reordering Optimization: Maximizing Memory Efficiency in C++ Structures

Efficient memory utilization is a crucial consideration in software development, especially when working with resource-constrained systems. In this article, we will explore the concept of reordering optimization and demonstrate its impact on memory utilization using C++ structures.