I spent last couple years helping to build this game
A fast paced multiplayer FPS with destructible environments
Performance is critical
Nobody would want to play like this
Though some still do...
There's a reason big companies keep investing into faster software
this is just what 15 minutes of googling shows
Naively: lower the latency - better the throughput
Just like gravity, Queueing Theory isn't kind
Bad engineering decisions can absolutely ruin both!
All software ultimatevely runs on some kind of hardware
Even if it takes a long chain of abstractions to get there
≠
Let's do some math!
A really simple algorithm giving good results
Each boid follows 3 simple rules
Alignment
Cohesion
Separation
For each boid out of N
Except it's not that easy!
This time we basically guessed right
Let's correct to ~32M flop/s
1400x slower than the machines hypothetical speed
Why?
I still write a lot of it
Is it better?
It's the same code though, what's the difference?
We are still 62x slower
We won't improve the algorithm
Let's profile!
Tracy, Unreal Insights, Optick
VerySleepy, Intel VTune, AMD uProf, Superluminal
Both are very useful
State of the art hybrid profilers are magic. Look at Magic Trace
Wait, are we slower now?
~0.65Gflop/sec
Let's take a closer look!
This is quite bad!
Let's see why it's bad
Memory performance didn't hold up to CPU performance
That's why we have caches!
Another famous example
At 4k boids we no longer fit into L2 cache!
Which is a whoppin 4MB on this machine
Let's move the cold data out
1.52x faster!
For many things we can't hope to fit in cache
But we don't need to!
Predictability of access is key
Our memory layout
That's the default in Python and many other languages
Prefetcher doesn't like it!
We want this
What's the difference?
~95ms -> ~83ms, 15% improvement
No changes other than data layout!
1.15Gflop/sec
Still 41x slower!
uops ltncy CPI port
Source: Agner Fog
Let's avoid it then
\[\begin{aligned} \sqrt{{(x_2 - x_1)^2 + (y_2 - y_1)^2}} < max \Rightarrow (x_2 - x_1)^2 + (y_2 - y_1)^2 < max^2 \end{aligned} \]
->
What's the difference?
~83ms -> ~69ms, another 20%!
Note on VTune's supremacy
We're at ~1.4 Gflops/sec
Still 33x slower than our potential
But those 47Gflops are for the whole CPU!
We're using merely 12.5% of our CPU
Let's use more!
Boids lend themselves well to parallelism
Kinda like this
Let's see, how it goes
Pretty good! 69ms -> 12.5ms
~7.6Gflop/sec
Was our threading schema a good idea?
It's all about caches again!
Modern CPU is a distributed system
A scary picture - MOESI state machine
That's how we get atomics and memory ordering
CPU thinks of data in cache lines
Usually that's a 64 bytes-long chunk of data
You can't actually load less data from RAM
So using some amazing MS Paint art
So using some amazing MS Paint art
So using some amazing MS Paint art
This is called false sharing
Our boids are just 12 bytes each
We run a real chance of multiple threads contending!
Let's share in bigger chunks!
12.5ms -> 8.8ms, ~40% better!
~11.3Gflop/sec
So far we have been generally doing one thing at a time
In other words we've been doing scalar processing
All the modern CPUs are vector!
There's actually no scalar float processing on x86-64!
SIMD - Single Instruction Multiple Data
SIMD extensions are architecture-specific
Let's vectorize!
14.5ms
~6.7Gflop/sec
🤔
That's single threaded!
2.2ms = ~44 GFlop/sec
🚀
Let's call it!
Autovectorization won't cut it
Compiler won't drastically restructure data for you
is very different to
Compiler also won't rewrite yor algorithm to be branchless
Examples before had optimization enabled but achieved 0% vectorization
For much simpler cases compiler does help
So do you need to hand-roll assembly or horrible intrinsic-infested code?
No!
There are better and portable ways now!
Though sometimes you might want to still hand-roll ASM
We took a look at some of the performance factors
We also skipped so many more
We did get a 1350x boost in performance, by...
We could also get rid of $O(n^2)$ thing and/or move to GPU
But I'm doing X development, it doesn't apply!
Can you maybe...
Some great resources
Thank you!
Questions?