Performance talk

Going fast!

Software performance on modern systems

Dmytro Shynkar

A bit of context

I spent last couple years helping to build this game

A fast paced multiplayer FPS with destructible environments

Fast and responsive movement
Everything around is destructible
(Un)controllably chaotic battles

Performance is critical

Nobody would want to play like this

Though some still do...

So basically

Not only games

The Amazon latency-to-revenue study sources are somewhat dubios
Mobile applications battery usage
Cloud/infra costs
much more

There's a reason big companies keep investing into faster software

Some examples

Google spends ungodly amounts of engeneering resources on V8 and WASM
Same for Meta and HHVM
Uber switches it's app to QUIC for better latency
Netflix hotpatches their cloud deployments and works with Intel to squeeze every drop
And much more
this is just what 15 minutes of googling shows

What is Performance?

How fast a single unit of work gets done - latency
How much work gets done in a unit of time - throughput

Latency

Latency

A lot of times the thing that leads to things feeling slow
Crazy important for games and any other interactive media
Oftentimes heavily impacted by IO performance
Average values are dangerously misleading

Throughput

More of a measure of whole system's performance
Absolutely critical for non-interactive workloads

Naively: lower the latency - better the throughput

Just like gravity, Queueing Theory isn't kind

Bad engineering decisions can absolutely ruin both!

Other types of software performance

Scalability
Utilization
Energy efficiency
IOPS
Probably infinitely more

Hardware context

All software ultimatevely runs on some kind of hardware

Even if it takes a long chain of abstractions to get there

Why does it matter?

≠

We'll see why and how shortly

Enough rambling

Demo time

How's the performance

Let's do some math!

Boids

A really simple algorithm giving good results

Each boid follows 3 simple rules

Alignment

Cohesion

Separation

For each boid out of N

6n distance calculations + ~10 flops
Distance calculation depending on impl may be just 2-3 flops
2.5kflops per boid, or 1M flops for 400
Let's be generous and say 2M per iteration
So for 15FPS we come to ~30M flop/s

Profiling time

Except it's not that easy!

Frame time includes input handling and drawing
Profilers to the resque!

Demo time

Profiling time

This time we basically guessed right

Let's correct to ~32M flop/s

This machine

AMD Ryzen 9 5900HS
32GB DDR4
RTX 3060

47 GFlops/sec

Just on the CPU

Napkin math

1400x slower than the machines hypothetical speed

Why?

How Python works

Let's look at what runs

Bytecode != assembly
Interpreters can still be written to be highly efficient
i.e. LuaJIT( just a 30x difference compared to C on average)
CPython is not one of those interpreters
Fast things in Python are not written in Python

I still write a lot of it

Rewrite it in Rust!

Is it better?

Demo time

Seems to be!

~3x or 96Mflop/s
~25x or 0.8Gflop/s

It's the same code though, what's the difference?

Abstract machine and optimization

Rust isn't "close to the metal"
Neither is C or C++
In every case we're programming against an abstract machine
That's where we get undefined behavior and other fun things from

The closest you can get to metal is assembly
It's not that scary!
PS: We are not touching Verilog

LLVM is really good at translating the abstract to concrete
First version
Second version
Godbolt is awesome

What next?

We are still 62x slower

We won't improve the algorithm

Let's profile!

Short intro to profilers

Instrumentation
Sampling

Instrumentation

Manually place zones to measure
Very low overhead
Uninstrumented code won't show up

Tracy, Unreal Insights, Optick

Sampling

Periodically stops the program to record callstack
Aggregation can miss short lived events
Vendor specific ones provide a lot on top

VerySleepy, Intel VTune, AMD uProf, Superluminal

Both are very useful

State of the art hybrid profilers are magic. Look at Magic Trace

Profiling time

Wait, are we slower now?

~0.65Gflop/sec

Let's take a closer look!

Memory architecture

This is quite bad!

Let's see why it's bad

Memory performance didn't hold up to CPU performance

That's why we have caches!

Another famous example

At 4k boids we no longer fit into L2 cache!

Which is a whoppin 4MB on this machine

Let's move the cold data out

1.52x faster!

More about caches

For many things we can't hope to fit in cache

But we don't need to!

Predictability of access is key

Our memory layout

That's the default in Python and many other languages

Prefetcher doesn't like it!

We want this

What's the difference?

~95ms -> ~83ms, 15% improvement

No changes other than data layout!

1.15Gflop/sec

Still 41x slower!

Digging further!

 uops ltncy  CPI    port

Source: Agner Fog

Let's avoid it then

\[\begin{aligned} \sqrt{{(x_2 - x_1)^2 + (y_2 - y_1)^2}} < max \Rightarrow (x_2 - x_1)^2 + (y_2 - y_1)^2 < max^2 \end{aligned} \]

What's the difference?

~83ms -> ~69ms, another 20%!

Note on VTune's supremacy

Going parallel

We're at ~1.4 Gflops/sec

Still 33x slower than our potential

But those 47Gflops are for the whole CPU!

We're using merely 12.5% of our CPU

Let's use more!

Boids lend themselves well to parallelism

Double buffer old and new boids
Give each thread a boid to update
Swap buffers

Kinda like this

Let's see, how it goes

Pretty good! 69ms -> 12.5ms

~7.6Gflop/sec

Sharing data

Was our threading schema a good idea?

It's all about caches again!

Modern CPU is a distributed system

A scary picture - MOESI state machine

That's how we get atomics and memory ordering

CPU thinks of data in cache lines

Usually that's a 64 bytes-long chunk of data

You can't actually load less data from RAM

So using some amazing MS Paint art

This is called false sharing

Our boids are just 12 bytes each

We run a real chance of multiple threads contending!

Let's share in bigger chunks!

Demo Time

12.5ms -> 8.8ms, ~40% better!

~11.3Gflop/sec

Data parallelism

So far we have been generally doing one thing at a time

Load one boid
Load one more boid
Calculate distance
Compare to some threshold
...

In other words we've been doing scalar processing

All the modern CPUs are vector!

There's actually no scalar float processing on x86-64!

SIMD - Single Instruction Multiple Data

SIMD extensions are architecture-specific

X86-64: SSE, AVX2, AVX512
ARM: Neon, SVE
RISC-V: RVV

Let's vectorize!

Demo Time

14.5ms

~6.7Gflop/sec

🤔

That's single threaded!

2.2ms = ~44 GFlop/sec

🚀

Let's call it!

Notes on SIMD implementation

Autovectorization won't cut it

Compiler won't drastically restructure data for you

is very different to

Compiler also won't rewrite yor algorithm to be branchless

Examples before had optimization enabled but achieved 0% vectorization

For much simpler cases compiler does help

So do you need to hand-roll assembly or horrible intrinsic-infested code?

No!

There are better and portable ways now!

Intel's ISPC
Rust's std::simd which I used
Google's Highway for C++

Though sometimes you might want to still hand-roll ASM

Summary

We took a look at some of the performance factors

Execution model
Memory caches
Memory access and prefetching
Instruction cost
Parallelism
False sharing
Vectorization

We also skipped so many more

Branch prediction
Atomics and synchronization
Instruction scheduling
System calls overhead
Memory bandwidth
IO: sync and async
And many more

We did get a 1350x boost in performance, by...

Thinking about what is the workload
Measuring the performance
Following the data and showing mechanical sympathy

We could also get rid of $O(n^2)$ thing and/or move to GPU

But I'm doing X development, it doesn't apply!

No need to hand-optimize everything from the go
Just leave the possibility to optimize at all
Optimizability is a crucial non-functional requirement

Can you maybe...

Pick a compiled language: Rust, Go, Swift, even Java
Not overly generalize and abstract from the get go
Bring your data closer to your processing
Make your code measurable and measure it
Consider efficiency as one of the engenering decision drivers

Some great resources

Thank you!

Questions?