Going fast!

Software performance on modern systems

Dmytro Shynkar

A bit of context

I spent last couple years helping to build this game

A fast paced multiplayer FPS with destructible environments

  • Fast and responsive movement
  • Everything around is destructible
  • (Un)controllably chaotic battles

Performance is critical

Nobody would want to play like this

Though some still do...

So basically

Not only games

  • The Amazon latency-to-revenue study sources are somewhat dubios
  • Mobile applications battery usage
  • Cloud/infra costs
  • much more

There's a reason big companies keep investing into faster software

Some examples

What is Performance?

  • How fast a single unit of work gets done - latency
  • How much work gets done in a unit of time - throughput

Latency

Latency

  • A lot of times the thing that leads to things feeling slow
  • Crazy important for games and any other interactive media
  • Oftentimes heavily impacted by IO performance
  • Average values are dangerously misleading

Throughput

  • More of a measure of whole system's performance
  • Absolutely critical for non-interactive workloads

Naively: lower the latency - better the throughput

Just like gravity, Queueing Theory isn't kind

Bad engineering decisions can absolutely ruin both!

Other types of software performance

  • Scalability
  • Utilization
  • Energy efficiency
  • IOPS
  • Probably infinitely more

Hardware context

All software ultimatevely runs on some kind of hardware

Even if it takes a long chain of abstractions to get there

Why does it matter?

6502 Processor Core i9 Processor
We'll see why and how shortly

Enough rambling

Demo time

How's the performance

Let's do some math!

Boids

A really simple algorithm giving good results

Each boid follows 3 simple rules

Alignment

Cohesion

Separation

For each boid out of N

  • 6n distance calculations + ~10 flops
  • Distance calculation depending on impl may be just 2-3 flops
  • 2.5kflops per boid, or 1M flops for 400
  • Let's be generous and say 2M per iteration
  • So for 15FPS we come to ~30M flop/s

Profiling time

Except it's not that easy!

  • Frame time includes input handling and drawing
  • Profilers to the resque!

Demo time

Profiling time

This time we basically guessed right

Let's correct to ~32M flop/s

This machine

  • AMD Ryzen 9 5900HS
  • 32GB DDR4
  • RTX 3060

47 GFlops/sec

Just on the CPU

Napkin math

1400x slower than the machines hypothetical speed

Why?

How Python works

Let's look at what runs

I still write a lot of it

Rewrite it in Rust!

Is it better?

Demo time

Seems to be!

  • ~3x or 96Mflop/s
  • ~25x or 0.8Gflop/s

It's the same code though, what's the difference?

Abstract machine and optimization

  • Rust isn't "close to the metal"
  • Neither is C or C++
  • In every case we're programming against an abstract machine
  • That's where we get undefined behavior and other fun things from
  • The closest you can get to metal is assembly
  • It's not that scary!
  • PS: We are not touching Verilog

What next?

We are still 62x slower

We won't improve the algorithm

Let's profile!

Short intro to profilers

  • Instrumentation
  • Sampling

Instrumentation

  • Manually place zones to measure
  • Very low overhead
  • Uninstrumented code won't show up

Tracy, Unreal Insights, Optick

Sampling

  • Periodically stops the program to record callstack
  • Aggregation can miss short lived events
  • Vendor specific ones provide a lot on top

VerySleepy, Intel VTune, AMD uProf, Superluminal

Both are very useful

State of the art hybrid profilers are magic. Look at Magic Trace

Profiling time

Wait, are we slower now?

~0.65Gflop/sec

Let's take a closer look!

Memory architecture

This is quite bad!

Let's see why it's bad

Memory performance didn't hold up to CPU performance

That's why we have caches!

Another famous example

At 4k boids we no longer fit into L2 cache!

Which is a whoppin 4MB on this machine

Let's move the cold data out

1.52x faster!

More about caches

For many things we can't hope to fit in cache

But we don't need to!

Predictability of access is key

Our memory layout

That's the default in Python and many other languages

Prefetcher doesn't like it!

We want this

What's the difference?

~95ms -> ~83ms, 15% improvement

No changes other than data layout!

1.15Gflop/sec

Still 41x slower!

Digging further!

 uops ltncy  CPI    port                   

Source: Agner Fog

Let's avoid it then

\[\begin{aligned} \sqrt{{(x_2 - x_1)^2 + (y_2 - y_1)^2}} < max \Rightarrow (x_2 - x_1)^2 + (y_2 - y_1)^2 < max^2 \end{aligned} \]

->

What's the difference?

~83ms -> ~69ms, another 20%!

Note on VTune's supremacy

Going parallel

We're at ~1.4 Gflops/sec

Still 33x slower than our potential

But those 47Gflops are for the whole CPU!

We're using merely 12.5% of our CPU

Let's use more!

Boids lend themselves well to parallelism

  • Double buffer old and new boids
  • Give each thread a boid to update
  • Swap buffers

Kinda like this

Let's see, how it goes

Pretty good! 69ms -> 12.5ms

~7.6Gflop/sec

Sharing data

Was our threading schema a good idea?

It's all about caches again!

Modern CPU is a distributed system

A scary picture - MOESI state machine

That's how we get atomics and memory ordering

CPU thinks of data in cache lines

Usually that's a 64 bytes-long chunk of data

You can't actually load less data from RAM

So using some amazing MS Paint art

So using some amazing MS Paint art

So using some amazing MS Paint art

This is called false sharing

Our boids are just 12 bytes each

We run a real chance of multiple threads contending!

Let's share in bigger chunks!

Demo Time

12.5ms -> 8.8ms, ~40% better!

~11.3Gflop/sec

Data parallelism

So far we have been generally doing one thing at a time

  • Load one boid
  • Load one more boid
  • Calculate distance
  • Compare to some threshold
  • ...

In other words we've been doing scalar processing

All the modern CPUs are vector!

There's actually no scalar float processing on x86-64!

SIMD - Single Instruction Multiple Data

SIMD extensions are architecture-specific

  • X86-64: SSE, AVX2, AVX512
  • ARM: Neon, SVE
  • RISC-V: RVV

Let's vectorize!

Demo Time

14.5ms

~6.7Gflop/sec

🤔

That's single threaded!

2.2ms = ~44 GFlop/sec

🚀

Let's call it!

Notes on SIMD implementation

Autovectorization won't cut it

Compiler won't drastically restructure data for you

Core i9 Processor

is very different to

Compiler also won't rewrite yor algorithm to be branchless

Examples before had optimization enabled but achieved 0% vectorization

For much simpler cases compiler does help

So do you need to hand-roll assembly or horrible intrinsic-infested code?

Core i9 Processor

No!

There are better and portable ways now!

  • Intel's ISPC
  • Rust's std::simd which I used
  • Google's Highway for C++

Though sometimes you might want to still hand-roll ASM

Summary

We took a look at some of the performance factors

  • Execution model
  • Memory caches
  • Memory access and prefetching
  • Instruction cost
  • Parallelism
  • False sharing
  • Vectorization

We also skipped so many more

  • Branch prediction
  • Atomics and synchronization
  • Instruction scheduling
  • System calls overhead
  • Memory bandwidth
  • IO: sync and async
  • And many more

We did get a 1350x boost in performance, by...

  • Thinking about what is the workload
  • Measuring the performance
  • Following the data and showing mechanical sympathy

We could also get rid of $O(n^2)$ thing and/or move to GPU

But I'm doing X development, it doesn't apply!

  • No need to hand-optimize everything from the go
  • Just leave the possibility to optimize at all
  • Optimizability is a crucial non-functional requirement

Can you maybe...

  • Pick a compiled language: Rust, Go, Swift, even Java
  • Not overly generalize and abstract from the get go
  • Bring your data closer to your processing
  • Make your code measurable and measure it
  • Consider efficiency as one of the engenering decision drivers

Some great resources

Thank you!

Questions?