Performant Code

Everyone loves software that feels nice and fast – instant feedback, quick boot times, getting the job done in as short amount of time as possible. On the other hand, everyone hates using slow software. Think about the last time you opened MS Word – how long did you have to wait before you could start writing?

The Point

In an environment where you’re billed per compute cycle, like cloud computing services, waste directly translates to higher bills. I have personally been on teams where thousands of dollars per month were saved by optimising a computation in the hotpath. In an embedded environment, where compute resources are fixed, suboptimal code can limit the overall capabilities of the system and lead to degredations in performance. When developing a product, you want a team that practices performant software engineering to ensure the success of your project.

Background

So how did we get here? In a world of libraries, frameworks, and seemingly infinite abstractions, it’s no wonder that the average program has a ton of stuff included that is not really used, nonetheless the computer still has to process all of it. Cache lines get clogged with unused code. Assets weighing millions or billions of bytes are loaded that the user never even looks at. Unimportant things are loaded before the things you care about.

How do we get high performance code? Not to be coy, but it’s by not writing low-performance code! But can it really be that simple? Yes, because the fastest code is the code that is never executed.

Cache

The interplay between cache lines, cache levels, data alignment, speculative execution and branch prediction are some things that can impact how quickly a CPU can get through a given piece of code. Designing for these are usually micro-optimisations, but they can add up quickly in the context of a large program. It is always a good idea to keep these in mind when writing code.

Data alignment, for example, is so easy to do that you might even be able to find a formatter for your editor that can do it for you. The idea is that in a struct you want types of the members to be ordered from greatest to least size, which minimises the amount of padding required. This can be a huge benefit when the struct is loaded into cache because not only do you improve the information density of your data, but you don’t push other data out of the way to make room for the waste. But it gets worse, with unaligned data the computer has to do even more cycles just moving this memory around, instead of doing the business logic computations.

Compilers

Compilers do a lot of impressive and amazing things, but they aren’t a magic wand for performant code; besides, optimisation isn’t even their primary job. The benchmark of a good compiler, I will argue, is it’s predictability. A compiler can only do a reasonably good and predictable job at converting high level programming language code into corresponding machine instructions. Sure, most compliers will give optimisation their best go, but if you intentionally or otherwise explicitly tell a compiler that you want more scopes created and more allocations made, it’s going to have to listen to you. You are, afterall, the advisor of your compiler.

It’s worth spending the time to get familiar with what kind of code your compiler emits, because once you understand that, you can then help guide your complier toward generating the kind of code you want to see instead of forcing it to guess, or worse: give up and do it long-hand. Matt Godbolt’s Compiler Explorer is a great learning tool for trying out different compilers on different architectures.

Assembly Comparison

Just the other day an engineer asked me what library to use for input validation. But in this program they only needed to ensure two things: 1) the first letter of the string is a capital letter, and 2) that the string length was not greater than 30. So I suggested to just write a validation function that will live in our codebase, and we can modify it or extend it as needed, rather than importing a new dependency. Let’s use Compiler Explorer to compare the output of two approaches to a hand-written validation function.

Optimal

Look at our two line function - there’s one allocation and all conditionals are in-lined. It’s about as efficient as you can make it a mere 25 assembly instructions, while still being clear enough to read.

func validate(s string) bool {
    l := len(s)
    return l > 0 && l < 30 && s[0] >= 'A' && s[0] <= 'Z'
}

Suboptimal

For comparison, check this unoptimised version clocking in at more than double the instruction count (60 vs 25) in the generated assembly code.

func validate(s string) bool {
    l := len(s)
    if (l < 0) {
        return false
    }

    if (l > 30) {
        return false
    }

    return s[0] >= 'A' && s[0] <= 'Z'
}

Although both functions do the same thing, they do it in drastically different ways.

About NB Embedded Pty Ltd

NB Embedded Pty Ltd specialises in smart electronic devices for commercial and industrial organisations in Australia, Europe, and Asia. Based in Brisbane, Australia, we have the expertise to turn your concept into a product ready for the market, enhance your current product range, or provide customised smart devices to improve your internal processes. Contact us for a fixed cost consultation to get practical advice on your project. Talk with us today.