Rewriting memcpy in Assembly: A Journey Into x86_64

I’ve always been intrigued by Assembly. Every time I’ve looked at the generated Assembly for my C or Zig code, it’s always been interesting to see:

How the simple things taken for granted are not so trivial at the machine level.
How compilers cleverly optimise the source code to take advantage of the structure of Assembly.

I believe that one of the best ways to learn something is to take a dive and try implementing it yourself. Based on this, we’ll go on a journey here to implement memcpy in asm, and see what we can learn about machine code.

By the end of this article, you should have a newfound appreciation of all the wonderful work compilers do for us to take our high-level code into machine code.You should also have a greater understanding of when it is necessary to drop down to Assembly for the sake of utilising the machine’s resources beyond what a compiler can perform.

This article is written for x86_64 assembly running on Linux, linked using the ELF64 format, and conforms to the System-V ABI with the AMD64 calling convention. If any of those words didn’t make sense to you, stop here and go through the intro and lesson 0 of this amazing introduction to Assembly.

I use an Intel processor, so most of the processor-specific information comes from Intel, but there is a significant overlap with AMD since they both use the x64 instruction set.

Defining the Interface

While we are writing the entire thing in Assembly, we still want it to be callable from our C code. To this end, add this to your main.c file:

#include "stdint.h"

extern void* asmcpy(void *restrict dest, void *restrict source, uint64_t count);

The extern keyword allows us to define a prototype for a function which exists somewhere outside the specified C code. It will be dynamically or statically linked into the generated executable file during the linking phase.

Asmcpy takes in a pointer to the destination, a pointer to the source, and a count of the bytes to move between the two locations. Even though we’re not going to be compiling the asmcpy routine (it’s already in Assembly), the restrict keyword is still a good hint to let the users of this function know that our function does not expect the source and destination pointers to overlap.

Taking the destination as the first argument, and the source as the second argument makes things line up very nicely with the rdi and rsi registers, which is going to come in handy later.

Hello World

We’re going to try to get a “Hello World” up and running first, before we get into the guts of this. For now, I want to make sure that our function is receiving the source and destination arguments properly. First, here’s our C file:

#include "stdint.h"
extern void asmcpy(void *restrict dest, const void *restrict source, uint64_t count);

int main() {
    char *source = "sourcen";
    char *dest = "dest";

    asmcpy(dest, source, 8);
}

Here’s our asmcpy.asm file:

global asmcpy

section .text
asmcpy:
    mov rax, 1
    mov rdi, 1
    ;rsi already contains a pointer to the source data
    ;rdx already contains the length of the messages
    syscall
    ret

Okay, let’s break this down:

global asmcpy is a directive to expose the asmcpy block so it can be accessed during the linking process. The asmcpy exposed here is the same one being expected by the extern void asmcpy line in our C program. It is the job of the linker to combine these symbols into a final executable.

Section .text contains executable instructions. In this case, we only have the asmcpy block, but there could be a lot more.

In our asmcpy block, we have the actual instructions being executed for this routine. This is a good time as any to remind you that we’re using the System-V Application Binary Interface, so the arguments passed to our function are stored in rdi, rsi, rdx, rcx, r8, and r9 for the 1st, 2nd, 3rd, 4th, 5th, and 6th arguments, respectively.

In order to write to the console, we’ll store the number which corresponds to the POSIX write in rax, and trigger an interrupt with the sycall instruction. The interrupt handler will read the contents of rax and call the corresponding function from the pre-defined syscalls.

mov rax, 1 communicates that we want to access POSIX write to print to the console, and we’ll use the same calling convention to specify the arguments we want to pass to write.

mov rdi, 1 passes 1 as the first argument to POSIXwrite. As a reminder, here’s the function prototype for POSIX write:

ssize_t write(int fd, const void *buf, size_t count);

1 in this case corresponds to stdout, which is the console output.

rsi and rdx — which correspond to the buf and count arguments for write, respectfully — have already been set to point to the source buffer and the count of the bytes to move, so we’re good.

With all of this in place, the syscall instruction triggers the interrupt I mentioned earlier, and the source buffer should be printed to the screen. ret returns to the calling function (int main, in this case).

Assembling, Compiling, and Linking

First, compile main.c to object code:

gcc -c -g -O3 main.c -o main.o

Then, assemble the asmcpy file:

nasm -f elf64 asmcpy.asm -o asmcpy.o #Install binutils if you don't have nasm

Link the two files:

gcc main.o asmcpy.o -o main

And finally, we can run the code:

./main
#Output: Source

If you followed all the steps, you should have some output on your console screen.

Getting Down To Business

Alright, now that everything’s wired up properly, it’s time to work on our implementation of asmcpy. If you’re going to get reasonably good with Assembly, I’d advise that you get used to reading the documentation for different instructions. FelixCloutier is a good start.

Note: If this is your first time doing this, you’re most likely going to get overwhelmed by the sheer amount of instructions you’ll see on that reference. In that case, I advise you to slow down, take a deep breath, and only read the instructions you’re immediately in need of. Over time, you’ll get better at using the reference. I’ll do a short demonstration below.

Reading Assembly Instruction References

The instruction we’re after is MOVS, which moves data from a string to another string. Let’s analyse the reference page in sections:

MOVS's instruction reference

The first section mentions the name of the instruction, as well as the different variants of it. It’s quite common for an instruction to have different variants for operating on bytes(MOVSB), words(2 bytes, MOVSW), double words(MOVSD), or others (not applicable here).

The Instruction Summary Table

Next, we have the Instruction Summary Table, which organises information about the instruction, its variants, the opcodes, the short description, and other bits of knowledge:

The Opcode Column: Stores the opcode — the byte(s) which will be in the instruction stream fed to the processor. You’ll notice that there are different opcodes for each of the variants. This is how the processor knows which of these to execute.

The Instruction column shows the syntax of the instruction, along with the expected operands for each form. In the first entry, for example, MOVS m8, m8 indicates that this instruction takes an 8-bit/1-byte memory location as the first operand, and an 8-bit/1-byte location as the second parameter. But how did I know that m8 meant 8-bit memory? I’ll let you know in a bit.

The Op/En column contains a key into the Instruction Operand Encoding Table right below this Instruction Summary Table, which provides the information on each of the operands for this instruction, if any.

64-Bit Mode tells us if this opcode is supported in 64-bit mode. For the first entry, valid indicates that this opcode is supported. For most instructions that you run, you won’t need to think too much about this.

Compatibility Mode does the same thing, but for the compatibility mode used by processors. Again, most of the legacy/compatibility-related material in the references is not going to be particularly relevant for common scenarios.

Description does exactly what it sounds like. We can infer from this line that this form of the instruction moves a byte from the RSI register to the RDI register.

Instruction Operand Encoding Table

Operand Encoding Table

This table gives more information about the operands to an instruction. Each entry in this table takes the Op/En field from the Instruction Summary Table and specifies whether it has one, two, three, or even four operands, as well as whether the operands are registers, memory operands, or immediate values.

Using the first entry from the Instruction Summary Table as an example, the ZO key there tells us that MOVS m8, m8 takes no operands.

But wait, how can an instruction with two operands in its syntax take no operands? I’m going to save you the stress of getting lost on that one and tell you that the MOVS m8, m8 only exists to document what MOVSB does. The same is applicable for MOVS m16,m16 and MOVSD, etc. This is confirmed by the fact that movsb and MOVS m8, m8 share the same opcode from the table.

Description

Description gives a more thorough breakdown of what exactly the instruction does, its operation procedure, etc.

Before we proceed, this where I’ll recommend Intel’s Software Developer Manuals, specifically the combined Volume 2, which contains a complete reference for the instruction set and other related, useful information. This is also how I knew that m8 refers to 8-bit/1-byte memory.

MOVSB is a relatively easy-to-understand instruction, so you can get the gist of it without consulting the manual, but others aren’t so straightforward.

Operation, Flags, and The Other Fields

Operation gives a pseudocode explanation of what this instruction does.

Flags lets us know which flags, if any, are modified by the instruction’s execution.

Most of the other fields explain the various exceptions which can be thrown by this instruction for various reasons. It’s worth taking a glance at, as they can be useful when debugging.

In case you haven’t put it together from the reference yet (no shame there), MOVS moves some amount of bytes from the memory address in RSI/ESI to the address in RDI/EDI. It then advances or decrements these registers by 1, 2, 4, or 8 bytes depending on the state of the direction flag. The REP prefix can allow us to repeat this operation until RCX is zero.

With all of this information, we’re ready for the first implementation of asmcpy. I encourage you to try it out on your own, though, before you proceed.

Implementing Asmcpy

Here we go.

global asmcpy

section .text
asmcpy:
    cld
    mov rcx, rdx
    rep movsb
    sub rdi, rdx
    mov rax, rdi
    ret

First, we need to clear the DF to make sure we’re going forward when we loop. This is what CLD does.
Then, mov rcx, rdx moves the count of bytes into rcx. In the AMD64 calling convention, the third argument to the function is held in rdx. Our prototype is

  extern void asmcpy(void *restrict dest, void *restrict source, uint64_t count);

This means that count is stored in rdx. However, rep repeats the operation (movsb, in this case) and decrements rcx until rcx is zero. This means that we need to move rdx into rcx.

Then rep movsb repeats the byte-by-byte move until rcx hits zero. At that point, rep terminates. Movsb copies bytes from rsi to rdi. From our calling convention, the source pointer is stored in rsi, and the destination is stored in rdi. Everything lines up perfectly.
We decrement the destination pointer to undo the forward advancements made by rep(remember rep advances both pointers).
Lastly, we store the rdi in rax, which is the register that gets returned after the function terminates.

Pretty compact, eh?

Let’s run some tests to make sure everything’s correct:

#include "stdint.h"
#include "stdio.h"
#include "stdlib.h"
#include "string.h"
extern void asmcpy(void *restrict dest, void *restrict source, uint64_t count);

void bench(uint64_t size) {
  char *src = malloc(size);
  char *dst = malloc(size);

  for (size_t i = 0; i < size; i++)
    src[i] = (char)(i & 0xFF);

  asmcpy(src, dst, size);
  if (memcmp(dst, src, size) != 0) {
    fprintf(stderr, "FAILED correctness check (size=%zu)n", size);
    exit(1);
  }
}

typedef uint64_t u64;

typedef uint32_t u32;

int main() {

  u64 sizes[] = {
      2,   3,   4,   5,   8,   9,   16,  17,  32,   33,   64,   65,
      128, 129, 256, 257, 512, 513, 768, 769, 1024, 1025};

  const int len = sizeof(sizes) / sizeof(sizes[0]);
  for (int i = 0; i < len; i++) {
    bench(sizes[i]);
  }
}

Everything passes. Sweet!

Note: It might seem like I randomly chose these numbers, but they’ll be pretty important later on.

We’re not done yet, though. What kind of project would this be if we didn’t compare the performance of our implementation to something much more polished? We’re going to compare asmcpy to Libc’s memcpy and see how our implementation stacks up!

Benchmarking Against Libc

Getting a Monotonic Clock

x64(or x86_64, if that’s more familiar to you) processors have a counter called the timestamp counter(TSC) that monotonically increments every CPU cycle, and is reset when the CPU is powered off. This counter can be read with the RDTSC instruction. To learn more about this counter and how it works, you’ll have to take a glance at the combined volume 3s of Intel’s Software Developer Manuals. Specifically, chapter 19. There’s also the reference for RDTSC instruction.

For statistical consistency, we’ll take multiple samples of each, and then compare the min, max, avg, and common operating range of these implementations. First, something minimal.

Creating a Minimal Benchmark

Reading the TSC with C Intrinsics

#include "x86intrin.h"
uint64_t start = __rdtsc();

Intrinsics allow us to write C code that almost exactly maps to Assembly instructions. Here, it’s the RDTSC instruction, which returns current value of the TSC.

Setting up the Rest

void bench_memcpy(u64 size, int runs, u64 results[]) {
  char *src = NULL;
  char *dst = NULL;
  if (init_buffers(&src, &dst, size) != 0) {
    fprintf(stderr, "ERROR while creating buffersn");
    exit(2);
  }
  for (int i = 0; i < runs; i++) {
    u64 start = __rdtsc();
    memcpy(dst, src, size);
    u64 end = __rdtsc();

    assert(end > start);
    u64 diff = end - start;

    assert(diff < end);
    if (memcmp(dst, src, size) != 0) {
      fprintf(stderr, "FAILED correctness check (size=%zu)n", size);
      exit(1);
    }
  }
}

I considered using a function pointer to allow using the same bench for both routines, but I wasn’t sure how that would hinder the optimiser.

Running this code as-is would probably give messy results due to modern CPUs employing out-of-order(OOO) execution. To explain OOO simply, let’s take a look at a simple set of assembly instructions:

mov    rbx, [rsi]   
add    rcx, rdx

Even though the mov comes first here, the processor is allowed to execute the add before the mov is done.

The primary reason for this performance — modern CPUs have several execution units for handling different operations such as loads, stores, basic arithmetic, etc. During execution, multiple assembly instructions are fetched from the instruction stream, decoded into some combination of micro ops, and dispatched to different execution units in the processor. In this way, multiple instructions can be in execution at the same time!

In this case, the load from memory might cause the execution units for that instruction to stall for several cycles due to memory latency. However, the adder units in the processor are not hindered by this stall, and are free to work on the add instruction. As long as all the results are rearranged to look as if they had executed in sequential order, everything is fine.

Note: This is a simple summary. I recommend this video by Matt Godbolt for a really exciting breakdown.

So, what does this have to do with our code? Well, let’s look at benchmark again:

u64 start = __rdtsc();
memcpy(dst, src, size);
u64 end = __rdtsc();

Do you see it, yet? Let’s look at the generated assembly for some clues (I’ll show how to do this later).

rdtsc
shl    $0x20, %rdx
mov    %rax, %r15
mov    %r13, %rsi
mov    %r12, %rdi
or     %rdx, %r15
mov    %rbp, %rdx
call   0x401070 <memcpy@plt>
rdtsc

Even if rdtsc is executed before and after memcpy, the CPU could reorder the exeuction such that rdtsc actually takes its reading while memcpy is being executed. This means we’d get inaccurate results. Here’s a line from the official reference:

…The RDTSC instruction is not a serializing instruction. It does not necessarily wait until all previous instructions have been executed before reading the counter. Similarly, subsequent instructions may begin execution before the read operation is performed…

So, how do solve this? We’ll use serialising instructions. But what a serialising instruction? I’ll give a brief explainer.

Serialisation And Program Order

Nothing is as it seems. In the pursuit of performance, modern CPUs perform all sorts of trickery under the hood to ensure that the hardware is kept as busy as possible. This includes reordering of memory loads and stores to optimise CPU bandwidth, and directly transmitting data between instructions instead of memory. Say we had two instructions:

mov [rdi], 10
mov rdx, [rsi]

And both of these instructions accessed the same memory location; rather than writing to the cache or main memory and then reading it again for the second instruction, the processor is instead allowed to directly transmit the write from the first instruction the second instruction, optimising memory bandwidth.

Intel CPUs(and some others) also employ buffering of writes to memory: Instead of writing multiple times to memory, the writes might be buffered in one of different internal buffers — including the store buffer, write-combining buffer, or some other intermediate location — and flushed to memory only when necessary. If you’d like to learn more about this, you’ll have to take a look Volume 3 of — you guessed it — Intel’s Software Developer Manuals. Specifically, chapter 13.

All of this, especially when combined with the OOO mechanism from before, means that at any point in time, it can be very difficult to have idea of which instruction has been fully retired, and the current state of memory.

Serialising instructions serve as checkpoints in the program order which guarantee various levels of consistency before proceeding. Take LFENCE, for example — it’s reference page states:

Performs a serializing operation on all load-from-memory instructions that were issued prior the LFENCE instruction. Specifically, LFENCE does not execute until all prior instructions have completed locally, and no later instruction begins execution until LFENCE completes. In particular, an instruction that loads from memory and that precedes an LFENCE receives data from memory prior to completion of the LFENCE..

Visualising it would look something like this:

A wall, perhaps

If you read beyond this part, though, you’ll encounter a caveat…

…(An LFENCE that follows an instruction that stores to memory might complete before the data being stored have become globally visible.) Instructions following an LFENCE may be fetched from memory before the LFENCE, but they will not execute (even speculatively) until the LFENCE completes.

Globally visible here refers to whether or not other processors/cores on the same system are able to see the results of an instruction. Again, because writes to memory are buffered, a write performed by a processor might not be visible to other processors in the same system until the writes are flushed to memory — there are other mechanisms around this, but those aren’t immediately relevant.

We’re running single-threaded code here, so the part about stores not being globally visible is fine; we only care that the instructions are done executing and have registered the data to be written to the memory, we don’t really care when they are written to memory.

Serialising with RDTSCP and LFENCE

RDTSCP reads the TSC, but also reads the CPUID from a different register. Like LFENCE, it waits until previous instructions are complete — unlike LFENCE, however, it does not stop instructions which come after it from being executed while it is running.

With all of this in mind, we can rewrite our code to make sure no funny business is going on.

u64 rdtscp() {
  u32 id;
  return __rdtscp(&id);
}

void bench_memcpy(u64 size, int runs, u64 results[]) {
  char *src = NULL;
  char *dst = NULL;
  if (init_buffers(&src, &dst, size) != 0) {
    fprintf(stderr, "ERROR while creating buffersn");
    exit(2);
  }
  assert(src != NULL && dst != NULL);

  for (int i = 0; i < runs; i++) {
    u64 start = rdtscp();
    __asm__ __volatile__("lfence" ::: "memory");

    memcpy(dst, src, size);

    u64 end = rdtscp();
    __asm__ __volatile__("lfence" ::: "memory");

    assert(end > start);
    u64 diff = end - start;

    assert(diff < end);
    if (memcmp(dst, src, size) != 0) {
      fprintf(stderr, "FAILED correctness check (size=%zu)n", size);
      exit(1);
    }

    results[i] = diff;
  }
  free(src);
  free(dst);
}

We’re not done yet, though.

Even if TSC increments with every cycle of the processor, it doesn’t increment faster when the processor clock changes — e.g. laptops might dynamically adjust their frequencies depending on their current load and power state. This behaviour means that when running on a laptop (which I am), we’ll have to ensure that the frequency of the processor is as fixed as possible to avoid unfair results. If you’re on a desktop PC, you don’t really need to do this.

Stabilising the Frequency on Linux

First, disable hyperthreading/SMT. It can lead to some unpredictability in results.

CPU frequency scaling is implemented via the CPUFreq subsystem on Linux. While we could directly interact with this subsytem, I like to use Cpupower as my preferred frontend, though you’re free to use whatever you wish.

I use an Intel Ice Lake processor, but you should be able to follow along for most cases.

First, we need to get information about the frequency range the hardware supports by running:

cpupower -c <Specific core here> frequency-info

CPUPower's results on my machine

Pay attention to the “hardware limits” section. From this, we can see that this core is capable of operating between 400Mhz and 3.6 GHz. We’ll need this information to specify an operating range.

Note: In my tests, I was not able to reliably pin my CPU to a specific frequency. Rather, it seems like mobile CPUs only take “suggestions”, and do their best to minimise fluctuations around the specified frequencies. If someone has a way around this, please reach out to me.

To “suggest” to the CPU to stay at a reasonable range, we’ll use the frequency-set min and frequency-set max to lock it:

sudo cpupower -c 3 frequency-set -g performance
sudo cpupower -c 3 frequency-set --min 2500000
sudo cpupower -c 3 frequency-set --max 2600000

If you watch the CPU frequencies with watch -n1 'cat /proc/cpuinfo | grep -i "mhz"', you’ll notice that the frequencies occasionally wander outside the range, but this has been stable enough for tests.

Pinning and Minimising Scheduling

We’ve successfully suggested CPU 3 to execute within a specified range. Now, we need to make sure that our program only executes on that CPU. This can be done with tasket:

taskset -c <core> <program>

Then, to make sure that the current program isn’t scheduled out by the kernel, we’ll increase the scheduling priority with the nice utility on Linux (this probably isn’t necessary since we already have taskset, but let’s just be safe):

sudo nice -n -20 <program> #-20 is the highest priority

Benchmarking Our Code

After all of that work, we’re ready to get some results. The link to the full code for this run is available here. Some things, first:

For this implementation, we’ll try to optimise up to 2048 bytes — half of the page size on many systems. That should be enough to cover a lot of general usecases. Translation: I haven’t figured out how to beat memcpy beyond that.
You’ll notice that the input data is segmented in the source code. This is because i found that earlier runs would sometimes affect the later results, particularly as the input sizes increased — I suspect this has to do with caching, eviction, or some other memory issue; I’m not sure yet — separating the inputs into different benchmarks resolved this.This might differ on your system, and you should probably experiment to find the right configuration. Who would have thought that microbenchmarking manually would be so tricky? Perhaps I should use a proven library…
Because of the above-mentioned point, we’ll focus more on the relation between the numbers, as opposed to the numbers themselves.

In addition to the optimisation for size, these tests are designed to check performance for ailgned and unaligned operations. This blog provides a good explanation on this topic. First, we’ll take a look at the generated assembly to make sure that nothing funny’s going on with objdump -d main | less…

Objdump for main

Due to the compiler reshuffling our code all over the place, you might have difficulties finding the benchmarks. You can hit key / on your keyboard to search and enter “rdtscp” as your search term. This’ll take you to the first occurence of that instruction in the text. Cool, we can see that the sequence is held: rdtscp, then lfence, asmcpy, rdtscp again, and lfence to finish it off.Bouncing around the searches, we can find the block for memcpy:

All seems to be well.

Now, let’s get to the results: Oh God, we’re getting absolutely mogged!

Glibc’s memcpy is twice as fast as our implementation for small strings. Let’s try for some bigger inputs:

The gap does shrink as we get higher, but our implementation never quite catches up to Glibc’s.

Interestingly, this seems to be the inverse of what i expected; due to rep movsb moving bytes at a time, I thought we’d be very competitive for small strings, and less so as the size increased. Instead, it seems like there is some sort of hidden optimisation in the CPU going on; the performance matches memcpy more and more for larger strings. (This is a good reminder to verify assumptions about code).

At this point, it’s time to take a closer look at how rep and movs work.

Rep and Movs — Under the Hood

Looking through the Volume 19 of the Software Developer Manual, there’s a section on “Fast-String Operation”.

Interestingly, this might be a very good explainer for why the performance of rep movsb was so competitive for large strings — even though rep movsb specifies byte-by-byte moves, the processor is allowed to alter its execution internally to move groups of these elements at a time. This is another example of CPUs employing clever optimisations for significant performance gains.

How much of a gain, you may ask? Well, let’s replace the rep movsb with a naive movsb loop and see the differences.

global asmcpy

section .text
asmcpy:
    dec rdx
    movsb
    jne asmcpy
    ret

The dec instruction decrements the content of a register by 1, and sets the zero flag in the CPU if the new value of the register is zero. (If you’re not familiar with jumps and flags in Assembly, here’s a good explainer for you).

Next, movsb performs the byte-level move from rsi to rdi.

Then, we jump to the start of the loop if the zero flag has not been set.

Here are the results from this:

It gets worse for bigger inputs

Now this is more like what I expected — we start out competitive with memcpy for small strings, and then get worse and worse as the input grows bigger. By the time we get to 256 bytes of input, asmcpy is more than 20x slower than memcpy. It’s clear, then, that rep movsb is doing some heavy optimisations in the processor to allow us keep up with memcpy.

Asmcopy, Revised

It’s clear that rep movsb is at least within the range of memcpy for large strings, so we’ll focus our optimisation efforts on making sure that we can keep up with memcpy for smaller strings, and progressively optimise on the way up.

Small Strings

Much of the improvement for small strings will center around a single concept — modern CPUs are heavily optimised to transfer blocks of data, rather single bytes at a time. This is even observable from the results; even though larger string sizes equate to more data being moved, there isn’t a linear scale up in the execution time of the different implementations.

With that in mind, here is our revised implementation:

global asmcpy
section .text
asmcpy:
    cmp rdx, 4
    jae .four
    mov rcx, rdx
    cld

.min:
    dec rdx
    movsb
    jne .min
    sub rdi, rcx
    mov rax, rdi
    ret


.four:
    cmp rdx, 8
    jae .eight
    mov ecx, [rsi + rdx*1 - 4]
    mov [rdi + rdx*1 - 4], ecx
    movsd
    sub rdi, 4
    mov rax, rdi
    ret

.eight:
    cmp rdx, 16
    jae .xmm
    mov rcx, [rsi + rdx*1 - 8]
    mov [rdi + rdx * 1 - 8], rcx
    movsq
    sub rdi, 8
    mov rax, rdi
    ret

Okay, we already know that the movsb loop is fine for small pieces of data, so we’ll leave that.

First, we have separate blocks for different input ranges. This is done to make sure that small strings only result in two copies at maximum. Control flow in Assembly is implemented by jumping between blocks based on different conditions.

Take the asmcpy, for example: first, we compare rdx(the count) with 4 and jump if the count is greater than or equal to 4. We accomplish this using the cmp and jae instructions. CMP takes two operands and performs performs integer subtraction between them (i.e a – b), but it also sets the eflags register based on the result. If you are unfamiliar with how control flow works at the CPU level, I’d recommend this brief write-up.

There’s an interesting trick here: Whenever we have data that straddles two native word sizes, i.e sizes between 4 and 8, or 8 and 16, we do two native copies and align them.

While we do “over-copy” a single element, it’s worth it for the efficiency of doing word-multiple copies.

mov ecx, [rsi + rdx*1 - 4]

Is just the CPU’s built-in way of doing array math. The syntax follows a simple pattern:

base + (index * scale) + displacement

Where

base – usually points to the start of a memory region (think of it as arr).
index – which element you want (like i in arr[i]).
scale – the size of each element.
displacement – a fixed offset added on top (can be positive or negative).

So in our example:

rsi is the base (pointer to the start of the array).
rdx is the index.
scale = 1, because each element is 1 byte (char).
displacement = - 4, which effectively steps back one dword.

Put together:

mov ecx, [rsi + rdx*1 - 4]

is equivalent to:

ecx = arr[i - 4];

We only use this for the elements from the native word size up to, but not including, the next word size (e.g 4, 5, 6, 7, but not 8), so we’re fine.

Let’s see how we stack up against memcpy again:

Much better

Neat! We’re actually on par with memcpy for small strings. Let’s go on to the next chunk.

SIMD!!!

I’m sure most people expected that we’d cover SIMD at some point in this. If you don’t know what SIMD is, I’ll give you a brief explainer.

SIMD, which stands for Single Instruction Multiple Data, is a mechanism on modern CPUs which allows them dispatch operations to multiple pieces of data in one instruction. SIMD works by packing data in registers, and operating in parallel on these “lanes”. SIMD allows an operation like this:

Parallel addition To be expressed in a single instruction:

paddw xmm0, xmm1

where xmm0 and xmm1 are two of the special-purpose SIMD registers available to CPUs.

This allows us to compute the equivalent of 8 adds in one cycle. A very useful thing to have for parallelism. If you’re interested in a deeper exploration of SIMD, I’d recommend this article.

While SIMD does allow for parallel operations, we’ll mostly be using it here to allow us perform 16, 32, and even 64-byte moves in one cycle with the corresponding xmm, ymm, and zmm registers.

.xmm:
    cmp rdx, 32
    jae .ymm
    movups xmm0, [rsi + rdx - 16]
    movups xmm1,[rsi]
    movups [rdi + rdx - 16], xmm0
    movups [rdi], xmm1
    mov rax, rdi
    ret

.ymm:
    cmp rdx, 64
    jae .zmm
    vmovups ymm0, [rsi + rdx - 32]
    vmovups ymm1, [rsi]
    vmovups [rdi + rdx  - 32], ymm0
    vmovups [rdi], ymm1
    mov rax, rdi
    ret

.zmm:
    cmp rdx, 128
    jae .prepare
    vmovups ymm0, [rsi]
    vmovups ymm1, [rsi+32]
    vmovups [rdi], ymm0
    vmovups [rdi+32], ymm1

    vmovups ymm2, [rsi+rdx-64]
    vmovups ymm3, [rsi+rdx-32]
    vmovups [rdi+rdx-64], ymm2
    vmovups [rdi+rdx-32], ymm3

    mov rax, rdi
    ret

Shall we unpack this, then?

This is the same trick as before, only done with bigger sizes. movups and vmovups are SIMD instructions to move unaliged 16 and 32 bytes of data, respectively. You’ll soon realise that SIMD instructions can often be a mouthful, but one get used to it.

I really wanted to employ AVX512 since my 10th gen processor does have support for it, but I kept getting inexplicable performance dips for the AVX512 instructions and registers, so we’ll stick wih the 32-byte registers provided for us by AVX2.

Okay, let’s run some tests.

Looks like the initial optimisations have paid off. We’re fully on par with memcpy!

The Final Form

Okay, this is the point where we start employing some more tools. We’ve reached the limit of the registers available to us from the system, so it’s time to start using loops:

.prepare:
    xor rcx, rcx
    lea rax, [rdx-64]

.beyond:
    cmp rdx, 2048
    ja .rep
    vmovups ymm0, [rsi+rcx]
    vmovups [rdi+rcx], ymm0
    vmovups ymm1, [rsi+rcx+32]
    vmovups [rdi+rcx+32], ymm1
    add rcx, 64
    cmp rcx, rax
    jb .beyond

    vmovups ymm0, [rsi + rax + 32]
    vmovups ymm1,[rsi+rax]
    vmovups [rdi + rax + 32], ymm0
    vmovups [rdi + rax], ymm1
    ret

.rep:
    cld
    mov rcx, rdx
    rep movsb
    sub rdi, rdx
    mov rax, rdi
    ret

The trick here is this — copy as many 64-byte chunks of the source to the destination as possible until there are less than 64 bytes left, and then do an “over-copy” like we did before to copy the tail.

xor rcx, rcx is a little trick to clear a register (set it to zero).
We’ve not really looked at the lea instruction, yet, but it’s useful for doing pointer arithmetic — lea rax, [rdx-64] roughly translates to “take the address in rdx, subtract 64 from it, and store that in rax”. There’s no law that stops us from using lea on non-addresses, though, so lea is often really useful to doing arithmetic in a single instruction. Here, we’re taking the count of elements in rdx, subtracting 64 from it, and storing that in rax. This allows us to know how many loop iterations we can run for the 64-byte loop.
You’ll notice that there are are two 32-byte moves. While this is done to allow moving 64 bytes in one iteration of the loop, there’s also a secondary reason: modern CPUs are able to execute multiple load and store operations simultaneously, but they need the instructions to be free of data dependencies. By unrolling the loop, we’ve given the processor more opportunities for instruction-level parallelism. Loop unrolling is a common technique used by compilers.
And finally, the tail of the data. Again, we use the technique with “over-copying” the last 64 bytes from the source to the destination.

E.g Say we had to copy 142 bytes from src to dst….

As the last part of the check, we’ll see if the input data is greater than 2048 bytes. If it is, we can freely pass it to rep movsb — it has proven capable of keeping up with memcpy for this range of data.

At this point, you know what comes next. Let’s look at some results!

Looks like asmcpy is pretty much on par with memcpy. That brings a smile to my face.

In Conclusion…

This was honestly really fun. I set out on this journey with a relatively shallow understanding of SIMD, Assembly instructions, CPU-specific details, and how high-level primitives map to machine code. Now, I’m a lot more familiar with all of these, and can better appreciate all of the work that has to go into writing compilers and optimised code.

I hope this was pretty fun for you, too. Maybe it’s time for you to tackle something like this but for a different Libc routine. Perhaps it’s time to start the Rewrite It In Assembly movement.

Talk to me on Twitter.