Compiling MethodScript to LLVM

Part I

In this inaugural post on the MethodScript blog, I wanted to describe the basics of how the MethodScript compiler works, with particular attention to how it works with LLVM. I’ll also dive a bit into LLVM itself, which can help potential future contributors, but also should be of interest to those who want to learn more about LLVM itself and don’t care about MethodScript at all. Indeed, very little MethodScript is actually going to be covered here, and we’ll actually mostly be using C to demonstrate some of the more advanced stuff that isn’t yet implemented in MethodScript. This should also serve as a decent enough “getting started” guide for anyone who wants to write their own programming language down the road. This is a bit of a long post, so I’ve split it into two parts. In part one, the basics will be covered, and in part two, I’ll dive deeper into some of LLVM’s optimizations.

LLVM Introduction

Before diving into the specifics, I’d like to introduce LLVM itself a little bit. LLVM originally stood for Low Level Virtual Machine, though is now just “LLVM” and is a standard and set of tools that allows various frontends to output a standardized Internal Representation (IR) or bitcode which is then compiled into architecture/OS specific executables using various backends. Many languages support LLVM without you necessarily knowing about it. Clang is a frontend that compiles C/C++ through LLVM, llgo for Golang, and rustc for Rust. Lots of backends are supported, including ARM, PPC, x86, x86_64, and most interestingly, WASM, which means that languages that have a LLVM frontend compiler can all compile directly to web assembly.

LLVM contains a number of optimizations, and these optimizations apply to the IR/bitcode level, which means that any optimization that are developed for LLVM should work equally well with any language. Later in this post, I’ll talk about a few of those optimizations, and Single Static Assignment, a somewhat unique way of programming. In MethodScript, for various reasons (some perhaps not the best) we use IR, not bitcode. IR is human readable, so in the case of a blog post, this is better anyways, but in general, IR and bitcode are completely interchangeable between each other, so for the rest of the post, I will only refer to the IR.

Creating the AST (Abstract Syntax Tree)

The first step of writing a programming language is parsing the text of the program itself. This is ultimately parsed into an object called an Abstract Syntax Tree, which is basically the data structure that represents a program. Consider the following simple code:

2 + 3 + 4

In this example, we’re using a thing called infix notation, that is, the plus signs go in between the things that are being added, but we could rewrite this to use a prefix notation instead, and use the name “add” instead of “+”, to make it look more like a function call.

add(2, 3, 4)

In fact, this is precisely what MethodScript does, and why it got its name. Internally, everything is a function, and while syntax sugar exists to allow you to write code as you would expect in a modern programming language, using functional notation such as add() is equally correct. In any case, the first step is to do a thing called lexing, which is to read through each individual character in the code, and create tokens from that. In the case of this code, we would end up with 8 tokens:

  • add (identifier, in this case a function identifier, because it precedes a left parenthesis)
  • left parenthesis
  • 2
  • comma
  • 3
  • comma
  • 4
  • right parenthesis

The lexer uses context in order to figure out what some of the tokens are, but many are left as unknown at this point, since we’re really only looking at a few characters at once here (the current character, and the two before and behind it). The next step is to take this flat list of tokens and create the AST from it. The parser is responsible for this. It’s called a tree because that’s a good way to visualize the relationship between the tokens. The parser creates a data structure that looks something like this:

A tree diagram showing the AST for the program "add(2, 3, 4)". At the top is a box labelled "add", with three circles under, with arrows pointing from the add box to each. The circles are labelled with 2, 3, and 4.

Any remaining unknown tokens are classified at this point as well, for instance, by inserting implied functions or other tree modifications. Next up, some basic optimizations are done, for instance, removing dead code or running various validations on some functions such as the regex related ones, to ensure that the passed in regex is correct. Actually, the compiler is smart enough to detect that the add function is always going to return the same result for this calculation, so it will always just replace this whole tree with the value “9”, but assume we skip this optimization for now. In this example, it’s fairly simple, the numbers are the child nodes, and they don’t have any further children. However, the tree can be much larger, and all nodes can be composed of other functions as well as constants that were hardcoded in the source.

All the steps so far are the same for the interpreter or if you’re trying to compile to native code. If you’re running your code in the interpreter, the code will start executing the AST based on the values in the tree, but if you’re compiling to native executables, it walks the tree and converts each node into equivalent LLVM IR code. In both cases, we walk the tree in a depth first recursive fashion, meaning that in this example, we would process the nodes in order: 2, 3, 4, add. In this case, “process” is a strong word for what happens with the constants, but the add function actually adds the numbers up. Had the numbers been the result of some other function call, that call would have to be processed first before the add function can be processed. In this context, when I say “processed” I mean evaluated/run in the case of the interpreter, and written out as LLVM IR in the case of native code.

Compiling LLVM IR

Assuming we’re writing out native code, then each function is expected to evaluate the tree and output the relevant LLVM IR. First though, let’s look at a basic Hello World written directly in LLVM.

main.ll:

source_filename = "main.ms"
target datalayout = "e-m:w-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-pc-windows-msvc19.30.30528"

@.strings.0 = linkonce_odr dso_local unnamed_addr constant [13 x i8] c"Hello World!\00"

define dso_local i32 @main(i32 %0, i8** %1) {
  %3 = call i32 @puts(i8* getelementptr inbounds ([13 x i8], [13 x i8]* @.strings.0, i64 0, i64 0))
  ret i32 0
}

declare dso_local i32 @puts(i8*)

; This is a line comment

At the top, we have meta information about the program, what the source file name was, the alignment and other information about the binary layout of the end executable, and what platform we’re targeting. For MethodScript, this is basically just copied from what clang does. In the future, we may end up diverging from clang, but for the most part doing what clang does without a good reason to do otherwise seems to be a good approach.

Under that, we are defining a global constant named “@.strings.0” with the content “Hello World!”, which ends with a null terminator. Next, we have the global function “@main”, in which we call the function puts(), send it a reference to the string (getelementptr in this case is getting a reference to the first element in the string, essentially casting the value which has type [13 x i8] to an element of type i8*). It returns a value, which is stored in the local variable %3 (which is unused). Then it returns a 32 bit value 0. Finally, we declare the function puts, which tells the compiler that this will be provided by some external library later. In MethodScript, the basic interaction with the system is done through the C standard library, which is where puts() is actually implemented. This will be linked in at a later stage, but has to be declared at this point. The “;” character starts a line comment.

This example is an actual (but manually pared down) example of the MethodScript program sys_out("Hello World!");. I’ll discuss how this is generally created in the next section.

However the IR was generated, the next step is to actually create the executable. Depending on the platform, these steps look slightly different, but overall the process is to compile the IR with the llc program, which produces object files and them link those using either the system’s linker (i.e. link.exe on Windows or ld on Linux) or LLVM’s linker lld. On Windows, this might look something like this:

llc.exe --filetype=obj -o=main.obj -O2 main.ll
lld-link.exe /out:"main.exe" /entry:main msvcrt.lib libcmt.lib /libpath:C:\Program Files (x86)\Windows Kits\10\um\x64 /defaultlib:C:\Program Files (x86)\Windows Kits\10\Lib\10.0.19041.0\ucrt\x64\ucrt.lib /subsystem:console main.obj

The most important things to note here are that we’re passing in main.ll and requesting an object file output, then taking that object file and linking it against the files which contain the C standard library. These files are included in the Windows SDK, with additional files located in the Visual Studio build tools. (If you’re interested in more specifics, the source code which configures all this is here, subject to changes, particularly some cleanup would be good.)

Once these commands are run, then voila, we should have an executable that can be run directly!

Creating LLVM IR

In the above section, we talked about the details of how to compile the IR once you have it. Once you’ve figured it out, it’s fairly easy to comprehend and explain, but this was actually a large problem to figure out, particularly in a seamless and user friendly way on all supported platforms. But this is actually the less interesting question, the more interesting question is how to generate the IR in the first place. In MethodScript, each individual function is responsible for generating its own IR, which the compiler simply assembles. So lets look at the generated code for some different MethodScript functions. For the rest of this post, I’ll only be showing the output of the main function, I’ll omit the rest of the boilerplate for readability, but if you want to try compiling these examples yourself to see the full output, you can do so with mscript -- asm main.ms after you install MethodScript, as everything described here is in the latest version of MethodScript.

rand():

define dso_local i32 @main(i32 %0, i8** %1) {
  %3 = call i32 @rand()
  %4 = call i32 @rand()
  %5 = shl i32 %4, 16
  %6 = or i32 %3, %5
  %7 = uitofp i32 %6 to double
  %8 = fdiv double %7, 2147483647.0
  ret i32 0
}

This is the result of the program rand() on a Windows machine. Interestingly, Windows and *nix OSes have different behavior with the C standard library how they generate random numbers, so to abstract this difference away, MethodScript has a different implementation in Windows which is more complicated, but this is more interesting to explain anyways. MethodScript defines that the rand() function with no arguments returns a double between 0 and 1, and the C standard library returns an integer between 0 and RAND_MAX. RAND_MAX varies on Windows and *nix, on Windows, it’s 2**15-1, and on *nix it’s 2**31-1, but it would be nice to have the same resolution on both, so on Windows, we just generate 2 random integers, then bitwise or them together to get a number of sufficient size. In the code above, you see we call rand() twice, take one of the values, shift it left (shl) 16 times, then bitwise or it with the second value. After that, the code for Windows and *nix is the same. We do an unsigned integer to floating point conversion. Then, we divide that number by the now standardized RAND_MAX, which then gives us a number between 0 and 1. I’ve omitted an important but crucial detail, which is that MethodScript automatically inserts code to seed rand with time() at the beginning of the program, so that you actually get a random number at the beginning of the program. This still has flaws, particularly if a new process is started multiple times a second, but will be improved upon in future versions.

Now let’s see what we get when build on this a little, and output it to the screen.

sys_out(rand()):

define dso_local i32 @main(i32 %0, i8** %1) {
  %3 = call i32 @rand()
  %4 = call i32 @rand()
  %5 = shl i32 %4, 16
  %6 = or i32 %3, %5
  %7 = uitofp i32 %6 to double
  %8 = fdiv double %7, 0x41DFFFFFFFC00000

  %9 = alloca [318 x i8], align 1
  %10 = getelementptr inbounds [318 x i8], [318 x i8]* %9, i64 0, i64 0
  %11 = call i32 (i8*, i8*, ...) @sprintf(i8* %10, i8* getelementptr inbounds ([6 x i8], [6 x i8]* @.strings.0, i64 0, i64 0), double %8)
  %12 = call i32 @puts(i8* %10)
  ret i32 0
}

You’ll notice the first several lines are identical, but the next bit is new. Essentially what is happening here is that we are taking the value returned by rand and automatically converting it to a string (using a buffer value and the C standard library’s sprintf function), which is then output to the screen using puts(), similar to the hello world example above. The code to convert the string is generic, so we simply allocate a buffer large enough to cover any size output, which seems to be 317 (+1 for the null terminator) characters long. This can probably be optimized down the road, as we would generally like to avoid allocations (using the alloca instruction) as much as possible. More about that in Part II.

Let’s look at one final example, which is to store the rand() value in a variable first, then output the variable.

double @d = rand();
sys_out(@d);

define dso_local i32 @main(i32 %0, i8** %1) {
  %3 = call i32 @rand()
  %4 = call i32 @rand()
  %5 = shl i32 %4, 16
  %6 = or i32 %3, %5
  %7 = uitofp i32 %6 to double
  %8 = fdiv double %7, 0x41DFFFFFFFC00000

  %9 = alloca double, align 8
  store double %8, double* %9, align 8
  %10 = load double, double* %9, align 8

  %11 = alloca [318 x i8], align 1
  %12 = getelementptr inbounds [318 x i8], [318 x i8]* %11, i64 0, i64 0
  %13 = call i32 (i8*, i8*, ...) @sprintf(i8* %12, i8* getelementptr inbounds ([6 x i8], [6 x i8]* @.strings.0, i64 0, i64 0), double %10)
  %14 = call i32 @puts(i8* %12)
  ret i32 0
}

Again, you’ll recognize the top and bottom parts from the previous example, but now we have new lines in the middle. We take the result of the rand() call, (which is stored in the local variable %8) and store it into the stack in memory (the pointer to the memory location is stored in %9), and then immediately after that, we load it into a local variable.. again? That’s right, you’re reading this seemingly nonsense code correctly. The alloca instruction allocates memory in RAM, store puts a value in it, and load pulls it out.

At first glance, this seems like inefficient code, and if LLVM didn’t offer some very specific optimizations, you’d be correct. However, it does, and since it turns out that from the front end’s perspective, this code is much easier to write (the rand(), assignment, and sys_out() calls are all very isolated from each other) and we know that LLVM will eventually clean up after us anyways, we can get the best of both worlds here.

In Part II, we’ll cover how that specifically works, and discuss more about optimizations in LLVM in general which allow this to work.

Part II

In this part, I’ll talk about how various LLVM optimizations work, and in particular explain why the example above isn’t any less efficient to write, and also explain some additional optimizations that LLVM provides. Before discussing further though, let’s take a step back and explain a bit about computer memory in general.

Memory Model of a Computer

A diagram showing an abstract explanation of computer memory. There are two rows, the top row is labelled Logical, and contains 3 segments, Registers, Stack, and Heap. The second row is labeled Physical, and has 3 segments, CPU, CPU Cache (L1/L2), and RAM. Labels show that segments to the left are faster, and segments to the right are slower. Arrows point from the Registers segment to the CPU segment, and from Stack to RAM and Heap to RAM.
Abstract Image of Computer Memory

On the physical layer, we have the CPU, which contains registers which are extremely finite in size and number. (Something like 16 general purpose registers, each of which are 64 bits long on modern CPUs.) However, this memory is extremely fast, being able to be accessed and used in just a few CPU cycles. There is also the L1/L2 (and sometimes L3+) cache, which is progressively larger and slower, but still fairly finite, which is managed directly by the hardware, and finally, the RAM, which is very large (many gigabytes in a modern desktop computer) and for most programs, effectively infinite. At a logical level, the registers exist in the CPU, and the stack and the heap exist in the RAM. As you go further right in this diagram, the slower the thing is, so all else being equal, you’d prefer to use things on the left of the chart. Of course, not everything is equal, and so sometimes we’re forced to use things on the right, but in many cases, we can get away with just using registers.

While not exactly accurate to say, in the above code snippets, you would have noticed the variables starting with the “%” sign. These are local variables, but you can generally think of them as CPU registers. LLVM will transparently handle converting these to actual registers where possible, and will avoid using RAM if possible. How it does this is actually super interesting. But first, let’s talk about why the variable names in the above code are all just increasing numbers.

Static Single Assignment

LLVM requires a tradeoff by programmers in order to be able to more easily run various optimizations. It requires the IR to be written in static single assignment (SSA) form, which effectively means that each local variable can only be assigned once. You can sort of think about it as if every variable declaration is const or final. When hand writing code, this would be extremely annoying, but luckily, most LLVM IR isn’t handwritten, it’s written by a compiler, so it tends to be easy enough to deal with. There’s an additional restriction, which is that variables which are not named (that is, variables that are just numbers) must be incremented each use, and not skipped. You notice in all the above examples that %2 is skipped however, but that’s because there’s an implied “code block”, which is labelled with the label “2”. The variables passed in to the function are %0 and %1 (argc and argv). Basically, this just means that we need to have a counter that steadily increases as we generate the code.

Using SSA means that LLVM can much more easily detect certain situations. Consider this example:

y = 2
y = 3
x = y

While as humans, we can see that the assignment of 2 to y is a useless operation, and that x = 3 is a much less verbose way to write this code, this is more difficult for computers to figure out. However, if we rewrite this in SSA form, it becomes much easier to see.

y1 = 2
y2 = 3
x = y2

Now we can see that y1 is never used and so can be immediately discarded, and that the value 3 can always just be substituted in place of y2.

Note that this restriction is only in place for local variables though, we can still store values into RAM as many times as we want.

%3 = alloca double
%3 = alloca double ; Wrong!

%4 = alloca double
store double 1.0, double* %4
store double 2.0, double* %4
; Ok! Changing values within memory is allowed.

Recall from the previous part that the alloca instruction allocates memory in RAM, store puts a value in it, and load pulls it out. So we can’t re-assign the value %3, but we can store different values within that spot in memory as many times as necessary. Remember the inefficient example from before?

define dso_local i32 @main(i32 %0, i8** %1) {
  ; rand()
  %3 = call i32 @rand()
  %4 = call i32 @rand()
  %5 = shl i32 %4, 16
  %6 = or i32 %3, %5
  %7 = uitofp i32 %6 to double
  %8 = fdiv double %7, 0x41DFFFFFFFC00000

  ; assign the value of rand to @d
  %9 = alloca double, align 8
  store double %8, double* %9, align 8
  %10 = load double, double* %9, align 8

  ; sys_out(@d)
  %11 = alloca [318 x i8], align 1
  %12 = getelementptr inbounds [318 x i8], [318 x i8]* %11, i64 0, i64 0
  %13 = call i32 (i8*, i8*, ...) @sprintf(i8* %12, i8* getelementptr inbounds ([6 x i8], [6 x i8]* @.strings.0, i64 0, i64 0), double %10)
  %14 = call i32 @puts(i8* %12)
  ret i32 0
}

Because we’re using SSA, LLVM can apply an optimization called mem2reg, which moves memory accesses to register accesses, which as stated above, when all else is equal, will be much faster. We can run individual optimizations against IR with the opt tool (included in the MethodScript toolchain installation) and then see the output of that. Let’s run the mem2reg optimization on this code.

opt.exe main.ll -S –mem2reg:

define dso_local i32 @main(i32 %0, i8** %1) {
  %3 = call i32 @rand()
  %4 = call i32 @rand()
  %5 = shl i32 %4, 16
  %6 = or i32 %3, %5
  %7 = uitofp i32 %6 to double
  %8 = fdiv double %7, 0x41DFFFFFFFC00000

  %9 = alloca [318 x i8], align 1
  %10 = getelementptr inbounds [318 x i8], [318 x i8]* %9, i64 0, i64 0
  %11 = call i32 (i8*, i8*, ...) @sprintf(i8* %10, i8* getelementptr inbounds ([6 x i8], [6 x i8]* @.strings.0, i64 0, i64 0), double %8)
  %12 = call i32 @puts(i8* %10)
  ret i32 0
}

Aha! So now we can see that it got rid of the unnecessary alloca/store/load pattern, and now the code looks the same as if we had simply written sys_out(rand()), and skipped the variable entirely. This pattern of getting rid of allocas allows many other optimizations as well, because generally speaking, the optimizations mostly happen against the local variables which are in SSA form, rather than values that end up being put into RAM.

Note: I keep heavily suggesting that alloca allocates memory in RAM. This isn’t strictly true, the only thing it strictly does is allocate memory “somewhere” which is of sufficient size to hold the type specified. And in most cases, if it survives optimization, it will end up allocating memory on the stack (in RAM). There are some backends where this is still not true, GPU programming, for instance, but conceptually, it’s fine to think of this as an allocation on the stack in RAM.

Phi Nodes

But this leads to an interesting question. What happens when you have branches? If you have a different value assigned to a variable depending on the results of an if statement, then it seems like it would be impossible to use all const/final variables in your program. At a high level, the way around this is to simply always use that alloca/store/load pattern. That’s allowed, and that’s how the compiler will output it. However, this means that we need to access RAM, which is slower, so.. how do we get around this? Φ nodes! (Hereafter always written as phi nodes.) First, let’s consider a short C program which demonstrates the problem.

int main(int argc, char** argv) {
	int i;
	if(argc > 1) {
		i = 42;
	} else {
		i = 7;
	}
	return i;
}

We’re forward declaring i, which is dynamically assigned based on input arguments, then returned. Now let’s look at the output IR code. To do this with C code, we can use the command clang -S --emit-llvm -O0 main.c

define dso_local i32 @main(i32 %0, i8** %1) #0 {
  %3 = alloca i32, align 4
  %4 = alloca i8**, align 8
  %5 = alloca i32, align 4
  %6 = alloca i32, align 4         ; int i
  store i32 0, i32* %3, align 4
  store i8** %1, i8*** %4, align 8
  store i32 %0, i32* %5, align 4
  %7 = load i32, i32* %5, align 4
  %8 = icmp sgt i32 %7, 1          ; argc < 1
  br i1 %8, label %9, label %10    ; if argc < 1, go to 9, else goto 10

9:                                                ; preds = %2
  store i32 42, i32* %6, align 4   ; i = 42
  br label %11 ; go to 11

10:                                               ; preds = %2
  store i32 7, i32* %6, align 4    ; i = 42
  br label %11                     ; go to 11

11:                                               ; preds = %10, %9
  %12 = load i32, i32* %6, align 4 ; load i
  ret i32 %12                      ; return i
}

There’s some additional instructions here that we haven’t covered, but it’s fairly simple. icmp does an integer comparison, and br jumps to the specified block depending on if the condition is true. The more important bit is the blocks, “9:” for instance creates a block with the label “9”, which can be jumped to from the br instruction (among a few others). Clang adds a few extra things that aren’t necessary. If you trace through, you’ll see that for instance %3 and %4 are never read from, but just ignore that for now. Otherwise, you can hopefully understand the rest of this code from my comments, and you can see that it is basically following the alloca/store/load pattern, just doing a different store depending on which block we enter.

You can also maybe see that this is basically a 1:1 mapping of the original C code, and it wouldn’t be too difficult to translate in your frontend compiler. But, from a performance perspective, this is worse! We’re forced to use RAM because we can’t assign to registers multiple times.

Or are we? Let’s run mem2reg on this and see what we get. NOTE: If you’re following along at home, if you want to run the opt command against this code, you’ll need to remove the optnone attribute from the source. This attribute is added by the -O0 flag, which disables optimizations, but it prevents any future optimizations from being run on these functions.

opt -S -mem2reg

define dso_local i32 @main(i32 %0, i8** %1) #0 {
  %3 = icmp sgt i32 %0, 1
  br i1 %3, label %4, label %5

4:                                                ; preds = %2
  br label %6

5:                                                ; preds = %2
  br label %6

6:                                                ; preds = %5, %4
  %.0 = phi i32 [ 42, %4 ], [ 7, %5 ]
  ret i32 %.0
}

Ok, so the allocas and stores and loads are all gone, and we see a new instruction at the bottom, phi. The phi instruction allows you to select a value based on the block that you just came from. So if we came from block 4, then it would return 42, and if we came from block 5, it would return 7. And that’s how LLVM gets rid of alloca/store/load even with branching!

Of course, we might now notice that the code blocks are empty, and so there probably more cleanup we can do here, and indeed, if we run all optimizations on the code, we get something even simpler.

opt -S -O3:

define dso_local i32 @main(i32 %0, i8** nocapture readnone %1) local_unnamed_addr #0 {
  %3 = icmp sgt i32 %0, 1
  %.0 = select i1 %3, i32 42, i32 7
  ret i32 %.0
}

So actually, because this code is really simple, phi nodes aren’t even necessary here, and fully optimized, it uses the select instruction instead, which returns one or the other value based on the value of a boolean.

Loop Optimization and Instruction Combination

LLVM contains a number of other optimizations as well, so let’s look at some others.

int main(int argc, char** argv) {
	for(int i = 1; i < 10; i++) {
		argc++;
	}
	return argc;
}

In this C code, we’re looping through from 1 to 10, adding 1 to argc, then returning argc at the end. Let’s see the raw IR for this.

clang -S –emit-llvm -O0

define dso_local i32 @main(i32 %0, i8** %1) #0 {
  %3 = alloca i32, align 4
  %4 = alloca i8**, align 8
  %5 = alloca i32, align 4
  %6 = alloca i32, align 4
  store i32 0, i32* %3, align 4
  store i8** %1, i8*** %4, align 8
  store i32 %0, i32* %5, align 4
  store i32 1, i32* %6, align 4     ; int i = 1
  br label %7

7:                                                ; preds = %13, %2
  %8 = load i32, i32* %6, align 4   ; load i into %8
  %9 = icmp slt i32 %8, 10          ; i < 10
  br i1 %9, label %10, label %16    ; if i < 10, go to 10, else goto 16

10:                                               ; preds = %7
  %11 = load i32, i32* %5, align 4  
  %12 = add nsw i32 %11, 1          ; add 1 to argc
  store i32 %12, i32* %5, align 4
  br label %13                      ; goto 13

13:                                               ; preds = %10
  %14 = load i32, i32* %6, align 4
  %15 = add nsw i32 %14, 1          ; i++
  store i32 %15, i32* %6, align 4
  br label %7, !llvm.loop !3

16:                                               ; preds = %7
  %17 = load i32, i32* %5, align 4
  ret i32 %17                       ; return argc
}

Fairly straightforward, if you know assembly concepts, anyways. Let’s mem2reg it and see what it does. (Remember to remove optnone!)

define dso_local i32 @main(i32 %0, i8** %1) #0 {
  br label %3

3:                                               ; preds = %7, %2
  ; argc (%.01), input arg if block 2, %6 if block 7
  %.01 = phi i32 [ %0, %2 ], [ %6, %7 ]
  ; i (%.0), 1 if block 2, %8 if block 7
  ; recall that there's an implied 2: at the top, 
  ; so block 2 means "the start"
  %.0 = phi i32 [ 1, %2 ], [ %8, %7 ]
  ; i (%.0) < 10
  %4 = icmp slt i32 %.0, 10
  ; if true, block 5, false, block 9
  br i1 %4, label %5, label %9

5:                                                ; preds = %3
  ; argc (%.01) + 1
  %6 = add nsw i32 %.01, 1
  ; go to 7
  br label %7

7:                                                ; preds = %5
  ; i (%.0) + 1
  %8 = add nsw i32 %.0, 1
  ; goto block 3
  br label %3, !llvm.loop !3

9:                                                ; preds = %3
  ; return argc (%.01)
  ret i32 %.01
}

This code is a bit harder to follow, so I’ve added a bunch of line comments to explain it. It’s actually not that super important to understand though, because the next optimization is the more interesting one, which is loop unrolling. But –mem2reg is a prerequisite for most other optimizations to run.

Loop unrolling basically makes the code bigger, for the sake of faster execution. However, in the case of constant sized loops, we can actually fully unroll them, and get rid of quite a lot of code. Let’s run the –loop-unroll optimization now.

opt -S –loop-unroll

define dso_local i32 @main(i32 %0, i8** %1) #0 {
  br label %3

3:                                                ; preds = %2
  br label %4

4:                                                ; preds = %3
  %5 = add nsw i32 %0, 1
  br label %6

6:                                                ; preds = %4
  br label %8

7:                                                ; preds = %31
  %.01.lcssa = phi i32 [ %30, %31 ]
  ret i32 %.01.lcssa

8:                                                ; preds = %6
  %9 = add nsw i32 %5, 1
  br label %10

10:                                               ; preds = %8
  br label %11

11:                                               ; preds = %10
  %12 = add nsw i32 %9, 1
  br label %13

13:                                               ; preds = %11
  br label %14

14:                                               ; preds = %13
  %15 = add nsw i32 %12, 1
  br label %16

16:                                               ; preds = %14
  br label %17

17:                                               ; preds = %16
  %18 = add nsw i32 %15, 1
  br label %19

19:                                               ; preds = %17
  br label %20

20:                                               ; preds = %19
  %21 = add nsw i32 %18, 1
  br label %22

22:                                               ; preds = %20
  br label %23

23:                                               ; preds = %22
  %24 = add nsw i32 %21, 1
  br label %25

25:                                               ; preds = %23
  br label %26

26:                                               ; preds = %25
  %27 = add nsw i32 %24, 1
  br label %28

28:                                               ; preds = %26
  br label %29

29:                                               ; preds = %28
  %30 = add nsw i32 %27, 1
  br label %31

31:                                               ; preds = %29
  br i1 false, label %32, label %7

32:                                               ; preds = %31
  br label %33

33:                                               ; preds = %32
  unreachable
}

Oh my. It almost seems like the original code would have been better than this. But wait! There’s a lot of kind of dumb code here. Particularly, we’re always jumping to the next block, so maybe we can get rid of the blocks and the br instructions? Yes! We can do that by simplifying the control flow graph with the –simplifycfg optimization. That gives us:

define dso_local i32 @main(i32 %0, i8** %1) #0 {
  %3 = add nsw i32 %0, 1
  %4 = add nsw i32 %3, 1
  %5 = add nsw i32 %4, 1
  %6 = add nsw i32 %5, 1
  %7 = add nsw i32 %6, 1
  %8 = add nsw i32 %7, 1
  %9 = add nsw i32 %8, 1
  %10 = add nsw i32 %9, 1
  %11 = add nsw i32 %10, 1
  ret i32 %11
}

Aha, that’s much easier to read. But now we’re actually just doing a bunch of add 1’s one after the other. Can we combine these? Yes! The –instcombine optimization.

define dso_local i32 @main(i32 %0, i8** %1) #0 {
  %3 = add nsw i32 %0, 9
  ret i32 %3
}

And now finally, we have our fully optimized code. Basically, we’re just adding 9 to argc, but in a really dumb way. However, LLVM saw right through that and fixed up our code for us anyways.

Loop unrolling can happen on non-finite loops too. I couldn’t figure out a way to get LLVM to demonstrate this, but here is a code sample from Wikipedia that shows how one might manually unroll a loop:

#include <stdio.h>

/* The number of entries processed per loop iteration.                        */
/* Note that this number is a 'constant constant' reflecting the code below.  */
#define BUNCHSIZE (8)

int main(void)
{ 
  int i = 0;                                    /* counter */
  int entries = 50;                             /* total number to process    */
  int repeat;                                   /* number of while repetitions*/
  int left = 0;                                 /* remainder (process later)  */ 
 
  /* If the number of elements is not be divisible by BUNCHSIZE,              */ 
  /* get repeat times required to do most processing in the while loop        */

  repeat = (entries / BUNCHSIZE);                /* number of times to repeat */
  left   = (entries % BUNCHSIZE);                /* calculate remainder       */

  /* Unroll the loop in 'bunches' of 8                                        */ 
  while (repeat--) 
  { 
    printf("process(%d)\n", i    );
    printf("process(%d)\n", i + 1); 
    printf("process(%d)\n", i + 2); 
    printf("process(%d)\n", i + 3); 
    printf("process(%d)\n", i + 4); 
    printf("process(%d)\n", i + 5); 
    printf("process(%d)\n", i + 6); 
    printf("process(%d)\n", i + 7);

    /* update the index by amount processed in one go                         */ 
    i += BUNCHSIZE;
  }

  /* Use a switch statement to process remaining by jumping to the case label */ 
  /* at the label that will then drop through to complete the set             */ 
  switch (left) 
  {
     case 7 : printf("process(%d)\n", i + 6);   /* process and rely on drop 
                                                   through                    */
     case 6 : printf("process(%d)\n", i + 5); 
     case 5 : printf("process(%d)\n", i + 4);  
     case 4 : printf("process(%d)\n", i + 3);  
     case 3 : printf("process(%d)\n", i + 2); 
     case 2 : printf("process(%d)\n", i + 1);   /* two left                   */
     case 1 : printf("process(%d)\n", i);       /* just one left to process   */ 
     case 0 : ;                                 /* none left                  */
  } 
}

Basically, you can see that what happens here is that the main loop is run floor(entries / bunch_size), and then the code is duplicated in a switch statement with fallthrough to handle the remainder. Code duplication is bad when a human writes it, but when an optimizing compiler writes it, it can sometimes be a great thing.

But why? Why would we do partial unrolling? Well, it comes down to 2 primary reasons. First the fact that the decision on which branch to take next can be rather expensive by itself, but also because of how CPUs work in general using a concept called branch prediction. The way a CPU works is that each core handles a single instruction at a time. But there is a pipeline of instructions that are waiting to be moved forward in the queue at each clock cycle. Branch prediction is required, because the instructions that are going to be run after this one can go ahead and be pre-loaded. But if it gets the branch prediction wrong, it becomes a very expensive operation, because now it has to backtrack by clearing out the incorrect instructions in the pipeline and going back and loading the other branch, and in the meantime can’t run any program instructions. So as a general rule of thumb, if you reduce the number of times a branch occurs, that also reduces the number of times branch prediction could potentially fail, and especially for long running or tight loops it can speed up the code substantially. I won’t go into more detail here, but if you’re interested in further reading, read up on CPU Instruction Pipelines and Branch Prediction.

LLVM contains a LOT of optimizations, some of which need to run multiple times, and there are plenty of others that I didn’t cover at all here. But hopefully you get an idea of how knowing the underlying optimizations can make it way easier to write your frontend, even without sacrificing binary speed.

Conclusion

While many of the topics discussed here aren’t quite yet implemented in MethodScript, it’s a good thing to understand them at a deeper level. If you’re interested in further reading, here are some resources that I’ve found that are interesting:

If I’ve piqued your interest in the topic, then please consider joining us and helping to contribute to MethodScript. MethodScript is a small but growing language, and you’d be getting in on the ground floor of helping to design a newish language. Check out the homepage at https://methodscript.com for more information.

Thanks for reading!

Leave a comment

Your email address will not be published. Required fields are marked *