Writing a MiniC-to-MSIL compiler in F# - Part 5 - Code generation
- This post is part of the series Writing a MiniC-to-MSIL compiler in F#.
- You can find the code for this series in the Mini-C GitHub repository.
Introduction
Let’s do a quick recap. By the end of the last part on building an intermediate representation, we were able to take source code like this:
int main() {
return 123 + 456;
}
and turn it into an intermediate representation (IR) like this:
{ // ILClass
Fields = [ { Type = typeof<int>; Name = "a" } ];
Methods = [
{ // ILMethod
Name = "main";
ReturnType = typeof<int>;
Parameters = [];
Locals = [];
Body = [ IL.Ldc_I4(123); IL.Ldc_I4(456); IL.Add; IL.Ret ];
}
]
}
In this part, we’ll focus on code generation:
In computing, code generation is the process by which a compiler’s code generator converts some intermediate representation of source code into a form (e.g., machine code) that can be readily executed by a machine.
We’ll be converting our intermediate representation into MSIL. Fortunately, our IR is already very similar to MSIL - a choice designed to make code generation easier.
Let’s get started. The code for this part is in CodeGenerator.fs in the GitHub repository.
Emitting MSIL
MSIL is a bytecode format - if you opened a .NET DLL or executable file in a binary viewer, all you’d see would be a long sequence of bytes. Nobody (presumably) wants to build their .NET applications by writing bytes.
The next level up is writing IL yourself. Microsoft provides a tool, ilasm.exe
, which assembles IL into .NET bytecode (or a Portable Executable (PE), as it’s called). If you were to write IL by hand for the trivial example program above, it would look like this (I have removed some items for brevity):
.assembly extern mscorlib
{
.publickeytoken = (B7 7A 5C 56 19 34 E0 89 )
.ver 4:0:0:0
}
.assembly TrivialExample
{
.hash algorithm 0x00008004
.ver 1:0:0:0
}
.module TrivialExample.exe
.class private auto ansi beforefieldinit TrivialExample.Program
extends [mscorlib]System.Object
{
.method private hidebysig static int32
Main() cil managed
{
.entrypoint
.maxstack 2
IL_0000: ldc.i4 123
IL_0001: ldc.i4 456
IL_0002: add
IL_0003: ret
}
}
Whenever a .NET language like C#, or Mini-C, is compiled, the compiler is essentially creating one of these IL files, and then assembling it into .NET bytecode. But to save you the trouble of building IL text files yourself, the .NET framework includes a nice API for generating IL, in the System.Reflection.Emit
namespace. Instead of, for example, writing this line in an IL file:
ldc.i4 123
we can just call this method:
ilGenerator.Emit(OpCodes.Ldc_I4, 123)
This is quite meta: we’re using a .NET API to generate (another) .NET application.
The main System.Reflection.Emit
types that we’ll need are:
Generating MSIL methods
In the Mini-C compiler, at least, most of the hard work in code generation is expended in generating methods. So that’s where we’ll start. First, a couple of helpful type aliases:
type MethodMappingDictionary = Dictionary<string, MethodInfo>
type FieldMappingDictionary = Dictionary<ILVariable, FieldInfo>
These will help us keep track of the generated versions of methods and fields.
We’ll wrap all the code to generate methods into a MethodGenerator
type (I’ll discuss in the next part my feelings on my use of classes / types in this compiler. I think I’d put the pieces together differently
if I was doing it now. I have been a C# developer for a long time, and I think it shows. There are often better ways of doing things in F#.)
type MethodGenerator(typeBuilder : TypeBuilder, ilMethod : ILMethod,
methodMappings : MethodMappingDictionary,
fieldMappings : FieldMappingDictionary) =
...
The ILMethod
object is the intermediate representation that we built in the last part.
First, we’ll create a new MethodBuilder
:
let methodAttributes = MethodAttributes.Public ||| MethodAttributes.Static
let methodBuilder = typeBuilder.DefineMethod(ilMethod.Name, methodAttributes)
Later on, when we’re generating the IL for a method call, we need to pass as the argument a MethodInfo
object. To support that, we need to keep track of all the MethodInfo
objects we’ve created for the functions in our program. That’s what the methodMappings
dictionary is for.
do methodMappings.Add(ilMethod.Name, methodBuilder)
(Rather confusingly, the *Builder
types inherit from the *Info
types; i.e. MethodBuilder
inherits from MethodInfo
.)
let ilGenerator = methodBuilder.GetILGenerator()
ilGenerator
is the key object here; we’ll use it to generate all the IL instructions for methods.
Next, we need some code to handle labels. In the IL examples above, the labels appear at the start of IL instructions (for example, IL_0001:
). If you’re writing IL by hand, you can give meaningful names to labels. We don’t need to do that, but we do need to use labels as branch targets. For example, at the end of a while
loop, we need to branch back to the beginning of the loop to test the condition again. The thing we’re branching back to is a label.
We’ll keep track of labels using another dictionary:
let labelMappings = new Dictionary<IL.ILLabel, System.Reflection.Emit.Label>()
And then for a given IL.ILLabel
, we’ll create a corresponding System.Reflection.Emit.Label
if we haven’t already, and return it:
let getLabel ilLabel =
if labelMappings.ContainsKey ilLabel then
labelMappings.[ilLabel]
else
let label = ilGenerator.DefineLabel()
labelMappings.Add(ilLabel, label)
label
Now we get to one of the core functions in this whole compiler - the function that actually generates IL instructions from the intermediate representation. In a real-world compiler, this function would be much larger. But in Mini-C, we can get away with using a small subset of the available MSIL instructions.
let emitOpCode (ilGenerator : ILGenerator) = function
| Add -> ilGenerator.Emit(OpCodes.Add)
| Br(l) -> ilGenerator.Emit(OpCodes.Br, getLabel l)
| Brfalse(l) -> ilGenerator.Emit(OpCodes.Brfalse, getLabel l)
| Brtrue(l) -> ilGenerator.Emit(OpCodes.Brtrue, getLabel l)
| Call(n) -> ilGenerator.Emit(OpCodes.Call, methodMappings.[n])
| CallClr(m) -> ilGenerator.Emit(OpCodes.Call, m)
| Ceq -> ilGenerator.Emit(OpCodes.Ceq)
| Cge -> ilGenerator.Emit(OpCodes.Clt)
ilGenerator.Emit(OpCodes.Ldc_I4_0)
ilGenerator.Emit(OpCodes.Ceq)
| Cgt -> ilGenerator.Emit(OpCodes.Cgt)
| Cle -> ilGenerator.Emit(OpCodes.Cgt)
ilGenerator.Emit(OpCodes.Ldc_I4_0)
ilGenerator.Emit(OpCodes.Ceq)
| Clt -> ilGenerator.Emit(OpCodes.Clt)
| Dup -> ilGenerator.Emit(OpCodes.Dup)
| Div -> ilGenerator.Emit(OpCodes.Div)
| Label(l) -> ilGenerator.MarkLabel(getLabel l)
| Ldarg(i) -> ilGenerator.Emit(OpCodes.Ldarg, i)
| Ldc_I4(i) -> ilGenerator.Emit(OpCodes.Ldc_I4, i)
| Ldc_R8(r) -> ilGenerator.Emit(OpCodes.Ldc_R8, r)
| Ldelem(t) -> ilGenerator.Emit(OpCodes.Ldelem, t)
| Ldlen -> ilGenerator.Emit(OpCodes.Ldlen)
| Ldloc(i) -> ilGenerator.Emit(OpCodes.Ldloc, i)
| Ldsfld(v) -> ilGenerator.Emit(OpCodes.Ldsfld, fieldMappings.[v])
| Mul -> ilGenerator.Emit(OpCodes.Mul)
| Neg -> ilGenerator.Emit(OpCodes.Neg)
| Newarr(t) -> ilGenerator.Emit(OpCodes.Newarr, t)
| Pop -> ilGenerator.Emit(OpCodes.Pop)
| Rem -> ilGenerator.Emit(OpCodes.Rem)
| Ret -> ilGenerator.Emit(OpCodes.Ret)
| Starg(i) -> ilGenerator.Emit(OpCodes.Starg, i)
| Stelem(t) -> ilGenerator.Emit(OpCodes.Stelem, t)
| Stloc(i) -> ilGenerator.Emit(OpCodes.Stloc, i)
| Stsfld(v) -> ilGenerator.Emit(OpCodes.Stsfld, fieldMappings.[v])
| Sub -> ilGenerator.Emit(OpCodes.Sub)
The only two that aren’t direct mappings are IL.Cge
(>=
) and IL.Cle
(<=
). That’s because these instructions don’t exist in IL. So for Cge
, for example, we instead:
- Compare using the less-than operator. If the first value is greater than or equal to the second value, this will push 0 onto the stack.
- Load the constant
0
onto the stack. - If these two values are equal, then the first value must be greater than or equal to the second value.
(As an aside: I don’t know why MSIL doesn’t have a Cge
instruction. Do any readers know?)
And finally, let’s bring it all together:
member x.Generate() =
methodBuilder.SetReturnType ilMethod.ReturnType
methodBuilder.SetParameters (List.toArray (ilMethod.Parameters |> List.map (fun p -> p.Type)))
let defineParameter index name =
methodBuilder.DefineParameter(index, ParameterAttributes.In, name) |> ignore
ilMethod.Parameters |> List.iteri (fun i p -> defineParameter (i + 1) p.Name)
let emitLocal (ilGenerator : ILGenerator) variable =
ilGenerator.DeclareLocal(variable.Type).SetLocalSymInfo(variable.Name)
ilMethod.Locals |> List.iter (emitLocal ilGenerator)
ilMethod.Body |> List.iter (emitOpCode ilGenerator)
let rec last =
function
| head :: [] -> head
| head :: tail -> last tail
| _ -> failwith "Empty list."
if (last ilMethod.Body) <> Ret then
ilGenerator.Emit(OpCodes.Ret)
(Apparently there’s some unwritten rule that you can’t have a proper functional program without having a function somewhere in it that uses recursion to get the last item in a list. So I’ve obliged.)
The only wrinkle here is that if there isn’t an explicit return
at the end of the function in the original source code, we need to make sure we still add one to the generated code, otherwise our executable will fail runtime verification.
That’s quite a whirlwind tour through IL generation. Obviously, we’ve only scratched the surface of what’s possible, but hopefully this gives you a rough idea of what’s involved.
Generating MSIL types
We can now generate IL for each of the functions in our Mini-C programs. Next, we need to generate IL for fields, and bring it all together into a .NET type. Mini-C doesn’t have types, but MSIL methods can only exist inside a type. We’ll generate one type as a container for all the fields and functions in any given Mini-C program.
First, some code to generate fields, and store the mapping from IL.Field
to System.Reflection.FieldInfo
(when we refer to fields inside method IL instructions, we need to use the FieldInfo
object):
type CodeGenerator(moduleBuilder : ModuleBuilder, ilClass : ILClass, moduleName : string) =
let fieldMappings = new FieldMappingDictionary()
let generateField (typeBuilder : TypeBuilder) (ilField : ILVariable) =
let fieldAttributes = FieldAttributes.Public ||| FieldAttributes.Static
let fieldBuilder = typeBuilder.DefineField(ilField.Name, ilField.Type, fieldAttributes)
fieldMappings.Add(ilField, fieldBuilder)
...
Next, the all-important method that creates a new type, generates the fields and methods for that type, and finally returns both the type and the main
method that will act as the program entry point:
type CodeGenerator(moduleBuilder : ModuleBuilder, ilClass : ILClass, moduleName : string) =
...
member x.GenerateType() =
let typeAttributes = TypeAttributes.Abstract ||| TypeAttributes.Sealed ||| TypeAttributes.Public
let typeBuilder = moduleBuilder.DefineType(moduleName + ".Program", typeAttributes)
ilClass.Fields |> List.iter (generateField typeBuilder)
let methodMappings = new MethodMappingDictionary()
let generateMethod ilMethod =
let methodGenerator = new MethodGenerator(typeBuilder, ilMethod, methodMappings, fieldMappings)
methodGenerator.Generate()
ilClass.Methods |> List.iter generateMethod
(typeBuilder.CreateType(), typeBuilder.GetMethod("main"))
And that’s pretty much it! It took me quite a lot of trial and error to get this working. Some bugs were hard to fix, because if you get IL generation wrong, your executable files can crash in odd and unhelpful ways. If you pop from the stack more times than you push, or if you don’t have a Ret
at the end of your methods, or… several other categories of error, you’ll get an exception at runtime. I think I could have used PEVerify to catch some of these errors without running the program, but I didn’t know about that at the time.
We’ve almost reached the end of this little journey through a .NET compiler. In the next and final part, I’ll show the compiler entry point (the function that actually takes Mini-C source code as a string, and compiles it into a .NET assembly). I’ll also discuss how I would have done things differently with the benefit of hindsight. See you then!