There’s an existing toy MLIR dialect, part of the mlir tutorial documentation, so I’ve renamed my dialect from toy to silly, and updated all the references to ‘toy calculator’ to ‘silly compiler’, or ‘silly language’. There’s no good reason to use this language, nor the compiler, so this is very appropriate. It was, however, an excellent learning tool. The toy namespace is renamed, as are various file names, and all the MLIR operators, function prefixes, and so forth.
In addition to the big rename, other changes since the V5 tag include:
- A GET builtin (can now to I/O, not just O)
- FOR loop support.
- Something much closer to a consistent coding style now (FooBar for structures, fooBar for functions, no more use of all of PascalCase, camelCase, and underscore separated variables).
- Almost all of the auto variables have been purged for clarity.
- I’ve removed the ‘using namespace mlir’ in lowering.cpp. Many of my mlir:: namespace references already had the namespace tag, so removing this allowed for more consistency. I may revert this if it proves too cumbersome, but if I do, I’ll remove all the mlir:: qualifiers consistently (unless they are needed for disambiguation).
- User errors in the parser/builder no longer log the internal file:line:func for the code that spots them, but just the file:line location of the code with the error. Those errors are now reported with mlir::emitError()
- Declarations in scf.for and scf.if/else regions are now supported.
- error test script now merged into bin/testit, so there’s just one script to run the regression test.
- Switched to /// style doxygen markup.
GET
Here’s a sample program with a GET call:
and the corresponding MLIR output:
module {
func.func @main() -> i32 {
"silly.scope"() ({
"silly.declare"() <{type = i32}> {sym_name = "x"} : () -> ()
%0 = silly.get : i32
silly.assign @x = %0 : i32
%1 = silly.load @x : i32
silly.print %1 : i32
%c0_i32 = arith.constant 0 : i32
"silly.return"(%c0_i32) : (i32) -> ()
}) : () -> ()
"silly.yield"() : () -> ()
}
}
In the generated MLIR, I’ve split the GET builtin into an SSA for the get itself. In the example above, that’s returning the %0 value, and an internal AssignOp, kind of as if the statement was:
with the type information for the get riding on the assignment variable. That choice doesn’t model of the language in an ideal way. However, there are plenty of other places where my generated MLIR also isn’t a great one-to-one match for the language, so I don’t feel too bad about having done that, but might make different choices, if I wanted to have a lowering pass that transformed the silly dialect into something that represented a different language.
Here’s the corresponding LLVM-IR for that MLIR (with the DI stripped out)
declare void @__silly_print_i64(i64)
declare i32 @__silly_get_i32()
define i32 @main() !dbg !4 {
%1 = alloca i32, i64 1, align 4
%2 = call i32 @__silly_get_i32()
store i32 %2, ptr %1, align 4
%3 = load i32, ptr %1, align 4
%4 = sext i32 %3 to i64
call void @__silly_print_i64(i64 %4)
ret i32 0
}
The use of the store/load pair that was related to the symbol references. There’s some remnant of that left in the assembly without optimization:
0: push %rax 1: call 62: R_X86_64_PLT32 __silly_get_i32-0x4 6: mov %eax,0x4(%rsp) a: movslq %eax,%rdi d: call 12 e: R_X86_64_PLT32 __silly_print_i64-0x4 12: xor %eax,%eax 14: pop %rcx 15: ret
but with optimization, we are left with everything in register:
0: push %rax 1: call 62: R_X86_64_PLT32 __silly_get_i32-0x4 6: movslq %eax,%rdi 9: call e a: R_X86_64_PLT32 __silly_print_i64-0x4 e: xor %eax,%eax 10: pop %rcx 11: ret
FOR
Here’s a little FOR test program:
This prints 1-10 and 1,3,5,7,9 respectively. Here’s the MLIR (with location information stripped out):
module {
func.func @main() -> i32 {
"silly.scope"() ({
"silly.declare"() <{type = i32}> {sym_name = "x"} : () -> ()
%c1_i64 = arith.constant 1 : i64
%0 = arith.trunci %c1_i64 : i64 to i32
%c11_i64 = arith.constant 11 : i64
%1 = arith.trunci %c11_i64 : i64 to i32
%c1_i64_0 = arith.constant 1 : i64
%2 = arith.trunci %c1_i64_0 : i64 to i32
scf.for %arg0 = %0 to %1 step %2 : i32 {
silly.assign @x = %arg0 : i32
%6 = silly.load @x : i32
silly.print %6 : i32
}
%c1_i64_1 = arith.constant 1 : i64
%3 = arith.trunci %c1_i64_1 : i64 to i32
%c11_i64_2 = arith.constant 11 : i64
%4 = arith.trunci %c11_i64_2 : i64 to i32
%c2_i64 = arith.constant 2 : i64
%5 = arith.trunci %c2_i64 : i64 to i32
scf.for %arg0 = %3 to %4 step %5 : i32 {
silly.assign @x = %arg0 : i32
%6 = silly.load @x : i32
silly.print %6 : i32
}
%c0_i32 = arith.constant 0 : i32
"silly.return"(%c0_i32) : (i32) -> ()
}) : () -> ()
"silly.yield"() : () -> ()
}
}
Observe that I did something sneaky in there: I’ve inserted a ‘silly.assign’ from the scf.for loop induction variable at the beginning of the loop, so that subsequent symbol based lookups just work. It would be cleaner to make the FOR loop variable private to the loop body (and have the builder reference the SSA induction variable directly forOp.getRegion().front().getArgument(0), instead of requiring a variable in the enclosing scope, but I did it this way to avoid the need for any additional dwarf instrumentation for that variable — basically, I was being lazy, and letting implementation guide the language “design”. Is that a hack? Absolutely!
Here’s the corresponding LLVM-IR:
declare void @__silly_print_i64(i64)
define i32 @main() {
%1 = alloca i32, i64 1, align 4
#dbg_declare(ptr %1, !9, !DIExpression(), !8)
br label %2
2: ; preds = %5, %0
%3 = phi i32 [ 1, %0 ], [ %8, %5 ]
%4 = icmp slt i32 %3, 11
br i1 %4, label %5, label %9
5: ; preds = %2
store i32 %3, ptr %1, align 4
%6 = load i32, ptr %1, align 4
%7 = sext i32 %6 to i64
call void @__silly_print_i64(i64 %7)
%8 = add i32 %3, 1
br label %2
9: ; preds = %2
br label %10
10: ; preds = %13, %9
%11 = phi i32 [ 1, %9 ], [ %16, %13 ]
%12 = icmp slt i32 %11, 11
br i1 %12, label %13, label %17
13: ; preds = %10
store i32 %11, ptr %1, align 4
%14 = load i32, ptr %1, align 4
%15 = sext i32 %14 to i64
call void @__silly_print_i64(i64 %15)
%16 = add i32 %11, 2
br label %10
17: ; preds = %10
ret i32 0
; uselistorder directives
uselistorder ptr %1, { 2, 3, 0, 1 }
}
and the unoptimized codegen:
0: push %rbx 1: sub $0x10,%rsp 5: mov $0x1,%ebx a: cmp $0xa,%ebx d: jg 23f: nop 10: mov %ebx,0xc(%rsp) 14: movslq %ebx,%rdi 17: call 1c 18: R_X86_64_PLT32 __silly_print_i64-0x4 1c: inc %ebx 1e: cmp $0xa,%ebx 21: jle 10 23: mov $0x1,%ebx 28: cmp $0xa,%ebx 2b: jg 44 2d: nopl (%rax) 30: mov %ebx,0xc(%rsp) 34: movslq %ebx,%rdi 37: call 3c 38: R_X86_64_PLT32 __silly_print_i64-0x4 3c: add $0x2,%ebx 3f: cmp $0xa,%ebx 42: jle 30 44: xor %eax,%eax 46: add $0x10,%rsp 4a: pop %rbx 4b: ret
At O2 optimization, the assembly printer chooses to unroll both loops completely, generating code like:
0: push %rax 1: mov $0x1,%edi 6: call b7: R_X86_64_PLT32 __silly_print_i64-0x4 b: mov $0x2,%edi 10: call 15 11: R_X86_64_PLT32 __silly_print_i64-0x4 15: mov $0x3,%edi 1a: call 1f 1b: R_X86_64_PLT32 __silly_print_i64-0x4 1f: mov $0x4,%edi 24: call 29 25: R_X86_64_PLT32 __silly_print_i64-0x4 29: mov $0x5,%edi 2e: call 33 2f: R_X86_64_PLT32 __silly_print_i64-0x4 33: mov $0x6,%edi 38: call 3d 39: R_X86_64_PLT32 __silly_print_i64-0x4 ...
SCF Region declarations
In the V5 tag of the compiler, a program like this wouldn’t work:
This is because my DeclareOp needs to be in a region that has an associated symbol table (my ScopeOp). I've dealt with this by changing the insertion point for any declares to the beginning of the ScopeOp for the function (either the implicit main function, or a user defined function).
MLIR for the above program now looks like this:
module {
func.func @main() -> i32 {
"silly.scope"() ({
"silly.declare"() <{type = i32}> {sym_name = "y"} : () -> ()
"silly.declare"() <{type = i32}> {sym_name = "x"} : () -> ()
%c3_i64 = arith.constant 3 : i64
silly.assign @x = %c3_i64 : i64
%0 = silly.load @x : i32
%c4_i64 = arith.constant 4 : i64
%1 = "silly.less"(%0, %c4_i64) : (i32, i64) -> i1
scf.if %1 {
%c42_i64 = arith.constant 42 : i64
silly.assign @y = %c42_i64 : i64
%3 = silly.load @y : i32
silly.print %3 : i32
}
%2 = "silly.string_literal"() <{value = "Done."}> : () -> !llvm.ptr
silly.print %2 : !llvm.ptr
%c0_i32 = arith.constant 0 : i32
"silly.return"(%c0_i32) : (i32) -> ()
}) : () -> ()
"silly.yield"() : () -> ()
}
}
The declares for x, y, are no longer in the program order, but no program can observe that internal change, as I don't provide any explicit addressing operations.
Here's the generated LLVM-IR for this program:
@str_0 = private constant [5 x i8] c"Done."
declare void @__silly_print_string(i64, ptr)
declare void @__silly_print_i64(i64)
define i32 @main() !dbg !4 {
%1 = alloca i32, i64 1, align 4
%2 = alloca i32, i64 1, align 4
store i32 3, ptr %2, align 4
%3 = load i32, ptr %2, align 4
%4 = sext i32 %3 to i64
%5 = icmp slt i64 %4, 4
br i1 %5, label %6, label %9
6: ; preds = %0
store i32 42, ptr %1, align 4
%7 = load i32, ptr %1, align 4
%8 = sext i32 %7 to i64
call void @__silly_print_i64(i64 %8)
br label %9
9: ; preds = %6, %0
call void @__silly_print_string(i64 5, ptr @str_0)
ret i32 0
}
Without optimization, the codegen is:
0: push %rax 1: movl $0x3,(%rsp) 8: xor %eax,%eax a: test %al,%al c: jne 20e: movl $0x2a,0x4(%rsp) 16: mov $0x2a,%edi 1b: call 20 1c: R_X86_64_PLT32 __silly_print_i64-0x4 20: mov $0x5,%edi 25: mov $0x0,%esi 26: R_X86_64_32 .rodata 2a: call 2f 2b: R_X86_64_PLT32 __silly_print_string-0x4 2f: xor %eax,%eax 31: pop %rcx 32: ret
And with optimization, the branching on constant values is purged, leaving just gorp for the print calls:
