Screenshot
There’s an existing toy MLIR dialect, part of the mlir tutorial documentation, so I’ve renamed my dialect from toy to silly, and updated all the references to ‘toy calculator’ to ‘silly compiler’, or ‘silly language’. There’s no good reason to use this language, nor the compiler, so this is very appropriate. It was, however, an excellent learning tool. The toy namespace is renamed, as are various file names, and all the MLIR operators, function prefixes, and so forth.
In addition to the big rename, other changes since the V5 tag include:
- A GET builtin (can now to I/O, not just O)
- FOR loop support.
- Something much closer to a consistent coding style now (FooBar for structures, fooBar for functions, no more use of all of PascalCase, camelCase, and underscore separated variables).
- Almost all of the auto variables have been purged for clarity.
- I’ve removed the ‘using namespace mlir’ in lowering.cpp. Many of my mlir:: namespace references already had the namespace tag, so removing this allowed for more consistency. I may revert this if it proves too cumbersome, but if I do, I’ll remove all the mlir:: qualifiers consistently (unless they are needed for disambiguation).
- User errors in the parser/builder no longer log the internal file:line:func for the code that spots them, but just the file:line location of the code with the error. Those errors are now reported with mlir::emitError()
- Declarations in scf.for and scf.if/else regions are now supported.
- error test script now merged into bin/testit, so there’s just one script to run the regression test.
- Switched to /// style doxygen markup.
GET
Here’s a sample program with a GET call:
and the corresponding MLIR output:
module {
func.func @main() -> i32 {
"silly.scope"() ({
"silly.declare"() <{type = i32}> {sym_name = "x"} : () -> ()
%0 = silly.get : i32
silly.assign @x = %0 : i32
%1 = silly.load @x : i32
silly.print %1 : i32
%c0_i32 = arith.constant 0 : i32
"silly.return"(%c0_i32) : (i32) -> ()
}) : () -> ()
"silly.yield"() : () -> ()
}
}
In the generated MLIR, I’ve split the GET builtin into an SSA for the get itself. In the example above, that’s returning the %0 value, and an internal AssignOp, kind of as if the statement was:
with the type information for the get riding on the assignment variable. That choice doesn’t model of the language in an ideal way. However, there are plenty of other places where my generated MLIR also isn’t a great one-to-one match for the language, so I don’t feel too bad about having done that, but might make different choices, if I wanted to have a lowering pass that transformed the silly dialect into something that represented a different language.
Here’s the corresponding LLVM-IR for that MLIR (with the DI stripped out)
declare void @__silly_print_i64(i64)
declare i32 @__silly_get_i32()
define i32 @main() !dbg !4 {
%1 = alloca i32, i64 1, align 4
%2 = call i32 @__silly_get_i32()
store i32 %2, ptr %1, align 4
%3 = load i32, ptr %1, align 4
%4 = sext i32 %3 to i64
call void @__silly_print_i64(i64 %4)
ret i32 0
}
The use of the store/load pair that was related to the symbol references. There’s some remnant of that left in the assembly without optimization:
0: push %rax
1: call 6
2: R_X86_64_PLT32 __silly_get_i32-0x4
6: mov %eax,0x4(%rsp)
a: movslq %eax,%rdi
d: call 12
e: R_X86_64_PLT32 __silly_print_i64-0x4
12: xor %eax,%eax
14: pop %rcx
15: ret
but with optimization, we are left with everything in register:
0: push %rax
1: call 6
2: R_X86_64_PLT32 __silly_get_i32-0x4
6: movslq %eax,%rdi
9: call e
a: R_X86_64_PLT32 __silly_print_i64-0x4
e: xor %eax,%eax
10: pop %rcx
11: ret
FOR
Here’s a little FOR test program:
INT32 x;
FOR ( x : (1, 11) )
{
PRINT x;
};
FOR ( x : (1, 11, 2) )
{
PRINT x;
};
This prints 1-10 and 1,3,5,7,9 respectively. Here’s the MLIR (with location information stripped out):
module {
func.func @main() -> i32 {
"silly.scope"() ({
"silly.declare"() <{type = i32}> {sym_name = "x"} : () -> ()
%c1_i64 = arith.constant 1 : i64
%0 = arith.trunci %c1_i64 : i64 to i32
%c11_i64 = arith.constant 11 : i64
%1 = arith.trunci %c11_i64 : i64 to i32
%c1_i64_0 = arith.constant 1 : i64
%2 = arith.trunci %c1_i64_0 : i64 to i32
scf.for %arg0 = %0 to %1 step %2 : i32 {
silly.assign @x = %arg0 : i32
%6 = silly.load @x : i32
silly.print %6 : i32
}
%c1_i64_1 = arith.constant 1 : i64
%3 = arith.trunci %c1_i64_1 : i64 to i32
%c11_i64_2 = arith.constant 11 : i64
%4 = arith.trunci %c11_i64_2 : i64 to i32
%c2_i64 = arith.constant 2 : i64
%5 = arith.trunci %c2_i64 : i64 to i32
scf.for %arg0 = %3 to %4 step %5 : i32 {
silly.assign @x = %arg0 : i32
%6 = silly.load @x : i32
silly.print %6 : i32
}
%c0_i32 = arith.constant 0 : i32
"silly.return"(%c0_i32) : (i32) -> ()
}) : () -> ()
"silly.yield"() : () -> ()
}
}
Observe that I did something sneaky in there: I’ve inserted a ‘silly.assign’ from the scf.for loop induction variable at the beginning of the loop, so that subsequent symbol based lookups just work. It would be cleaner to make the FOR loop variable private to the loop body (and have the builder reference the SSA induction variable directly forOp.getRegion().front().getArgument(0), instead of requiring a variable in the enclosing scope, but I did it this way to avoid the need for any additional dwarf instrumentation for that variable — basically, I was being lazy, and letting implementation guide the language “design”. Is that a hack? Absolutely!
Here’s the corresponding LLVM-IR:
declare void @__silly_print_i64(i64)
define i32 @main() {
%1 = alloca i32, i64 1, align 4
#dbg_declare(ptr %1, !9, !DIExpression(), !8)
br label %2
2: ; preds = %5, %0
%3 = phi i32 [ 1, %0 ], [ %8, %5 ]
%4 = icmp slt i32 %3, 11
br i1 %4, label %5, label %9
5: ; preds = %2
store i32 %3, ptr %1, align 4
%6 = load i32, ptr %1, align 4
%7 = sext i32 %6 to i64
call void @__silly_print_i64(i64 %7)
%8 = add i32 %3, 1
br label %2
9: ; preds = %2
br label %10
10: ; preds = %13, %9
%11 = phi i32 [ 1, %9 ], [ %16, %13 ]
%12 = icmp slt i32 %11, 11
br i1 %12, label %13, label %17
13: ; preds = %10
store i32 %11, ptr %1, align 4
%14 = load i32, ptr %1, align 4
%15 = sext i32 %14 to i64
call void @__silly_print_i64(i64 %15)
%16 = add i32 %11, 2
br label %10
17: ; preds = %10
ret i32 0
; uselistorder directives
uselistorder ptr %1, { 2, 3, 0, 1 }
}
and the unoptimized codegen:
0: push %rbx
1: sub $0x10,%rsp
5: mov $0x1,%ebx
a: cmp $0xa,%ebx
d: jg 23
f: nop
10: mov %ebx,0xc(%rsp)
14: movslq %ebx,%rdi
17: call 1c
18: R_X86_64_PLT32 __silly_print_i64-0x4
1c: inc %ebx
1e: cmp $0xa,%ebx
21: jle 10
23: mov $0x1,%ebx
28: cmp $0xa,%ebx
2b: jg 44
2d: nopl (%rax)
30: mov %ebx,0xc(%rsp)
34: movslq %ebx,%rdi
37: call 3c
38: R_X86_64_PLT32 __silly_print_i64-0x4
3c: add $0x2,%ebx
3f: cmp $0xa,%ebx
42: jle 30
44: xor %eax,%eax
46: add $0x10,%rsp
4a: pop %rbx
4b: ret
At O2 optimization, the assembly printer chooses to unroll both loops completely, generating code like:
0: push %rax
1: mov $0x1,%edi
6: call b
7: R_X86_64_PLT32 __silly_print_i64-0x4
b: mov $0x2,%edi
10: call 15
11: R_X86_64_PLT32 __silly_print_i64-0x4
15: mov $0x3,%edi
1a: call 1f
1b: R_X86_64_PLT32 __silly_print_i64-0x4
1f: mov $0x4,%edi
24: call 29
25: R_X86_64_PLT32 __silly_print_i64-0x4
29: mov $0x5,%edi
2e: call 33
2f: R_X86_64_PLT32 __silly_print_i64-0x4
33: mov $0x6,%edi
38: call 3d
39: R_X86_64_PLT32 __silly_print_i64-0x4
...
SCF Region declarations
In the V5 tag of the compiler, a program like this wouldn’t work:
INT32 x;
x = 3;
IF ( x < 4 )
{
INT32 y;
y = 42;
PRINT y;
};
PRINT "Done.";
This is because my DeclareOp needs to be in a region that has an associated symbol table (my ScopeOp). I've dealt with this by changing the insertion point for any declares to the beginning of the ScopeOp for the function (either the implicit main function, or a user defined function).
MLIR for the above program now looks like this:
module {
func.func @main() -> i32 {
"silly.scope"() ({
"silly.declare"() <{type = i32}> {sym_name = "y"} : () -> ()
"silly.declare"() <{type = i32}> {sym_name = "x"} : () -> ()
%c3_i64 = arith.constant 3 : i64
silly.assign @x = %c3_i64 : i64
%0 = silly.load @x : i32
%c4_i64 = arith.constant 4 : i64
%1 = "silly.less"(%0, %c4_i64) : (i32, i64) -> i1
scf.if %1 {
%c42_i64 = arith.constant 42 : i64
silly.assign @y = %c42_i64 : i64
%3 = silly.load @y : i32
silly.print %3 : i32
}
%2 = "silly.string_literal"() <{value = "Done."}> : () -> !llvm.ptr
silly.print %2 : !llvm.ptr
%c0_i32 = arith.constant 0 : i32
"silly.return"(%c0_i32) : (i32) -> ()
}) : () -> ()
"silly.yield"() : () -> ()
}
}
The declares for x, y, are no longer in the program order, but no program can observe that internal change, as I don't provide any explicit addressing operations.
Here's the generated LLVM-IR for this program:
@str_0 = private constant [5 x i8] c"Done."
declare void @__silly_print_string(i64, ptr)
declare void @__silly_print_i64(i64)
define i32 @main() !dbg !4 {
%1 = alloca i32, i64 1, align 4
%2 = alloca i32, i64 1, align 4
store i32 3, ptr %2, align 4
%3 = load i32, ptr %2, align 4
%4 = sext i32 %3 to i64
%5 = icmp slt i64 %4, 4
br i1 %5, label %6, label %9
6: ; preds = %0
store i32 42, ptr %1, align 4
%7 = load i32, ptr %1, align 4
%8 = sext i32 %7 to i64
call void @__silly_print_i64(i64 %8)
br label %9
9: ; preds = %6, %0
call void @__silly_print_string(i64 5, ptr @str_0)
ret i32 0
}
Without optimization, the codegen is:
0: push %rax
1: movl $0x3,(%rsp)
8: xor %eax,%eax
a: test %al,%al
c: jne 20
e: movl $0x2a,0x4(%rsp)
16: mov $0x2a,%edi
1b: call 20
1c: R_X86_64_PLT32 __silly_print_i64-0x4
20: mov $0x5,%edi
25: mov $0x0,%esi
26: R_X86_64_32 .rodata
2a: call 2f
2b: R_X86_64_PLT32 __silly_print_string-0x4
2f: xor %eax,%eax
31: pop %rcx
32: ret
And with optimization, the branching on constant values is purged, leaving just gorp for the print calls:
0: push %rax
1: mov $0x2a,%edi
6: call b
7: R_X86_64_PLT32 __silly_print_i64-0x4
b: mov $0x5,%edi
10: mov $0x0,%esi
11: R_X86_64_32 .rodata
15: call 1a
16: R_X86_64_PLT32 __silly_print_string-0x4
1a: xor %eax,%eax
1c: pop %rcx
1d: ret