Month: July 2025

A first pytorch program to build a simple pattern recognizer.

July 12, 2025 math and physics play No comments , , , , ,

What is it called if you use AI to build an AI?  That’s what I’ve done to understand a bit of how PyTorch works, building a little program to train a simple network to attempt to recognize Fibonacci like sequences.

Fibonacci recap.

Recall that the Fibonacci series is defined by a recurrence relationship where the next term is the sum of the previous two terms
\begin{equation}\label{eqn:pytorch:10}
F_k = F_{k-1} + F_{k-2}.
\end{equation}
Two specific values are required to start things off, and the usual two such starting values are \( 0, 1 \), or \( 1, 1 \).

I liked the idea of using something deterministic for training data, and asked Claude to build me a simple NN that used Fibonacci like sequences
\begin{equation}\label{eqn:pytorch:20}
F_k = \alpha F_{k-1} + \beta F_{k-2},
\end{equation}
as training data, using \( a = F_0, b = F_1 \). It turns out that choices of \( \alpha, \beta \) greater than one make the neural network training blow up, as the values increase quickly, getting out of control. It was possible to work around that in the NN training, but renormalizing the input data using a log transformation, and then re-exponentiating it afterwards. However, I decided that such series were less interesting than those closer to the Fibonacci series itself, and disabled that log renormalization by default (a command line option –logtx is available to force that, required for both training and inferrence, if used.)

Neural Network Architecture

There are a couple building blocks that the network uses.

  • \(\mathrm{ReLU} \) (Rectified Linear Unit) is an activation function in PyTorch
    \begin{equation*}
    \textrm{ReLU}(x) = \max(0, x).
    \end{equation*}
    If input is positive, then the output is the input, but if the input is negative or zero, the output is zero.
  • \( \mathrm{Dropout} \).
    Dropout is a regularization technique that randomly sets some neurons to zero during training to prevent overfitting.During training, for each neuron:
    \begin{equation*}
    y_i =
    \begin{cases}
    0 & \text{with probability } p \\
    \frac{x_i}{1-p} & \text{with probability } 1-p
    \end{cases}
    \end{equation*}
    where \( p \) is the dropout probability, \( x_i \) is the input to the neuron, and \( y_i \) is the output.With the 10\% dropout probability, this means that some inputs are zeroed randomly, with whatever is left increased slightly (approximately 1.1x).

The model is a feedforward neural network with the following structure:

Input Layer

  • Input: (\mathbf{x} = [f_{k-2}, f_{k-1}] \in \mathbb{R}^2\)
  • Output: \(f_k \in \mathbb{R}\)

Hidden Layers

The network has 3 hidden layers with ReLU activations:

\begin{equation*}
\mathbf{h}_1 = \textrm{ReLU}(\mathbf{W}_1 \mathbf{x} + \mathbf{b}_1)
\end{equation*}
\begin{equation*}
\mathbf{h}_1′ = \textrm{Dropout}(\mathbf{h}_1, p=0.1)
\end{equation*}

\begin{equation*}
\mathbf{h}_2 = \textrm{ReLU}(\mathbf{W}_2 \mathbf{h}_1′ + \mathbf{b}_2)
\end{equation*}
\begin{equation*}
\mathbf{h}_2′ = \textrm{Dropout}(\mathbf{h}_2, p=0.1)
\end{equation*}

\begin{equation*}
\mathbf{h}_3 = \textrm{ReLU}(\mathbf{W}_3 \mathbf{h}_2′ + \mathbf{b}_3)
\end{equation*}

Output Layer

\begin{equation*}
\hat{f}_k = \mathbf{W}_4 \mathbf{h}_3 + \mathbf{b}_4
\end{equation*}

Where:

  • \(\mathbf{W}_1 \in \mathbb{R}^{32 \times 2}, \mathbf{b}_1 \in \mathbb{R}^{32}\)
  • \(\mathbf{W}_2, \mathbf{W}_3 \in \mathbb{R}^{32 \times 32}, \mathbf{b}_2, \mathbf{b}_3 \in \mathbb{R}^{32}\)
  • \(\mathbf{W}_4 \in \mathbb{R}^{1 \times 32}, \mathbf{b}_4 \in \mathbb{R}^{1}\)

Essentially, we have some FMA like operations, but using matrices, not floats, with some functional filters between layers.  Eons ago, I recall using sigmoid functions as the filters, which were non-linear.  It looks like modern networks use operations that are more amenable to parallel computation.

The setup for the network layers is pretty simple

class SequencePredictor(nn.Module):
    def __init__(self, input_size=2, hidden_size=32, output_size=1):
        super(SequencePredictor, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, hidden_size)
        self.fc3 = nn.Linear(hidden_size, hidden_size)
        self.fc4 = nn.Linear(hidden_size, output_size)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(0.1)

    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.dropout(x)
        x = self.relu(self.fc2(x))
        x = self.dropout(x)
        x = self.relu(self.fc3(x))
        x = self.fc4(x)
        return x

Training.

Training only takes a couple lines of code:

    # Convert to PyTorch tensors
    X_tensor = torch.FloatTensor(X_norm)
    y_tensor = torch.FloatTensor(y_norm).unsqueeze(1)
    
    # Create model
    model = SequencePredictor()
    criterion = nn.MSELoss()
    optimizer = optim.Adam(model.parameters(), lr=0.001)

    # Training loop
    print("Training model...")
    losses = []

    for epoch in range(epochs):
        # Forward pass
        predictions = model(X_tensor)
        loss = criterion(predictions, y_tensor)

        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        losses.append(loss.item())

        if (epoch + 1) % 200 == 0:
            print(f'Epoch [{epoch+1}/{epochs}], Loss: {loss.item():.6f}')

A lot of this is a blackbox, but unlike more complicated and sophisticated models, it looks pretty feasible to learn what each of these steps is doing. It looks like the optimizer is probably doing a step of gradient descent in the current neighbourhood.

Results

The classic Fibonacci series wasn’t in the training data, and the model only achieves order of magnitude predictions for it:

Testing on α=1, β=1, a=1, b=1...

Step True     Predicted  Absolute Error %     Relative Error %    
----------------------------------------------------------------
0    1.0      1.0        0.0                  0.0                 
1    1.0      1.0        0.0                  0.0                 
2    2.0      2.7        0.7                  35.9                
3    3.0      4.2        1.2                  39.3                
4    5.0      6.4        1.4                  27.8                
5    8.0      9.5        1.5                  19.0                
6    13.0     13.6       0.6                  4.9                 
7    21.0     18.8       2.2                  10.6                
8    34.0     25.0       9.0                  26.6                
9    55.0     33.4       21.6                 39.3                

Total absolute error: 38.30111360549927
Total relative error: 203.36962890625

Of the training data, one particular sequence matches fairly closely:

Testing on α=0.911811, β=0.857173, a=1.45565, b=1.65682...

Step True     Predicted  Absolute Error %     Relative Error %
----------------------------------------------------------------
0    1.5      1.5        0.0                  0.0
1    1.7      1.7        0.0                  0.0
2    2.8      3.7        0.9                  32.8
3    3.9      5.4        1.5                  37.9
4    6.0      8.2        2.3                  38.0
5    8.8      12.2       3.4                  38.5
6    13.1     17.5       4.3                  33.0
7    19.5     23.7       4.2                  21.3
8    29.0     31.7       2.7                  9.2
9    43.2     43.2       0.1                  0.1

Total absolute error: 19.270923376083374
Total relative error: 210.85977474600077

The worse matching from the training data has increasing relative error as the sequence progresses:

Testing on α=0.942149, β=1.03861, a=1.36753, b=1.94342...

Step True     Predicted  Absolute Error %     Relative Error %
----------------------------------------------------------------
0    1.4      1.4        0.0                  0.0
1    1.9      1.9        0.0                  0.0
2    3.3      3.7        0.5                  15.1
3    5.1      5.6        0.6                  11.0
4    8.2      8.5        0.3                  4.2 
5    13.0     12.5       0.4                  3.3 
6    20.7     17.8       2.9                  14.1
7    33.0     24.0       9.0                  27.3
8    52.6     32.1       20.4                 38.9
9    83.8     43.7       40.0                 47.8

Total absolute error: 74.20500636100769
Total relative error: 161.64879322052002

To get a better idea of how the model does against the training data, here are a couple plots, one with good matching initially, and then divergence:

and one with good average on the whole, but no great specific matching anywhere:

and plots against all the training data:

 

It would be more interesting to train this against something that isn’t completely monotonic, perhaps some set of characteristic functions, sine, sinc, exp, …  Since the model ends up extrapolating against the first couple elements, having training data that starts with a wider variation would be useful.

Added FUNCTION/CALL support to my toy compiler

July 7, 2025 clang/llvm No comments , , , , ,

I’ve tagged V4 for my toy language and MLIR based compiler.

See the Changelog for the gory details (or the commit history).  There are three specific new features, relative to the V3 tag:

    1. Adds support (grammar, builder, lowering) for function declarations, and function calls. Much of the work for this was done in branch use_mlir_funcop_with_scopeop, later squashed and merged as a big commit. Here’s an example

      FUNCTION bar ( INT16 w, INT32 z )
      {
          PRINT "In bar";
          PRINT w;
          PRINT z;
          RETURN;
      };
      
      FUNCTION foo ( )
      {
          INT16 v;
          v = 3;
          PRINT "In foo";
          CALL bar( v, 42 );
          PRINT "Called bar";
          RETURN;
      };
      
      PRINT "In main";
      CALL foo();
      PRINT "Back in main";
      

      Here is the MLIR for this program:

      module {
        func.func private @foo() {
          "toy.scope"() ({
            "toy.declare"() <{type = i16}> {sym_name = "v"} : () -> ()
            %c3_i64 = arith.constant 3 : i64
            "toy.assign"(%c3_i64) <{var_name = @v}> : (i64) -> ()
            %0 = "toy.string_literal"() <{value = "In foo"}> : () -> !llvm.ptr
            toy.print %0 : !llvm.ptr
            %1 = "toy.load"() <{var_name = @v}> : () -> i16
            %c42_i64 = arith.constant 42 : i64
            %2 = arith.trunci %c42_i64 : i64 to i32
            "toy.call"(%1, %2) <{callee = @bar}> : (i16, i32) -> ()
            %3 = "toy.string_literal"() <{value = "Called bar"}> : () -> !llvm.ptr
            toy.print %3 : !llvm.ptr
            "toy.return"() : () -> ()
          }) : () -> ()
          "toy.yield"() : () -> ()
        }
        func.func private @bar(%arg0: i16, %arg1: i32) {
          "toy.scope"() ({
            "toy.declare"() <{param_number = 0 : i64, parameter, type = i16}> {sym_name = "w"} : () -> ()
            "toy.declare"() <{param_number = 1 : i64, parameter, type = i32}> {sym_name = "z"} : () -> ()
            %0 = "toy.string_literal"() <{value = "In bar"}> : () -> !llvm.ptr
            toy.print %0 : !llvm.ptr
            %1 = "toy.load"() <{var_name = @w}> : () -> i16
            toy.print %1 : i16
            %2 = "toy.load"() <{var_name = @z}> : () -> i32
            toy.print %2 : i32
            "toy.return"() : () -> ()
          }) : () -> ()
          "toy.yield"() : () -> ()
        }
        func.func @main() -> i32 {
          "toy.scope"() ({
            %c0_i32 = arith.constant 0 : i32
            %0 = "toy.string_literal"() <{value = "In main"}> : () -> !llvm.ptr
            toy.print %0 : !llvm.ptr
            "toy.call"() <{callee = @foo}> : () -> ()
            %1 = "toy.string_literal"() <{value = "Back in main"}> : () -> !llvm.ptr
            toy.print %1 : !llvm.ptr
            "toy.return"(%c0_i32) : (i32) -> ()
          }) : () -> ()
          "toy.yield"() : () -> ()
        }
      }
      

      Here’s a sample program with an assigned CALL value:

      FUNCTION bar ( INT16 w )
      {
          PRINT w;
          RETURN;
      };
      
      PRINT "In main";
      CALL bar( 3 );
      PRINT "Back in main";
      

      The MLIR for this one looks like:

      module {
        func.func private @bar(%arg0: i16) {
          "toy.scope"() ({
            "toy.declare"() <{param_number = 0 : i64, parameter, type = i16}> {sym_name = "w"} : () -> ()
            %0 = "toy.load"() <{var_name = @w}> : () -> i16
            toy.print %0 : i16
            "toy.return"() : () -> ()
          }) : () -> ()
          "toy.yield"() : () -> ()
        }
        func.func @main() -> i32 {
          "toy.scope"() ({
            %c0_i32 = arith.constant 0 : i32
            %0 = "toy.string_literal"() <{value = "In main"}> : () -> !llvm.ptr
            toy.print %0 : !llvm.ptr
            %c3_i64 = arith.constant 3 : i64
            %1 = arith.trunci %c3_i64 : i64 to i16
            "toy.call"(%1) <{callee = @bar}> : (i16) -> ()
            %2 = "toy.string_literal"() <{value = "Back in main"}> : () -> !llvm.ptr
            toy.print %2 : !llvm.ptr
            "toy.return"(%c0_i32) : (i32) -> ()
          }) : () -> ()
          "toy.yield"() : () -> ()
        }
      }
      

      I’ve implemented a two stage lowering, where the toy.scope, toy.yield, toy.call, and toy.returns are stripped out leaving just the func and llvm dialects. Code from that stage of the lowering is cleaner looking

      llvm.mlir.global private constant @str_1(dense<[66, 97, 99, 107, 32, 105, 110, 32, 109, 97, 105, 110]> : tensor<12xi8>) {addr_space = 0 : i32} : !llvm.array<12 x i8>
      func.func private @__toy_print_string(i64, !llvm.ptr)
      llvm.mlir.global private constant @str_0(dense<[73, 110, 32, 109, 97, 105, 110]> : tensor<7xi8>) {addr_space = 0 : i32} : !llvm.array<7 x i8>
      func.func private @__toy_print_i64(i64)
      func.func private @bar(%arg0: i16) {
        %0 = llvm.mlir.constant(1 : i64) : i64
        %1 = llvm.alloca %0 x i16 {alignment = 2 : i64, bindc_name = "w.addr"} : (i64) -> !llvm.ptr
        llvm.store %arg0, %1 : i16, !llvm.ptr
        %2 = llvm.load %1 : !llvm.ptr -> i16
        %3 = llvm.sext %2 : i16 to i64
        call @__toy_print_i64(%3) : (i64) -> ()
        return
      }
      func.func @main() -> i32 {
        %0 = llvm.mlir.constant(0 : i32) : i32
        %1 = llvm.mlir.addressof @str_0 : !llvm.ptr
        %2 = llvm.mlir.constant(7 : i64) : i64
        call @__toy_print_string(%2, %1) : (i64, !llvm.ptr) -> ()
        %3 = llvm.mlir.constant(3 : i64) : i64
        %4 = llvm.mlir.constant(3 : i16) : i16
        call @bar(%4) : (i16) -> ()
        %5 = llvm.mlir.addressof @str_1 : !llvm.ptr
        %6 = llvm.mlir.constant(12 : i64) : i64
        call @__toy_print_string(%6, %5) : (i64, !llvm.ptr) -> ()
        return %0 : i32
      }
      

      There are some dead code constants left there (%3), seeming due to type conversion, but they get stripped out nicely by the time we get to LLVM-IR:

      @str_1 = private constant [12 x i8] c"Back in main"
      @str_0 = private constant [7 x i8] c"In main"
      
      declare void @__toy_print_string(i64, ptr)
      
      declare void @__toy_print_i64(i64)
      
      define void @bar(i16 %0) {
        %2 = alloca i16, i64 1, align 2
        store i16 %0, ptr %2, align 2
        %3 = load i16, ptr %2, align 2
        %4 = sext i16 %3 to i64
        call void @__toy_print_i64(i64 %4)
        ret void
      }
      
      define i32 @main() {
        call void @__toy_print_string(i64 7, ptr @str_0)
        call void @bar(i16 3)
        call void @__toy_print_string(i64 12, ptr @str_1)
        ret i32 0
      }
    2. Generalize NegOp lowering to support all types, not just f64.
    3. Allow PRINT of string literals, avoiding requirement for variables. Example:

          %0 = "toy.string_literal"() <{value = "A string literal!"}> : () -> !llvm.ptr loc(#loc)
          "toy.print"(%0) : (!llvm.ptr) -> () loc(#loc)

       

The next obvious thing to do for the language/compiler would be to implement conditionals (IF/ELIF/ELSE) and loops. I think that there are MLIR dialects to facilitate both (like the affine dialect for loops.)

However, having now finished this function support feature (which I’ve been working on for quite a while), I’m going to take a break from this project. Even though I’ve only been working on this toy compiler project in my spare time, it periodically invades my thoughts. With all that I have to learn for my new job, I’d rather have one less extra thing to think about, so that I don’t feel pulled in too many directions at once.