What is it called if you use AI to build an AI? That’s what I’ve done to understand a bit of how PyTorch works, building a little program to train a simple network to attempt to recognize Fibonacci like sequences.
Fibonacci recap.
Recall that the Fibonacci series is defined by a recurrence relationship where the next term is the sum of the previous two terms
\begin{equation}\label{eqn:pytorch:10}
F_k = F_{k-1} + F_{k-2}.
\end{equation}
Two specific values are required to start things off, and the usual two such starting values are \( 0, 1 \), or \( 1, 1 \).
I liked the idea of using something deterministic for training data, and asked Claude to build me a simple NN that used Fibonacci like sequences
\begin{equation}\label{eqn:pytorch:20}
F_k = \alpha F_{k-1} + \beta F_{k-2},
\end{equation}
as training data, using \( a = F_0, b = F_1 \). It turns out that choices of \( \alpha, \beta \) greater than one make the neural network training blow up, as the values increase quickly, getting out of control. It was possible to work around that in the NN training, but renormalizing the input data using a log transformation, and then re-exponentiating it afterwards. However, I decided that such series were less interesting than those closer to the Fibonacci series itself, and disabled that log renormalization by default (a command line option –logtx is available to force that, required for both training and inferrence, if used.)
Neural Network Architecture
There are a couple building blocks that the network uses.
- \(\mathrm{ReLU} \) (Rectified Linear Unit) is an activation function in PyTorch
\begin{equation*}
\textrm{ReLU}(x) = \max(0, x).
\end{equation*}
If input is positive, then the output is the input, but if the input is negative or zero, the output is zero. - \( \mathrm{Dropout} \).
Dropout is a regularization technique that randomly sets some neurons to zero during training to prevent overfitting.During training, for each neuron:
\begin{equation*}
y_i =
\begin{cases}
0 & \text{with probability } p \\
\frac{x_i}{1-p} & \text{with probability } 1-p
\end{cases}
\end{equation*}
where \( p \) is the dropout probability, \( x_i \) is the input to the neuron, and \( y_i \) is the output.With the 10\% dropout probability, this means that some inputs are zeroed randomly, with whatever is left increased slightly (approximately 1.1x).
The model is a feedforward neural network with the following structure:
Input Layer
- Input: (\mathbf{x} = [f_{k-2}, f_{k-1}] \in \mathbb{R}^2\)
- Output: \(f_k \in \mathbb{R}\)
Hidden Layers
The network has 3 hidden layers with ReLU activations:
\begin{equation*}
\mathbf{h}_1 = \textrm{ReLU}(\mathbf{W}_1 \mathbf{x} + \mathbf{b}_1)
\end{equation*}
\begin{equation*}
\mathbf{h}_1′ = \textrm{Dropout}(\mathbf{h}_1, p=0.1)
\end{equation*}
\begin{equation*}
\mathbf{h}_2 = \textrm{ReLU}(\mathbf{W}_2 \mathbf{h}_1′ + \mathbf{b}_2)
\end{equation*}
\begin{equation*}
\mathbf{h}_2′ = \textrm{Dropout}(\mathbf{h}_2, p=0.1)
\end{equation*}
\begin{equation*}
\mathbf{h}_3 = \textrm{ReLU}(\mathbf{W}_3 \mathbf{h}_2′ + \mathbf{b}_3)
\end{equation*}
Output Layer
\begin{equation*}
\hat{f}_k = \mathbf{W}_4 \mathbf{h}_3 + \mathbf{b}_4
\end{equation*}
Where:
- \(\mathbf{W}_1 \in \mathbb{R}^{32 \times 2}, \mathbf{b}_1 \in \mathbb{R}^{32}\)
- \(\mathbf{W}_2, \mathbf{W}_3 \in \mathbb{R}^{32 \times 32}, \mathbf{b}_2, \mathbf{b}_3 \in \mathbb{R}^{32}\)
- \(\mathbf{W}_4 \in \mathbb{R}^{1 \times 32}, \mathbf{b}_4 \in \mathbb{R}^{1}\)
Essentially, we have some FMA like operations, but using matrices, not floats, with some functional filters between layers. Eons ago, I recall using sigmoid functions as the filters, which were non-linear. It looks like modern networks use operations that are more amenable to parallel computation.
The setup for the network layers is pretty simple
class SequencePredictor(nn.Module): def __init__(self, input_size=2, hidden_size=32, output_size=1): super(SequencePredictor, self).__init__() self.fc1 = nn.Linear(input_size, hidden_size) self.fc2 = nn.Linear(hidden_size, hidden_size) self.fc3 = nn.Linear(hidden_size, hidden_size) self.fc4 = nn.Linear(hidden_size, output_size) self.relu = nn.ReLU() self.dropout = nn.Dropout(0.1) def forward(self, x): x = self.relu(self.fc1(x)) x = self.dropout(x) x = self.relu(self.fc2(x)) x = self.dropout(x) x = self.relu(self.fc3(x)) x = self.fc4(x) return x
Training.
Training only takes a couple lines of code:
# Convert to PyTorch tensors X_tensor = torch.FloatTensor(X_norm) y_tensor = torch.FloatTensor(y_norm).unsqueeze(1) # Create model model = SequencePredictor() criterion = nn.MSELoss() optimizer = optim.Adam(model.parameters(), lr=0.001) # Training loop print("Training model...") losses = [] for epoch in range(epochs): # Forward pass predictions = model(X_tensor) loss = criterion(predictions, y_tensor) # Backward pass optimizer.zero_grad() loss.backward() optimizer.step() losses.append(loss.item()) if (epoch + 1) % 200 == 0: print(f'Epoch [{epoch+1}/{epochs}], Loss: {loss.item():.6f}')
A lot of this is a blackbox, but unlike more complicated and sophisticated models, it looks pretty feasible to learn what each of these steps is doing. It looks like the optimizer is probably doing a step of gradient descent in the current neighbourhood.
Results
The classic Fibonacci series wasn’t in the training data, and the model only achieves order of magnitude predictions for it:
Testing on α=1, β=1, a=1, b=1... Step True Predicted Absolute Error % Relative Error % ---------------------------------------------------------------- 0 1.0 1.0 0.0 0.0 1 1.0 1.0 0.0 0.0 2 2.0 2.7 0.7 35.9 3 3.0 4.2 1.2 39.3 4 5.0 6.4 1.4 27.8 5 8.0 9.5 1.5 19.0 6 13.0 13.6 0.6 4.9 7 21.0 18.8 2.2 10.6 8 34.0 25.0 9.0 26.6 9 55.0 33.4 21.6 39.3 Total absolute error: 38.30111360549927 Total relative error: 203.36962890625
Of the training data, one particular sequence matches fairly closely:
Testing on α=0.911811, β=0.857173, a=1.45565, b=1.65682... Step True Predicted Absolute Error % Relative Error % ---------------------------------------------------------------- 0 1.5 1.5 0.0 0.0 1 1.7 1.7 0.0 0.0 2 2.8 3.7 0.9 32.8 3 3.9 5.4 1.5 37.9 4 6.0 8.2 2.3 38.0 5 8.8 12.2 3.4 38.5 6 13.1 17.5 4.3 33.0 7 19.5 23.7 4.2 21.3 8 29.0 31.7 2.7 9.2 9 43.2 43.2 0.1 0.1 Total absolute error: 19.270923376083374 Total relative error: 210.85977474600077
The worse matching from the training data has increasing relative error as the sequence progresses:
Testing on α=0.942149, β=1.03861, a=1.36753, b=1.94342... Step True Predicted Absolute Error % Relative Error % ---------------------------------------------------------------- 0 1.4 1.4 0.0 0.0 1 1.9 1.9 0.0 0.0 2 3.3 3.7 0.5 15.1 3 5.1 5.6 0.6 11.0 4 8.2 8.5 0.3 4.2 5 13.0 12.5 0.4 3.3 6 20.7 17.8 2.9 14.1 7 33.0 24.0 9.0 27.3 8 52.6 32.1 20.4 38.9 9 83.8 43.7 40.0 47.8 Total absolute error: 74.20500636100769 Total relative error: 161.64879322052002
To get a better idea of how the model does against the training data, here are a couple plots, one with good matching initially, and then divergence:
and one with good average on the whole, but no great specific matching anywhere:
and plots against all the training data:
It would be more interesting to train this against something that isn’t completely monotonic, perhaps some set of characteristic functions, sine, sinc, exp, … Since the model ends up extrapolating against the first couple elements, having training data that starts with a wider variation would be useful.