GAP (GradientBasedLearningForCAP) - Chapter 8: Fitting Parameters

Suppose we have a parametrised morphism (\mathbb{R}^p, f):\mathbb{R}^n \to \mathbb{R} where \mathbb{R}^p is the parameters of the morphism and f:\mathbb{R}^{p+n} \to \mathbb{R} is a morphism in a skeletal category of smooth maps (It represents a loss function over an input in \mathbb{R}^n and parameter vector in \mathbb{R}^p). Given a set of training examples \{X_1, \ldots, X_m\} where each X_i \in \mathbb{R}^n, we want to fit a parameter vector \Theta \in \mathbb{R}^p such that the output of f is minimized on the training examples.

We can achieve this by creating an update-lens for each training example. This update-lens reads the current parameters \Theta and updates it according to the gradient of the loss function f at the example X_i. We start by substituting the training example X_i into f resulting in a morphism f_i:\mathbb{R}^p \to \mathbb{R} defined by f_i(\Theta) = f(\Theta, X_i). By applying the reverse differential lens functor ReverseDifferentialLensFunctor

on f_i, we obtain a lens \mathbf{R}(f_i):(\mathbb{R}^p, \mathbb{R}^p) \to (\mathbb{R}^1, \mathbb{R}^1). The get-morphism of this lens reads the current parameters \Theta and computes the loss f_i(\Theta), while the put-morphism Rf_i:\mathbb{R}^p \times \mathbb{R}^1 \to \mathbb{R}^p is given by (\Theta, r) \mapsto rJ_{f_i}(\Theta) where J_{f_i}(\Theta) \in \mathbb{R}^{1 \times p} is the Jacobian matrix of f_i evaluated at \Theta.

The One-Epoch update lens for the example X_i is then obtained by precomposing an optimizer lens (e.g., gradient descent, Adam, etc.) to the following lens \mathbf{R}(f_i) \cdot \varepsilon where \varepsilon:(\mathbb{R}^1, \mathbb{R}^0) \to (\mathbb{R}^1, \mathbb{R}^1) is the lens defined by:

Suppose we choose the optimizer lens to be the gradient descent optimizer with learning rate \eta = 0.01 > 0, then the resulting One-Epoch update lens for the example X_i is given by Now, we can start by a random parameter vector \Theta_0 \in \mathbb{R}^p and apply the update morphism of the One-Epoch update lens for X_1 to obtain a new parameter vector \Theta_1, then use \Theta_1 and the One-Epoch update lens for X_2 to obtain \Theta_2, and so on. After going through all training examples, we have completed one epoch of training. To perform multiple epochs of training, we can simply repeat the process.

For example, suppose we start with the parmetised morphism (\mathbb{R}^2, f):\mathbb{R}^2 \to \mathbb{R} where f:\mathbb{R}^{2+2} \to \mathbb{R} is defined by f(\theta_1, \theta_2, x_1, x_2) = (x_1-\theta_1)^2 + (x_2-\theta_2)^2 where \Theta := (\theta_1, \theta_2) \in \mathbb{R}^2 represents the parameters and x = (x_1, x_2) \in \mathbb{R}^2 is the input. Given training examples X_1 = (1,2) and X_2 = (3,4), the morphism f_1:\mathbb{R}^2 \to \mathbb{R} is defined by f_1(\theta_1, \theta_2) = (1 - \theta_1)^2 + (2 - \theta_2)^2 with Jacobian matrix Thus, the One-Epoch update lens for X_1 is given by: and the One-Epoch update lens for X_2 is given by: Suppose we start with the parameter vector \Theta = (0,0). Then:

Thus, after one epoch of training, the updated parameters are \Theta_2 = (0.0796, 0.1192). Repeating this process for multiple epochs will further refine the parameters to minimize the loss function over the training examples. Eventually, we expect the parameters to converge to \Theta = [2, 3] which minimizes the loss function. The point whose distance from [1, 2] and [3, 4] is minimized is [2, 3]. See the examples section for the implementation of this process in GAP.

8.2 Notes on Batching

Given a parametrised (loss) morphism (\mathbb{R}^p, f):\mathbb{R}^n \to \mathbb{R} and a set of training examples \{X_1, \ldots, X_m\} where each X_i \in \mathbb{R}^n. If the number of training examples m is large, it may be beneficial to use mini-batches during training. Given a positive integer batch_size, the loss morphism is first batched using Batchify. This means, we create a new parametrised morphism (\mathbb{R}^p, f_{batch}):\mathbb{R}^{batch\_size \cdot n} \to \mathbb{R} where f_{batch}(\Theta, X_{i_1}, \ldots, X_{i_{batch\_size}}) = \frac{1}{batch\_size} \sum_{j=1}^{batch\_size} f(\Theta, X_{i_j}). We divide the training examples into mini-batches of size batch_size (padding the list by repeating randomly chosen examples if necessary to make its length divisible by batch_size). And then we consider each mini-batch as a single training example. Now, we can repeat the training process described above using the batched loss morphism and the new training examples. For example, if the parametrised morphism is (\mathbb{R}^p, f):\mathbb{R}^2 \to \mathbb{R} where f(\theta_1, \theta_2, x_1, x_2) = (x_1-\theta_1)^2 + (x_2-\theta_2)^2, and we have training examples [[1,2], [3,4], [5,6], [7,8], [9,10]], then for batch_size = 2, the batched loss morphism is (\mathbb{R}^p, f_{batch}):\mathbb{R}^4 \to \mathbb{R} where f_{batch}(\theta_1, \theta_2, x_1, x_2, x_3, x_4) = \frac{1}{2} \left( (x_1-\theta_1)^2 + (x_2-\theta_2)^2 + (x_3-\theta_1)^2 + (x_4-\theta_2)^2 \right) (See Batchify operation). Since the number of training examples is not divisible by batch_size, we pad the list by randomly choosing an example (say, [1,2]) and appending it to the list. Then the new training examples set would be [[1,2,3,4], [5,6,7,8], [9,10,1,2]].

8.3 Operations

8.3-1 OneEpochUpdateLens

The argument parametrised_morphism must be a morphism in a category of parametrised morphisms whose target has rank 1 (a scalar loss).

The argument optimizer is a function which takes the number of parameters p and returns an optimizer lens in the category of lenses over Smooth. Typical examples are Lenses.GradientDescentOptimizer, Lenses.AdamOptimizer, etc.

The list training_examples must contain at least one example; each example is a dense list representing a vector in \mathbb{R}^n.

8.3-2 OneEpochUpdateLens

Same as OneEpochUpdateLens, but reads the training examples from a file. The file is evaluated using EvalString and is expected to contain a GAP expression evaluating to a dense list of examples.

8.3-3 Fit

Perform nr_epochs epochs of training using the given one_epoch_update_lens and initial weights initial_weights.

The lens one_epoch_update_lens must have get-morphism \mathbb{R}^p \to \mathbb{R}^1 and put-morphism \mathbb{R}^p \to \mathbb{R}^p for the same p as the length of initial_weights. The option verbose controls whether to print the loss at each epoch.

8.4 Examples

gap> Smooth := SkeletalCategoryOfSmoothMaps( );
SkeletalSmoothMaps
gap> Para := CategoryOfParametrisedMorphisms( Smooth );
CategoryOfParametrisedMorphisms( SkeletalSmoothMaps )
gap> Lenses := CategoryOfLenses( Smooth );
CategoryOfLenses( SkeletalSmoothMaps )
gap> D := [ Smooth.1, Smooth.1, Smooth.1, Smooth.1 ];
[ ℝ^1, ℝ^1, ℝ^1, ℝ^1 ]
gap> p1 := ProjectionInFactorOfDirectProduct( Smooth, D, 1 );
ℝ^4 -> ℝ^1
gap> p2 := ProjectionInFactorOfDirectProduct( Smooth, D, 2 );
ℝ^4 -> ℝ^1
gap> p3 := ProjectionInFactorOfDirectProduct( Smooth, D, 3 );
ℝ^4 -> ℝ^1
gap> p4 := ProjectionInFactorOfDirectProduct( Smooth, D, 4 );
ℝ^4 -> ℝ^1
gap> f := PreCompose( (p3 - p1), Smooth.Power(2) )
>         + PreCompose( (p4 - p2), Smooth.Power(2) );
ℝ^4 -> ℝ^1
gap> dummy_input := CreateContextualVariables( [ "theta_1", "theta_2", "x1", "x2" ] );
[ theta_1, theta_2, x1, x2 ]
gap> Display( f : dummy_input := dummy_input );
ℝ^4 -> ℝ^1

‣ (x1 + (- theta_1)) ^ 2 + (x2 + (- theta_2)) ^ 2
gap> f := MorphismConstructor( Para, Para.2, [ Smooth.2, f ], Para.1 );
ℝ^2 -> ℝ^1 defined by:

Underlying Object:
-----------------
ℝ^2

Underlying Morphism:
-------------------
ℝ^4 -> ℝ^1
gap> Display( f : dummy_input := dummy_input );
ℝ^2 -> ℝ^1 defined by:

Underlying Object:
-----------------
ℝ^2

Underlying Morphism:
-------------------
ℝ^4 -> ℝ^1

‣ (x1 + (- theta_1)) ^ 2 + (x2 + (- theta_2)) ^ 2
gap> optimizer := Lenses.GradientDescentOptimizer( :learning_rate := 0.01 );
function( n ) ... end
gap> dummy_input := CreateContextualVariables( [ "theta_1", "theta_2", "g1", "g2" ] );
[ theta_1, theta_2, g1, g2 ]
gap> Display( optimizer( 2 ) : dummy_input := dummy_input );
(ℝ^2, ℝ^2) -> (ℝ^2, ℝ^2) defined by:

Get Morphism:
------------
ℝ^2 -> ℝ^2

‣ theta_1
‣ theta_2

Put Morphism:
------------
ℝ^4 -> ℝ^2

‣ theta_1 + 0.01 * g1
‣ theta_2 + 0.01 * g2
gap> update_lens_1 := OneEpochUpdateLens( f, optimizer, [ [ 1, 2 ] ], 1 );
(ℝ^2, ℝ^2) -> (ℝ^1, ℝ^0) defined by:

Get Morphism:
------------
ℝ^2 -> ℝ^1

Put Morphism:
------------
ℝ^2 -> ℝ^2
gap> dummy_input := CreateContextualVariables( [ "theta_1", "theta_2" ] );
[ theta_1, theta_2 ]
gap> Display( update_lens_1 : dummy_input := dummy_input );
(ℝ^2, ℝ^2) -> (ℝ^1, ℝ^0) defined by:

Get Morphism:
------------
ℝ^2 -> ℝ^1

‣ ((1 + (- theta_1)) ^ 2 + (2 + (- theta_2)) ^ 2) / 1 / 1

Put Morphism:
------------
ℝ^2 -> ℝ^2

‣ theta_1 + 0.01 * (-1 * (0 + 0 + (1 * ((2 * (1 + (- theta_1)) ^ 1 * -1 + 0) * 1 
  + 0 + 0 + 0) * 1 + 0 + 0 + 0) * 1 + 0))
‣ theta_2 + 0.01 * (-1 * (0 + 0 + 0 + (0 + 1 * (0 +
  (0 + 2 * (2 + (- theta_2)) ^ 1 * -1) * 1 + 0 + 0) * 1 + 0 + 0) * 1))
gap> update_lens_1 := SimplifyMorphism( update_lens_1, infinity );
(ℝ^2, ℝ^2) -> (ℝ^1, ℝ^0) defined by:

Get Morphism:
------------
ℝ^2 -> ℝ^1

Put Morphism:
------------
ℝ^2 -> ℝ^2
gap> Display( update_lens_1 : dummy_input := dummy_input );
(ℝ^2, ℝ^2) -> (ℝ^1, ℝ^0) defined by:

Get Morphism:
------------
ℝ^2 -> ℝ^1

‣ (theta_1 - 1) ^ 2 + (theta_2 - 2) ^ 2

Put Morphism:
------------
ℝ^2 -> ℝ^2

‣ 0.98 * theta_1 + 0.02
‣ 0.98 * theta_2 + 0.04
gap> update_lens_2 := OneEpochUpdateLens( f, optimizer, [ [ 3, 4 ] ], 1 );
(ℝ^2, ℝ^2) -> (ℝ^1, ℝ^0) defined by:

Get Morphism:
------------
ℝ^2 -> ℝ^1

Put Morphism:
------------
ℝ^2 -> ℝ^2
gap> Display( update_lens_2 : dummy_input := dummy_input );

(ℝ^2, ℝ^2) -> (ℝ^1, ℝ^0) defined by:

Get Morphism:
------------
ℝ^2 -> ℝ^1

‣ ((3 + (- theta_1)) ^ 2 + (4 + (- theta_2)) ^ 2) / 1 / 1

Put Morphism:
------------
ℝ^2 -> ℝ^2

‣ theta_1 + 0.01 * (-1 * (0 + 0 + (1 * ((2 * (3 + (- theta_1)) ^ 1 * -1 + 0) * 1 
+ 0 + 0 + 0) * 1 + 0 + 0 + 0) * 1 + 0))
‣ theta_2 + 0.01 * (-1 * (0 + 0 + 0 + (0 + 1 * (0 +
(0 + 2 * (4 + (- theta_2)) ^ 1 * -1) * 1 + 0 + 0) * 1 + 0 + 0) * 1))
gap> update_lens_2 := SimplifyMorphism( update_lens_2, infinity );
(ℝ^2, ℝ^2) -> (ℝ^1, ℝ^0) defined by:

Get Morphism:
------------
ℝ^2 -> ℝ^1

Put Morphism:
------------
ℝ^2 -> ℝ^2
gap> Display( update_lens_2 : dummy_input := dummy_input );
(ℝ^2, ℝ^2) -> (ℝ^1, ℝ^0) defined by:

Get Morphism:
------------
ℝ^2 -> ℝ^1

‣ (theta_1 - 3) ^ 2 + (theta_2 - 4) ^ 2

Put Morphism:
------------
ℝ^2 -> ℝ^2

‣ 0.98 * theta_1 + 0.06
‣ 0.98 * theta_2 + 0.08
gap> update_lens := OneEpochUpdateLens( f, optimizer, [ [ 1, 2 ], [ 3, 4 ] ], 1 );
(ℝ^2, ℝ^2) -> (ℝ^1, ℝ^0) defined by:

Get Morphism:
------------
ℝ^2 -> ℝ^1

Put Morphism:
------------
ℝ^2 -> ℝ^2
gap> Display( update_lens : dummy_input := dummy_input );
(ℝ^2, ℝ^2) -> (ℝ^1, ℝ^0) defined by:

Get Morphism:
------------
ℝ^2 -> ℝ^1

‣ (
    ((1 + (- theta_1)) ^ 2 + (2 + (- theta_2)) ^ 2) / 1 +
    ((3 + (- theta_1)) ^ 2 + (4 + (- theta_2)) ^ 2) / 1
  ) / 2

Put Morphism:
------------
ℝ^2 -> ℝ^2

‣ theta_1 + 0.01 * (-1 * (0 + 0 + (1 * ((2 * (1 + (- theta_1)) ^ 1 * -1 + 0) * 1
  + 0 + 0 + 0) * 1 + 0 + 0 + 0) * 1 + 0)) + 0.01 * (-1 * (0 + 0 +
  (1 * ((2 * (3 + (- (theta_1 + 0.01 * (-1 * (0 + 0 +
  (1 * ((2 * (1 + (- theta_1)) ^ 1 * -1 + 0) * 1 + 0 + 0 + 0) * 1
  + 0 + 0 + 0) * 1 + 0))))) ^ 1 * -1 + 0) * 1 + 0 + 0 + 0) * 1 + 0 + 0 + 0) * 1 
  + 0))
‣ theta_2 + 0.01 * (-1 * (0 + 0 + 0 + (0 + 1 * (0 + (0 + 2 * (2 + 
  (- theta_2)) ^ 1 * -1) * 1 + 0 + 0) * 1 + 0 + 0) * 1)) + 0.01
  * (-1 * (0 + 0 + 0 + (0 + 1 * (0 + (0 + 2 * (4 +
  (- (theta_2 + 0.01 * (-1 * (0 + 0 + 0 + (0 + 1 * (0 + (0 + 2 * (2 +
  (- theta_2)) ^ 1 * -1) * 1 + 0 + 0) * 1 + 0 + 0) * 1))))) ^ 1 * -1) * 1 
  + 0 + 0) * 1 + 0 + 0) * 1))
gap> update_lens := SimplifyMorphism( update_lens, infinity );
(ℝ^2, ℝ^2) -> (ℝ^1, ℝ^0) defined by:

Get Morphism:
------------
ℝ^2 -> ℝ^1

Put Morphism:
------------
ℝ^2 -> ℝ^2
gap> Display( update_lens : dummy_input := dummy_input );
(ℝ^2, ℝ^2) -> (ℝ^1, ℝ^0) defined by:

Get Morphism:
------------
ℝ^2 -> ℝ^1

‣ theta_1 ^ 2 - 4 * theta_1 + theta_2 ^ 2 - 6 * theta_2 + 15

Put Morphism:
------------
ℝ^2 -> ℝ^2

‣ 0.9604 * theta_1 + 0.0796
‣ 0.9604 * theta_2 + 0.1192
gap> "If we used only update_lens_1, the parameters converge to (1,2)";;
gap> theta := [ 0, 0 ];;
gap> for i in [ 1 .. 1000 ] do theta := PutMorphism( update_lens_1 )( theta ); od;
gap> theta;
[ 1., 2. ]
gap> "If we used only update_lens_2, the parameters converge to (3,4)";;
gap> theta := [ 0, 0 ];;
gap> for i in [ 1 .. 1000 ] do theta := PutMorphism( update_lens_2 )( theta ); od;
gap> theta;
[ 3., 4. ]
gap> "If we use the combined update_lens, the parameters converge to (2,3)";;
gap> theta := [ 0, 0 ];;
gap> for i in [ 1 .. 1000 ] do theta := PutMorphism( update_lens )( theta ); od;
gap> theta;
[ 2.0101, 3.0101 ]
gap> "Inseated of manually applying the put-morphism, we can use the Fit operation:";;
gap> "For example, to fit theta = (0,0) using 10 epochs:";;
gap> theta := [ 0, 0 ];;
gap> theta := Fit( update_lens, 10, theta );
Epoch  0/10 - loss = 15
Epoch  1/10 - loss = 13.9869448
Epoch  2/10 - loss = 13.052687681213568
Epoch  3/10 - loss = 12.19110535502379
Epoch  4/10 - loss = 11.39655013449986
Epoch  5/10 - loss = 10.663813003077919
Epoch  6/10 - loss = 9.9880895506637923
Epoch  7/10 - loss = 9.3649485545394704
Epoch  8/10 - loss = 8.790302999738083
Epoch  9/10 - loss = 8.2603833494932317
Epoch 10/10 - loss = 7.7717128910720641
[ 0.668142, 1.00053 ]

Let us in this example find a solution to the equation \theta^3-\theta^2-4=0. We can reframe this as a minimization problem by considering the parametrised morphism (\mathbb{R}^1, f):\mathbb{R}^0 \to \mathbb{R}^1 where f(\theta) = (\theta^3-\theta^2-4)^2.

gap> Smooth := SkeletalCategoryOfSmoothMaps( );
SkeletalSmoothMaps
gap> Para := CategoryOfParametrisedMorphisms( Smooth );
CategoryOfParametrisedMorphisms( SkeletalSmoothMaps )
gap> Lenses := CategoryOfLenses( Smooth );
CategoryOfLenses( SkeletalSmoothMaps )
gap> f := Smooth.Power( 3 ) - Smooth.Power( 2 ) - Smooth.Constant([ 4 ]);
ℝ^1 -> ℝ^1
gap> Display( f );
ℝ^1 -> ℝ^1
‣ x1 ^ 3 + (- x1 ^ 2) + - 4
gap> f := PreCompose( f, Smooth.Power( 2 ) );
ℝ^1 -> ℝ^1
gap> Display( f );
ℝ^1 -> ℝ^1

‣ (x1 ^ 3 + (- x1 ^ 2) + - 4) ^ 2
gap> f := MorphismConstructor( Para, Para.0, [ Smooth.1, f ], Para.1 );
ℝ^0 -> ℝ^1 defined by:

Underlying Object:
-----------------
ℝ^1

Underlying Morphism:
-------------------
ℝ^1 -> ℝ^1
gap> dummy_input := CreateContextualVariables( [ "theta" ] );
[ theta ]
gap> Display( f : dummy_input := dummy_input );
ℝ^0 -> ℝ^1 defined by:

Underlying Object:
-----------------
ℝ^1

Underlying Morphism:
-------------------
ℝ^1 -> ℝ^1

‣ (theta ^ 3 + (- theta ^ 2) + -4) ^ 2
gap> optimizer := Lenses.AdamOptimizer( :learning_rate := 0.01,
>                 beta1 := 0.9, beta2 := 0.999, epsilon := 1.e-7 );
function( n ) ... end
gap> dummy_input := CreateContextualVariables( [ "t", "m", "v", "theta", "g" ] );
[ t, m, v, theta, g ]
gap> Display( optimizer( 1 ) : dummy_input := dummy_input );
(ℝ^4, ℝ^4) -> (ℝ^1, ℝ^1) defined by:

Get Morphism:
------------
ℝ^4 -> ℝ^1

‣ theta

Put Morphism:
------------
ℝ^5 -> ℝ^4

‣ t + 1
‣ 0.9 * m + 0.1 * g
‣ 0.999 * v + 0.001 * g ^ 2
‣ theta + 0.01 / (1 - 0.999 ^ t) * ((0.9 * m + 0.1 * g) /
(1.e-07 + Sqrt( (0.999 * v + 0.001 * g ^ 2) / (1 - 0.999 ^ t) )))
gap> update_lens := OneEpochUpdateLens( f, optimizer, [ [ ] ], 1 );
(ℝ^4, ℝ^4) -> (ℝ^1, ℝ^0) defined by:

Get Morphism:
------------
ℝ^4 -> ℝ^1

Put Morphism:
------------
ℝ^4 -> ℝ^4
gap> dummy_input := CreateContextualVariables( [ "t", "m", "v", "theta" ] );
[ t, m, v, theta ]
gap> Display( update_lens : dummy_input := dummy_input );
(ℝ^4, ℝ^4) -> (ℝ^1, ℝ^0) defined by:

Get Morphism:
------------
ℝ^4 -> ℝ^1

‣ (theta ^ 3 + (- theta ^ 2) + -4) ^ 2 / 1 / 1

Put Morphism:
------------
ℝ^4 -> ℝ^4

‣ t + 1
‣ 0.9 * m + 0.1 * (-1 * (1 * (2 * (theta ^ 3 + (- theta ^ 2) + -4) ^ 1 *
  (3 * theta ^ 2 + (- 2 * theta ^ 1)) * 1) * 1 * 1))
‣ 0.999 * v + 0.001 * (-1 * (1 * (2 * (theta ^ 3 + (- theta ^ 2) + -4) ^ 1 *
  (3 * theta ^ 2 + (- 2 * theta ^ 1)) * 1) * 1 * 1)) ^ 2
‣ theta + 0.01 / (1 - 0.999 ^ t) * ((0.9 * m + 0.1 *
  (-1 * (1 * (2 * (theta ^ 3 + (- theta ^ 2) + -4) ^ 1 * 
  (3 * theta ^ 2 + (- 2 * theta ^ 1)) * 1) * 1 * 1))) /
  (1.e-07 + Sqrt( (0.999 * v + 0.001 * (-1 * (1 * (2 *
  (theta ^ 3 + (- theta ^ 2) + -4) ^ 1 * (3 * theta ^ 2 + (- 2 * theta ^ 1)) * 1)
  * 1 * 1)) ^ 2) / (1 - 0.999 ^ t) )))
gap> Fit( update_lens, 10000, [ 1, 0, 0, 8 ] : verbose := false );
[ 10001, 4.11498e-13, 1463.45, 2. ]
gap> UnderlyingMorphism( f )( [ 2. ] );
[ 0. ]