Improving LBFGS and LBFGS-B algorithms in PYTorch

Introduction

Training neural networks to perform various tasks is an essential operation in many machine learning applications. The main workhorses --especially in deep learning-- for training are : SGD and Adam. Both these methods are first order optimization methods. In other words, they find the direction (gradient) where the desired solution (more or less) is at, and then make a step towards that solution, where the step size is normally called the learning rate.

Second order optimization methods not only use the gradient to update the solution, but also the curvature (or the Hessian) information. Due to this reason, they normally perform better than first order methods (faster convergence). However, there is a big drawback: the calculation of the Hessian is computationally expensive.

We can use the best of both the first order and the second order optimization techniques. That is, we can only use the gradient but approximate the Hessian using the gradient. This is essentially what is done in BFGS (Broyden Fletcher Goldfarb Shanno) methods. There is a whole family of such methods, all of which are generally called as quasi Newton methods.

SAGECal and LBFGS

The BFGS algorithm is slightly modified to work under situations where the number of unknowns are too large to fit the Hessian in memory, this is the well known limited memory BFGS or LBFGS. While BFGS uses an approximation to the full Hessian (that need to be stored), LBFGS only stores a set of vectors and calculates a reduced rank approximation to the Hessian approximation. Therefore LBFGS needs much less memory to operate. In calibration of radio astronomical data (SAGECal), we use LBFGS very heavily.

On the other hand, stochastic algorithms work by only reading a small amount of data at a time, for example only a small batch of images are used per iteration in training a neural network. We have recently modified our LBFGS algorithm to work in a stochastic mode as well. Therefore, the stochastic LBFGS algorithm can work with a large number of unknowns and also handle a large amount of input (training) data by reading a small batch at every iteration. Both Adam and SGD are also stochastic algorithms. Ideally, stochastic mode of operation is the best situation that we can aim for. But there are caveats. We should be careful to check that it converges!

Using our LBFGS in machine learning

Various groups have developed and improved the LBFGS algorithm in PyTorch and what we use is not the only working version. Each have their own pros and cons and we do not go into detail here. What we present are the essential details for a new user to use our version of LBFGS for machine learning. We present also two examples, one with full-batch training and one with stochastic mode training.

Closure

In PyTorch, input to the LBFGS routine needs a method to calculate the training error and the gradient, which is generally called as the closure. This is the single most important piece of python code needed to run LBFGS in PyTorch. Here is the example code from PyTorch documentation, with a small modification.

  for input, target in dataset:
      def closure():
          if torch.is_grad_enabled():
            optimizer.zero_grad()
          output = model(input)
          loss = loss_fn(output, target)
          if loss.requires_grad:
            loss.backward()
          return loss
      optimizer.step(closure)

Note that before we call optimizer.zero_grad(), we check whether the gradient calculation is enabled (and needed). Before we call loss.backward() also, we check if this is possible by checking for loss.requires_grad. This is needed because every call to closure() does not need the calculation of a gradient inside LBFGS (specifically in line search). In contrast, for gradient descent methods, the above modifications are not necessary because the gradient is always used when a call to closure() is made.

One can get confused on how the variables are passed to the closure(). One way of understanding this is by considering the similarities with a MATLAB function handle. Quoting MATLAB documentation: any non-input variables pass their values to the function at a time the (MATLAB) function handle is created. The same applies to a closure(). Here is a more lengthy example, where more than one closure() is created.

        for i,(data1,data2,data3) in enumerate(zip(trainloader1,trainloader2,trainloader3),0):
           # get the inputs
           inputs1,labels1=data1
           inputs2,labels2=data2
           inputs3,labels3=data3
           # wrap them in variable
           inputs1,labels1=Variable(inputs1),Variable(labels1)
           inputs2,labels2=Variable(inputs2),Variable(labels2)
           inputs3,labels3=Variable(inputs3),Variable(labels3)

           def closure1():
                 if torch.is_grad_enabled():
                    optimizer1.zero_grad()
                 outputs=net1(inputs1)
                 loss=criterion1(outputs,labels1)
                 if loss.requires_grad:
                    loss.backward()
                 return loss
           def closure2():
                 if torch.is_grad_enabled():
                    optimizer2.zero_grad()
                 outputs=net2(inputs2)
                 loss=criterion2(outputs,labels2)
                 if loss.requires_grad:
                    loss.backward()
                 return loss
           def closure3():
                 if torch.is_grad_enabled():
                    optimizer3.zero_grad()
                 outputs=net3(inputs3)
                 loss=criterion3(outputs,labels3)
                 if loss.requires_grad:
                    loss.backward()
                 return loss

           optimizer1.step(closure1)
           optimizer2.step(closure2)
           optimizer3.step(closure3)

Full batch LBFGS

For the first example, we consider a Rule 110 problem where we use the full dataset for training. The example python script given here can be used directly with the modified LBFGS, once the closure() is defined properly. We create the LBFGS optimizer as

optimizer = torch.optim.LBFGS([x],history_size=7,max_iter=10,line_search_fn=True,batch_mode=False)

Note that batch_mode is set to False here because we are using the full batch for training.

SGD/Adam/LBFGS Error vs CPU time — Figure 1: Rule 110 training error with CPU time spent. Per iteration, LBFGS uses more CPU time but the training error goes down much faster than with Adam or SGD.

In figure 1, the x-axis shows the total CPU time taken and the y-axis shows the training error. Even though LBFGS uses more CPU time per-iteration, it converges faster and SGD or Adam.

Stochastic LBFGS

To compare the performance of the stochastic LBFGS algorithm, we use a simple convolutional neural network model from PyTorch examples (get the code from here), with the CIFAR 10 dataset. We use a batch size of 32 for training and the LBFGS optimizer is created as

optimizer = torch.optim.LBFGS(net.parameters(), history_size=10, max_iter=4, line_search_fn=True,batch_mode=True)

Note that now batch_mode is set to True for stochastic mode of operation.

SGD/Adam/LBFGS Training Error vs CPU time — Figure 2: CIFAR 10 training error with CPU time spent. The batch size is 32. Both Adam and SGD use 10 epochs while LBFGS uses 3 epochs for (approximately) the same amount of CPU time. The verification error is about 10% higher in LBFGS compared to Adam and SGD.

In figure 2 we have shown the training error (cross entropy loss) for the CIFAR 10 dataset. Even though LBFGS gives lower training error, the verification error for this example is higher in LBFGS compared to SGD and Adam. More work in this area needs to be done, and this is very much work in progress.

Bound constrained LBFGS and Kolmogorov-Arnold networks

It is possible to train KAN networks (pykan) using the bound contrained version of LBFGS, lbfgsb.py. Run the PDE example to see how it works.

Conclusions

What we have presented is a very simple introduction to using our improved version of LBFGS. More work is being done continuously to improve its performance. For example, try changing the activation from ReLU to ELU and see what happens! A promising application of our new LBFGS code is federated learning, and the code is here.

References

The technical details are found in the following publications (see the references in them for related work).

S. Yatawatta, H. Spreeuw, and F. Diblen, Improving LBFGS optimizer in PyTorch: Knowledge transfer from radio interferometric calibration to machine learning, in 2018 IEEE 14th International Conference on e-Science, Amsterdam, 2018.
S. Yatawatta, L. De Clercq, H. Spreeuw, and F. Diblen, A Stochastic LBFGS Algorithm for Radio Interferometric Calibration, in 2nd IEEE Data Science Workshop (DSW), Minneapolis MN, 2019.
S. Yatawatta, Stochastic Calibration of Radio Interferometers, MNRAS 2020. (preprint)