Simulated Annealing Algorithms


By Malcolm McLean Homepage

Example code ( C source)
Parallel example (MPI)

Simulated Annealing was introduced by Kirkpatrick (1983) as a method for optimising general complex systems. The examples he used were the Travelling Salesman problem, where a salesman has to complete a tour of several cities, connected by roads of varying distances, visiting each one only once, in minimal time, and a real problem of placing wires on an integrated circuit.

Simulated annealing is an enhancement of the iterative improvement, or evolutionary algorithm. In a simple evolutionary algorithm, the program starts with a random initial solution, and then slightly mutates or perturbs it. If the new solution is better than the old, it replaces it. Iterative improvement works when each small change in solution space equates to a small change in result space, and when the result surface, or “energy surface” is concave. If the relationship between solution and result is very complex, and these conditions do not hold, the algorithm may become trapped in a local minimum, like a golf ball coming to rest in a shallow depression rather than on the lowest part of the course.


Simulated Annealing resolves the problem by applying a temperature factor. A move is always accepted if it leads to improvement, but sometimes accepted if it leads to a deterioration, with a probability based on the temperature. At high temperatures local minima are easily jumped out of, at low temperatures the algorithm becomes effectively iterative improvement. By starting at a high temperature, and slowly reducing, the entire results surface can be sampled, and the search directed to more favourable regions.


Simulated Annealing can be applied to virtually any optimisation problem. All that is required is a scalar measure of “fitness” for each solution, a way of generating an initial solution, and a way of perturbing the solution so that a reasonable number of mutants are viable. The last requirement is in practice the most stringent. Often successful evolutionary programming relies on the representation of the solution rather than upon any particular search strategy. For instance, it quite easy to make a random initial string converge on the sequence “Methinks it is like a weasel” if the fitness function is simply the ASCII code difference between the characters. However if we add the stipulation that all intermediates must be valid English language sentences, then the problem becomes very tricky indeed.


Though it is a general purpose method, Simulated Annealing was inspired by a real physical phenomenon, the statistical mechanics of many-bodied systems at a given temperature.


At a high temperature, atoms move around due to thermal energy. As collisions occur, some atoms speed up whilst others slow down. Overall energy, of course, remains constant, due to the laws of conservation. The distribution of speeds is given by the Maxwell-Boltzmann equation, where it is assumed that velocity in the x, y and z directions is normally distributed. Total velocity is


sqrt( x2 + y2 + z2)






leading to a distribution such as that shown on the graph.


In a gas, the molecules will have this distribution of velocities, given by the equation


Nj/N = exp( energy(j) / (k * t) ) / sum i=0 , N exp( energy(i) / (k * T) )


T is the temperature, j is an arbitrary energy state, or binned velocity, N is the total number of possible states, and k is the Boltzmann constant. k, is 1.38065×10-23 J K-1, which is the same as the gas constant, R. Where the molecules are complex, and have different internal energies of conformation, the Boltzmann distribution also gives the distribution of states. The ratio Nj/N is the proportion of molecules that are in state j, and the denominator of the equation is known as the “partition function”.

Now imagine that a protein or other big molecule is at a non-zero temperature. As it is buffeted by solvent molecules, it changes conformation. Since the changes are random, this is termed a Monte Carlo process, from the roulette tables in the resort. Over time, the conformations must match the Boltzmann distribution. Assuming that each change is relatively small, the molecule takes a random walk in conformation space. Some changes will increase internal energy, other will decrease it. To allow the molecule to reach the Boltzmann distribution, over time, we need the Metropolis criterion for acceptance (Metropolis, 1953). If a move is energetically favourable, or neutral, it is always accepted. If it is unfavourable, it is accepted with a probability of p = exp(-d / (k*T) ).


k is the Boltzmann constant, d the energy difference, and T the temperature.

Or in code


/*
  Apply metropolis criteria for acceptance
  Params: current %Gâ ³%@ current energy level           
                test %Gâ ³%@ score of new conformation
                temperature %Gâ ³%@ the temperature of the simulation in Kelvin
   Returns: 1 to accept the move, 0 to reject it.        
*/
int metrop(double current, double test, double temperature)
{
  double p;
  double k = 1.38;  /* Boltzmann constant %Gâ ³%@ multiplied by 10^23 */

  if(test <= current)
    return 1;
  else
  {
    p = rand() /(RAND_MAX + 1.0);
    if( p < exp( (current - test) / (k * temperature) ) )
      return 1;
    else
      return 0;
  }
}

Note that the Boltzmann distribution will only result if there is no mutation pressure – for every proposed move state A -> state B there must be an equal chance of the opposite move from state B -> state A. If a move is rejected, N is still incremented with the count of the original state increased. It is also necessary for every state to be accessible from every other, with energy barriers crossable within the time of the simulation. All barriers will eventually be crossed if the simulation is run for infinite time. This is called the ergodic property. Note also that the density of states – the denominator of the Boltzmann equation – is as important as the energy of the state for determining how often it is occupied. This represents entropy in a physical system. If there are a large number of disordered states similar in energy to the lowest energy, ordered state, the system will spend most of its time in disorder.


If the temperature is lowered slowly enough, the system always converges on the most stable state. However it is is lowered too quickly, or “quenched”, it can become trapped in a local minimum. This also has a physical analogue. Spin glasses, which are not glasses at all but a type of magnetic material which is to normal magnets as glass is to a true crystal, consist of particles with magnetic moments which become trapped in random orientations. There are a vast number of metastable states of the system, corresponding to local minima on the energy function. However is the glass is repeatedly heated and cooled, it will eventually adopt the most stable state, with all particles in the same orientation.


Cooling schedules


If a Simulated Annealing system is cooled slowly enough, with a logarithmic schedule, such that


T[i] = R/log(i) – logarthimic


with suitably large R, it will always converge on the minimum (Geman and Geman, 1984). However R often has to be so large that this is not a practical schedule for most applications.


A faster schedule is linear cooling.

T[i] = R/i

The most commonly used schedule is exponential

T[i] = T[0] * a^i - exponential, a is slightly under 1.0

(This is equivalent to multiplying temperature by a at each step)


Hajek (1988) proved that exponential cooling could not guarantee convergence. The initial temperature, the cooling factor, and the number of runs or the terminating temperature must be selected to give reasonable odds of finding the minimum, within the constraints of available processing time.


Another common schedule is the plateau system with termination. Temperature is lowered in a series of discrete steps, and a relatively large number of Metropolis carried out at the same temperature. If the number of accepted moves exceeds a certain limit, the temperature is lowered tot he next level prematurely. This algorithm tends to cut down the initial high temperature search.


The tabu search (Glover, 1986) is a refinement of the algorithm such that previously visited states are not revisited. Note that strictly this breaks the reversibility property for a Metropolis walk, and therefore the ensemble of states no longer represents the Boltzmann distribution. The problem with the algorithm is that every state visited has to be stored. However if it is possible to hash the values of the state, the search can be relatively fast. Oakley (2005) found that tabu searches were poor for searching the conformational space of proteins, but good at searching localised regions of the energy surface.


“Quenching” is to do a steepest descent, by analogy with the physical process whereby the swordsmith will repeatedly heat up and quickly cool his blade. One relatively successful algorithm is to do a relatively coarse-grained simulated annealing run, with steepest descent searches to find the local minimum in between each step. As the temperature of the coarse run is reduced, the zero-temperature minimisation step gets longer.


There are some attempts to automatically select the cooling schedule. One of the simplest is to start with an initial temperature where the initial acceptance rate is approximately 50%. The problem is that if the start configuration is completely random then the chance of a random perturbation making an improvement is exactly 50%. Depending on the move set, the state may move quickly or slowly to the space of interesting solutions. However, at the initial temperature, the algorithm must be able to sample the whole of the energy surface, the so-called “ergodic” property.

The Lam (1988) schedule is an adaptive exponential schedule which attempts to adapt cooling to the statistical properties of the move set. By measuring the variance of the average energy change in each move,


s[k+1] = s[k] + lambda * (1/ sigma) * (1/(s[k]*s[k] * sigma * sigma) * ( vchange/2)


s is 1/T, or the reciprocal of the temperature.

lambda is a quality factor

sigma is the standard deviation of the energy, computed as a rolling window with earlier states gradually down-weighted

vchange is the variance of the average energy change during a proposed move.

(see Chu 1999 for a summary of the method).


Parallel implementations

Simulated annealing is inherently a serial algorithm, since the state of the system at any time step is dependent on its state in the previous step. The score function varies from problem to problem, and can sometimes be parallelised, but often not, because the function has to be simple enough to allow for millions of evaluations in reasonable time. Modern cheap parallel machines are built from commodity personal computer processors, networked together. Adding extra processors is thus relatively cheap, but the speed of communication is poor. A practical algorithm has to minimise inter-thread communication, but it can be relatively extravagant with processors.

The simplest way of parallelising simulated annealing is to make multiple parallel runs. The algorithm is run for N times, and the lowest value taken. If all runs converge on the same value, then this is a powerful indication that the true minimum has been found.


The replica exchange method (Swendsen 1986, Hansmann 1997) involves two parallel runs at different temperatures, with an exchange taking place when

p = exp( - (1/(k*tj) – 1/(k *ti)) * (Ei – Ej) );

the high temperature run samples the conformational space, whilst the low temperature ensures that the local minimum is found. Where there are more than two processors, some schedule for exchanging must be found. Normally adjacent temperatures exchange. Since the temperatures only are exchanged, only two temperature values, rather than two sets of coordinates, need to be physically passed between processors. The replicas have no contact with each other, therefore when all are in an identical state we have a reasonable assurance that the algorithm has converged.


Another parallel algorithm, culljump is to make several runs at the same temperature, and periodically cull the bottom half of the distribution. An optimal result very quickly distributes to all the rest of the processors, however variation is maintained for a few steps. This algorithm appears to scale well.



References

Azencott R (ed) Simulated Annealing Parallelization Techniques (1992) John Wiley and Sons, New York

Glover, F (1986) Comput Oper Res 13 : 533.

Swendsen RH, Wang JS (1986) Replica Monte Carlo simulation of spin-glasses Physical Review Letters 57 : 2607 - 2609.

Hansmann U H E (1997) Parallel tempering algorithm for conformational studies of biological molecules Chemical Physics Letters 281: 140-150

Hajek B (1988) Cooling schedules for optimal annealing Math Oper Res 13 : 311-329 op cit Azencott

Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E (1953) Equation of State Calculations by Fast Computing Machines J. Chem. Phys. 21: 1087-1092.

Lam J Delsome JM (1988) Performance of a New Annealing Schedule Paper 22.1 25th ACM / IEEE Design Automation Conference

Chu KW Deng Y Reinitz J (1999) Parallel Simulated Annealing by Mixing of States Journal of Computational Physics 148 : 646-662