In Proceedings of the IEEE International Workshop on Emerging Technologies and Factory Automation, August 11-14, 1992, Melbourne, Australia, pp435-440.

Learning Logic Functions Explicitly by

Back-Propagation in NOR-Nets

David C. Keenan

116 Bowman Parade, Bardon QLD 4065, Australia

*Abstract* This paper describes networks whose units approximate simple logic gates (in particular NOR gates). These networks can learn by back-propagation and the trained network can be implemented directly, quickly and cheaply using existing programmable logic devices rather than special processors or custom VLSI hardware. Results are given for a preliminary experiment where a minimal NOR network learns XOR.

I. The Problem

Artificial neural networks can often avoid the need to design from explicit rules by *learning* to approximate an unknown logic function or finite automaton, but they are not (at time of writing) easily or cheaply implemented in fast hardware. They can be implemented cheaply but slowly by simulation on conventional computers, or quickly but expensively on special "neural" processors or as custom VLSI hardware.

Electronic logic devices such as PLDs (programmable logic devices) provide an easy compact, cheap and fast hardware implementation of logic functions or finite automata, but PLDs cannot learn and so must be designed from explicit rules.

We could have the best of both worlds if neural network learning procedures such as back-propagation could be applied directly to simulated networks of logic gates so that the trained network could be implemented directly in PLDs.

There may be other applications for a network which can *learn* an unknown discrete function and provide *explicit* logical rules as a result. Of course artificial neural networks can learn to approximate continuous functions as well, but we are only concerned here with discrete functions and automata.

A. Back-Propagation

Back-propagation is a gradient descent search procedure most often applied to layered networks (both feedforward and recurrent) of the conventional smoothed-threshold-of-weighted-sum units, also called *sigmoid* units [1], [4], [6].

Gradient descent search in networks relies on being able to find the partial derivative of some error measure with respect to each weight in the network. We then change each weight by a small amount in the direction which will reduce the error. The error function most often used is the sum of the squares of the differences between the desired and actual activations over the output units.

The back-propagation algorithm uses the chain rule (for differentiation of composite functions) to substantially reduce the computation required to find the derivatives for the weights to non-output units. This is viewed as propagating the differences at the output units back through the network while activation functions are replaced by their derivative.

B. Programmable Logic Devices

A PLD is an electronic logic device; typically a 24 pin dual-in-line package containing a few hundred logic gates whose interconnections and external connections are not fully determined. That is, large parts of the connection matrix may be determined by *programming* the device.

In early devices, programming was a once-only process which consisted in blowing microscopic fuses to leave only the desired connections. Modern PLDs are reprogrammable about 10,000 times and cost around $10. The programmable part of the connection matrix is still referred to as the *fuse-map*. Considered as a weight matrix, the fuse-map consists of ‘weights’ which are either 0 (not connected) or 1 (connected).

Folded-NOR PLDs have only high fan-in NOR gates in place of the AND and OR layers of more conventional PLDs. Some of the NOR gates are dedicated in an output layer, but others are entirely uncommitted so that it is possible to create very general networks. Each input and fedback output is available both direct and complemented. These are referred to as *two-line* inputs. Other ‘fuses’ provide for each output to be direct or complemented, synchronous or asynchronous, connected or disconnected. If an output is disconnected, its pin may be used as an input since it is still fed back.

Fig. 1. Simplified schematic diagram of a typical folded-NOR PLD.

The functions implemented by a single PLD are limited by the number of input and output pins but there is scope for implementing more complex functions by stacking multiple PLDs, and PLDs with 68 pins have recently become available.

II. Trainable NOR Nets

Since it is possible to create any logic function with NOR (or NAND) gates alone [5], [8], [9], and since PLDs consisting solely of NOR gates are readily available, I tried to determine how back-propagation could be applied to networks of pseudo NOR gates.

Ideally the units would be connected by weights which can vary between 0 (not connected) and 1 (connected), so when the network has learnt, all weights will be very close to either 0 or 1.

For input and output (unit activation) values we will use 0 for ‘false’ and 1 for ‘true’. This gives the correct behaviour for NOR gates since an unconnected input must be equivalent to an input which is always ‘false’, whereas for NAND gates, ‘unconnected’ would have to be equivalent to ‘true’.

Given two-line inputs, any Boolean function may be implemented in two layers of NOR gates (equivalent to Conjunctive Normal Form, CNF). However, with only two layers, the number of hidden units required may be exponential in the number of inputs. This is the case with parity functions, a generalisation of XOR to more than two inputs. Fortunately a linear number of gates per layer may be obtained with only a logarithmic number of layers.

The binary version of Steinbuch’s *learning matrix* [10] has weights of only 0 or 1 but each layer asserts only one output at a time and a multi-layer learning algorithm is not provided.

A. Ski-Slope NOR Units

Preliminary experiments attempted to apply back-propagation to *fuzzy* NOR units [3]. On analysing the reasons for a lack of success with these units, it was realised that a sigmoid which is symmetrical about (0, 0) could be used to approximate a NOR function if only the right half was used and logically inverted (complemented). That is, we can use a conventional summing unit provided that the weights and activations are never negative and the soft-threshold function looks like a ski slope. However, rather than complementing the right half of a sigmoid, we can use a much simpler ski-slope function, such as negative exponential (e* ^{-x}*), whose derivative is simply its negation.

Fig. 2. Comparison of exponential,* y* = e* ^{-x}* , (solid line)

and complemented sigmoid,

The region of interest is for

Our units will have the following activation function.

*a _{i}* =

where *a** _{i}* is the activation of unit

* w _{ij}* is the weight from unit

(*x*) is a ski-slope squashing function such as

e^{-x} for *x* ≥ 0.

The differential equations for back-propagation in a ski-slope NOR net are similar to those for conventional sigmoid units except for the negation and slight simplification caused by using a pure negative exponential.

Given the activation function

*a _{i}* = e

and the usual sum of squares error function

*E* = * _{i}*(

where *t** _{i}* is the target activation of unit

then for weights to output units we have

__∂ E __= -(

∂

and for other weights

__∂ E __=

∂

where the subscript *i* ranges only over output units.

Implementation details are given in three appendices which describe changes to the *EPDP* BP simulator source code and simulation files [4].

Fig. 3. Ski-slope NOR for two inputs, *z* = e^{-(x+y)}.

On learning a discrete function we hope that the trained network will contain weights which are either very large or close to zero, which we can interpret as connected or not connected for regular NOR gates.

III. An Experiment

The *EPDP* BP simulator was modified to simulate ski-slope NOR units as described in the appendices. The ski-slope function used was

(x) = e^{-}*x*

The error back-propagation code was modified to use the derivative of this ski-slope function.

__∂______ = -(*x*)

∂*x*

A. The XOR Test

The test used in the following experiments is the learning of the XOR problem by a NOR net with two two-line inputs, two hidden units and one output unit. This is the minimum NOR net capable of implementing XOR. It can implement XOR in only two ways as shown in Fig. 4. It will be seen that these two implementations are identical under exchange of the two hidden NOR units.

Fig. 4. The two minimal implementations of XOR by a NOR net.

All trials used a learning rate of 0.25 with no momentum term and were considered successful if the total sum of squares error fell below 0.16. All trials used sequential (not randomly permuted) training and adjusted the weights after every pattern (rather than every epoch).

All weights were constrained to be positive and all biases zero and unmodifiable. ‘Hardwired’ units were included to compute the complement for each input (see Appendix C). The following four training patterns were used in the sequence shown.

* a*0 *a*1 *t*

0 0 0

0 1 1

1 0 1

1 1 0

B. Results

The network never failed to learn XOR. The trained network always corresponded to one of the two implementations shown in Fig. 4.

100 trials were conducted with random initial weights in the range 0 to 0.1. The number of epochs required ranged from 303 to 413.

10 trials were conducted with random initial weights in the range 0.5 to 0.6. The number of epochs required ranged from 271 to 335.

10 trials were conducted with random initial weights in the range 1.0 to 1.1. The number of epochs required ranged from 238 to 326.

It seems to matter little what the initial weights are, so long as they are not identical and not too large or too widely spread. It also appears likely that the learning rate can be increased substantially without causing oscillations.

IV. Conclusions

While many more experiments need to be performed, these preliminary results at least allow the possibility of trainable networks which have a direct interpretation in Boolean algebra or propositional logic and may be directly implemented in available electronic logic hardware.

V. Future Directions

Ski-slope NOR nets should be tested for their ability to learn more complex functions. Higher order parity functions should be attempted with two layer as well as deeper networks. Sequential functions such as latches and counters should be attempted.

We are interested in the question of what classes of Boolean functions and automata are *probably approximately learnable* [11] by this method, since we want to be able to use incomplete training sets, ones which are merely polynomial in the number of inputs and feedbacks, rather than exponential.

Most PLDs have an additional fuse for each output to determine whether the output is to be direct or complemented. In a two layer NOR net this has the effect of changing from CNF (conjunctive normal form) to DNF (disjunctive normal form) when inputs are also exchanged for their complements. Most functions require fewer hidden units in one form than in the other so it would be useful if the net could *learn* this fuse as well. This may be as simple as determining whether the output is more often true (DNF) or false (CNF) over the input combinations.

We could further examine sensitivity to initial weights and learning rate. Using initial weights near zero should have the additional advantage of minimising the number of connections. We could also try a decay term to minimise the number of connections.

We could try something similar to Fahlman’s cascade correlation algorithm [2], where hidden units are incrementally added to the net during learning.

How do ski-slope NOR and sigmoid units differ in computational power? Composing two ski-slope NOR units produces a kind of sigmoid unit. The sigmoid function is very skew but sigmoid nonetheless. The weight between the two NOR units (or rather its logarithm) plays a role somewhat like a bias term. The availability of both direct and complemented inputs compensates to some degree for the fact that weights are not allowed to be negative.

These observations suggest investigating how well ski-slope NOR nets can learn to approximate continuous functions, instead of merely binary ones. Of course, in this case the advantage of cheap hardware implementation using PLDs is lost.

Sejnowski and Rosenberg’s NETtalk [7] implements a discrete function which maps English text to phonemes. (An aside: I wonder how well NETtalk pronounces its first author’s name?) It might be instructive to attempt to train a NOR net with the NETtalk training data and to determine the number of hidden units/layers required. Sejnowski & Rosenberg used conventional sigmoid units with 203 inputs, 80 hidden units and 26 output units.

In addition to the possibility for direct hardware implementation, such a trained ‘NORtalk’ would provide a set of logical rules which could be used to map text into phonemes algorithmically.

Appendix A

Fixing a Bug and Adding a Feature Relating to Weight Constraints in the EPDP BP Simulator

Here are the changes required to fix a bug in the BP (back propagation) simulator from *EPDP* [4]. The bug is as follows. If you do not have a ‘constraints:’ section in your network (.net) file then no notice will be taken of the positive or negative constraints on weights or biases having the predefined constraint characters ‘p’ or ‘n’. Adding an empty ‘constraints:’ section is enough to work around this bug if you do not wish to change the source code.

We also give the changes required to add an obvious feature which doesn’t exist in the standard BP. That is, when a constraint character is defined with *both* a floating point number *and* the attribute ‘random’, then the random amount is *added* to the floating point value. Previously the ‘random’ attribute would take precedence and the number would be ignored. This feature is useful in experimenting with ski-slope NOR networks.

Simulator Source Code Changes

*In the file ‘weights.c’ make the following changes:* From near the end of the function ‘read_constraints’, cut the following

positive_constraints = ((double **)

emalloc((unsigned int)(sizeof(double *) * MAXCONSTRAINTS)));

for (i = 0; i < MAXCONSTRAINTS; i++)

positive_constraints[i] = NULL;

negative_constraints = ((double **)

emalloc((unsigned int)(sizeof(double *) * MAXCONSTRAINTS)));

for (i = 0; i < MAXCONSTRAINTS; i++)

negative_constraints[i] = NULL;

and paste it near the beginning of the function ‘define_net’, between

constants['n' - 'a'].negative = TRUE;

and

while (fscanf(in_stream, "%s", string) != EOF)

That fixes the bug. The remaining changes add the feature. In the function ‘read_network’ change

if (con[ch - 'a'].random) {

if (con[ch - 'a'].positive) {

:

weight[r][since_first] = wrange * rnd();

:

else if (con[ch - 'a'].negative) {

:

weight[r][since_first] = wrange * (rnd() - 1);

:

else

weight[r][since_first] = wrange * (rnd() - .5);

}

else

weight[r][since_first] = con[ch - 'a'].value;

to

weight[r][since_first] = con[ch - 'a'].value;

if (con[ch - 'a'].random) {

if (con[ch - 'a'].positive) {

:

weight[r][since_first] += wrange * rnd();

:

else if (con[ch - 'a'].negative) {

:

weight[r][since_first] += wrange * (rnd() - 1);

:

else

weight[r][since first] += wrange * (rnd() - .5);

}

where the dots stand for sections which don’t change.

In the function ‘read_biases’ make corresponding changes where ‘bias’ is substituted for ‘weight’ above.

Note that the file ‘weights.c’ also forms part of other *EPDP* simulators, IAC, CS and PA.

*In the file ‘bp.c’ make similar changes to those above:* In the function ‘reset_weights’ change

if (constants[ch - 'a'].random) {

if (constants[ch - 'a'].positive) {

weight[j][i] = wrange * rnd();

}

else if (constants[ch - 'a'].negative) {

weight[j][i] = wrange * (rnd() - 1);

}

else

weight[j][i] = wrange * (rnd() -.5);

}

else

weight[j][i] = constants[ch - 'a'].value;

to

weight[j][i] = constants[ch - 'a'].value;

if (constants[ch - 'a'].random) {

if (constants[ch - 'a'].positive) {

weight[j][i] += wrange * rnd();

}

else if (constants[ch - 'a'].negative) {

weight[j][i] += wrange * (rnd() - 1);

}

else

weight[j][i] += wrange * (rnd() -.5);

}

and, in the same function, make the corresponding changes where ‘bias’ is substituted for ‘weight’ above.

Appendix B

Making the EPDP BP Simulator Symmetric about Zero

Here are the changes required to make the BP simulator from *EPDP* [4] use a sigmoid which is symmetric about zero (-1, +1) for faster and more reliable learning.

The sigmoid given below is not only symmetrical about zero but has a midpoint slope of 1, so as to equalise the initial learning rates of output vs. hidden units. See the last two paragraphs of the answer to Q.5.1.1 in Appendix E of *EPDP* for a discussion of this problem. I also prefer a midpoint slope of 1.0 because it’s easier to reason about, and it eliminates a multiplication from the simulator.

Simulation File Changes

Because the range is doubled, and the slope quadrupled, certain parameters need to be scaled accordingly. Since the maximum error is doubled, sums of squares of errors will be quadrupled.

Here’s how to adjust the parameters in your startup (.str) files.

ecrit -> ecrit * 4

tmax -> 1 - (2 * (1 - tmax)) eg. 1.0 -> 1.0, 0.9 -> 0.8

lrate -> lrate / 4

wrange -> wrange / 4.

You might also take the opportunity to ‘set mode lgrain pattern’ (the default) since this is less susceptible to sticking at local minima.

The weights and biases in any weight initialisation (.wts) files should be divided by 4.

If you are only making these changes in preparation for making the changes for the trainable-NOR simulation in Appendix C then don’t bother changing your pattern files now. Otherwise you should adjust the values in your pattern (.pat) files according to

value -> value * 2 - 1.

If the values are Boolean you can just change all the zeros to hyphens. Remember that ‘+’ can be used instead of ‘1’, ‘.’ used instead of ‘0’ and ‘-’ instead of ‘-1’. I find that using ‘-’ for ‘false’ and ‘1’ for ‘true’ gives the most readable-at-a-glance patterns.

Simulator Source Code Changes

In the file ‘weights.c’ change:

double lrate = 0.5;

double wrange = 1;

to

double lrate = 0.125;

double wrange = 0.25;

Note that the file ‘weights.c’ also forms part of other *EPDP* simulators, IAC, CS and PA.

*In the file ‘bp.c’ make the following changes:* In the function ‘logistic’ change

if (x > 15.935773) return(.99999988);

else if (x < -15.935773) return(.00000012);

else return(1.0 / (1.0 + exp(-1.0 * x)));

to

if (x > 22) return( 0.9999999999999999997); /* largest double < 1.0 */

else if (x < -22) return(-0.9999999999999999999); /* smallest double > -1.0 */

else return((2.0 / (1.0 + exp(-2.0 * x)))-1.0);

Note: different systems may require different cutoffs to prevent a result of 1.0 or -1.0 while ensuring monotonicity. The above is correct for a Macintosh using double-precision IEEE arithmetic.

Change all occurrences of the word ‘logistic’ to ‘sigmoid’ since the above function no longer corresponds to the logistic function.

In the function ‘compute_error’ change

if(target[j] >= 0) /* We care about this one */

to

if(target[j] >= -1.0) /* We care about this one */

so that we can still use a negative number (< -1) to indicate ‘no target’ while allowing actual target values down to -1.0.

Also change

del = delta[i] = error[i] * activation[i] * (1.0 - activation[i]);

to

del = delta[i] = error[i] * (1.0 - (activation[i] * activation[i]));

which multiplies the error by the derivative of our symmetric sigmoid. Note that 1-(*a´
a*) = (1-*a*)(1+*a*).

In the function ‘setinput’ change

if ( *pp < 0.0 ) {

to

if ( *pp < -1.0 ) {

so that we can still use negative numbers (< -1) in patterns to refer to the previous activation of a unit while allowing actual input values down to -1. We are unlikely to want the previous value of unit 1 since it will usually be an input.

In the function ‘settarget’ change

else if(target[i] == 0.0) {

target[i] = 1 - tmax;

to

else if(target[i] == -1.0) {

target[i] = - tmax;

In the function ‘sumstats’ change

if (target[j] >= 0) {

to

if (target[j] >= -1.0) {

Appendix C

Making the EPDP BP Simulator Simulate Ski-Slope NOR Units

Here are the changes required to make the BP simulator from *EPDP* [4] simulate trainable NOR gates so that the trained network explicitly embodies logical rules and can be directly implemented by electronic logic devices. The changes given below are intended to be applied *after* the changes given in Appendices A and B.

Simulation File Changes

In your network (.net) files, all weights must be constrained to be positive, eg. use ‘p’s or ‘%p’. All biases must be zero and unmodifiable, eg. use ‘.’s or ‘%.’.

Since every input must appear both direct and complemented, you must either double the number of inputs, or preferably add a ‘hardwired’ unit for every input to compute its complement. For recurrent networks, the use of hardwired units is the only reasonable alternative for supplying direct and complemented values of the so called context units.

Such a unit will be connected to its associated input or context unit by a large positive unmodifiable weight (45 is as good as infinity when using double precision arithmetic) and will take no input from anywhere else. To specify an unmodifiable weight, use an upper case character for the constraint.

Fig. 5. A ski-slope NOR net showing input-complement units.

These hardwired units can be referred to as the *input-complement layer* to distinguish them from the ‘real’ hidden layer(s), i.e. the trainable hidden layer(s). The output of such an input-complement unit will be treated as an input in regard to connection to the rest of the network. That is, it will usually be connected via modifiable weights to the first ‘real’ hidden layer, in the same way as its associated direct input.

Your template (.tem) files are likely to need changing to display the modified network.

If your pattern (.pat) files have been changed for the symmetrical BP so that ‘false’ is represented by ‘-1’ or ‘-’ then they need to be changed back to ‘0’ or ‘.’. Any existing weight initialisation (.wts) files will be useless.

Simulations can probably be run without requiring changes to parameters in startup (.str) files, although better results may be obtained by changing them once some understanding is gained.

Simulator Source Code Changes

*In the file ‘bp.c’ make the following changes:* After the function ‘sigmoid’, (which was previously called ‘logistic’) add a new function

double ski_slope (double x) {

if (x > 44) return(0.0000000000000000001); /* smallest double > 0.0 */

else return(exp(-1.0 * x));

}

In the functions ‘init_output’, ‘cycle’ and ‘compute_output’, change

activation[i] = sigmoid(net);

to

activation[i] = ski_slope(net);

When combined with biases fixed at zero, and weights which are constrained to be positive, this gives us our pseudo-NOR function.

In the function ‘compute_error’, change

del = delta[i] = error[i] * (1.0 - (activation[i] * activation[i]));

to

del = delta[i] = error[i] * activation[i] * -1.0;

which multiplies the error by the derivative of our ski-slope activation function.

Acknowledgement

Thanks go to Andrew Smith (Launceston, TAS) for updating the *EPDP* BP source code so I could modify and compile it on my Macintosh, and for introducing me to *Laws of Form* [9] which was a source of inspiration regarding the universality of unbounded fan-in NOR gates. This work was carried out while the author was supported by an Australian Postgraduate Coursework Award.

References

[1] J. L. Elman, "Finding structure in time," *Cognitive Science*, vol. 14, pp. 179-211, 1990.

[2] S. E. Fahlman and C. Lebiere, "The Cascade-Correlation Learning Algorithm," Carnegie Mellon University, Tech Report CMU-CS-90-100, 1990.

[3] A. Kandel and S. C. Lee, *Fuzzy Switching and Automata: Theory and Applications*. New York: Crane Russak, 1979. St Lucia, Australia: University of Queensland Press, 1979.

[4] J. L. McClelland and D. E. Rumelhart, "Training Hidden Units: The Generalized Delta Rule," in *Explorations in Parallel Distributed Processing: A Handbook of Models, Programs, and Exercises* (includes software on floppy disk). Cambridge, MA: MIT Press, 1988.

[5] M. L. Minsky, "Neural Networks: Automata made up of parts," in *Computation: Finite and Infinite machines*. Englewood Cliffs, NJ: Prentice-Hall, 1967, pp. 32-66.

[6] D. E. Rumelhart, G. E. Hinton and R. J. Williams, "Learning internal representations by error propagation," in *Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Vol. 1 Foundations*, D.E. Rumelhart, J.L. McClelland and the PDP Research Group Eds. Cambridge, MA: MIT Press, 1986, pp. 354-361.

[7] T. J. Sejnowski and C. R. Rosenberg, "Parallel networks that learn to pronounce English text," *Complex Systems,* vol. 1, pp. 145-168, 1987.

[8] H. M. Sheffer, *Transactions of the American Mathematical Society*, vol. 14, pp. 481-488, 1913.

[9] G. Spencer-Brown, *Laws of Form. *London: George Allen and Unwin, 1969. 2nd edition, New York: Julian Press, 1972. 3rd edition, New York: E.P.Dutton Paperback, 1979.

[10] K. Steinbuch and U. A. W. Piske, "Learning Matrices and their Applications," *IEEE Transactions on Electronic Computers,* pp. 846-862, 1963.

[11] L. G. Valiant, "A theory of the learnable," *Communications of the ACM*, vol. 27, no. 11, pp. 1134-1142, 1984.