Computer Science 294-7 Lecture #20
Compute Blocks

Notes by Luca Carloni

1 Programmable Logic Array

1.1 PLAs, LUTs, PALs

An arbitrary Boolean function can be expressed in a canonic format called the two-level sum-of-products representation. This representation can be mapped into a very regular implementation by an automated process. The circuit structure that makes this possible is called the Programmable Logic Array (PLA) . Fig 1.1 illustrates the high regularity of the PLA logic structure: a first layer of gates implements the AND operations, also called product-terms or minterms, while a second layer realizes the OR functions, called sum-terms.
Fig. 1.1 PLA
In the particular case, we have a K-inputs, N-product terms and M-outputs PLA. Each small yellow square in the AND-plane represents a memory cell which allows us to construct product-terms from a selection of the input signal (and their complements). The memory cells in the OR-plane allow to activate a product-term as an input of a given subset of the output functions. PLAs provide a fast implementation of large product-terms and can be used as a logic block within a FPGA by linking them with programmable interconnect as with LUTs. On the other hand there exist some interesting functions (parity, arithmetic..) for which the number of product-terms can be exponential in the number of inputs. It is interesting to compare the structure of a PLA with a ROM memory array (or a LUT, see Fig 1.2 ).
Memory (LUT)
Topologically, both structures are identical. The only difference is that the decoder (AND-plane) of the ROM enumerates all possible minterms, while the AND-plane of the PLA only realizes a limited set of them. Finally, Fig 1.3 illustrates a Programmable Array Logic device (PAL) having the OR-plane fixed and the AND-plane programmable.
Fig 1.3 PAL
In[1], Brown and Rose give a complete tutorial of the commercially available Field-Programmable Logic Devices (FPD), providing a clear classification of them in the three main categories: simple PLDs, complex PLDs and FPGAs.

1.2 PLA-based FPGAs

In[2] Kouloheris and El Gamal investigate experimentally the tradeoff between the area of a PLA-based FPGA and its cell granularity. They propose PLAs as an area-efficient alternative to LUTs due to the two following considerations:

a LUT with much more than 4 inputs is prohibitively large because of its exponential growth in size with the number of inputs;
it has been found that, on average, the functions mapped into the LUTs used considerably fewer product-terms than the LUT capacity;

The analysis of Kouloheris and El Gamal is made on a variety of benchmarks sets and with an area model which account for the global routing. A total-area vs. (K,M) plot (where K and M are respectively the number of PLA inputs and outputs) reports the following results:

1 output cells are the smallest from K=2 to K=4;
2 output cells are the smallest from K=4 to K=7;
3 output cells are the smallest beyond K=7;

The smallest total area is obtained for a PLA with 8-10 inputs, 3-4 outputs and 12-13 product-terms. For K>4 the differences between 2,3, and 4 output cells are not statistically significant.

1.3 PLA-based FPGA vs. LUT-based FPGAs

Kouloheris and El Gamal compared the smallest LUT implementations with the smallest PLA implementations, considering the same programming technology, namely EPROM cells. The total-area for the PLA cell implementation ranges from 80% to 130% of the LUT one (without considering the ECC benchmarks which produce very bad results, namely 300% worse). A disadvantage of these PLA-based implementations is that they dissipate static power. On the other hand, they claim that the PLAs give in average 25% fewer wiring tracks and 40% fewer levels of logic, which would lead to a better performance. Still, it should be considered that a reduction in logic levels produce in general an increasing of the average fanout per cell output which can bound the possible performance gain.

1.3 PLA Area Optimization Technique

Statistical analysis show that on average only about half of the inputs are involved in any of PLA product-term. This suggest that a fixed product-term folding could be used to reduce the PLA size (Fig 1.4)

Fig. 1.4 PLA Area Optimization: Product-term Folding
Moreover, the fact that only about 10% of the product terms are shared between outputs leads to fix the OR-plane as in PALs. (Fig 1.5)

Fig. 1.5 PLA Area Optimization: Fixed OR-Plane

1.4 A Commercial PLA Device: ALTERA 9000

The general architecture of tha ALTERA 9000 is shown in Fig 1.6a: the FastTrack interconnect provide the communication among the Logic Array Blocks (LABs) and the I/O cells.

Fig. 1.6a ALTERA 9000: General Architecture
Each LAB contains 16 macrocells having the structure illustrated in Fig. 1.6b: a programmable AND-plane feeds an OR gate and a flip-flop.

Fig. 1.6b ALTERA 9000: Logic Array Block

2 Universal Logic Module

2.1 ULMs and FPGAs

Universal Logic Modules (ULMs) are logic block capable of realizing all functions of a fixed number of variables assuming that permutations and negations of variables are provided outside these blocks. Old research on ULMs [3] and new work on FPGAs have not been related until recently, when studies started appearing about the usefulness of ULMs as logic blocks in FPGAs[4,5]. ULMs are defined as blocks with m general purpose inputs that can realize any function up to n inputs with n < munder the assumption that permutations and negations of signals are generated cost-free outside the logic block. This assumption virtually holds for FPGAs.

2.2 Equivalence Classes of Boolean Functions

The set of Boolean functions of n variables can be divided into equivalence classes considering the following operations:

input inversion (N)
input permutation (P)
output inversion (N)

The equivalence under all three operations is called NPN-equivalence. The following table shows the number of equivalence classes for a Boolean functions of 2 variables: we have respectively 5 N-equivalence classes, 4 NP-equivalence classes and 3 NPN-equivalence classes

Function	N	NP	NPN
0
1
a	a	a	a
a'
b	b
b'
ab	ab	ab	ab
a'b
ab'
a'b'
a+b	a+b	a+b
a+'b
a+b'
a'+b'
a@b	a@b	a@b	a@b
a'@b'
16	5	4	3

Fig 2.1 show an example of ULM, namely a ULM.2 which can implement all the functions of two variables by either routing appropriately the two input variables a and b into the input pins y0, y1, y2 or by assigning to these pins the constant values 0 and 1.

Fig. 2.1 An Example: ULM.2
The following table shows a set of possible assignments for y0, y1, y2 realizing all the 2-input functions.

Function	y0	y1	y2
0	0	a	1
1	1	a	0
a	1	a	1
a'	0	a	0
b	b	a	b'
b'	b'	a	b
ab	b	a	1
a'b	0	a	b'
ab'	b'	a	1
a'b'	0	a	b
a+b	1	a	b'
a+'b	b	a	0
a+b'	1	a	b
a'+b'	b'	a	0
a@b	b'	a	b'
a'@b'	b'	a	b

2.3 Replacing LUTs with ULMs for FPGAs?

Each computational block of a FPGA can be implemented with a ULM instead of a LUT. However, observe that to compensate the lack of internal programmability, a ULM has always more inputs than an equivalent LUT. The following table illustrates the relationship between the number of input of the classes of functions to realize and the minimum number of inputs of the corresponding ULMs:

n	ULM Inputs
2	3
3	5
4	8
5	13
6	21

Hence, if we suppose to replace the LUT.4s with the corresponding ULM.4s within a FPGA, we can see the inpact of the increasing number of inputs. In fact, Fig 2.2 shows that the number of switches necessary to route a channel with length 10 becomes more than double. Notice that we considered also the depopulation of input switches which can be obtained for a LUT due to the input permutability and which is precluded for an ULM.

Fig. 2.2 ULM Input and Switches

In [5] Zilic and Vranesic propose a class of ULM circuits for FPGA that limits the number of inputs pins to n by using separate programming bits. They also present a methodology for systematic development of ULM circuits which is based on the BDD representation of Boolean functions. They give an explicit construction of ULM.3 which can replace a 3-input LUT using only 5 programming bits (e.g. saving 3 bits). Moreover, they propose a practical solution with 13 bits for implementing the 202 NPN-equivalence classes of a function of 4 variables. They claim that these ULMs give advantages both with respect to logic block area and internal delay.
However, this approach demands more flexibility from the network (one can't permute LUT inputs) and more interconnect programming bits. Making dense coding assumptions, giving up input permutability alone costs us roughly log2(k!) (4-5 bits for k=4) per LUT. Consequently, the bits saved in the compute block are likely to now be required in the network. With non-dense interconnect encoding, the cost in network bits is even greater as we see in Fig. 2.2.

References

S. Brown and J. Rose. Architecture of FPGAs and CPLDs: A Tutorial Granularity. In IEEE Design and Test of Computers, 13(2):42--57, Summer 1996 [HTML abstract w/ pointer to full paper in PS].
J. Kouloheris and A. El Gamal. PLA-based FPGA Area versus Cell Granularity. In Proceedings of the Custom Integrated Circuits Conference, pages 4.3.1--4. IEEE, May 1992.
X. Chen and S. L. Hurst. A Comparison of Universal-Logic-Module Realizations and Their Application in the Synthesis of Combinatorial and Sequential Logic Networks. IEEE Transactions on Computers, 31(2):140--147, February, 1982.
C.C. Lin and M. Marek-Sadowska. Universal Logic gates for FPGA design FPGAs. In Proceedings of ICCAD94, pages 164--168.
Z. Zilic and Z. G. Vranesic. Using BDDs to Design ULMs for FPGAs. In Proceedings of the International Symposium on Field Programmable Gate Arrays, pages 24--30, February 1996.

Back to main page

Computer Science 294-7 Lecture #20 Compute Blocks