# On Low Complexity Bit Parallel Polynomial Basis Multipliers 

Arash Reyhani-Masoleh and M. Anwar Hasan<br>Centre for Applied Cryptographic Research, University of Waterloo, Waterloo, Ontario, Canada N2L 3G1.<br>\{areyhani, ahasan\}@uwaterloo.ca


#### Abstract

Representing finite field elements with respect to the polynomial (or standard) basis, we consider a bit parallel multiplier architecture for the finite field $G F\left(2^{m}\right)$. Time and space complexities of such a multiplier heavily depend on the field defining irreducible polynomials. Based on a number of important classes of irreducible polynomials, we give exact complexity analyses of the multiplier gate count and time delay. In general, our results match or outperform the previously known best results in similar classes. We also present exact formulations for the coordinates of the multiplier output. Such formulations are expected to be useful to efficiently implement the multiplier using hardware description languages, such as VHDL and Verilog, without having much knowledge of finite field arithmetic.


Keywords: Finite or Galois field, Mastrovito multiplier, pentanomial, polynomial basis, trinomial and equally-spaced polynomial.

## 1 Introduction

With the rapid expansion of the Internet and wireless communications, more and more digital systems are becoming increasingly equipped with some form of cryptosystems to provide various kinds of data security. Many such cryptosystems rely on computations in very large finite fields and require fast computations in the fields [51]. Among the basic arithmetic operations over finite field $G F\left(2^{m}\right)$, addition is easily realized using $m$ two-input XOR gates while multiplication is costly in terms of gate count and time delay.

In the past, many bit parallel multipliers were proposed (see for example [3, 9/211610]). In [43], Mastrovito proposed an algorithm along with its hardware architecture for polynomial (PB) basis multiplication. In his scheme, first a binary matrix is formed which is then multiplied with a binary vector to obtain the required result. Halbutogullari and Koc have given a method for constructing the Mastrovito multiplier for arbitrary irreducible polynomials [2]. This method considers general as well as special classes of irreducible polynomials such as trinomials, all-one polynomials (AOPs) and equally-spaced polynomials (ESPs). So far, for these special polynomials, the XOR gate count and time delay of the Halbutogullari-Koc algorithm appear to be the lowest. In [11], Zhang and

Parhi give a systematic method to design the Mastrovito multiplier. Moreover, in [11], the method is extended to design the modified Mastrovito multiplication scheme proposed in [8]. They also present new results on the complexities of the Mastrovito multiplier for two classes of irreducible pentanomials. Recently, Rodriguez-Henriquez and Koc in [7] have proposed a PB multiplier for special case of pentanomials and have given its time and gate complexities.

In this article, first we review the multiplication scheme and its bit-parallel architecture presented in [6]. Then, using the reduction matrix $\mathbf{Q}$, the complexities of the multiplier based on a number of irreducible polynomials are obtained. We also present explicit formulations for the output coordinates of the multiplier in terms of its inputs. Such formulations can be directly coded using VHDL or Verilog languages to implement an efficient multiplier by someone who is not that familiar with finite field arithmetic. It is shown that for general irreducible polynomials, the space and time complexities of the proposed structure are lower than those available in the literature in terms of combined gate count and time delay. Furthermore, this architecture has fewer signals to be routed which is advantageous for VLSI implementation.

## 2 Polynomial Basis Multiplications over GF(2 $\left.{ }^{m}\right)$

Let $P(x)=x^{m}+\sum_{i=0}^{m-1} p_{i} x^{i}$ be a monic irreducible polynomial over $G F(2)$ of degree $m$, where $p_{i} \in G F(2)$ for $i=0,1, \cdots, m-1$. Let $\alpha \in G F\left(2^{m}\right)$ be a root of $P(x)$, i.e., $P(\alpha)=0$. Then the set $\left\{1, \alpha, \alpha^{2}, \cdots, \alpha^{m-1}\right\}$ is referred to as the polynomial or standard basis and each element of $G F\left(2^{m}\right)$ can be written with respect to (w.r.t.) the polynomial basis (PB). Let $A$ be an element in $G F\left(2^{m}\right)$, then the representation of $A$ w.r.t. the PB is $A=\sum_{i=0}^{m-1} a_{i} \alpha^{i}, a_{i} \in$ $\{0,1\}$, where $a_{i}$ 's are the coordinates. For convenience, these coordinates will be denoted in vector notation ${ }^{1}$ as $\mathbf{a}=\left[a_{0}, a_{1}, a_{2}, \cdots, a_{m-1}\right]^{T}$, where $T$ denotes the transposition. Using this vector notation, the representation of $A$ can be written as $A=\boldsymbol{\alpha}^{T} \mathbf{a}$, where $\boldsymbol{\alpha}=\left[1, \alpha, \alpha^{2}, \cdots, \alpha^{m-1}\right]^{T}$. Let $S$ be the binary polynomial of degree not more than $2 m-2$ obtained by the direct multiplication of the PB representations of any two elements $A$ and $B$ of $G F\left(2^{m}\right)$, i.e.,

$$
\begin{equation*}
S=\left(\sum_{i=0}^{m-1} a_{i} \alpha^{i}\right) \cdot\left(\sum_{j=0}^{m-1} b_{j} \alpha^{j}\right)=\sum_{k=0}^{m-1} d_{k} \alpha^{k}+\sum_{k=0}^{m-2} e_{k} \alpha^{m+k} \tag{1}
\end{equation*}
$$

where

$$
\begin{align*}
& \mathbf{d}=\left[d_{0}, d_{1}, \cdots, d_{m-1}\right]^{T}=\mathbf{L b},  \tag{2}\\
& \mathbf{e}=\left[e_{0}, e_{1}, \cdots, e_{m-1}\right]^{T}=\mathbf{U b}, \tag{3}
\end{align*}
$$

[^0]\[

\mathbf{L} \triangleq\left[$$
\begin{array}{llllll}
a_{0} & 0 & 0 & 0 & \cdots & 0  \tag{4}\\
a_{1} & a_{0} & 0 & 0 & \cdots & 0 \\
a_{2} & a_{1} & a_{0} & 0 & \cdots & 0 \\
\vdots & \vdots & \ddots & \ddots & \ddots & \vdots \\
a_{m-2} & a_{m-3} & \cdots & a_{1} & a_{0} & 0 \\
a_{m-1} & a_{m-2} & \cdots & a_{2} & a_{1} & a_{0}
\end{array}
$$\right], \mathbf{U} \triangleq\left[$$
\begin{array}{cccccc}
0 & a_{m-1} & a_{m-2} & \cdots & a_{2} & a_{1} \\
0 & 0 & a_{m-1} & \cdots & a_{3} & a_{2} \\
\vdots \vdots & \ddots & \ddots & \vdots & \vdots \\
0 & 0 & \cdots & 0 & a_{m-1} & a_{m-2} \\
0 & 0 & \cdots & 0 & 0 & a_{m-1}
\end{array}
$$\right] .
\]

Then, the product $C=A \cdot B$ can be obtained by the following modulo reduction.

$$
\begin{equation*}
C \triangleq \sum_{i=0}^{m-1} c_{i} \alpha^{i} \equiv S \bmod P(\alpha) \tag{5}
\end{equation*}
$$

Definition 1. [3] The reduction matrix $\mathbf{Q}$ is an $m-1$ by $m$ binary matrix which is obtained from

$$
\begin{equation*}
\boldsymbol{\alpha}^{\uparrow} \equiv \mathbf{Q} \boldsymbol{\alpha}(\bmod P(\alpha)) \tag{6}
\end{equation*}
$$

where $\boldsymbol{\alpha}^{\uparrow}=\left[\alpha^{m}, \alpha^{m+1}, \cdots, \alpha^{2 m-2}\right]^{T}$.
Theorem 1. [6] Let $C$ be the product of $A$ and $B \in G F\left(2^{m}\right)$. Then,

$$
\begin{equation*}
\mathbf{c}=\left[c_{0}, c_{1}, \cdots, c_{m-1}\right]^{T}=\mathbf{d}+\mathbf{Q}^{T} \mathbf{e} \tag{7}
\end{equation*}
$$

where $\mathbf{d}$, $\mathbf{e}$ and $\mathbf{Q}$ are defined in (2), (3), and (6) respectively.
The corresponding architecture for polynomial basis multiplication over $G F\left(2^{m}\right)$ is shown in Figure 1 This structure is divided into two parts: IP-network and Q-network. The IP-network has $m$ blocks (denoted as $I_{0}, I_{1}, \cdots, I_{m-1}$ ) which generates vectors $\mathbf{d}$ and $\mathbf{e}$ in accordance with (2) and (32), using $m^{2}$ AND gates and $(m-1)^{2}$ XOR gates. Using (21) and (3), the delay for $d_{j}, 0 \leq j \leq m-1$, and $e_{i}, 0 \leq i \leq m-2$, can be calculated from

$$
\begin{align*}
T\left(d_{j}\right) & =T_{A}+\left\lceil\log _{2}(j+1)\right\rceil T_{X}, \quad 0 \leq j \leq m-1  \tag{8}\\
T\left(e_{i}\right) & =T_{A}+\left\lceil\log _{2}(m-i-1)\right\rceil T_{X}, \quad 0 \leq i \leq m-2 \tag{9}
\end{align*}
$$

In Figure 11 the $\mathbf{Q}$-network takes $\mathbf{d}$ and $\mathbf{e}$ as inputs and generates $\mathbf{c}$. It is noted that the number of lines on the interconnection bus IB is fixed and is equal to the number of $e_{j}$ 's, i.e., $m-1$. In Figure 1 there are three buses, $A, B$ and IB, and the number of lines on the buses is $3 m-1$.

In the following sections, we attempt to minimise the number of XOR gates of the $\mathbf{Q}$-network for special irreducible polynomials, namely equally-spaced polynomials, trinomials, and pentanomials. We start with equally-spaced polynomials which are very structured and will help us present the remaining special cases with less difficulties.

## 3 Multipliers Using Equally-Spaced Polynomials

Definition 2. A polynomial $P(x)=x^{n s}+x^{(n-1) s}+\cdots+x^{s}+1$, over $G F(2)$, with $n s=m$ and $1 \leq s \leq\left\lfloor\frac{m}{2}\right\rfloor$, is called an equally-spaced polynomial (denoted as $s$-ESP) of degree $m$.


Fig. 1. Architecture of the multiplier over $G F\left(2^{m}\right)$, where $C S^{i}$ represents an $i$ - fold cyclic shift.

When $s=1$, we have 1-ESP which is the same as the all-one polynomial (AOP) which has the highest Hamming weight among all polynomials of degree $m$. On the other hand, $s=\left\lfloor\frac{m}{2}\right\rfloor$ results in the least Hamming weight irreducible polynomial (i.e., trinomial) of degree $m$. It is easy to check that for an equally spaced trinomial $m$ is even and $s=\frac{m}{2}$.
Theorem 2. For an $s$-ESP based multiplier over $G F\left(2^{m}\right)$, the number of $A N D$ gates $\left(N_{A}\right)$, the number of XOR gates $\left(N_{X}\right)$ and time delay $\left(T_{C}\right)$ are $N_{A}=m^{2}$, $N_{X}=m^{2}-s$, and $T_{C}=T_{A}+\left(1+\left\lceil\log _{2} m\right\rceil\right) T_{X}$, respectively.

Proof. When $\alpha$ is a root of the $s$-ESP of degree $m$ as defined above, we have

$$
\alpha^{m+i}= \begin{cases}\alpha^{i}+\alpha^{s+i}+\cdots+\alpha^{(n-1) s+i}, & 0 \leq i<s,  \tag{10}\\ \alpha^{i-s}, & s \leq i \leq m-2 .\end{cases}
$$

Using (10), the reduction matrix $\mathbf{Q}$ is obtained as

$$
\mathbf{Q}=\left[\begin{array}{ccc}
\mathbf{I}_{\mathbf{s}} & \mathbf{I}_{\mathbf{s}} \cdots & \mathbf{I}_{\mathbf{s}}  \tag{11}\\
\mathbf{I}_{\mathbf{m}-\mathbf{s}-1} & \mathbf{0}_{s+1}
\end{array}\right]
$$

where $\mathbf{I}_{\mathbf{j}}$ is the $j \times j$ unity matrix and $\mathbf{0}_{s+\mathbf{1}}$ is a zero matrix which has $m-s-1$ rows and $s+1$ columns. The graphical representations of $\mathbf{Q}$ in (11) for different values of $s$ are shown in Figure 2, In this figure, non-zero entries of $\mathbf{Q}$ are shown with the small squares.

In order to obtain exact expressions for $N_{X}$ and $T_{C}$, first we obtain the coordinates of $C$. To this end, from Theorem 1 and (11), one can write

$$
\begin{equation*}
c_{j}=d_{j}^{\prime}+e_{j \bmod s}, \quad 0 \leq j \leq m-1 \tag{12}
\end{equation*}
$$



Fig. 2. Graphical representations of the locations of non-zeros entries of $\mathbf{Q}$ for $s$-ESP $P(x)=x^{n s}+x^{(n-1) s}+\cdots+x^{s}+1, m=n s$. (a) $s=1$ (AOP), (b) $1<s<\frac{m}{2}$, (c) $s=\frac{m}{2}$ (trinomial).
where

$$
d_{j}^{\prime}= \begin{cases}d_{j}+e_{j+s} & 0 \leq j \leq m-s-2,  \tag{13}\\ d_{j} & m-s-1 \leq j \leq m-1 .\end{cases}
$$

Thus, using (12) and (13), the exact XOR gate count for an $s$-ESP based multiplier is $N_{X}=m^{2}-s$. Also, by using (8) and (9), $d_{j}^{\prime}$ of (13) can be generated with a maximum gate delay of $T_{A}+\left(1+\left\lceil\log _{2} m\right\rceil\right) T_{X}$.

It is worth mentioning that the resultant number of signal lines on IB reduces from $m-1$ to $s$, which is considerably lower than the $s$-ESP based Mastrovito multiplier which has $\frac{m(m-s)}{2 s}+m$ signal lines [4]. Thus, the total number of lines on the buses of the multiplier is $2 m+s$.

## 4 Extension to More Generic Polynomials

Here we consider irreducible polynomials of the form $P(x)=x^{m}+x^{k_{t}}+\cdots+$ $x^{k_{2}}+x^{k_{1}}+1$, where $1 \leq k_{1}<k_{2}<\cdots<k_{t} \leq \frac{m}{2}$. The Hamming weight of $P(x)$ is $t+2$ and the degree of the second leading term is less than or equal to $\frac{m}{2}$. All five binary fields recommended by NIST for ECDSA can be constructed by such irreducible polynomials.

In order to apply the general formulation stated in Section 2 to these polynomials, first we obtain the corresponding $\mathbf{Q}$ matrix. Note that all the rows of the $\mathbf{Q}$ matrix are the PB representations of $\alpha^{m+i}, 0 \leq i \leq m-2$, where $\alpha$ is a root of $P(x)$. Since $P(\alpha)=0$, then $\alpha^{m}=1+\alpha^{k_{1}}+\alpha^{k_{2}}+\cdots+\alpha^{k_{t}}$. Thus, the 0 -th row, i.e., $i=0$, has only ones in these $t+1$ columns of $\mathbf{Q}: 0, k_{1}, k_{2}, \cdots, k_{t}$. The consecutive rows of this matrix can be obtained by using a linear feedback shift register (LFSR). As a result, the rows with $i=0$ to $m-k_{t}-1$ of $\mathbf{Q}$ have $t+1$ ones.

The $\mathbf{Q}$ matrix for $t=1$ and $t=3$ (i.e., trinomials and pentanomials, respectively) are shown in Figure 3. As shown in this figure, row $i, 0 \leq i \leq m-k_{t}-1$ of $\mathbf{Q}$ has $t+1$ ones corresponding to the $t+1$ segmented lines. When the last column of $\mathbf{Q}$ contains one which takes place in row $i=m-k_{j}-1, j=t, \cdots, 2,1$, the next row originates new $t+1$ lines in columns: $0, k_{1}, k_{2}$, up to $k_{t}$ provided


Fig. 3. Graphical representations of the reduction matrix $\mathbf{Q}$ for trinomials: (a) $k=$ $k_{1}=1$ (b) $1<k<\frac{m}{2}$ (see Figure 2 (c) for $k_{1}=\frac{m}{2}$ ); and for pentanomials: (c) $k_{1}=1$ (d) $1<k_{1} \leq \frac{m}{2}$.
that there is no previous lines that pass these columns. If there exists a previous line that passes the column $k_{j}, 1 \leq j \leq t$, then the previous line terminates in column $k_{j}-1$ and no new line originates from column $k_{j}$ due to XORing of two lines. This happens in row $\frac{m}{2}$ and column $\frac{m}{2}$ in Figure 2(c) for trinomials when $k_{1}=\frac{m}{2}$. This is also the case for pentanomials where $t=3$ and it is shown in Figures [3(c) and 3(d) for $k_{1}=1$ and $1<k_{1} \leq \frac{m}{2}$, respectively.

We divide the lines of $\mathbf{Q}$ into $t+1$ sets (see Figure 4 for $t=3$ ) such that $\mathbf{Q}=\mathbf{Q}_{0}+\mathbf{Q}_{1}+\mathbf{Q}_{2}+\cdots+\mathbf{Q}_{t}$ where non-zero entries of $\mathbf{Q}_{i}, 0 \leq i \leq t$ start from the column $k_{i}$ (assume that $k_{0}=0$ ). It is noted that the last non-zero entry of sub-matrix $\mathbf{Q}_{i}, 1 \leq i \leq t$ is in column $m-1$, whereas the one in $\mathbf{Q}_{0}$ is in column $m-2$. Moreover, the number of ones in each column of $\mathbf{Q}_{i}, 0 \leq i \leq t$ is at most $t+1$ if $k_{1}>1$, and $t$ if $k_{1}=1$.


Fig. 4. Graphical representations of submatrices of $\mathbf{Q}=\mathbf{Q}_{0}+\mathbf{Q}_{1}+\mathbf{Q}_{2}+\mathbf{Q}_{3}$ for pentanomials $P(x)=x^{m}+x^{k_{3}}+x^{k_{2}}+x^{k_{1}}+1$, where $1<k_{1}<k_{2}<k_{3} \leq \frac{m}{2}$, (see Figure (3) for $\mathbf{Q}$ ). (a) $\mathbf{Q}_{0}$, (b) $\mathbf{Q}_{1}$, (c) $\mathbf{Q}_{2}$, (d) $\mathbf{Q}_{3}$.

Theorem 3. The number of $X O R$ gates and the time delay of the multiplier based on the irreducible polynomial $P(x)=x^{m}+x^{k_{t}}+\cdots+x^{k_{2}}+x^{k_{1}}+1$, $1 \leq k_{1}<k_{2}<\cdots<k_{t} \leq \frac{m}{2}$ are

$$
N_{X}=(m+t)(m-1)
$$

$$
T_{C}=T_{A}+\left(\left\lceil\log _{2}(t+1)\right\rceil+\left\lceil\log _{2}\left(\left\lceil\frac{t}{2}\right\rceil+1\right)\right\rceil+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}
$$

Proof. Let us denote $\mathbf{e}^{(i)}=\left[e_{0}^{(i)}, e_{1}^{(i)}, \cdots, e_{m-1}^{(i)}\right]^{T}=\mathbf{Q}_{i}^{T} \mathbf{e}, 0 \leq i \leq t$, then using Theorem 1, we can obtain the coordinates of the pentanomial based multiplication as

$$
\begin{equation*}
\mathbf{c}=\mathbf{d}+\mathbf{e}^{(0)}+\mathbf{e}^{(1)}+\mathbf{e}^{(2)}+\cdots+\mathbf{e}^{(t)} . \tag{14}
\end{equation*}
$$

First, let us assume $k_{1} \neq 1$. Using $\mathbf{Q}_{0}$ (see Figure 4 (a) for $t=3$ ), the elements of $\mathbf{e}^{(0)}$ are as follows:

$$
e_{j}^{(0)}= \begin{cases}e_{j}+e_{j+m-k_{t}}+\cdots+e_{j+m-k_{2}}+e_{j+m-k_{1}}, & \text { if } 0 \leq j \leq k_{1}-2  \tag{15}\\ e_{j}+e_{j+m-k_{t}}+\cdots+e_{j+m-k_{2}} & \text { if } k_{1}-1 \leq j \leq k_{2}-2 \\ \vdots & \vdots \\ e_{j}+e_{j+m-k_{t}} & \text { if } k_{t-1}-1 \leq j \leq k_{t}-2 \\ e_{j} & \text { if } k_{t}-1 \leq j \leq m-2 \\ 0 & \text { if } j=m-1\end{cases}
$$

The total number of XOR gates to form $e_{j}^{(0)}$ 's, $0 \leq j \leq k_{t}-2$, is $N_{1}=$ $t\left(k_{1}-1\right)+(t-1)\left(k_{2}-k_{1}\right)+\cdots+k_{t}-k_{t-1}=\sum_{i=1}^{t} k_{i}-t$. Let $T\left(e_{j}^{(0)}\right)$ denote the time delay due to gates to find $e_{j}^{(0)}$. As seen in (15), the longest path delay is to obtain $e_{0}^{(0)}=e_{0}+e_{m-k_{t}}+\cdots+e_{m-k_{2}}+e_{m-k_{1}}$, i.e., $T\left(e_{j}^{(0)}\right) \leq T\left(e_{0}^{(0)}\right)$. In order to reduce this delay, we first add any two terms except $c_{0}$, e.g., $e_{m-k_{j}}+e_{m-k_{i}}$, $1 \leq i, j \leq t, i \neq j$. Then add these $\left\lceil\frac{t}{2}\right\rceil$ signals to $c_{0}$ using a binary tree of XOR gates. Since $T\left(e_{j}\right)=T_{A}+\left\lceil\log _{2}(m-j-1)\right\rceil T_{X}$, then $T\left(e_{m-k_{j}}+e_{m-k_{i}}\right) \leq$ $T_{X}+T\left(e_{m-k_{t}}\right)=T_{A}+\left(1+\left\lceil\log _{2}\left(k_{t}-1\right)\right\rceil\right) T_{X} \leq T_{A}+\left\lceil\log _{2}(m-1)\right\rceil T_{X}$, where the last inequality is due to $k_{t} \leq \frac{m}{2}$. Thus, we have

$$
T\left(e_{j}^{(0)}\right) \leq \begin{cases}T_{A}+\left(\left\lceil\log _{2}\left(\left\lceil\frac{t}{2}\right\rceil+1\right)\right\rceil+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}, & \text { if } 0 \leq j \leq k_{t}-2  \tag{16}\\ T_{A}+\left\lceil\log _{2}(m-1)\right\rceil T_{X} & \text { if } k_{t}-1 \leq j \leq m-2\end{cases}
$$

By reusing the signals of $e_{j}^{(0)}$ 's, the coordinates of $\mathbf{e}^{(i)}$, for $1 \leq i \leq t$, can be obtained as

$$
e_{j}^{(i)}= \begin{cases}0, & \text { if } 0 \leq j \leq k_{i}-1  \tag{17}\\ e_{j-k_{i}}^{(0)} \text { otherwise }\end{cases}
$$

This results in the coordinates of $C=A B$ as

$$
c_{j}=d_{j}+ \begin{cases}e_{j}^{(0)} & \text { if } 0 \leq j \leq k_{1}-1  \tag{18}\\ e_{j}^{(0)}+e_{j}^{(1)} & \text { if } k_{1} \leq j \leq k_{2}-1 \\ \vdots & \vdots \\ e_{j}^{(0)}+e_{j}^{(1)}+\cdots+e_{j}^{(t-1)} & \text { if } k_{t-1} \leq j \leq k_{t}-1 \\ e_{j}^{(0)}+e_{j}^{(1)}+\cdots+e_{j}^{(t)} & \text { if } k_{t} \leq j \leq m-2 \\ e_{j}^{(1)}+e_{j}^{(2)}+\cdots+e_{j}^{(t)} & \text { if } j=m-1\end{cases}
$$

by using (14). To realize (18) in hardware, one requires $N_{2}=m+\left(k_{2}-k_{1}\right)+$ $2\left(k_{3}-k_{2}\right)+\cdots+(t-1)\left(k_{t}-k_{t-1}\right)+t\left(m-k_{3}-1\right)+t-1=(t+1) m-\sum_{i=1}^{t} k_{i}-1$ XOR gates. Thus, the total XOR gates needed for the multiplier is $(m-1)^{2}+N_{1}+N_{2}=$ $(m+t)(m-1)$.

To obtain the time delay of the proposed multiplier, we use a binary tree for each coordinate in (18). For $j \notin\left[k_{t}, m-2\right]$, it is seen in (18) that $T_{C} \leq$ $\left\lceil\log _{2}(t+1)\right\rceil T_{X}+T\left(e_{0}^{(0)}\right)$ and the proof is complete by using (16). Now, we need only to obtain the time delay of $c_{j}^{\prime} s$ for $k_{t} \leq j \leq m-2$. For $j \in\left[k_{t}, m-2\right]$, if we form $c_{j}=\left(d_{j}+e_{j}^{(0)}\right)+e_{j}^{(1)}+e_{j}^{(2)}+\cdots+e_{j}^{(t)}$ such that $d_{j}+e_{j}^{(0)}$ is calculated first, then

$$
\begin{aligned}
T\left(d_{j}+e_{j}^{(0)}\right) & \leq T_{A}+\left(1+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X} \\
& \leq T_{A}+\left(\left\lceil\log _{2}\left(\left\lceil\frac{t}{2}\right\rceil+1\right)\right\rceil+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}
\end{aligned}
$$

Also, using (17) and (16), one can see

$$
T\left(e_{j}^{(t)}\right) \leq T_{A}+\left(\left\lceil\log _{2}\left(\left\lceil\frac{t}{2}\right\rceil+1\right)\right\rceil+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}
$$

which implies that

$$
T_{C} \leq T_{A}+\left(\left\lceil\log _{2}(t+1)\right\rceil+\left\lceil\log _{2}\left(\left\lceil\frac{t}{2}\right\rceil+1\right)\right\rceil+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}
$$

and the proof is complete.
In addition to the three buses shown in Figure 1 now, there will be another bus in the middle of the $\mathbf{Q}$-network for signals $e_{j}^{(0)}$ for $0 \leq j \leq k_{t}-2$. Thus, the total number of lines on the buses is $3 m+k_{t}-2$.

Corollary 1. For $k_{1}=1$ and $t>1$, the time delay would reduce to

$$
T_{A}+\left(\left\lceil\log _{2}(t+1)\right\rceil+\left\lceil\log _{2}\left\lceil\frac{t}{2}\right\rceil\right\rceil+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}
$$

Based on the above results, one can obtain the time delay and the number of XOR gates for the trinomial based multiplier by substituting $t=1$ in Theorem (3) for $k_{1} \neq \frac{m}{2}$ and $s=\frac{m}{2}$ in Theorem 2 for $k_{1}=\frac{m}{2}$. Note that the results for $k_{1}=\frac{m}{2}$ are obtained using the implementation of the $\frac{m}{2}$-ESP based multiplier.

## 5 Special Classes of Pentanomials

A polynomial with five non-zero coefficients, i.e., $P(x)=x^{m}+x^{k_{3}}+x^{k_{2}}+x^{k_{1}}+1$, where $1 \leq k_{1}<k_{2}<k_{3} \leq m-1$, is called a pentanomial of degree $m$. The non-zero constant term is due to the irreducibility properly needed to define the field. In terms of the values of $k_{i} \mathrm{~s}$, the pentanomials can be divided into a number of different classes. Below we consider two special classes of irreducible pentanomials as proposed in [11].

### 5.1 Class 1: $k_{3} \leq \frac{m}{2}$

For this class of irreducible pentanomial where $k_{3} \leq \frac{m}{2}$, one can apply $t=3$ to the complexity results we have presented in Section 4. This yields the following.

Corollary 2. The gate counts and time delay of the multiplier for the the pentanomial $P(x)=x^{m}+x^{k_{3}}+x^{k_{2}}+x^{k_{1}}+1$, where $1 \leq k_{1}<k_{2}<k_{3} \leq \frac{m}{2}$, are

$$
\begin{aligned}
N_{A} & =m^{2}, \\
N_{X} & =m^{2}+2 m-3, \\
T_{C} & =\left\{\begin{array}{l}
T_{A}+\left(3+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}, \text { if } k_{1}=1 \\
T_{A}+\left(4+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}, \text { otherwise, }
\end{array}\right.
\end{aligned}
$$

and the number of lines on the buses is $3 m+k_{3}-2$.
The number of XOR gates can be reduced if we choose a pentanomial such that $k_{1}=k_{3}-k_{2}$. Towards this, let us introduce the following set of new signals

$$
\begin{equation*}
e_{j}^{\prime}=e_{j+m-k_{3}}+e_{j+m-k_{2}}, 0 \leq j \leq k_{2}-2 . \tag{19}
\end{equation*}
$$

Equation (19) can be used to generate $e_{j}^{(0)}, 0 \leq j \leq k_{2}-2$, by substituting $t=3$ in (15) as follows

$$
e_{j}^{(0)}= \begin{cases}e_{j}+e_{j}^{\prime}+e_{j+m-k_{1}}, & \text { if } 0 \leq j \leq k_{1}-2  \tag{20}\\ e_{j}+e_{j}^{\prime} & \text { if } k_{1}-1 \leq j \leq k_{2}-2 \\ e_{j}+e_{j+m-k_{3}} & \text { if } k_{2}-1 \leq j \leq k_{3}-2 \\ e_{j} & \text { if } k_{3}-1 \leq j \leq m-2 \\ 0 & \text { if } j=m-1\end{cases}
$$

The total number of XOR gates needed to generate $e_{j}^{(0)}$, s (see (201) is $N_{1}=$ $k_{1}+k_{2}+k_{3}-3$ where $k_{2}-1$ of which is due to (19). Also, the maximum delay due to gates in (20) is

$$
T\left(e_{j}^{(0)}\right) \leq \begin{cases}T_{A}+\left(2+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X} & \text { if } 0 \leq j \leq k_{1}-2  \tag{21}\\ T_{A}+\left(1+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X} & \text { if } k_{1}-1 \leq j \leq k_{3}-2 \\ T_{A}+\left\lceil\log _{2}(m-1)\right\rceil T_{X} & \text { if } k_{3}-1 \leq j \leq m-1\end{cases}
$$

Lemma 1. With symbols defined as above, one has

$$
\begin{aligned}
& e_{j}^{(0)}+e_{j}^{(1)}=e_{j+k_{2}-m}^{\prime}, \text { for } m-k_{2} \leq j \leq m-2, \\
& e_{j}^{(2)}+e_{j}^{(3)}=e_{j-k_{2}}^{(0)}+e_{j-k_{2}}^{(1)}, \text { for } k_{3} \leq j \leq m-1 .
\end{aligned}
$$

Let us represent $e_{j}^{(01)}, 0 \leq j \leq m-1$, as the elements of $\left(\mathbf{Q}_{0}+\mathbf{Q}_{1}\right)^{T} \mathbf{e}$, where $\mathbf{Q}_{0}$ and $\mathbf{Q}_{1}$ are shown in Figure [(a) and Figure 4(b), respectively. Then, substituting $t=3$ in the general case given in (18) and using the above lemma, we can obtain the coordinates of $C=A B$ as follows:

$$
\begin{equation*}
c_{j}=d_{j}+e_{j}^{(01)}+e_{j-k_{2}}^{(01)}, 0 \leq j \leq m-1, \tag{22}
\end{equation*}
$$

where $e_{j-k_{2}}^{(01)}=0$ for $j<k_{2}$, and

$$
e_{j}^{(01)}= \begin{cases}e_{j}^{(0)} & \text { if } 0 \leq j \leq k_{1}-1  \tag{23}\\ e_{j}^{(0)}+e_{j}^{(1)} & \text { if } k_{1} \leq j \leq m-k_{2}-1 \\ e_{j+k_{2}-m}^{\prime} & \text { if } m-k_{2} \leq j \leq m-2 \\ e_{j}^{(1)} & \text { if } j=m-1\end{cases}
$$

As seen in (23), one has to realize $e_{j}^{(0)}+e_{j}^{(1)}$ for all $k_{1} \leq j \leq m-k_{2}-1$ which requires $m-k_{2}-k_{1}$ XOR gates. Once $e_{j}^{(01)}$,s are obtained, then equation (222) requires $2 m-k_{2}$ XOR gates. Thus, the total number of XOR gates needed for the multiplier is $(m-1)^{2}+N_{1}+m-k_{2}-k_{1}+2 m-k_{2}=m^{2}+m+k_{1}-2$. Due to the reuse of terms $e_{j}^{\prime}, 0 \leq j \leq k_{2}-1$, and $e_{j}^{(0)}+e_{j}^{(1)}, k_{1} \leq j \leq m-k_{2}-1$, additional lines needed on the bus in the $\mathbf{Q}$-network are $\left(k_{2}-1\right)$ and $\left(m-k_{1}-k_{2}\right)$, respectively. Thus, the total number of lines on the buses is increased to $4 m+k_{2}-3$.

To obtain the time delay of the proposed multiplier, we use Table 1 which shows the maximum delay of the used signals in (22) for the given ranges of $j$ in each row. In this figure $i, 0 \leq i \leq 4$, represents the time delay of $T_{A}+$ $\left(i+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}$, and the numbers inside brackets are for $k_{1}=1$. Also, $x$ determines either $e_{j}^{(01)}$ or $e_{j-k_{2}}^{(01)}$ to be added with $d_{j}$ first to obtain $c_{j}$. In each row of this table, the delays are obtained for the first digit of the given range. This is because as $j$ increases, the time delays of the used signals in each row of this table decreases. As seen in this table, the maximum delay of the multiplier is $T_{A}+\left(4+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}$. For $k_{1}=1$, only one signal, i.e., $c_{k_{3}}$, has the delay of $T_{A}+\left(4+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}$. One can reduce this delay to $T_{A}+$ $\left(3+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}$ if only $c_{k_{3}}$ is realized as $c_{k_{3}}=\left(\left(d_{k_{3}}+e_{j}^{(0)}\right)+e_{j}^{(1)}\right)+e_{k_{3}-k_{2}}^{(01)}$ by using one extra XOR gate.

Table 1. Maximum time delays of the signals, where $i, 0 \leq i \leq 4$, represents the time delay of $T_{A}+\left(i+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}$, numbers inside brackets are for $k_{1}=1$, and $x$ determines either $e_{j}^{(01)}$ or $e_{j-k_{2}}^{(01)}$ to be added first with $d_{j}$.

| $j$ | $e_{j}^{(0)}$ | $e_{j}^{(1)}$ | $e_{j}^{(01)}$ | $e_{j-k_{2}}^{(01)}$ | $d_{j}+x$ | $c_{j}$ |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| $0 \leq j \leq k_{1}-1$ | $2(1)$ | - | $2(1), x$ | - | 3 | 3 |
| $k_{1} \leq j \leq k_{2}-1$ | 1 | $2(1)$ | $3(2), x$ | - | $4(3)$ | $4(3)$ |
| $k_{2} \leq j \leq k_{3}-1$ | 1 | $2(1)$ | $3(2)$ | $2(1), x$ | $3(2)$ | $4(3)$ |
| $k_{3} \leq j \leq k_{3}+k_{1}-1$ | 0 | 1 | $2, x$ | $3(2)$ | 3 | 4 |
| $k_{3}+k_{1} \leq j \leq m-k_{2}-1$ | 0 | 0 | $1, x$ | $3(2)$ | 2 | $4(3)$ |
| $m-k_{2} \leq j \leq m-1$ | 0 | 0 | $1, x$ | $3(2)$ | 2 | $4(3)$ |
| $j=m-1$ | - | 0 | $1, x$ | 1 | 2 | 3 |

Based on the above results, we can state the following.
Theorem 4. The gate counts and time delay of the multiplier based on the pentanomial $P(x)=x^{m}+x^{k_{3}}+x^{k_{2}}+x^{k_{1}}+1$, where $1 \leq k_{1}<k_{2}<k_{3} \leq \frac{m}{2}$, and $k_{3}-k_{2}=k_{1}$ are

$$
N_{A}=m^{2}
$$

$$
\begin{aligned}
N_{X} & = \begin{cases}m^{2}+m & \text { if } k_{1}=1 \\
m^{2}+m+k_{1}-2 & \text { otherwise },\end{cases} \\
T_{C} & =\left\{\begin{array}{l}
T_{A}+\left(3+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}, \text { if } k_{1}=1 \\
T_{A}+\left(4+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}, \text { otherwise },
\end{array}\right.
\end{aligned}
$$

and the number of lines on the buses is $4 m+k_{2}-3$.

Remark 1. To verify that class 1 irreducible pentanomials exist, we have used a Maple ${ }^{\mathrm{TM}}$ program for $m \in[160,600]$ and have found that at least one irreducible pentanomial exists for each $m$ in the range of 160 to 600 . This is of interest to elliptic curve cryptosystem designers. In order to minimise the number of XOR gates of the multiplier, we have obtained irreducible pentanomials such that $k_{1}$ is minimum. We have also observed that, $k_{1}$ is less than or equal six for all $m$ in the above mentioned range.

It is noted that the pentanomial presented in [7] is a special case when $k_{1}=1$.

### 5.2 Class 2: $m-k_{3}=k_{3}-k_{2}=k_{2}-k_{1}=s, \frac{m-1}{8} \leq s \leq \frac{m-1}{3}$

We refer to polynomials $P(x)=x^{m}+x^{k_{3}}+x^{k_{2}}+x^{k_{1}}+1$, where $1 \leq k_{1}<k_{2}<$ $k_{3} \leq m-1$, and $m-k_{3}=k_{3}-k_{2}=k_{2}-k_{1}=s$ as class 2 type. Similar to the other special irreducible polynomials, here we first obtain the corresponding reduction matrix. Then the coordinates and complexities of the multiplier can be obtained. Based on the values of $s$ ( or $k_{1}=m-3 s$ ), we can divide the reduction matrix into different forms. Because of lack of space, only three of them are presented here. These $\mathbf{Q}$ matrices for $\frac{m-1}{8} \leq s \leq \frac{m-1}{3}$ (or $1 \leq k_{1} \leq 5 s+1$ ) are shown in Figure 5 Based on this figure, we can state the following theorem.


Fig. 5. Graphical representations of the reduction matrix $\mathbf{Q}$ for class 2 pentanomials $P(x)=x^{m}+x^{k_{3}}+x^{k_{2}}+x^{k_{1}}+1$, where $m-k_{3}=k_{3}-k_{2}=k_{2}-k_{1}=s$. (a) $\frac{m-1}{4} \leq s \leq \frac{m-1}{3}$ or $1 \leq k_{1} \leq s+1$ (see Figure 2(a) for $k_{1}=s$ ), (b) $\frac{m-1}{5} \leq s<\frac{m-1}{4}$ or $s+1<k_{1} \leq 2 s+1$, (c) $\frac{m-1}{8} \leq s<\frac{m-1}{5}$ or $2 s+1<k_{1} \leq 5 s+1$.

Theorem 5. The gate counts and the time delay of the multiplier for the pentanomial $P(x)=x^{m}+x^{m-s}+x^{m-2 s}+x^{m-3 s}+1$, for $\frac{m-1}{8} \leq s \leq \frac{m-1}{3}$ are $N_{A}=m^{2}$,

$$
\left.\begin{array}{l}
N_{X}=\left\{\begin{array}{l}
m^{2}+m-s-1, \quad \text { if } \frac{m-1}{4} \leq s \leq \frac{m-1}{3} \\
m^{2}+2 m-5 s-2 \text { if } \frac{m-1}{5} \leq s<\frac{m-1}{4} \\
m^{2}+m-2
\end{array} \quad \text { if } \frac{m-1}{8} \leq s<\frac{m-1}{5}\right.
\end{array}\right\} \begin{aligned}
& T_{A}+\left(3+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}, \text { if } \frac{m-1}{5} \leq s \leq \frac{m-1}{3} \\
& T_{A}+\left(4+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}, \text { otherwise. }
\end{aligned}
$$

Remark 2. Using Maple ${ }^{\mathrm{TM}}$, we have found that there exists 147 values of $m$, where $m \in[160,600]$ such that polynomial $P(x)=x^{m}+x^{m-s}+x^{m-2 s}+x^{m-3 s}+$ $1,1 \leq s \leq \frac{m-1}{3}$ is irreducible. Among them only 23 have $1 \leq s<\frac{m-1}{8}$.

Table 2. Comparison of related polynomial basis multipliers.

| Reference | Special Case | \#XOR | Time delay |
| :---: | :---: | :---: | :---: |
| $P(x)=x^{n s}+x^{(n-1) s}+\cdots+x^{s}+1, m=n s$ |  |  |  |
| This paper, [2]11] |  | $m^{2}-s$ | $T_{A}+\left(1+\left\lceil\log _{2} m\right\rceil\right) T_{X}$ |
| $P(x)=x^{m}+x^{k}+1$ |  |  |  |
| 91211 | $k=1$ | $m^{2}-1$ | $T_{A}+\left(1+\left\lceil\log _{2} m\right\rceil\right) T_{X}$ |
| $92 \mid 11$ | $1<k \leq \frac{m}{2}$ | $m^{2}-1$ | $T_{A}+\left(2+\left\lceil\log _{2} m\right\rceil\right) T_{X}$ |
| This paper, [10 | $1 \leq k \leq \frac{m}{2}$ | $m^{2}-1$ | $T_{A}+\left(2+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}$ |
| $P(x)=x^{m}+x^{k_{t}}+\cdots+x^{k_{2}}+x^{k_{1}}+1,1 \leq k_{1}<k_{2}<\cdots<k_{t} \leq \frac{m}{2}$ |  |  |  |
| [11] | - | $(m+t)(m-1)$ | $T_{A}+\left(2 t+\left\lceil\log _{2} m\right\rceil\right) T_{X}$ |
|  | $t>1$ | $(m+t)(m-1)$ | $\begin{gathered} T_{A}+\left(\left\lceil\log _{2}\left(\left\lceil\frac{t}{2}\right\rceil+1\right)\right\rceil+\right. \\ \left.\left\lceil\log _{2}(t+1)\right\rceil+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X} \\ \hline \end{gathered}$ |
| $P(x)=x^{m}+x^{k_{3}}+x^{k_{2}}+x^{k_{1}}+1,1<k_{1}<k_{2}<k_{3} \leq \frac{m}{2}$ |  |  |  |
| [11] | $k_{1} \geq 1$ | $m^{2}+2 m-3$ | $T_{A}+\left(6+\left\lceil\log _{2} m\right\rceil\right) T_{X}$ |
| This paper | $k_{1}>1$ | $m^{2}+2 m-3$ | $T_{A}+\left(4+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}$ |
| This paper | $k_{1}=1$ | $m^{2}+2 m-3$ | $T_{A}+\left(3+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}$ |
| This paper | $k_{3}-k_{2}=k_{1}$ | $m^{2}+m+k_{1}-2$ | $T_{A}+\left(4+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}$ |
| [7] | $k_{3}-k_{2}=k_{1}=1$ | $m^{2}+m+2 k_{2}$ | $T_{A}+\left(3+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}$ |
| This paper | $k_{3}-k_{2}=k_{1}=1$ | $m^{2}+m$ | $T_{A}+\left(3+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}$ |
| This paper, [7] | $k_{i}=i$ | $m^{2}+m$ | $T_{A}+\left(3+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}$ |
| $P(x)=x^{m}+x^{m-s}+x^{m-2 s}+x^{m-3 s}+1$ |  |  |  |
|  | $1 \leq s \leq \frac{m-1}{3}$ | $m^{2}+4 m-5 s-5$ | $T_{A}+\left(\left[\frac{d}{4}\right\rfloor+4+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}$ |
|  | $s \leq \frac{\bar{m}-1^{3}}{3}$ | $\geq m^{2}+2.33 m-7$ | $\geq T_{A}+\left(4+\left\lceil\log _{2}(m-1)\right\rceil\right) T_{X}$ |
| This paper | $\frac{m-1}{8} \leq s \leq \frac{m-1}{3}$ | $\leq m^{2}+m$ | $\leq T_{A}+\left(4+\left[\log _{2}(m-1)\right\rceil\right) T_{X}$ |

## 6 Complexity Results and Concluding Remarks

In this article, time and space complexities of bit parallel multipliers for $G F\left(2^{m}\right)$ have been considered. A comparison of our newly derived gate counts and delays

Table 3. Comparison of the structure of Figure 1 with the Mastrovito multiplier in terms of number of number of lines on the buses.

| Multipliers | \# Lines on the buses |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | trinomial | $s$-ESP | pentanomial | generic |
| Mastrovito [4) | $3 m-1$ | $\frac{m(m-s)}{2 s}+2 m$ | $5 m-3$ | $(t+2)(m-1)+2$ |
| This paper | $3 m-1$ | $2 m+s$ | $\leq 4 m+k_{2}$ | $3 m+k_{t}-2$ |

with those of existing ones is shown in Table 2, As seen in this table, for trinomial $x^{m}+x+1$, the multiplier of Figure 1 has one additional XOR gate delay compared to the best one available in the literature, i.e., [211]. However, our results for the ESPs and trinomials $(k \neq 1)$ match the corresponding best results available ( $[2]$ [11] and [9]). Also, the resultant gate and time complexities for trinomials match those presented in [10.

For a more generic irreducible polynomial as discussed in Section 4, the multiplier in Figure has the same gate count but a shorter time delay compared to [11]. For class 1 pentanomials, this multiplier is faster than [11] and has fewer XOR gates if the special case of $k_{3}-k_{2}=k_{1}$ is used. This proposed special case of class 1 covers the case of pentanomials reported in (7) where $k_{1}=1$. Compared to the multiplier proposed in [7], the multiplier discussed in this paper for the special case of $k_{1}=k_{3}-k_{2}=1$ has $2 k_{2}$ fewer XOR gates and match the ones proposed in [7] for $k_{1}=1$ and $k_{2}=2$. Also, for class 2 pentanomials, our multiplier is either faster or has the same gate delay and has at least $1.33 m-7$ fewer XOR gates than the multiplier reported in [11.

In VLSI implementation, in addition to the gate counts, the number of lines on the buses is also an important parameter which determines the space complexity and consequently its actual time delay. Table 3 compares this metric of the proposed architecture with that of Mastrovito multiplier [4]. As shown in this table, the architectures discussed here have a fewer number of lines on the buses compared to the well known Mastrovito multiplier.

Acknowledgements. This work has been supported in part by an NSERC postdoctoral fellowship awarded to A. Reyhani-Masoleh and in part by an NSERC grant awarded to M. A. Hasan.

## References

1. G. B. Agnew, R. C. Mullin, and S. A. Vanstone. "An Implementation of Elliptic Curve Cryptosystems Over $F_{2155}$ ". IEEE J. Selected Areas in Communications, 11(5):804-813, June 1993.
2. A. Halbutogullari and C. K. Koc. "Mastrovito Multiplier for General Irreducible Polynomials". IEEE Transactions on Computers, 49(5):503-518, May 2000.
3. E. D. Mastrovito. "VLSI Designs for Multiplication over Finite Fields $G F\left(2^{m}\right)$ ". In LNCS-357, Proc. AAECC-6, pages 297-309, Rome, July 1988. Springer-Verlag.
4. E. D. Mastrovito. VLSI Architectures for Computation in Galois Fields. PhD thesis, Linkoping Univ., Linkoping Sweden, 1991.
5. A.J. Menezes, I.F. Blake, X. Gao, R.C. Mullin, S.A. Vanstone, and T. Yaghoobian. Applications of Finite Fields. Kluwer Academic Publishers, 1993.
6. A. Reyhani-Masoleh and M. A. Hasan. "A New Efficient Architecture of Mastrovito Multiplier over $G F\left(2^{m}\right)$ ". In $20^{\text {th }}$ Biennial Symposium on Communications, pages 59-63, Kingston, Ontario, Canada, May 2000.
7. F. Rodriguez-Henriquez and C. K. Koc. "Parallel Multipliers Based on Special Irreducible Pentanomials". IEEE Transactions on Computers, to appear, 2003, available at http://islab.oregonstate.edu/koc/Publications.html.
8. L. Song and K. K. Parhi. "Low Complexity Modified Mastrovito Multipliers over Finite Fields $G F\left(2^{M}\right)$ ". In ISCAS-99, Proc. IEEE International Symposium on Circuits and Systems, pages 508-512, 1999.
9. B. Sunar and C. K. Koc. "Mastrovito Multiplier for All Trinomials". IEEE Transactions on Computers, 48(5):522-527, May 1999.
10. H. Wu. "Bit-Parallel Finite Field Multiplier and Squarer Using Polynomial Basis". IEEE Transactions on Computers, 51(7):750-758, July 2002.
11. T. Zhang and K. K. Parhi. "Systematic Design of Original and Modified Mastrovito Multipliers for General Irreducible Polynomials". IEEE Transactions on Computers, 50(7):734-748, July 2001.

[^0]:    ${ }^{1}$ In this paper, vectors and matrices are shown with small and capital bold faces, respectively.

