Data flow algorithms for processors with vector extensions: Handling actors with internal state

Full use of the parallel computation capabilities of present and expected CPUs and CPUs require use of vector extensions. Yet many actors in data flow systems for digital signal processing have internal state (or, equivalently, an edge that loops from the actor back to itself) that impose serial dependencies between actor invocations that make vectorizing across actor invocations impossible. Ideally, issues of inter-thread coordination required by serial data dependencies should be handled by code written by parallel programming experts that is separate from code specifying signal processing operations. The purpose of this paper is to present one approach for so doing in the case of actors that maintain state. We propose a methodology for using the parallel scan (also known as prefix sum) pattern to create algorithms for multiple simultaneous invocations of such an actor that results in vectorizable code. Two examples of applying this methodology are given: (1) infinite impulse response filters and (2) finite state machines. The correctness and performance of the resulting IIR filters are studied.


I. INTRODUCTION
Processors available for embedded systems increasing provide two forms of data parallelism.One form is the use of a graphics processor unit (GPU) for compute purposes [1].The other form of data parallelism is the increase in the bit width of vectors of integer or floating point numbers that can be operated on by a single instruction in a single instruction multiple data (SIMD) fashion.For example, Intel and AMD have added the Advanced Vector Extensions (AVX) [2] [3] capability to their processors.Some ARM designs include NEON, a similar vector SIMD instruction set extension.Use of such vector extensions have been found to increased speed of numerically-intesive tasks reasonably similar to signal processing while reducing power consumption per computation.[4] Multicore CPUs with such vector extensions have been found (by workers examining numerical simulation workloads not too dissimilar from DSP workloads) to achieve approximately one third of the price-performance (i.e., gigaflops per dollar) and power-performance (i.e., gigaflops per watt) of a GPU external to the CPU [5], without the real estate requirements of an external GPU.Both GPUs and vector extensions provide fine grained parallelism in the form of the ability to fetch or store multiple data items from or to evenly strided memory locations and perform several instances of the same mathematical operation on different data with one machine instruction.
There are three options for programming vector extensions.One is assembly language, which is so time-consuming that it is rarely practical.The second is to code in standard imperative languages (e.g.C, C++) and compile using vectorizing compilers.Such compilers attempt to locate parallelism in inner loops that has no or few enough and the right sort of data dependencies between loop iterations that vector instructions can be generated that compute the results of several adjacent loop iterations at once.If such an attempt is successful, we say that the loop has been vectorized.However, different compilers and different versions of the same compiler differ wildly in their ability successfully to vectorize the same loop.[6] The third option is to write the most compute-intensive loops a kernel language such as OpenCL [7] or CUDA [8].In such languages, parallelism is made explicit by separating the specification of computations that may proceed in parallel (called "work items" in OpenCL), from the specification of both the size and location of the data on which to compute.A compiler for a kernel language can produce code that uses vector extensions to execute several work items in parallel.Kernel languages have therefore been found to take better advantage both of vector extensions and opportunities for multithreading than do vectorizing compilers [9].Furthermore, the only way to program GPUs is using either OpenCL or CUDA.Now let's consider using the parallelism made usable by implementing data flow actors in a kernel language.Suppose an actor maintains no internal state, that is, never writes to any variables that may be read during a later firing of the actor.Then a kernel language implementation that effectively obtains parallelism is straightforward.First, a schedule is chosen where the actor is to be fired some number N times in a row.That portion of the schedule is then implemented by invoking a kernel implementing the actor functionality where the work items process the N data items, with no synchronization or passing of data required between them.
This straightforward strategy breaks down when the actor has an edge that is a self-loop, that is, an edge that begins and ends at that edge.Suppose that data placed on the edge are specified to arrive at the next execution of the actor.Then this self-loop is equivalent to the actor having one or more internal variables, that is, internal state, maintained within the actor.Indeed, some dataflow formalisms provide the appearance of having internal state as a convenience for programmers while implementing it as a self-loop.Even though many data flow formalisms do not permit internal state, below we write about actors with internal state with the understanding that they can be specified in those formalisms with the aid of self-loops.
Having internal state means that past values of the actor's inputs may influence all future outputs.Two simple examples of this (which will be referred to repeatedly below) are an actor that represents an infinite impulse response (IIR) filter and an actor that implements a finite state machine (FSM).In the case of an IIR filter, the state is some fixed number of previous filter output values.In the case of an FSM, it is the current state.One may not simply invoke a kernel implementing the actor on N data items because the operation of the actor operating on the ith data item depends on the state after the completion of the actor's computation on the (i − 1)st data item.
Kernel languages include coordination mechanisms such as barriers and fences to permit information to be passed between work items via shared memory.However, correctly programming using such mechanisms is time consuming and error prone.The clarity of code is also reduced because the desired signal processing functionality is mixed in with these synchronization considerations.Ideally, issues of interworkgroup coordination required by serial data dependencies should be handled by code written by parallel programming experts that is separate from code specifying signal processing operations.The purpose of this paper is to present one approach for so doing in the case of actors that maintain state.
We do not consider the problem of arriving at a schedule that permits the large number of consecutive actor firings that allow such vectorized code to run effectively.Such scheduling is considered in [10].

II. PROPOSED METHODOLOGY
The approach proposed here is based on identifying the transformation that an actor performs on its state variables as a function of each input.If any two such transformations can be composed with reasonable efficiency, then there is a parallel method for composing a large number of transformations with little additional overhead.
To formalize this notion, let x 1 , x 2 , . . .be the scalar-or vector-valued inputs to the actor at the first, second, and so on invocation of the actor.Let s 0 be the initial values of the variables local to the actor that represent its internal state.Let s i represent the values of those variables after the actor reads and processes inputs x i .With out loss of generality, the outputs of the actor are either (1) a subset of the state variables or (2) may be computed from the state variables simply and quickly.The mapping performed by the actor can then be written as a function f where s i = f (s i−1 ; x i ) Or, once input x i is available, we can write s i = f i (s i−1 ) where f i (s) = f (s, x i ).Now suppose that the actor is scheduled to run N times successively and the first invocation of this block of N is the kth one during the program execution.Then the successive values of the state variables are f k (s k−1 ), then , and so on, where • represents functional composition.The final value of the state variables is Because the actor outputs are among the state variables, all of these functional compositions must be computed.However, since functional composition is an associative operation, the parallel prefix [11], also called parallel scan [12], pattern can be applied to yield a parallel algorithm for computing all of the required functional compositions at once.The details of the scan algorithm are beyond the scope of this article: they may be found in the references cited in this paragraph.What is important about scan for the present purposes is: • Given objects a 1 , a 2 , . . ., a n from a set C and an associative binary operation ⊕ on C, scan computes a 1 ⊕ a 2 , a 1 ⊕ a 2 ⊕ a 3 , . .., a 1 ⊕ . . .⊕ a n .(Some implementations also require a neutral element object of the same type as the a i 's, such that e ⊕ a = a for all a ∈ C.) • When there are sufficient processors or parallel lanes in vector mode instructions, the parallel execution of the scan requires O(log n) time using O(n) calls to the ⊕ function.
• There are a number of implementations of scan available in kernel languages [13]- [16].These implementations encapsulate the barriers, fences, and multiple kernel invocations required in a correct and efficient implementation of scan and provide an interface through which the using programmer provides the ⊕ function.Generally the ⊕ function is relatively straightforward to write because it must not have dependencies on variables other than its arguments and so can not use barriers, fences, or other inter-thread synchronization or communication.
The basic idea of the remainder of this paper is to use the functional composition operator • as ⊕ and apply scan to compute the chains of functional compositions and so on.This results in following methodology for producing algorithms suitable for implementing actors for processors with vector features using kernel languages.
1) Identify a representation for the state space s.
2) Identify a representation for the state transformations f i , including how to create f i given actor input x i .The representation must also be suitable for representing all possible compositions of the state transformations.3) Create an efficient algorithm for composing the state transformations.
The reason that this is a methodology and not an algorithm is that some creativity is often required in order to arrive at a suitable representation and composition algorithm.Also there is no guarantee that a parallel algorithm that has net speedup will result.Now we present two examples of applying this methodology, namely scan-based parallel actors for IIR filters and finite state machines.

III. EXAMPLE 1: IIR FILTERS Consider an IIR filter whose ith output given inputs
The state required to compute the suceeding output y[i] contains the previous Q outputs and P − 1 previous inputs, or as a vector, x[i−2], . . ., x[i−P ], 1) T .(The reason for the constant, final element 1 is so that an affine operation can be represented in a matrix.)The transformation f i can be represented by a matrix M(x i ), a function of x i , The rth row of M (x i ) tells how to compute the value of the rth state variable after the ith firing of the IIR filter actor.It can be verified by substitution that this matrix does map the state vector at step i − 1 to that at step i, i.e., s i = M(x i )s i−1 .Since the composition of the transformations represented by two matrices is their matrix product, composition of the transformations f is performed by multiplication of the matrix representations M. That is, for any m > 0, This yields the following algorithm for an IIR filter actor.At application start time, the state vector s of the actor is initialized.Suppose that the actor is invoked N successive times on input data x k , . . .x k+N −1 .Then N outputs y k , . . .y k+N −1 and the updated value of s are computed thus: 1) Compute M(x k ), . . ., M(k k+N −1 ) in parallel.(There are no data dependencies among these computations.)2) Apply the scan operation to the matrices, where the binary operator is matrix multiplication.
Call the results 3) For i = k, . . ., k + N − 1, let output y i be the first element of R i s. 4) Update s to be R k+N −1 s.Note that although the matrices M are sparse, in general the matrices R will have full upper triangles.Vector extensions are generally designed so that the dot products in the inner loops of matrix multiplications will vectorize, so this method should vectorize on such processors.

IV. EXAMPLE 2: FINITE STATE MACHINES
Let M = (Σ, S, T, s 0 , A) be an FSM with alphabet Σ, state set S, transitions T : Σ × S → S, initial state s 0 ∈ S, and accepting states A ⊆ S. Let's assume again that an actor implementing this FSM is scheduled to run N times consecutively with inputs x i ∈ Σ, for i ∈ k, . . .k + N − 1 yielding corresponding boolean outputs y i which are true if and only if M is in an accepting state after reading x i .I will show two parallel scan-based algorithms for computing the outputs of the actor.The first one will be given by exhibiting a state vector s and matrices M(x i ) so that the algorithm presented in the previous section can be re-used.The second one does some pre-computation given M so that table lookups substitute for the matrix multiplications.
Without loss of generality, identify the |S| states with the integers 1, . . ., |S|, with 1 being the initial state and the accepting states being the highest-numbered |A| states.In the vector-matrix algorithm, s is a column vector of length |S| containing a one at the index corresponding to the current state of M and all other values zero.The initial value of s is s 0 = (1, 0, . . ., 0) T .Let M(x i ) contain a 1 at index (r, c) if T (x i , c) = r and zero otherwise.Then as required for a representation of the transition function, if the machine is in state c before reading symbol x i , that is, s i−1 has its one at index c, then s i = M(x i )s i−1 is in state r, i.e., has its 1 at index r.Matrix multiplication is again the representation of composition of the transition functions, and the algorithm of the previous section can be applied, changing only the step where the outputs are computed: 1) Compute M(x k ), . . ., M(k k+N −1 ) in parallel.
2) Apply the scan operation to the matrices, where the binary operator is matrix multiplication.
Call the results 3) For i = k, . . ., k + N − 1, let output y i be true if and only if R i s has its 1 at one of the |A| highest indices.4) Update s to be R k+N −1 s.In contrast to the case of IIR filters, with FSMs the matrices R are sparse: like the M's, they will have only one nonzero per row, and that element's value must be a 1.All vectors s likewise contain all zeros except for one element with value 1.So, the matrices M and R could be stored as a vectors of length |S| containing the index of the 1 in each row and the vectors s could be stored as the index of the element with value one.Matrix-matrix multiplication then would take O(|S|) time and matrix-vector multiplication O(1) time.
Further speed improvement is possible by doing some computation once the FSM is known but before the first invocation of the actor.As the matrices M and R contain only 1's and 0's, there can be only a finite number of them.So all the possible products M (x i ) • • • M (x j ) of matrices M(x) for x ∈ Σ can be carried out in advance and a table built of the results.Each of these matrices can be identified with an integer, so functional composition is no longer matrix multiplication but a look up of an entry in a 2D integer array.(Those knowledgeable in abstract algebra will recognize that this lookup table is isomorphic to the transition semigroup of the FSM.[17]) Functional composition now takes O(1) time.For details of this representation, of the resulting greedy algorithm for constructing the look up table, and a study of look up table size as a function of FSM size for one particular signal processing application see [18].Previous table-driven methods yielding unbounded concurrency in FSM execution are described in [19].
V. EXPERIMENTAL VERIFICATION References [18] and [20] present applications verifying the correctness and studies of the performance of the FSM method running on GPUs and on multicore CPUs.On GPUs, the look up tables of the previous section were stored in constant memory, which permits irregular, simultaneous access from multiple threads.Those studies did not consider the question of vectorization.
The vectorizability of the look up table method of the previous section was studied by coding the scan algorithm in C and compiling it with the Intel C++ Composer XE for Windows version 2013 SP1 vectorizing compiler targeting the AVX vector extensions.The inner loop of the scan could not be vectorized.Vectorizing this loop requires simultaneous reads of random addresses within the lookup table.Examination of the AVX vector extension instruction set [3] shows that it and earlier x86 vector extensions support only simultaneous reads of adjacent memory addresses.Thus, the table lookup is not vectorizable.The compiler identifies the entire loop as not being effeciently vectorizable.AVX-512 supports gather instructions that simultaneously read randomly indexed values into a vector register [3].Hence this limitation is probably not permanent-but this assertion remains to be tested.
A software implementation of a simple SDF system was constructed in the Python 3 language.The IIR filter actor was implemented in OpenCL 1.2 using the parallel scan operator provided by PyOpenCL version 2013.2.Testing was performed on an Apple MacBook Pro with a dual core Intel i5-2435M CPU at 2.40 GHz and Apple's OpenCL library. Figure 1 shows the output of the IIR filter y[n] = 0.99 y[n − 1] + 0.01 x[n] when excited with an impulse, yielding the expected exponential decay, RC filter response.
Actor execution times were measured using OpenCL profiling events inserted into the command queue before and after execution of the scan kernels.Figure 2 shows repeated measurements of the throughput of the IIR filter actor as a function of N .The peak throughput obtained is about 88 megasamples per second, on a relatively low power processor.The throughput increases with N up to approximately 100,000 samples.There is a classic trade-off between throughput and latency.

VI. CONCLUSIONS
Full use of the parallel computation capabilities of present and expected CPUs and GPUs require use of vector extensions.Yet many actors in data flow systems for digital signal processing have internal state that impose serial dependencies between actor invocations that make vectorizing across actor invocations impossible.We have presented a novel methodology that in some cases permits such vectorization through use of the parallel scan pattern.We presented examples of the use of the methodology on two types of actor useful in practice, along with experimental evaluation of the resulting actors.

Fig. 1 .
Fig. 1.Output of IIR filter.Input is a unit impulse at sample index 128.

Fig. 2 .
Fig.2.Boxplot of 20 repeated measurements of the base 2 logarithm of the number of consecutive actor invocations N (in this implementation the same as the size of sample buffers passed to the actor) vs throughput in samples per second of the IIR filter.