FIR Filter Implementation

4.1 What is the basic algorithm for implementing FIR filters?

Structurally, FIR filters consist of just two things: a sample delay line and a set of coefficients. To implement the filter:

Put the input sample into the delay line.

Multiply each sample in the delay line by the corresponding coefficient and accumulate the result.

Shift the delay line by one sample to make room for the next input sample.

4.2 How do I implement FIR filters in C?

There are lots of possibilities, including a few tricks. To illustrate, we have provided a set of FIR filter algorithms implemented in C called “FirAlgs.c” in ScopeFIR’s distribution file. FirAlgs.c includes the following functions:

fir_basic: Illustrates the basic FIR calculation described above by implementing it very literally.

fir_circular: Illustrates how circular buffers are used to implement FIRs.

fir_shuffle: Illustrates the “shuffle down” technique used by some of Texas Instruments’ processors.

fir_split: Splits the FIR calculation into two flat (non-circular) pieces to avoid the use circular buffer logic and shuffling.

fir_double_z: Uses a double-sized delay line so that the FIR calculation can be done using a flat buffer.

fir_double_h: Similar to fir_double_z, this uses a double-sized coefficient so that the FIR calculation can be done using a flat buffer.

4.3 How do I implement FIR filters in assembly?

FIR assembly algorithms are quite processor-specific, but the most common system uses a circular buffer mechanism provided by the DSP processor. The basic steps are:

Configure the circular buffer; load the coefficient and delay-line pointers. Then, for each input sample:

Store the incoming data in the delay line; increment the delay-line pointer.

Clear the multiplier-accumulator.

Loop over all coefficients/delays; accumulate the values obtained by multiplying the coefficients by the delayed samples.

Round or truncate the result as the FIR output.

Alternatively, a “shuffle down” method is used in Texas Instruments’ older fixed-point processors to implement circular buffers. The processor literally moves each sample delay values by one slot during each multiply-accumulate (via the “MACD” instruction).

Each DSP microprocessor manufacturer provides example FIR assembly code in its data books or its application handbooks, so be sure to look at those before you “reinvent the circular buffer”.

4.4 How do I test my FIR implementation?

Here are a few methods:

Impulse Test: A very simple and effective test is to put an impulse into it (which is just a “1” sample followed by at lest N – 1 zeroes.) You can also put in an “impulse train”, with the “1” samples spaced at least N samples apart. If all the coefficients of the filter come out in the proper order, there is a good chance your filter is working correctly. (You might want to test with non-linear phase coefficients so you can see the order they come out.) We recommend you do this test whenever you write a new FIR filter routine.

Step Test: Input N or more “1” samples. The output after N samples, should be the sum (DC gain) of the FIR filter.

Sine Test: Input a sine wave at one or more frequencies and see if the output sine has the expected amplitude.

Swept FM Test: From Eric Jacobsen: “My favorite test after an impulse train is to take two identical instances of the filter under test, use them as I and Q filters and put a complex FM linear sweep through them from DC to Fs/2. You can do an FFT on the result and see the complete frequency response of the filter, make sure the phase is nice and continuous everywhere, and match the response to what you’d expect from the coefficient set, the precision, etc.”

4.5 What tricks are useful in implementing FIR filters?

FIR tricks center on two things 1) not calculating things that don’t need to be calculated, and 2) “faking” circular buffers in software.

4.5.1 How do I skip needless calculations?

First, if your filter has zero-valued coefficients, you don’t actually have to calculate those taps; you can leave them out. A common case of this is “half-band” filter, which have the property that every-other coefficient is zero.

Second, if your filter is “symmetric” (linear phase), you can “pre-add” the samples which will be multiplied by the same coefficient value, prior to doing the multiply. Since this technique essentially trades an add for a multiply, it isn’t really useful in DSP microprocessors which can do a multiply in a single instruction cycle. However, it is useful in ASIC implementations (in which addition is usually much less expensive than multiplication); also, some newer DSP processors now offer special hardware and instructions to make use of this trick.

4.5.2 How do I fake circular buffers in software?

When hardware support for circular buffers isn’t available, you have to “fake” them. Also, since ANSI C has no construct to describe circular buffers, most C compilers can’t generate code to use them, even if the target processor has them.

You can always implement a circular buffer by duplicating the logic of a circular buffer in software (and many have), but the overhead can be prohibitive; the circular-fake might take several instructions to implement, compared to just a single instruction to do the multiply-accumulate operation. Therefore you need to fake it.

Here are several basic techniques to fake circular buffers:

Split the calculation: You can split any FIR calculation into its “pre-wrap” and “post-wrap” parts. By splitting the calculation into these two parts, you essentially can do the circular logic only once, rather than once per tap. (See fir_double_z in FirAlgs.c above.)

Duplicate the delay line: For a FIR with N taps, use a delay line of size 2N. Copy each sample to its proper location, as well as at location-plus-N. Therefore, the FIR calculation’s MAC loop can be done on a flat buffer of N points, starting anywhere within the first set of N points. The second set of N delayed samples provides the “wrap around” comparable to a true circular buffer. (See fir_double_z in FirAlgs.c above.)

Duplicate the coefficients: This is similar to the above, except that the duplication occurs in terms of the coefficients, not the delay line. Compared to the previous method, this has a calculation advantage of not having to store each incoming sample twice, and it also has a memory advantage when the same coefficient set will be used on multiple delay lines. (See fir_double_h in FirAlgs.c above.)

Use block processing: In block processing, you use a delay line which is a multiple of the number of taps. You therefore only have to move the data once per block to implement the delay-line mechanism. When the block size becomes “large”, the overhead of a moving the delay line once per block becomes negligible.