Digital Signal Processing Tricks

One of the things that makes DSP so interesting is its wide variety of tricks, usually to reduce the processing required for a given operation. The right DSP trick can be a nugget of gold. Here, we provide a gold mine of well known - and not so well known - DSP tricks.

Expanded Table of Contents

Top-Level Table of Contents

Filtering Tricks

DSP Trick: Simple Filter Coefficient Interpolation

From: (Eric Jacobsen)
Subject: DSP Trick - Simple Filter Coefficient Interpolation
Date: 23 Oct 1999 00:00:00 GMT
Newsgroups: comp.dsp

Often it is necessary to design a FIR filter coefficient set that is longer than can be practically computed using common automated routines like Parks-McLellan or its Remez subroutine. In most cases this problem can be alleviated by using P-M or Remez to design a short coefficient set and then interpolating it up to the desired length. A simple interpolation method that avoids some of the distortions that may be associated with linear, spline, or polynomial interpolation utilizes the interpolating capabilities of the DFT. The procedure may be achieved as follows:

  1. Design a shortened FIR filter that can be computed using an appropriate design technique.
  2. Take a forward DFT of this shortened FIR coefficient set.
  3. Zero pad the result up to the desired length of the final filter.
  4. Take the inverse-DFT of the zero-padded vector.
  5. The real part of the inverse-DFT is the interpolated coefficient set of the desired length.

It is always prudent to verify that the interpolated filter has the desired response.

Eric Jacobsen Minister of Algorithms, EF Data Corp.

Editor's Note: This trick has some similarities to another dspGuru article, "How to Interpolate in the Time-Domain by Zero-Padding in the Frequency Domain".

DSP Trick: FIR Filtering in C

Subject: DSP Trick: Filtering
Date: 1999/04/21
Newsgroups: comp.dsp

Name: Filtering in C

Category: Programming Trick

Application: FIR-Filtering

Advantages: Simple and fast


This is more like a trick in C but might also be applicable in DSP environments. It allows to compute FIR-filtering in a fast manner when the filter length contains a factor of, let's say 4 (other factors are also possible).

The Trick:

Let's suppose you have an array for the FIR filter coefficients w[LEN] and an array for the delay line x[LEN]. A straightforward approach to do the filtering operation would be:

   y=0; //will contain your result
   for (i=0; i<LEN; i++)
      y = y + w[i]*x[i];

However, many processors don't like the index calculations, so it is sometimes advantageous to do less index calculations. Here's how it goes (suppose LEN is divisible by four. As stated above, the trick basically also works for other factors):

   w_end_ptr  = &w[LEN]; // load sentinel
   w_ptr      = w;
   x_ptr      = x;
   y0         = 0.; // we assume floating point here, so scaling is
   y1         = 0.; // not an issue
   y2         = 0.;
   y3         = 0.;
//----Do the filtering------------------------------------------
   while(w_ptr != w_end_ptr)
      y0 = y0 + w_ptr[0]*x_ptr[0];
      y1 = y1 + w_ptr[1]*x_ptr[1];
      y2 = y2 + w_ptr[2]*x_ptr[2];
      y3 = y3 + w_ptr[3]*x_ptr[3];
      w_ptr = w_ptr + 4;
      x_ptr = x_ptr + 4;
   y = y0+y1+y2+y3;         // y will contain the filtering result

I had pretty good result with this on a SPARC.

Rainer Storn

DSP Trick: Filtering in QAM transmitters and receivers

Subject: Re: DSP Tricks
From: Allan Herriman
Date: 1999/04/22
Newsgroups: comp.dsp

Name: Filtering in QAM transmitters and receivers. When *NOT* to do what the textbooks tell you to do.

Category: Hardware architecture, or implementation

Application: QAM receivers (with hardware emphasis)

Advantages: The textbook descriptions of QAM receivers sometimes miss practical details. "Optimal" solutions may not be the best ones... These tricks are simply a list of possible reasons for deviating from "normal" QAM filter design.

Introduction: (what the textbooks tell you to do)

Any modem which modulates a linear channel (AM, ASK, FM, FSK, PM, PSK, BPSK, QAM, QPSK, or even baseband signalling) subject to noise will use filtering to improve the error rate in the receiver.

In general, there will be filters on the output signals (tx) and the input signals (rx), and also the bit in the middle (the channel).

There are two parts to this:

  1. Eliminate ISI (Inter Symbol Interference) due to limitations in the frequency or phase response of the channel.
  2. Use matched filtering to produce a maximum likelihood receiver.

1. ISI can be eliminated if the channel (including tx and rx filters) frequency response is (1) linear phase, and (2) has symmetry about a point at half the symbol rate (Fs/2).

    ^ H(f)
1.0 |--------\
    |         \
    |          \
    |           \
    |            \
0.0 +----------------------------->f
            ^  ^  ^
            |  |  (1+a)·Fs/2
            |  |
            |  Fs/2

'a' here is actually 'alpha' - the rolloff factor. (Although I have seen (1+alpha) used instead of alpha.) Alpha can be between 0 and 1, but commonly this will be 0.3 to 0.5 (or 1.3 to 1.5 using the other definition.) for 30% to 50% excess bandwidth

For some reason (because the maths isn't too hard?), most implemenations use a "raised cosine" response. The rolloff section between (1-a)·Fs/2 and (1+a)·Fs/2 is actually a half cycle of a cosine wave.

Note: it is also possible to use an adaptive equaliser in the receiver (either before or after the symbol decisions are made) which can reduce ISI. But this is "simply" a filter which adapts its response so the above requirement is met. Adaptive equalisers are used when the channel response is unknown or changing. They may either "adapt" to a training sequence and then remain fixed (like a fax machine), or they continuously adapt.

2. A matched filter will produce the lowest errors in the receiver output for a channel which adds white gaussian noise (an AWGN channel) if the rx filter impulse response is the time inverse of the tx pulse shape. (The tx pulse shape is determined by the tx filter.) In the frequency domain, this means that the magnitude responses of the rx and tx filters are the same, but the phase responses are opposite (and the combination has zero phase (linear phase in practice)). The matched filter output is only valid at the symbol sampling instant.

(This was inherent in the maths. If you want to know more, look at a textbook.) For example, if we transmit square pulses, then the rx filter should have a square impulse response. This would be an integrate-and-dump filter.

3. Combining 1 and 2 results in the following:

An optimal modem will use root-raised cosine filtering in the tx and rx filters. (A root-raised cosine filter puts "half" the response in the tx and "half" in the rx filter, so that the product in the frequency domain is a raised cosine.) The total channel reponse will have zero ISI, and the tx and rx filters are the same, so we have minimised the probability of errors.

The Tricks

The above description can be found in any communications textbook. Now for what the textbooks leave out: some examples of when *not* to use "optimal" filters.

Trick #1:

Must meet transmit spectral mask because:

  1. Certain regulatory bodies place restrictions on the tx spectrum from a modem. For RF modems, the out-of-band emissions sometimes have to be < -80dBc.
  2. Sometimes, the tx signal will interfere with the rx signal at the same end of the link in nearby channels. This is known as NEXT (Near End Cross Talk). In the case of an RF modem, the tx signal can be more than 100dB stronger than the rx signal, so NEXT can be a big problem.

Both of these place limits on the tx filter. This will entail:

  1. Using a small alpha.
  2. Truncating the tails of the tx filter frequency response.
    • This will result in degraded performance.
    • Truncate the rx filter reponse as well.
  3. Using a non-root-raised cosine tx filter. Pick one that allows a sharper rolloff.
  4. Allocating more of the raised cosine filter to the tx, and less to the rx

Trick #2:

Interfering signal has non-white spectrum. (AWGN assumption was made in the matched filter derivation.) Known narrowband interferers can be handled by putting a notch in the rx filter. If the notch is very narrow, the tx filter needn't be changed. Adjacent channel interference can be handled by making the rx filter slightly narrower. (See Trick #1 above)

Trick #3:

Symbol timing recovery problems. A matched filter produces a maximum likelihood estimate of the input symbol at a particular instant only. This assumes that this instant is known. Some simpler symbol-timing recovery schemes may require sub-optimal filtering to work. For example, wideband rx and tx filters allow signal transition detection to be used for symbol timing recovery. (This is how a UART works.) Symbol timing recovery is usually easier with larger alpha. (Books could be written about symbol-timing recovery. Any takers?)

Trick #4:

When one of the filters cannot be controlled. Perhaps the receiver uses analog filtering only, possibly in a SAW filter in the IF (passband) or RLC filter at baseband (BTW, 2nd and 3rd order butterworth have been used here). This filter will only be rough approximation for a root-raised-cosine, and will not have a linear phase response. This can be compensated for in the (FIR) tx filter.

Trick #5:

When there are significant non-linearities (in the tx output amplifier). Usually, the requirement will be to have the smallest amount of AM in the tx, which allows the average output power to be higher for a given amount of spectral spreading (due to the non-linearity). This may require wider tx filters and narrower rx filters. Useful where power efficiency is important (satellite links, handheld equipment, etc). There is also a case for using a larger alpha here. In extreme cases, it is possible to pick a modulation scheme that has a constant-amplitude constellation. (OQPSK, GMSK, etc.)

Trick #6:

When the rx filter is inside a feedback loop controlling carrier phase or frequency tracking. The group delay of the rx filter limits the tracking bandwidth of these loops (due to stability considerations). If a wider loop bandwidth is required (perhaps because of capture range or perhaps poor phase noise performance in the up- and downconverters), then the rx filter may need to be changed if it is not possible to move it outside the loop. In this case, allocate more of the raised cosine filter to the tx, and less to the rx (or try harder to move it outside the loop).

DSP Trick: Using Parks-McClellan to Design a Non-Linear Phase FIR Filter

From: (Eric Jacobsen)
Subject: DSP Trick - Using P-M to design a non-linear phase FIR filter.
Date: 23 Oct 1999 00:00:00 GMT
Newsgroups: comp.dsp

It is possible to use the Parks-McClellan algorithm to design FIR filters with non-linear phase response. For example, a FIR filter with the equivalent response of a Butterworth filter can be designed using the P-M routine.

First, take a common Butterworth description like that in Parks and Burrus where H(x) is a complex function. Create two PM input grids using the real and imaginary components of the Butterworth response. Use PM (or "remez" or whatever it's called on your favorite system) to design two FIR filters using the respective input grids, but turn the 'Hilbert' switch on for the one derived from the imaginary component. Sum the results, i.e., add together the nth coefficients of each filter to create a single N-tap filter.

The resulting FIR filter (assuming you've done your job to make sure everything converges) will have the desired response from the original expression used to generate the PM input grids.

Eric Jacobsen
Minister of Algorithms

Function Approximation Tricks

In DSP, the exact value of a mathematical function usually isn't needed: a degree of inaccuracy usually can be tolerated, and computational speed usually is a primary concern. The right combination of accuracy and speed is very much application-specific.

Approximation of mathematical functions, therefore, is an important subject in DSP. Highly accurate approximations are well known. However, here we present some less-well known approximation DSP tricks which provide an outstanding combination of speed and accuracy.

DSP Trick: Magnitude Estimator

From: Grant Griffin
Subject: DSP Trick: Magnitude Estimator
Date: 02 Oct 1999 00:00:00 GMT

Name: Magnitude Estimator

Category: Algorithmic

Application: Signal strength measurement, AM demodulation, etc.

Advantages: This estimation algorithm is very quick compared to calculating magnitude to full precision using a square-root operation.


Given a complex number represented as "I + jQ", the magnitude is: sqrt(I**2 + Q**2). Unfortunately, the square-root operation in this formula is inherently slow and complicated to calculate, either in software or in hardware.

For applications which do not require a full-precision magnitude value, the use of a magnitude _estimation_ can save calculations. This well-known and widely used algorithm is a true gem of DSP because it provides considerable savings in calculation, at the cost of only a minimal loss of accuracy.

The Trick:

Given a complex number whose real part is "I" and whose imaginary part is "Q", the algorithm estimates the magnitude as:

Mag ~=Alpha * max(|I|, |Q|) + Beta * min(|I|, |Q|)

"Alpha" and "Beta" are two constants whose values can be chosen to trade among RMS error, peak error, and implementation complexity.

How It Works:

The absolute value operations folds the complex number into the range of 0-90 degrees, and the min, max operations further fold the complex number into the range of 0-45 degrees. Within this limited range, a linear combination of I and Q are a good approximation of magnitude.

Values for Alpha and Beta:

A program to demonstrate and test this algorithm is given at bottom. It includes a table of the most useful values for Alpha and Beta. The program prints out the following:

             Alpha * Max + Beta * Min Magnitude Estimator

Name                  Alpha           Beta       Avg Err   RMS   Peak
                                                 (linear)  (dB)  (dB)
Min RMS Err      0.947543636291 0.392485425092   0.000547 -32.6 -25.6
Min Peak Err     0.960433870103 0.397824734759  -0.013049 -31.4 -28.1
Min RMS w/ Avg=0 0.948059448969 0.392699081699   0.000003 -32.6 -25.7
1, Min RMS Err   1.000000000000 0.323260990000  -0.020865 -28.7 -23.8
1, Min Peak Err  1.000000000000 0.335982538000  -0.025609 -28.3 -25.1
1, 1/2           1.000000000000 0.500000000000  -0.086775 -20.7 -18.6
1, 1/4           1.000000000000 0.250000000000   0.006456 -27.6 -18.7
Frerking         1.000000000000 0.400000000000  -0.049482 -25.1 -22.3
1, 11/32         1.000000000000 0.343750000000  -0.028505 -28.0 -24.8
1, 3/8           1.000000000000 0.375000000000  -0.040159 -26.4 -23.4
15/16, 15/32     0.937500000000 0.468750000000  -0.018851 -29.2 -24.1
15/16, 1/2       0.937500000000 0.500000000000  -0.030505 -26.9 -24.1
31/32, 11/32     0.968750000000 0.343750000000  -0.000371 -31.6 -22.9
31/32, 3/8       0.968750000000 0.375000000000  -0.012024 -31.4 -26.1
61/64, 3/8       0.953125000000 0.375000000000   0.002043 -32.5 -24.3
61/64, 13/32     0.953125000000 0.406250000000  -0.009611 -31.8 -26.6


If you need a more accurate estimation than this algorithm provides, you can use some variation of it. For example, varying values of Alpha and Beta can be taken from a small lookup table, driven by the relative size of min and max values. Another possibility is to use this estimate as the "seed" of an iterative magnitude estimator.


As an alternative, consider using the CORDIC algorithm, especially in hardware applications.


This algorithm is described in Understanding Digital Signal Processing by Richard G. Lyons [Lyo97], in Digital Signal Processing in Communication Systems by Marvin E. Frerking [Fre94], and elsewhere.

Thanks to Clay S. Turner for providing some of the coefficients.

The Code:

#include <math.h>
#include <stdio.h>
* Name: mag_est.c
* Synopsis:
*   Demonstrates and tests the "Alpha * Min + Beta * Max" magnitude
*   estimation algorithm.
* Description:
*   This program demonstrates the "Alpha, Beta" algorithm for 
*   estimating the magnitude of a complex number.  Compared to
*   calculating the magnitude directly using sqrt(I^2 + Q^2), this
*   estimation is very quick.
*   Various values of Alpha and Beta can be used to trade among RMS
*   error, peak error, and coefficient complexity.  This program
*   includes a table of the most useful values, and it prints out the
*   resulting RMS and peak errors.
* Copyright 1999  Grant R. Griffin
*                    The Wide Open License (WOL)
* Permission to use, copy, modify, distribute and sell this software
* and its documentation for any purpose is hereby granted without
* fee, provided that the above copyright notice and this license
* appear in all source copies. THIS SOFTWARE IS PROVIDED "AS IS"
* for more information.
double alpha_beta_mag(double alpha, double beta, double inphase,
                      double quadrature)
   /* magnitude ~= alpha * max(|I|, |Q|) + beta * min(|I|, |Q|) */
   double abs_inphase = fabs(inphase);
   double abs_quadrature = fabs(quadrature);
   if (abs_inphase > abs_quadrature) {
      return alpha * abs_inphase + beta * abs_quadrature;
   } else {
      return alpha * abs_quadrature + beta * abs_inphase;
double decibels(double linear)
   #define SMALL 1e-20
   if (linear <= SMALL) {
      linear = SMALL;
   return 20.0 * log10(linear);
void test_alpha_beta(char *name, double alpha, double beta,
                     int num_points)
   #define PI 3.141592653589793
   int ii;
   double phase, real, imag, err, avg_err, rms_err;
   double peak_err = 0.0;
   double sum_err = 0.0;
   double sum_err_sqrd = 0.0;
   double delta_phase = (2.0 * PI) / num_points;
   for (ii = 0; ii < num_points; ii++) {
      phase = delta_phase * ii;
      real = cos(phase);
      imag = sin(phase);
      err = sqrt(real * real + imag * imag)
            - alpha_beta_mag(alpha, beta, real, imag);
      sum_err += err;
      sum_err_sqrd += err * err;
      err = fabs(err);
      if (err > peak_err) {
         peak_err = err;
   avg_err = sum_err / num_points;
   rms_err = sqrt(sum_err_sqrd / num_points);
   printf("%-16s %14.12lf %14.12lf  %9.6lf %4.1lf %4.1lf\n",
          name, alpha, beta, avg_err, decibels(rms_err),
void main(void)
   #define NUM_CHECK_POINTS 100000
   typedef struct tagALPHA_BETA {
      char *name;
      double alpha;
      double beta;
   #define NUM_ALPHA_BETA 16
   const ALPHA_BETA coeff[NUM_ALPHA_BETA] = {
      { "Min RMS Err",      0.947543636291, 0.3924854250920 },
      { "Min Peak Err",     0.960433870103, 0.3978247347593 },
      { "Min RMS w/ Avg=0", 0.948059448969, 0.3926990816987 }, 
      { "1, Min RMS Err",              1.0,     0.323260990 },
      { "1, Min Peak Err",             1.0,     0.335982538 },
      { "1, 1/2",                      1.0,      1.0 / 2.0  },
      { "1, 1/4",                      1.0,      1.0 / 4.0  },
      { "Frerking",                    1.0,            0.4  },
      { "1, 11/32",                    1.0,     11.0 / 32.0 },
      { "1, 3/8",                      1.0,      3.0 / 8.0  },
      { "15/16, 15/32",        15.0 / 16.0,     15.0 / 32.0 },
      { "15/16, 1/2",          15.0 / 16.0,      1.0 / 2.0  },
      { "31/32, 11/32",        31.0 / 32.0,     11.0 / 32.0 },
      { "31/32, 3/8",          31.0 / 32.0,      3.0 / 8.0  },
      { "61/64, 3/8",          61.0 / 64.0,      3.0 / 8.0  },
      { "61/64, 13/32",        61.0 / 64.0,     13.0 / 32.0 }
   int ii;
   printf("\n             Alpha * Max + Beta * Min Magnitude
   printf("Name                  Alpha           Beta       Avg Err  
RMS   Peak\n");
   printf("                                                 (linear) 
(dB)  (dB)\n");
   for (ii = 0; ii < NUM_ALPHA_BETA; ii++) {
      test_alpha_beta(coeff[ii].name, coeff[ii].alpha, coeff[ii].beta,

DSP Trick: Quick-and-Dirty Logarithms

From: Ray Andraka <>
Subject: Quick and dirty logarithm trick
Date: 22 Jun 2000 00:00:00 GMT
Newsgroups: comp.dsp

Name: Quick and dirty logarithms

Category: Algorithmic

Application: Needs fairly precise logarithm of a value quickly.

Advantages: Very fast, very little computation. Look up table is small for precision to well under a dB. Compact hardware implementation. Works for arbitary logarithm bases.

Disadvantages: Precision limited by table size. Requires normalized floating point input.

Introduction: Logarithms are usually computed by using a taylor series or another iterative algorithm. This takes up valuable clock cycles or hardware. Such algorithms are also usually limited to a specific logarithm base. This trick takes advantage of the normalization and the exponent in floating point representations.

The Trick: Floating point representation of a number separates the number into an exponent (usually in excess notation), a sign and a significand, also known as a mantissa. The number represented is then N=M*2^E.

The log (in whatever base is convenient) of that number is LOG(N)=LOG(M * 2^E)=LOG(M) + E*LOG(2). If we assume M is normalized, 1<=M<2, then 0 <=LOG(M) <=LOG(2). The LOG of the exponent gives us the coarse estimate of the LOG (to 6dB), which can be found by just multiplying the exponent by a constant (LOG(2) in the appropriate base). For IEEE floating point, a constant offset may need to be added to the exponent to correct for the excess 128 notation.

If finer precision is required, and here's the real trick, then we can obtain an estimate of the logarithm of the mantissa and add that to the logarithm of the exponent to get a more accurate estimate. The logarithm of the mantissa can be found using a small look-up table addressed by the most significant bits of M (the most significant bit is always 1 if M is normalized, so that bit need not be used in the address). The more bits of M used, the larger the table, and the more accurate the logarithm. A single bit from M will get you close to 3dB, while a 5 bit address (32 entry table) will get close to 1/4 dB accuracy, which is usually all that is needed.

This trick works well in hardware (FPGA) implementations, since the table size is kept small enough to be implemented in one logic cell. For hardware implementations, it may be convenient to use the exponent to address a table instead of using a multiplier.

-Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.

DSP Trick: Fixed-Point Atan2 With Self Normalization

Subject: DSP Trick:  Fixed-pt. Atan2 with Self Normalization
From: Jim Shima
Date: 1999/04/23
Newsgroups: comp.dsp

Name: Fast fixed-pt. atan2 calculation with self normalization

Category: Algorithmic

Application: Used when one wants to compute the 4-quadrant arctangent of a complex number (or any number with x-y coordinates) with a self-normalizing function.

Example Applications: digital FM demodulation, phase angle computations

Advantages: Fast


Computing a 4-quadrant arctangent on DSPs has been the subject of many discussions. Several techniques such as table lookup and polynomial expansion are well known.

In fixed-point DSPs, some normalization of the complex number may be necessary, effectively implementing a hard limiter or amplitude invariance function.

In fact, computing:

theta = atan(y/x)

includes the necessary normalization, but in a fixed-pt. DSP, the division can result in values outside the fixed-pt. range of [-1,1).

Also, for certain applications, such as digital FM demodulation, any amplitude fluctuations must be removed before computing the phase angle.

The Trick:

I computed a self-normalizing ratio depending on the quadrant that the complex number resides in. For a complex number z, let x=Re(z) and y=Im(z).

For a complex number in quadrant I, compute the ratio:

r = ---     (1)

To get the phase angle, compute:

theta1 = pi/4 - pi/4*r (2)

Likewise, if the complex number resides in quadrant II, compute the ratio:

r = ---     (3)

And to get the quadrant II phase angle, compute:

theta2=3*pi/4 - pi/4*r (4)

If it turns out that the complex number was really in quad IV instead of quad I, just negate the answer resulting from (2).

Likewise, do the same if the number was in quad III instead of quad II. By doing this, you have a 4-quadrant arctan function.

The max error using equations (2) or (4) is a little less than 0.07 rads (only at a few angles though). The accuracy of the estimator is actually quite good considering using a 1st-order polynomial to estimate the phase angle.

If you use a higher degree polynomial, it turns out that the even powers of the poly will disappear (due to the odd function), thus relaxing some of the computational load.


To obtain better accuracy (a max error of .01 rads), one can replace equations (2) and (4) with:

theta1=0.1963 * r^3 - 0.9817 * r + pi/4 (2a)
theta2=0.1963 * r^3 - 0.9817 * r + 3*pi/4 (4a)

equations (2a) or (4a) can be computed using 2 MACs, which does not involve much more computation for a 7x increase in accuracy.

Here is some C pseudocode (not optimized in any fashion) using equations (1)-(4):

// Fast arctan2
float arctan2(float y, float x)
   coeff_1 = pi/4;
   coeff_2 = 3*coeff_1;
   abs_y = fabs(y)+1e-10      // kludge to prevent 0/0 condition
   if (x>=0)
      r = (x - abs_y) / (x + abs_y);
      angle = coeff_1 - coeff_1 * r;
      r = (x + abs_y) / (abs_y - x);
      angle = coeff_2 - coeff_1 * r;
   if (y < 0)
   return(-angle);     // negate if in quad III or IV

DSP Trick: Simultaneous Parabolic Approximation of Sin and Cos

From: Olli Niemitalo (
Subject: Trick: Simultaneous parabolic approximation of sin and cos
Newsgroups: comp.dsp
Date: 2001-06-30 05:15:09 PST

Name: Simultaneous parabolic approximation of sin and cos

Category: Algorithmic

Application: When you need both sin and cos at once, and you need 'em fast, and using multiplications and parabolic approximation is OK, try this. Possible applications are audio panning, mixing fader, maybe even realtime filter coefficient calculations and 2D/3D rotation transformations.

Advantages: Cheap, only one or two multiplies per approximation pair. No discontinuities.


A quarter of a sinusoid cycle is approximated with a parabola. The remaining quarters are horizontally and vertically flipped versions of the approximated quarter. Instead of creating two different approximations, the same polynomial is used for both sin and cos. The positive crossing point of sin(angle) and cos(angle), at angle=pi/4, is made to correspond to x=0 in the polynomial. This shifts the center of a quarter at x=0 and turns switching between quarters (and between sin and cos) into x and y sign flips.

The Trick:

The range that is approximated is:

x=-1/2 .. +1/2, angle=0 .. pi/2

Angle and x are related by:

x=angle 2 / pi - 1/2

The approximation: ("~=" stands for "approximately equals")

sin(angle) ~=sinapprox(x)=(2 - 4 c) x^2 + c + x
cos(angle) ~=cosapprox(x)=(2 - 4 c) x^2 + c - x

As you see, the only difference is in the last sign, so you can calculate the part left from that first and then add or sub x to get sin or cos. For different angle ranges, you have to flip signs of the whole equation or the sign of the x term. See the example source code.

c is a constant that can be fine-tuned for different applications. With c yet undefined, the polynomials hit 0 and 1 in the right places. Here are some differently optimized (and approximated) values for c:

A) c ~=0.7035

B) c ~=0.71256755058 

C) c=sqrt(2)/2 ~=0.70710678118654752440

D) c=3/4=0.75 (one mul choice!)

To compare the errors produced by different choices of c, we look at the maximums and minimums of the following error measures, for different c:

error=sinapprox(x) - sin(angle)

Min: A) -0.0213 B) -0.0150 C) -0.0187 D) 0
Max: A) +0.0212 B) +0.0271 C) +0.0235 D) +0.0560

error=sqrt(sinapprox(x)^2 - cosapprox(x)^2) - 1

Min: A) -0.0261 B) -0.0155 C) -0.0214 D) 0
Max: A)  0      B) +0.0155 C) 0 D) +0.1250

The different c were optimized as follows:

A) was optimized for minimum maximum absolute difference to sin.

B) was optimized for minimum maximum absolute difference of sqrt(sinapprox(x)^2+cosapprox(x)^2) to 1. This is important for example in a rotation transformation, where you want the dimensions to stretch as little as possible from the original. Note, though, that C) gives the lowest total magnitude "wobbling" range.

C) was optimized similarly to B but never lets sqrt(sinapprox(x)^2+cosapprox(x)^2) exceed 1. This is useful if you calculate filter coefficients and don't want unstable poles. Also, sinapprox(0) and cosapprox(0) give sqrt(2)/2, which is the correct
D) was optimized for continuous differential, which reduces high harmonics. This is good if you are creating sine and cosine signals rather than doing "random access" to the function. Also, it eliminates the other multiplication, making the algo one-mul and low bit-depth:

x=-1/2 .. +1/2

sinapprox(x)=3/4 - x^2 + x
cosapprox(x)=3/4 - x^2 - x

Example program:

#include <stdio.h>

#include <math.h>
// This program gives approximated float sin and cos for a fixed point
// phase input. Rather than messing with radians, phase is normalized so
// that phase = 0 corresponds to angle = 0 and phase = 2^BITSPERCYCLE 
// corresponds to angle = 2pi. Angles outside this range also give 
// correct results. Calculations are optimized to do with minimum 
// operations and to avoid the unary minus operator.
// Put here your choices of bit-depth, phase, and the constant C
#define INPUTPHASE   666
#define C            0.70710678118654752440f
// Some useful constants (PI and <math.h> are not part of algo)
#define PI (float)3.1415926535897932384626433832795
int main(void) {
    int phasein = INPUTPHASE; // Phase input
    float sinout, cosout;
    // Modulo phase into quarter, convert to float 0..1
    float modphase = (phasein & (1<<BITSPERQUARTER)-1)
    // Extract quarter bits 
    int quarter = phasein & (3<<BITSPERQUARTER);
    // Recognize quarter
    if (!quarter) { 
        // First quarter, angle = 0 .. pi/2
        float x = modphase - 0.5f;      // 1 sub
        float temp = (2 - 4*C)*x*x + C; // 2 mul, 1 add
        sinout = temp + x;              // 1 add
        cosout = temp - x;              // 1 sub
    } else if (quarter == 1<<BITSPERQUARTER) {
        // Second quarter, angle = pi/2 .. pi
        float x = 0.5f - modphase;      // 1 sub
        float temp = (2 - 4*C)*x*x + C; // 2 mul, 1 add
        sinout = x + temp;              // 1 add
        cosout = x - temp;              // 1 sub
    } else if (quarter == 2<<BITSPERQUARTER) {
        // Third quarter, angle = pi .. 1.5pi
        float x = modphase - 0.5f;      // 1 sub
        float temp = (4*C - 2)*x*x - C; // 2 mul, 1 sub
        sinout = temp - x;              // 1 sub
        cosout = temp + x;              // 1 add
    } else {
        // Fourth quarter, angle = 1.5pi..2pi
        float x = modphase - 0.5f;      // 1 sub
        float temp = (2 - 4*C)*x*x + C; // 2 mul, 1 add
        sinout = x - temp;              // 1 sub
        cosout = x + temp;              // 1 add
    // Print some shit, not part of the algo
    float angle = 2*PI*phasein/(1<<BITSPERCYCLE);
    printf("phase: %d/%d, angle: %f\n\n", phasein, 1<<BITSPERCYCLE,
    printf("sinapprox: %.6f, cosapprox: %.6f\n", sinout, cosout);
    printf("      sin: %.6f,       cos: %.6f\n", sin(angle),
    return 0;

DSP Trick: Square Root Computation

Date: Mon, 19 Apr 1999 17:50:23 +0200
From: rainer storn <>
Newsgroups: comp.dsp
Subject: Square root computation

Name: Square root computation

Category: Algorithmic Trick

Application: e.g. Magnitude computation

Advantages: Simple and fast


Once in a while the square root of data has to be computed on a DSP. A method that comes to mind is polynomial approximation of the square root function. A much simpler method, however is the usage of Newton's method or, if division is a problem, successive approximation.

The Trick:

1) Newton's method:

Newton's method states that if you want to find the zero of a function f(x) you may start at some guessing point x1 and compute the next guessing point for the zero x2 according to

x2 = x1 - f(x1)/f´(x1)    (1)

In the next step make x1 := x2 and repeat equation (1) etc. until the difference between x1 and x2 is sufficiently small. Newton's method does not alway converge, but if it does, it converges fast. In the special case of the square root computation Newton's method is guaranteed to converge.

Let's say you want to compute x = sqrt(z) then make

f(x) = x*x - z.  (2)

If we compute f(x) = 0 we immediately see, that x must be the square root of z.

Now plug (2) into (1) to get

x2 = x1 - 0.5*x1 - 0.5*z/x1
   = 0.5*(x1 - z/x1);

which is the recursive equation to be used.

The starting value for x1 is not critical but the closer it is to the real zero the better. I usually use x1 = z.

2) Successive approximation:

If division is a problem, then successive approximation might be the right choice to compute sqrt(x). Successive approximation basically does a binary search to find the square root. A little C-style pseudocode might best explain how it is done:

//----y = sqrt(y)---------
  a = 2^ceil(b/2); //b is the wordlength you are using
  y = a;           //a is the first estimation value for sqrt(x)
  for (i=0; i<ceil(b/2); i++)   //each iteration achieves one bit
  {                             //of accuracy
     a    = a*0.5;
     diff = x - y*y;
     if (diff > 0)
        y = y + a;
     else if (diff < 0)
       y = y - a;

Again, there might be better choices for the starting point but the one given is safe.

Rainer Storn

DSP Trick: Polynomial Evaluation via Horner's Rule

Date: Tue, 20 Apr 1999 09:28:02 +0200
From: rainer storn <>
Newsgroups: comp.dsp
Subject: Trick: Polynomial Evaluation (update)

Name: Polynomial Evaluation

Category: Algorithmic Trick

Application: e.g. Polynomial approximation of a transcendental function

Advantages: Simple and fast

Introduction: When evaluating a polynomial it is advantageous to use Horner's rule as this nicely fits the MAC architectures of DSPs.

The Trick:

Suppose you want to compute a polynomial:

p(x)=p0 +p1*x + p2*x^2 + ... + pk*x^k (1)

Horner's rule to compute this polynomial is given by:

p(x)=p0 + (p1 + (p2 + ...(pk)*x)*x)...)*x (2)

or in a little C-snippet

//----computes result=p(x) with p[j] = pj------
   result = 0;
   for (i=k; i>0; i--)
      result = (result + p[i])*x;
   result = result + p[0];

another, even better possibility is:

//----computes result=p(x) with p[j] = pj------
   result = p[k];
   for (i=k-1; i>=0; i--)
      result = result*x + p[i];

Data Conversion Tricks

DSP Trick: Fast Floating-Point to mu-Law Conversion

From: Jim Thomas
Subject: Fast floting point to mu-law conversion
Date: 6 May 1999

Name: Fast floating point to mu-law conversion

Category: Algorithmic

Application: When you need to convert audio data to mu-law and don't have a lot of time to mess around.

Advantages: Very Fast (I implemented this in a seven cycle loop on a Sharc processor).

Introduction: This trick is based upon the happy coincidence that mu-law compressed data is very much like a poor man's floating point format. Mu-law is a companding method fully descried in the ITU specification G.711. I believe this algorithm to be fully compliant.

The Trick: The algorithm is performed in a few steps. The steps are described below. Although I have implemented and tested this approach in assembly on a Sharc, I am not at liberty to post the code (I do not own it). This approach will work when the data is already in floating-point format as per IEEE 754. Other floating point formats (i.e. Texas Instruments) may work as well, but will require adaptation. The input data is expected to range from -1.0 to +1.0. I present a C algorithm, which will not be fast, but should get the basic idea across.

#define MAX   (8031.0/8192.0)
#define BIAS  (33.0/8192.0)
#define MAGIC (0x16f)
float x;
long  mu;
long  *x_as_long;
/* limit the sample between +/- max and take absolute value */
x = input_sample;
if ((x < -MAX) || (x > MAX))
  x = MAX;
else if (x < 0)
  x = -x;
/* add bias */
x += BIAS;
/* Extract the segment and quantization bits from the exponent and the
mantissa.  Since we have limited the range of the signal already, the
exponent will be well restricted. The pointer "x_as_long" is used to
avoid unwanted type conversion.  In assembly, this is easy. */
x_as_long = (long*)&x;
mu = (*x_as_long >> 19) & 0xff;
/* Unfortunately, mu needs a slight (but magical) adjustment */
mu = MAGIC - mu;
/* All that remains is to splice in the sign bit */
if (input_samle >= 0)
  mu |= 0x80;

DSP Trick: Gray Code Conversion

From: Jerry Avins <>
Subject: Converting Between Binary and Gray Code
Date: 28 Sep 2000
Newsgroups: comp.dsp

Name: Binary / Gray Code Conversion

Category: Algorithm

Application: Some sensors send information in Gray code. These must be converted to binary in order to do arithmetic with it. Occasionally, it is necessary to convert back.

Advantages: Higher speed or smaller code than other ways I've seen. (A neat way to do this is elusive enough that I've seen it done with a look-up table.)

Introduction: The by-the-book conversion of Gray to binary is O(N-1), where N is the word length. One of my algorithms is O(log(N)-1), where N is the word length rounded up to a power of two. The other is O(N-1) (including N-1 subroutine calls), but the code is very compact, especially if binary-to-Gray is also needed.

A Gray code is one in which adjacent numbers differ by one symbol. There are many Gray Codes, even in binary. They can be devised in any base. When Gray (or Gray code) is used without specifying which one, what is meant is reflected Binary Gray. To convert binary to Gray, it is only necessary to XOR the original unsigned binary with a copy of itself that has been right shifted one place. That is easy to do with common instructions. The cannonical way to convert Gray to binary is to XOR the bits one at a time, starting with the two highest bits, using the newly calculated bit in the next XOR. Ugh! I show two easier ways here. Robert Sherry wrote the actual C code. The function declarations make these shorts and the code assumes 16 bits; you can change that.

The Trick:

        The purpose of this function is to convert an unsigned
        binary number to reflected binary Gray code.
unsigned short binaryToGray(unsigned short num)
        return (num>>1) ^ num;
A tricky Trick: for up to 2^n bits, you can convert Gray to binary by
performing (2^n) - 1 binary-to Gray conversions. All you need is the
function above and a 'for' loop.
        The purpose of this function is to convert a reflected binary
        Gray code number to a binary number.
unsigned short grayToBinary(unsigned short num)
        unsigned short temp = num ^ (num>>8);
        temp ^= (temp>>4);
        temp ^= (temp>>2);
        temp ^= (temp>>1);
        return temp;

Miscellaneous Tricks

DSP Trick: Complex Downconverters for Signals at Fs/4 or 3Fs/4

PRE>Date: Thu, 22 Apr 1999 02:52:42 GMT From: Allan Herriman <> Newsgroups: comp.dsp Subject: Re: DSP Tricks THIS WORK IS PLACED IN THE PUBLIC DOMAIN

Name: Complex downconverters for signals at Fs/4 or 3Fs/4

Category: Hardware architecture

Application: Use for QAM modem receivers in FPGAs. Don't use if the local oscillator needs to be frequency agile.

Advantages: This trick produces a trivially simple fixed frequency complex downconverter without needing an NCO or sine lookup tables


"Classical" modem receiver architecture using complex baseband processing:

passband       +---+            +---------+
signal in      |   |            | FIR     |     (baseband)
>----+-------->| X |----------->| Lowpass |---> real
     |         |   |            | Filter  |     output
     |         +---+            +---------+     I
     |           ^
     |           | Cosine
     |    +------+
     |    |
     |    |    +---+            +---------+
     |    |    |   |            | FIR     |     (baseband)
     +----|--->| X |----------->| Lowpass |---> imaginary
          |    |   |            | Filter  |     output
          |    +---+            +---------+     Q
          |      ^
+------------+   | Sine
| Quadrature |   |
| Local      |---+
| Oscillator |

The 'X' are multipliers.

[ Typically, the FIR lowpass filters will be decimating types if the signal bandwidth is much less than the sample frequency, Fs. Polyphase filters may also be used here to resample the signals at the symbol rate (or 2x the symbol rate, etc) if this is not an integer submultiple of Fs. ]

The Trick:

If the local oscillator isn't required to have a controllable frequency or phase, and it is possible to centre the input signal around Fs/4 or 3Fs/4, then the hardware can be dramatically simplified.

Local oscillator outputs at Fs/4:

Cos: +1, 0, -1, 0, +1, 0, -1, etc.
Sin: 0, +1, 0, -1, 0, +1, 0,

  1. Since we only need to multiply by 0, -1 or +1, the multiplier is trivial.
  2. Every second input to each filter is 0, so the filters only need to run at Fs/2. (In this case, we only need to multiply by -1 or +1.)
  3. The multiplier can be removed altogether, and the +/- performed in the filter, either by twiddling the coefficients or by telling the accumulator to subtract.
  4. Since the Cos and Sin terms are never non-zero at the same time, it may be possible to to use just one FIR filter, designed to handle I/Q interleaved data. This filter will have two accumulators which alternate between input samples.
passband       +---------+
signal in      | FIR     |---> real baseband output I
>------------->| Lowpass |
               | Filter  |---> imaginary baseband output Q

There are filter chips designed to do this, such as the HSP43168 from Harris. (Although there are better chips for new designs...) This trick also potentially saves routing in FPGAs.

Note: if this is a QAM receiver, then a separate carrier phase correction stage after the lowpass filters will be required. Also, any frequency errors in the input signal will be present during filtering. The solution is to:

  1. Tolerate this problem (after quantifying its effects).
  2. Use an AFC loop which controls the frequency error before downconversion.
  3. Don't use this trick - use an NCO as the LO.

DSP Trick: Sinusoidal Tone Generator

Subject: Trick : Tone Generation
From: Darrell <>
Date: 1999/04/19
Newsgroups: comp.dsp

Name: Sinusoidal tone generator

Category: Algorithm

Application: If they need to generate a continuous tone of specific frequency and amplitude.

Advantages: Fast and accurate.


I've seen this trick in a few different places, although never the (very simple) deriviation. It is based on the z-transform for sin(wn).

The Trick:

   y(w) = F{sin(wn)} = ----------------------       [1, pg. 159]
                                   -1    -2
                       1 - cos(w)*z   + z
                       -1    -2            -1
   y(w)*[1 - 2*cos(w)*z   + z  ] = sin(w)*z

Taking the inverse z-transform:

   y[n] - 2*cos(w)*y[n-1] + y[n-2] = sin(w)*delta[n-1] = 0
   y[n] = 2*cos(w)*y[n-1] - y[n-2]

Solving for the initial conditions (start at n=1 since y[0] and y[-1] are so trivial to compute):

   y[0]  = sin(0) = 0
                              Ft              Ft
   y[-1] = sin(-w) = sin(-2*pi*--) = -sin(2*pi*--)
                              Fs              Fs

where Ft is the tone frequency and Fs is the sampling frequency. For example, to generate a 1 kHz tone with a sampling frequency of 8 kHz:

   y[0]   = 0
   y[-1]  = 1/sqrt(2) ~= 0.7071
   cos(w) = 1/sqrt(2) ~= 0.7071
   y[n] = 1.414*y[n-1] - y[n-2]
   n |  y[n-2] |  y[n-1] |   y[n]
   1 |  0.7071 |  0.0000 | -0.7071
   2 |  0.0000 | -0.7071 | -1.0000
   3 | -0.7071 | -1.0000 | -0.7071
   4 | -1.0000 | -0.7071 |  0.0000
   5 | -0.7071 |  0.0000 |  0.7071
   6 |  0.0000 |  0.7071 |  1.0000
   7 |  0.7071 |  1.0000 |  0.7071
   8 |  1.0000 |  0.7071 |  0.0000

Note that all calculations were done with 64-bit floating point and rounded by hand.

The z-transform of this shows two poles on the unit circle at +/- pi/4 radians. I've never encountered stability issues but care should be taken to evaluate quantization effects on the system.


[1] Discrete-Time Signal Processing [Opp89]

DSP Trick: Fast Floating Point to Mu-Law Conversion

DSP Trick: Fast Floating-Point to mu-Law Conversion

From: Jim Thomas
Subject: Fast floting point to mu-law conversion
Date: 6 May 1999

Name: Fast floating point to mu-law conversion

Category: Algorithmic

Application: When you need to convert audio data to mu-law and don't have a lot of time to mess around.

Advantages: Very Fast (I implemented this in a seven cycle loop on a Sharc processor).

Introduction: This trick is based upon the happy coincidence that mu-law compressed data is very much like a poor man's floating point format. Mu-law is a companding method fully descried in the ITU specification G.711. I believe this algorithm to be fully compliant.

The Trick: The algorithm is performed in a few steps. The steps are described below. Although I have implemented and tested this approach in assembly on a Sharc, I am not at liberty to post the code (I do not own it). This approach will work when the data is already in floating-point format as per IEEE 754. Other floating point formats (i.e. Texas Instruments) may work as well, but will require adaptation. The input data is expected to range from -1.0 to +1.0. I present a C algorithm, which will not be fast, but should get the basic idea across.

#define MAX   (8031.0/8192.0)
#define BIAS  (33.0/8192.0)
#define MAGIC (0x16f)
float x;
long  mu;
long  *x_as_long;
/* limit the sample between +/- max and take absolute value */
x = input_sample;
if ((x < -MAX) || (x > MAX))
  x = MAX;
else if (x < 0)
  x = -x;
/* add bias */
x += BIAS;
/* Extract the segment and quantization bits from the exponent and the
mantissa.  Since we have limited the range of the signal already, the
exponent will be well restricted. The pointer "x_as_long" is used to
avoid unwanted type conversion.  In assembly, this is easy. */
x_as_long = (long*)&x;
mu = (*x_as_long >> 19) & 0xff;
/* Unfortunately, mu needs a slight (but magical) adjustment */
mu = MAGIC - mu;
/* All that remains is to splice in the sign bit */
if (input_samle >= 0)
  mu |= 0x80;

DSP Trick: Fixed-Point DC Blocking Filter with Noise-Shaping

From: robert bristow-johnson <>
Subject: Fixed-Point DC Blocking Filter with Noise-Shaping
Date: 22 Jun 2000 00:00:00 GMT, Updated: 17 Apr 2001
Newsgroups: comp.dsp

Name: Fixed-Point DC Blocking Filter With Noise-Shaping

Category: Algorithmic

Application: Used anytime DC offsets need to be removed (audio level metering or nearly any audio signal processing to avoid clicks). Since A/D converters and their front end circuitry cannot be expected to be totally DC free, this is useful partly to insure that 'silence' is equavalent to 'zero'.

Advantages: Utterly wipes out DC without introducing its own DC


The use of a digital differentiator along with a 'leaky' integrator (essentially a one-pole LPF) to accomplish DC blocking is nothing new. It is essentially a 1st order HPF.

  • Differentiator: y[n]=x[n] - x[n-1]
  • Leaky Integrator: y[n]=pole*y[n-1] + x[n]
    (0 < 1-pole << 1) if pole=1, then it is a non-leaky integrator and completely undoes the differentiator.

However, doing this with a fixed-point DSP (such as the 56K) introduces a couple of problems. First the order of operations has to be considered.

Leaky Integrator first followed by differentiator: Since the differentiator truly wipes out DC no matter what the numerical issues are, this has the advantage that no DC gets into the output. The disadvantage is that if the pole is very close to 1 (which is necessary if you want your DC blocker to just block DC and other very low frequencies), the the gain of the leaky integrator blows up at low frequencies, enough that saturation is vitually assured and the post differentiation will not fix that horrible distortion.

Differentiator first followed by leaky integrator: DC is killed in the first stage and the gains at low frequecies are reduced enough to guarantee that the integrator does not saturate. Also, there is _no_ quantization error introduced by the differentiator so there isn't the problem of the integrator boosting quantization noise at low frequencies. The disadvantage is that the integrator creates its own DC problems due to quantization (rounding or truncating down) and limit cycling. This limit cycling happens when the amount that y[n] is reduced by multiplying by pole is equal to 1/2 LSB and rounding returns y[n+1] back to its previous non-zero value y[n]. it gets stuck on a non-zero value that gets measurably large as pole gets close to 1.

The Trick:

We use the 2nd option above (differentiator first) and deal with the limit-cycling problem using error-shaping or noise-shaping with a zero on DC (Randy Yates called this "fraction saving"). By feeding back the quantization error that occurs in the one-pole LPF (the leaky integrator), one can design an error-to-output transfer function to have a desired frequency response (to literally shape the noise or error spectrum to steer noise away from where you don't want it to where you can tolerate it). With a zero of this noise transfer placed right on z=1 (or at DC) we force the amplitude of the error in the output to be zero at the frequency of 0 (at the expense of increasing the error amplitude at Nyquist). We literally will have infinite S/N ratio at DC. Therefore if no DC goes into the LPF (and the differentiator sees to that), then no DC comes out. Even if the quantization is not rounding to nearest but always rounding down (definitely a DC bias here) any DC introduced gets killed by the error shaping (this is what happens with 'fraction saving').

       x[n] = input to DC blocking filter (and to differentiator)
       d[n] = output of differntiator and input to LPF
       y[n] = output of LPF and of DC blocking filter
       e[n] = quantization error of LPF (or leaky integrator)
       d[n] = x[n] - x[n-1]   (no quantization error)
       y[n] = Quantize{ pole*y[n-1] + d[n] - e[n-1] }
       e[n] = y[n] - (pole*y[n-1] + d[n] - e[n-1])
       y[n] = (pole*y[n-1] + d[n] - e[n-1])            + e[n]
       y[n] = (pole*y[n-1] + x[n] - x[n-1] - e[n-1])   + e[n]
       y[n] = (y[n-1] - (1-pole)*y[n-1] + x[n] - x[n-1] - e[n-1]) + e[n]

I got this down to 3 instructions per sample in a 56K (using sample buffering, another trick, and pipelining) but here is what it might look like using C code:

// let's say sizeof(short) = 2 (16 bits) and sizeof(long) = 4 (32 bits)
short x[], y[];
long acc, A, prev_x, prev_y;
double pole;
unsigned long n, num_samples;
pole = 0.9999;
A = (long)(32768.0*(1.0 - pole));
acc = 0;
prev_x = 0;
prev_y = 0;
for (n=0; n<num_samples; n++)
    acc   -= prev_x;
    prev_x = (long)x[n]<<15;
    acc   += prev_x;
    acc   -= A*prev_y;
    prev_y = acc>>15;               // quantization happens here
    y[n]   = (short)prev_y;
    // acc has y[n] in upper 17 bits and -e[n] in lower 15 bits

DSP Trick: Dealing With Propagating Truncation Errors

Subject: Dealing with propagating truncation errors
From: Ole Wolf
Date: 1999/05/26
Newsgroups: comp.dsp

Name: Dealing with propagating truncation errors.

Category: Algorithmic tricks.

Application: When values are truncated, and the truncated values are used later in the algorithm, ultimately corrupting the output.

Advantages: You get an acceptable output without having to switch to a completely different algorithm, and generally without having to sacrifice too much execution time.

Introduction: When you truncate or round the value of, say, a 32-bit product register to 16 bits, you may wind up getting pesky little rounding artifacts due to the reduced precision. Although it isn't a problem in most applications, sometimes the small rounding errors do propagate through subsequent stages of an algorithm and eventually corrupt the output. Here are a couple of suggestions that can help you get a cleaner output.

The Trick:

  1. Use extended-precision arithmetic, such as larger-width multiplies and add-with-carry operations, but that's computationally expensive. I think this is the most common way to kill those pesky errors that begin to accumulate. Processor support for extended-precision arithmetic includes:
    • add with carry and subtract with borrow
    • multiplication of signed/signed, signed/unsigned, and unsigned/unsigned operands
  2. Create "leakage." This means that at strategic logactions in an algorithm you multiply with (1-epsilon) so that a value continously becomes slightly smaller than it was supposed to do. If you do that in a feedback loop, the errors that are feeding back will fade out.
  3. Dither the signal. You may find instability caused by a limit cycle appearing for zero (i.e., smaller magnitude than one LSB) or constant input. Or, an algorithm may "stall" because an input signal is too weak to excite the variables in an algorithm. Both problems may be eliminated by dithering the variables in the algorithm. You do that by adding white noise to the variables.
  4. In Handbook of Digital Signal Processing: Engineering Applications [Ell87], Douglas F. Elliot discusses "error feedback". As far as I remember, this is a technique where you essentially "save" the truncation error of a previous iteration and add it in the next iteration, effectively introducing some redundancy. This redundancy effectively creates extended precision for you with less computational effort.
  5. If everything gets out of hand, have some error detection mechanism built into the algorithm (say, to check for magnitude of a certain variable) and re-initialize your algorithm by resetting coefficients or what else you can think of resetting.