If you read these posts regularly you will know that any attempt to digitize a music signal - to reduce it to a finite numerical representation - will ultimately result in some degree of quantization error. This is because you cannot represent the waveform with an absolute degree of precision. It is a bit like trying to express 1/3 as a decimal (1.3333333 …… ) - the more 3’s you write, the more accurate it is - the less the ‘quantization error’ - but the error is still there. This quantization error comprises both noise and distortion. The distortion components are those which are related to the signal itself (mathematically we use the term ‘correlated’), and generally can be held to represent sonic defects. The noise is unrelated to the signal (mathematically ‘uncorrelated’) and represents the sort of background noise that we can in practice “tune out” without it adversely affecting our perception of the sound quality.

The way to eliminate the distortion caused by quantization is simply to add some more noise to it. But not so much as to totally subsume the distortions. It turns out that if we add just the right amount of noise, it doesn’t so much bury the distortion as cause it to shrink. If we do it right, it shrinks to a level just below the newly-added noise floor. Of course, this noise floor is now slightly higher than before, but this is perceived to sound better than the lower noise level with the higher distortion. The process of deliberately adding noise is called ‘dither’, and we can mathematically analyze exactly how much noise, and what type of noise, is necessary to accomplish the desired result. The answer is ‘TPDF’ dither (it doesn’t matter if you don’t know what that means) at the level of the Least Significant Bit (LSB). This means that the greater the Bit Depth of your signal, the less the amount of noise you have to add to ensure the absence of distortion components in the quantization error.

Explaining and understanding exactly why that works is beyond the scope of this post, but I should point out that the analysis leads to some deeper and more profound insights, the implications of which I want to talk about. Essentially, the idea of dither is this: when you digitize an analog signal (or reduce the bit depth of a digital signal - same thing) you are not constrained to always choose the nearest quantization level. Sometimes good things can happen if you instead choose a different quantization level, as we shall see.

One thing that is easy to grasp is the concept of averaging. If you count the number of people who live in a house, the answer is always an integer number. But if you average over several houses, the average number of occupants can be a fractional number - for example 2.59. Yet you will never look in an individual house and see 2.59 people. It is the same with digital audio. By measuring something multiple times, you can get an “average” value, which has more precision than the bit depth with which the values are measured. In digital audio we call this “oversampling”.

Recall also that in order to digitally sample an analog waveform, we need a sample rate which is at least twice that of the highest frequency present in the waveform. An audio waveform contains many frequencies, ranging from deep bass to high treble, so the sampling frequency must be at least twice that of the highest treble frequencies. Clearly, therefore, the sampling frequency is going to to be many, many times higher that what we would need to capture the lower frequencies alone. You could argue therefore, that the lowest frequencies are highly oversampled, and that the possibility therefore ought to exist to record their content at a precision which, thanks to "averaging", is greater than the nominal bit depth. And you would be right.

Noise shaping takes advantage of the fact that the lower frequencies are inherently over-sampled, and allows us to push the background noise level at these lower frequencies down below what would otherwise be the limit imposed by the fixed bit depth. In fact it even allows us to encode signals below the level of the LSB, right down to that noise floor. You would think it wouldn’t be possible, but it is, because of the fact that the low frequencies as quite highly oversampled. In effect, you can think of the low frequency information as being encoded by averaging it over a number of samples. In reality it is a lot more complicated than that, but that simplistic picture is essentially correct.

Like playing a flute, actually doing the noise shaping is a lot more difficult than talking about how to do it. A noise shaping circuit (or, in the digital domain, algorithm) is conceptually simple. You take the output of the quantizer and subtract it from its input. The result is the quantization error. You pass that through a carefully designed filter and subtract its output in turn from the original input signal. You are in effect putting the quantization error into a negative feedback loop. In designing such a noise shaper, though, you must not ask it to do the impossible, otherwise it won’t work and will go unstable. What you must do is recognize that only the low frequencies can benefit from noise shaping, so the filter must be a low-pass filter, and pass only the low frequency components of the quantization error through the feedback loop. This negative feedback in effect tries to reduce the quantization error only at those low frequencies.

But there’s no free lunch. All of those quantization errors can’t just be made to go away. Because of the low-pass filter, the higher frequency components of the quantization error are not subject to the same negative feedback and so the actual quantization error becomes dominated by high frequency components. The low frequency components end up being suppressed at the expense of increases in the high frequency components. This is why it is called “Noise Shaping”. It would be more accurate to refer to it as “Quantization Error Shaping”, but that trips less fluidly off the tongue. What we have done is to select quantization levels that are not necessarily those with the lowest individual quantization error, but as a result have nonetheless ended up with an improved performance.

At this point, a good question might be to ask just how much we can suppress the quantization error noise? And there is an answer to that. It is referred to as “Gerzon & Craven”, after the authors who published the first analysis of the subject in 1989. What Gerzon & Craven says is that if we plot the quantization noise on a dB scale against the frequency on a linear scale, as we use noise shaping to push the quantization noise floor down at the low frequency end, we plot out a new curve. There is an area that appears between the old and new curves. Then, at higher frequencies, noise shaping requires us to pull the noise floor up above the existing noise floor. Again, an area appears between the old curve and the new one. Gerzon and Craven tells us that the two areas must be equal. Since there is a fundamental limit on how high we can pull up the high frequency noise floor (we can’t pull it up higher than 0dB), it follows that there is a practical limit on how much we can push down the low frequency noise. In practice, however, too high a degree of noise shaping requires highly aggressive filters, and these can end up dominating the issue due to practical problems of their own.

For a lot of applications, the high frequency area overlaps with the signal bandwidth. A perfect example is 16/44.1 “red book” audio. The high frequency area goes up to 22.05kHz, of which the audio bandwidth is taken to comprise up to 20kHz. Any noise shaping done on 16/44.1 audio must therefore introduce audible high frequency noise. It must therefore be done - if it is done at all - very judiciously.

There are two very important things to bear in mind about noise shaping. The first is that the high frequency content is crucial to both the low frequency noise suppression and low-level signal encoding. In a real sense, those effects are actually encoded by the high frequency noise itself. If you were to pass the noise-shaped signal through a low-pass filter that cuts out only the high frequency noise, then as soon as you re-quantized the output of the filter to the bit depth of the original signal, all of that information would be lost again.

The second thing is that the noise-shaped noise is now part of the signal, and cannot be separated out. This is of greatest importance in applications such as 16/44.1 where the signal and the shaped noise share a part of the spectrum. Every time you add noise-shaped dither to such a signal as part of a processing stage, you end up adding to the high frequency noise. Considering that noise shaping may easily add 20dB of high frequency noise, this is a very important consideration.

All this is fundamental to the design of DSD, which is built upon the foundation of noise shaping. A 1-bit bitstream has a noise floor of nominally -6dB, which is useless for high quality audio. But if we can use noise-shaping to push it down to, say, -120dB over the audio bandwidth, then all of a sudden it becomes interesting. In order to do that, we would need an awful lot of high frequency headroom into which we can shape all the resultant noise. Additionally, we only have an absolute minimum of headroom into which we can push all this noise. We will need something like 1,000kHz of high frequency space in which to shape all this noise. Enter DSD, which has 1.4MHz available, and practical SDMs can just about be designed to do the job.

If we can double the sample rate of DSD and get what we now refer to as DSD128, or even increase it further to DSD256, DSD512, etc, then we can not only suppress the noise floor across the audio bandwidth, but also well into the ultrasonic region, so that it is totally removed from any audio content. Perhaps this is why those higher flavours of DSD have their adherents.

I want to finish with some comments related to the paragraph above where I talk about how the HF noise is integral to the LF performance gains. I want to discuss how this applies to DSD. Obviously, I have to strip off the HF noise before I can play the track. But if I can’t do that without regressing to 1-bit audio with -6B noise floor, how is it of any practical use? The answer is that the HF content is only crucial while the signal remains in the 1-bit domain. As soon as I free it from the shackles of 1-bit representation, all bets are off. Converting it to analog is one way of releasing those shackles. I can then use an analog filter to strip off the ultrasonic noise. Converting it to a 64-bit digital format would be another. In the 64-bit domain, for example, 1 and 0 become 1.0000000000000000E+000 and 0.0000000000000000E+000 respectively, and any quantization errors all of a sudden become vanishingly small. In the 64-bit digital domain I can do all sorts of useful and interesting things, like digitally filter out all the HF noise, which is now superfluous. But if I ever want to return it to the 1-bit domain, I need to go through the whole high-performance SDM once again, which would serve to add it right back in.