Basic statistical concepts for digital signal processing

2022-03-09T17:43:21.473769

information here has been gathered from various sources, most notably the great dspguide.com. Code blocks have been rewritten by me in Ruby, Rust, and Lua and their accuracy has not been tested.

Here we begin to gather some rough ideas on statistics that will be helpful as we dive deeper in DSP.

In electronics the mean is called the DC value. For AC, the signal is referred to from how much it fluctuates from the mean. if the signal is a simple waveform, it can be described simply by its peak to peak amplitude, but most signals have a random nature at which point we use standard deviation.

array = [set of numbers]

MEAN = array.sum() / array.size()

VARIANCE = SIGMA**2 =
    array.map(|e| e - MEAN).sum() ** 2 / array.size() - 1
    // the power of the fluctuation

SIGMA**2 =
    (array.map(|e| e**2).sum() - array.sum()**2 / (array.size()) / (N - 1)

STANDARD_DEVIATION = (SIGMA**2).sqrt() = SIGMA
    // how much a signal fluctuates

TYPICAL_ERROR = SIGMA / N**(1/2)

Standard deviation only measures the AC component of a signal. rms (root-mean-squared) value measures both AC and DC components of a signal

// running stats are leaner computationally
sum = 0
sum_squares = 0
elements = [set of numbers]
RUNNING_STATS = SIGMA**2 = elements.each_with_index{|e,i|
    sum += e
    sum_squares += e**2
    mean = sum / i
    standard_dev = i.eql?(1) ? 0 : (Math.sqrt(sum_squares - sum**2/i)) / (i - 1)
    puts mean
    puts standard_dev
}

In some cases the mean describes what is being measured and the standard deviation represents noise. In these situations we describe the signal to noise ratio which is the mean divided by the standard deviation. Another way to describe this is: coefficient of variation (CV) which is the (standard deviation / mean) * 100%

Statistics is the science of interpreting numerical data, while probability is used in DSP to understand the processes that generate signals. A distinction between the acquired signal and the underlying process is key.

In probability theory the Strong Law of Large Numbers tells us that the error of a signal = 0 as N approaches infinity; so small number of samples will always have a larger error than larger number of samples.

This concept of error is interesting when we think about what calculating a mean for a signal truly means. We are not calculating the mean for the underlying process, just the mean for N sampled signal, and we know this sampled signal will contain statistical noise. This is where the N-1 in the standard deviation comes from, this compensation (compared to simply N) reduces the noise in small sample sizes. As N gets larger N-1 becomes more and more irrelevant (compared to N).

Calculating the mean and standard deviation can be computationally intensive. We can use a histogram to make this calculation more simplistic. To get a histogram we make a table where the index of the table is the number of possible values a sample can have. For an 8bit signal that value will be 256. The y-value of each x index is the number of times that sample value occurs in our signal.

sample histogram table = {
  sample_value1: # occurrences in signal,
  sample_value2: # occurrences in signal,
  etc
}

// in action

H = { table of values }

N = 0
for i=1,#H {
    N += H[i]
}
N = N

MEAN = 0
for i=1,#H {
    MEAN += H[i] * i
}
MEAN = MEAN/N

SD = 0
for i=1,#H {
    SD += H[i] * (i - MEAN)**2
}
SD = SD / (N-1)

pmf (probability mass function; for use on digital signals), pdf (probability density function; for use with analog signals) can be inferred from a histogram, as they are related concepts. If instead of the number of occurrences (y-axis) you replace it with the probability of occurrences: you go from the histogram to the pmf. the sum of values of the histogram is equal the number of samples, the sum of values of the pmf is equal to 1. the sum of the area under the curve of the pdf is equal to 1.

the pdf of a few common signals: square wave: two infinitesimally small spikes representing the two possible values. triangle wave: will have a uniform distribution between its peak to peak values. random noise: so-called normal distribution or gaussian bell curve.
the integral of the pdf (called the cumulative distribution function cdf or psi(x) where x is the standard deviation) is used to find the probability that a signal will be within a certain range.

fn binned_histogram(signal: [f64;25000]) -> [i64;999] {
    // used when number of possible values is greater than number of samples
    // particularly when using floating point numbers

    // choosing the number of bins is balancing act
    // between x-axis and y-axis resolution.

    // let signal have possible values between 0.0 and 10.0

    let bins: [i64;999] = [0;999];
    signal.iter().for_each( |e|
        let bin_number = (e * 100.0) as usize;
        bins[bin_number] += 1;
    );
    bins
}

common distributions

fn raw_exponential_curve(x: f64) -> f64 {
    // random processes tend to have a bell shaped distribution
    (-x.powi(2)).exp()
}

fn normal_distrubtion(x:f64) -> f64 {
    // normalized gaussian distribution
    let num: f64 = (-(x - MEAN).powi(2) /
        (2.0 * STANDARD_DEVIATION.powi(2))).exp();
    let den: f64 = STANDARD_DEVIATION * (2.0 * PI).sqrt();
    num / den
}

Gaussian curves cannot be integrated by elementary methods. So the integral is calculated via numerical integration. You sample the continuous curve and add these samples up simulate integration.

In order to test DSP with signals that resemble real-world signals, we have a need to generate random noise signals that have a Gaussian pdf. The randomness of a standard random number generator has a uniform distribution. We can create less uniform distribution by adding together randomly generating numbers.

-- random number generation
-- needed for creating test signals
-- that resemble the gaussian pdf of real-world signals

function uniform_distribution
    -- mean = 0.5, standard_dev = 1/SQRT(12)
    return math.random
end

function less_uniform_distribution
    -- mean = 1; standard_dev = 1/SQRT(6)
    return math.random + math.random
end

function nearly_gaussian_distribution
    -- mean = 6.0; standard_dev = 1
    sum = math.random
    for _=1,12 do
        sum += math.random
    end
    return sum
end

function seed_to_random_algo(seed)
    -- seed will be a 1 or 0
    -- a,b,c are tuned constants; how are they chosen?
    return (a * seed + b) % c
end

an audio EQ analogy: the accuracy is the frequency, precision is the Q