Samples sequence behaviour: Savitzky-Golay processing.
Introduction.
Nowadays, often and often it's possible listening to several new terms coming from IT world, one of them is: BIG DATA. Some year ago, talking about tera-bytes or peta bytes looked like talking about sci-fi, but, the evolution of hardware technologies brought us to face up huge quantity of data.
When you come across with big data world, one of the problem that you can incurr to, is, quickly understand the "behavoiur" of the data trend. In other words, when managing thousands, millions or billions of data samples, it can be very difficult to look up the key points of data samples sequence (e.g. min, max, trough, peak etc.)
The following article givesto the reader a suggestion to process samples sequence when the problem is: find maxima.
Because it's unuseful reinvent the wheel, and because after the end of WWII a plethora of algorithms were created, I chose the ingenious algorithm of Abraham Savitzky and Marcel J. E. Golay.
Savitzky - Golay.
Giving a glance at wikipedia you'll be able to see (extract)
A SavitzkyGolay filter is a digital filter that can be applied to a set of digital data points for the purpose of smoothing the data, that is, to increase the signal-to-noise ratio without greatly distorting the signal. This is achieved, in a process known as convolution, by fitting successive sub-sets of adjacent data points with a low-degree polynomial by the method of linear least squares. When the data points are equally spaced, an analytical solution to the least-squares equations can be found, in the form of a single set of "convolution coefficients" that can be applied to all data sub-sets, to give estimates of the smoothed signal, (or derivatives of the smoothed signal) at the central point of each sub-set. The method, based on established mathematical procedures, was popularized by Abraham Savitzky and Marcel J. E. Golay who published tables of convolution coefficients for various polynomials and sub-set sizes in 1964. Some errors in the tables have been corrected. The method has been extended for the treatment of 2- and 3-dimensional data.
Savitzky and Golay's paper is one of the most widely cited papers in the journal Analytical Chemistry and is classed by that journal as one of its "10 seminal papers" saying "it can be argued that the dawn of the computer-controlled analytical instrument can be traced to this article”.
Theory.
A more detailed explanation of the algorithm (a mathematical point of view) can be retrieved from the downloaded section.
In practice.
Now, we're aware that there's a faboulus algorithm able to process data and ….what... from now on?
As mentioned above, the Savitzky-Golay algorithm "substitutes" a sequence of samples with a polynomial function that best fits to it. That allows us to immediately know the hot point of the data sequence; How the algo implementation makes it possible? A way to do that is by making the derivative of the polynomial function, so that the user is allowed to know all peaks and trough of the samples range considered.
The application fields of the SG algorithm are unlimited: statistic, mathematic, chemistry (it was born in the chemistry context), physics, telecommunication and so on.
The article aims to show the power of the algorithm, examining a samples sequence of just 2048 elements; Treating thousands, millions or billions of samples it's only hardware matter; in our study case, 2048 samples suffices to demonstrate how to employ it.
The field of application is telecommunication (frequency modulation):
processing samples coming from the E4000-RTL2832U tuner IC of a NooElec NESDR SMART XR, a software defined radio device.
The aim of the algorithm in this case study is finding the maxima of the samples in the 87.5-108 MHz so that the station tuning is made automatically.
Even the C and C++ are old languages the writer is still fascinated from them; therefore the code that processes samples is written in simple C++.
The code is made by three files: sgsmooth.h, sgsmooth.cpp and the main. The first two are the implementation of the Savitsky-Golay algorithm, the last is the orchestrator.
Tipically when a programming language is used in an applied field every sort of elegance is set aside. This is an example. But, on the other side, the reader is going to be focused on the gist of the matter. A spartan effective code is better then an elegant unuseful one.
Let's proceed.
Within the main a samples sequence of 2048 is provided; the struct spectrumPointInfo brings inside, the info about the points, that is, its abscissa, its amplitude and its derivative.
typedef struct spectrumPointInfo {
int x;
double amplitude;
double derivative;
spectrumPointInfo& operator=(spectrumPointInfo _spectrumPointInfo)
{
this->x = _spectrumPointInfo.x;
this->amplitude = _spectrumPointInfo.amplitude;
this->derivative = _spectrumPointInfo.derivative;
return *this;
}
} StructSpectrumPointInfo;
The SG window processed each time, is made by 30 samples, and the polynomial degree has been set to 5. The main is an orchestrator, so, it makes use not only of the SG algorithm, but other minor elaborations are made to achieve the goal; for example, the sequence of samples is processed two times in amplituted, to obtain a smoother final function. The other processing blocks are so simple that the reader won't have difficult to understand them.
It's worth noting the makeAVG function that makes an arithmetical average of the data, pre_elaborate that subtractes from a sequence of samples its average and doCalculateMax that is the real orchestrator.
//make average of all samples
double makeAVG(double* points, int array_lentgth) {
double average = 0.0;
for(int a=0;a
0 ? (points[a]-average) : 0 ;
}
return points;
}
The output of the main will be 4 files:
calc_sgsmooth.txt, giving the points of the smoothed function
calc_sgsderiv.txt, giving the derivatives
max.txt, giving all points sorted for amplitude
max_chosen.txt, giving the best amplitude points
to compile the code, go into the project dir and type:
ivan@beyond-macbook:~/Desktop/sg_ice\-> g++ -o sg_ice main.cpp sgsmooth.cpp -Wall -fPIC -std=c++1
to launch the process, type:
ivan@beyond-macbook:~/Desktop/sg_ice\-> ./sg_ice
Within the companion excel, the reader can see three columns of numbers: the original samples sequence, the sequence of sample elaborated, the derivative sample sequence and on the right the realted charts.
In the download section the reader can retrieve a zip file containing:
- the project
- data_demodulator_comparison_2018_11.xlsx
- savgol.pdf
Savitzky-Golay is one of the innumerable algorithms available from the data processing discipline, but, its semplicity and its ingenious are underestimated and often forgotten; it is a powerfull tool to analyze every sort of data in a digital context.
That's all, thanks you for reading.