Spectral Entropy and Zero Crossing Rate for Endpoint Detection of Speech Signals

tags: Speech signal processing Endpoint detection matlab algorithm

The idea of using double threshold method:

First, the spectral entropy of the noise segment (high randomness and high confusion) is greater than the speech segment, distinguishing the voiced and noise of the voice, and the voiced voice is retained. The unvoiced phase and noise of the time, and then the short-term zero-crossing rate of the unvoiced voice is lower than the noise segment, the unvoiced voice is separated from the noise, and the unvoiced voice is retained, thereby obtaining a complete voice segment, and realizing the endpoint detection of the voice segment, which can be automatically Remove the noise segment.

Spectral entropy definition: Among them: according to the idea of matrix operation editing program: first calculate the fft result of the framing matrix, and convert it to dB units (experimental test using dB as a unit has obvious effect) squared to get the top N of each column of matrix Y vs Y The denominator cigmaY of the probability calculation is added by / 2 lines, and then cigY is expanded into N lines, which is convenient for matrix operation with ./ to obtain the probability matrix P and substitute it into the final spectral entropy calculation formula. The code is as follows: freq = fft (frame_w, N);% Convert the data from the time domain to the frequency domain spect = real (10log10 (freq));% Change the frequency domain result unit to dB
Y=spectconj (spect);% calculated energy sumY = sum (Y (1: end / 2, :)); sumY = sumY (ones (1, N), :);
P = Y. / sumY;% calculate the sample point probability
H = -sum (P (1: end / 2,:). * (log2 (P (1: end / 2,:))));% is defined by the spectral entropy formula

According to the idea of the double threshold method, the detection process traverses the frame matrix once. If the spectral entropy of a frame is less than the high spectral entropy threshold, it may enter the speech segment and continue to traverse, if the spectral entropy of a frame is less than When the low spectral entropy threshold is reached, determine that the speech segment has been entered, record the position of the frame, and traverse from the position. If the short-term zero-crossing rate of a frame is greater than the zero-crossing rate threshold, determine the frame as the end of the noise header The position of the record, change the frame position to noiseEnd, jump out and traverse forward. Then the idea of the noise tail processing is similar. If the spectral entropy of a frame is less than the low spectral entropy threshold, and the spectral entropy of the next frame is higher than the threshold, it is determined that the frame is in the speech segment, and it starts to traverse backward, adding a zero crossing rate greater than If the zero-crossing rate threshold is exceeded, record the frame position as noiseBegin and jump out of all traversal.

The idea is still very clear, the code is not difficult to write.
The complete code below (for reference only, the voice data used in the experiment is: oh.mat, you may need to modify the corresponding threshold parameters if you do it yourself)
% Ocross: short-term zero crossing rate
%~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
% Input parameters:
% frame_w frame frame matrix
% output parameters:
% zerocross Zero-crossing rate sequence
%~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
function [zerocross]=Ocross(frame_w)
[frame_length,frame_number]=size(frame_w);
zerocross=zeros(1,frame_number);
for i=1 : frame_number
u = frame_w (:, i);% Take out a frame
for j=1 : frame_length-1
if u (j) * u (j + 1) <0% judge whether it is zero crossing
zerocross (i) = zerocross (i) +1;% is zero crossing, record once
end% end zero crossing judgment
% end single frame loop
end% end frame number loop

% Experiment 4 endpointDetection
% Apr.18 2020
%~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
clear
close all
clc

%~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

%~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
%
load(‘oh.mat’,‘data’);
fs=10000;
N=1024;

width = 3; % Width in inches
height = 3; % Height in inches
alw = 0.75; % AxesLineWidth
fsz = 13; % Fontsize
lw = 1.2; % LineWidth 1.5
msz = 7; % MarkerSize

frame_time=20e-3;
[frame_m,frame_w,frame_length,frame_shift,frame_number]=enframe(data,fs,20e-3,10e-3,‘hamming’);
% Call the framed windowing function to obtain the framed matrix, framed windowed matrix, frame length, number of frames
timeAxis=(1:frame_number)*20e-3;
zerocross=Ocross(frame_w);
plot(data)
legend (‘Original voice time domain map’, ‘Location’, ‘best’)
xlabel (‘Time (in points)’);
ylabel (‘amplitude’);

mag = sum (abs (frame_w));% amplitude
f = (0:N/2-1)/Nfs;% calculation frequency sequence
freq = fft (frame_w, N);% Convert data from time domain to frequency domain
spect=real(10log10 (freq));% Convert frequency domain result unit to dB
Y=spect.conj (spect);% calculated energy
sumY=sum(Y(1:end/2,：));
sumY=sumY(ones(1,N),：);
P = Y. / sumY;% calculate the sample point probability
H=-sum(P(1:end/2,:).(log2 (P (1: end / 2,:))));% is substituted into the formula defined by the spectral entropy
figure();
plot(timeAxis,H)
legend (‘Spectral Entropy Time Domain Map’, ‘Location’, ‘best’)
xlabel (‘Time / s’);
ylabel (‘amplitude’);
figure()
subplot(211)
plot(timeAxis,mag)
subplot(212)
plot(timeAxis,H)
EntropyHigh = max (H) * 0.995;% spectral entropy high threshold
EntropyLow = min (H) * 1.06; Low threshold of% spectral entropy
hold on
plot([timeAxis(1), timeAxis(end)], [EntropyHigh, EntropyHigh], ‘r’, ‘LineWidth’,lw, ‘MarkerSize’, msz);
plot([timeAxis(1), timeAxis(end)], [EntropyLow, EntropyLow], ‘g’, ‘LineWidth’,lw, ‘MarkerSize’, msz);
legend (‘Spectral entropy versus assignment time-domain plot’, ‘High spectral entropy threshold’, ‘Low spectral entropy threshold’, ‘Location’, ‘best’)
xlabel (‘Time / s’);
ylabel (‘amplitude’);
figure()
T = 24;% sets the zero-crossing rate threshold
plot(timeAxis,zerocross)
hold on
plot([timeAxis(1), timeAxis(end)], [T, T], ‘r’, ‘LineWidth’,lw, ‘MarkerSize’, msz);
legend (‘Zero crossing rate’, ‘threshold’, ‘Location’, ‘best’)
xlabel (‘Time / s’);
ylabel (‘amplitude’);

figure()
plot(data)
hold on
for i=1:frame_number
if(H(i)<EntropyLow && H(i-1)>EntropyLow )
noiseEnd=i-1;
for j=noiseEnd：-1:1
if(zerocross(j)>T)
noiseEnd=j;
break
end
end
end
if(H(i)<EntropyLow && H(i+1)>EntropyLow )
noiseBegin=i+1;
for j=noiseBegin:frame_number
if(zerocross(j)>T)
noiseBegin=j;
break
end
end
break
end
end
endIndex=noiseEnd*(frame_length+1)/2;
beginIndex=noiseBegin*(frame_length+1)/2;
noise1=data(1:endIndex);
noise2=data(beginIndex:length(data));
plot(noise1,‘r’)
hold on
plot(beginIndex:length(data),noise2,‘g’)
legend (‘Original voice’, ‘First noise segment’, ‘Tail noise segment’, ‘Location’, ‘best’)
xlabel (‘Time (number of sampling points)’);
ylabel (‘amplitude’);
figure()
oh_clean=data(endIndex:beginIndex);
plot(oh_clean)
legend(‘cleanData’,‘Location’,‘best’)
xlabel (‘Time (number of sampling points)’);
ylabel (‘amplitude’);
save(‘oh_clean’)

Intelligent Recommendation

Zero Crossing Rate and Python implementation

Zero crossing rate (Zero Crossing Rate，ZCR) Refers to the number of times the voice signal passes through the zero point (from positive to negative or from negative to positive) in each frame. This fe...

Double threshold speech endpoint detection (Python implementation)

Written in front It took a few days to finish the first experiment of audiovisual signal processing. In fact, it is quite simple, share it here. This article introduces the method of voice endpoint de...

Speech endpoint detection and speech segmentation based on double threshold method

voice_activity_detection Audio Split Audio endpoint detection and speech segmentation based on double threshold method Code is on my githubvoice_activity_detection If you think it's a bit useful, plea...

Short-term average zero level of speech signals

First, short-term average zero zero 1. For continuous speech signals, it can be examined for the case of the time domain waveform through the time axis; 2. For discrete signals, it is essentially the ...

Voice breakpoint detection (short-term improvement of subband spectral entropy)

1. Audio Analysis 1. Output the sync time point information of the voice, and the time point is expressed in milliseconds; 2. Split the voice into multiple wav files; Endpoint detection: determine the...