We will understand the following steps.
Objectives
- Preprocess audio data to improve quality and consistency.
- Extract relevant features from speech data.
- Train a machine learning model Logistic Regression for classification.
- Build model performance using accuracy and classification metrics.
Techniques Used
Audio Preprocessing
- Trimming: Remove silence from the beginning and end of the audio.
- Normalization: Check all audio levels are consistent files.
- Resampling: Convert all files to a consistent rate.
- Padding or Truncating: Ensure all inputs are the same length.
Feature Extraction
We extract powerful features that capture the characteristics of the speaker's voice.
- Mel-Frequency Cepstral Coefficients: Capture timbral and phonetic content.
- Spectral Centroid: Measures the "center of mass" of frequencies.
- Spectral Rolloff: Frequency below which a set percentage of 85% of the energy is contained.
- Zero-Crossing Rate: Counts how often the signal changes sign higher for noisy or unvoiced sounds.
- RMS Energy: Captures the loudness of the signal.
Model
We use a Logistic Regression model to classify audio based on extracted features. The model is trained on labeled examples of male and female voices.
Evaluation
The final model is evaluated using.
- Accuracy
- Precision, Recall, F1-Score
- Confusion Matrix
Note. Whatever techniques and technology I am telling you about in this practical, I have already defined the theory like recall precision and F1 score or machine learning, EDA data, everything. If you do not understand anything in this topic, then you can read the previous article first.
Practical
Install Required Pip Files.
![Install Required Pip Files]()
Importing Libraries
![Importing Libraries]()
Experimenting on a Sample
let's take a sample and use it in experimenting and visualizing.
![Sample]()
![Sample2]()
Audio Processing
1. Load Audio File
![Load Audio File]()
![Audio Wave]()
2. Trim Silence
Removes unnecessary silence from the beginning and end of the audio. This helps eliminate parts of the audio that contain no useful information.
We use librosa.effects.trim() function with a parameter top_db which stands for "how many decibels below the peak" should be considered silence.
"Decibels" is a logarithmic unit used to measure sound level.
- Lower top_db (20) stricter silence removal (only trims very quiet parts).
- Higher top_db (60) is more aggressive (trims even moderately quiet parts).
We will choose 35 which is somewhat in the middle.
![Trim Silence]()
3. Noise Reduction
Reduces background noise such as hums, hisses, or ambient sounds using filters or noise reduction algorithms.
- High-pass filters remove low-frequency noise.
- noise reduce library can automatically detect and reduce noise.
![Noise Reduction]()
![Resized]()
4. Normalization
Ensures all audio signals are on the same volume scale by scaling the waveform so its peak is consistent across samples.
5. Resampling
Resamples all audio to the same sampling rate (16,000 Hz), which ensures uniformity across the dataset.
Sample rate: how many samples per second are used to represent the audio.
Let's look at the audio wave at a more zoomed level.
![Resampling]()
Feature Extraction
Now let's look at various feature extraction techniques.
1. Spectrogram
Shows how the frequencies of the audio signal change over time.
- X-axis: time
- Y-axis: frequency
- Color: amplitude of each frequency at that moment
![Spectrogram]()
2. Mel Spectrogram
Similar to a regular spectrogram, but the frequency axis is scaled to match how humans hear (the Mel scale).
It focuses more on low to mid frequencies, which are most important for speech and music.
![Mel Spectrogram]()
3. Spectral Centroid
The Spectral Centroid tells us where the "center of mass" of the sound frequencies is it shows us how "bright" or "dark" a sound is.
- If most energy is in high frequencies, the centroid is high the sound is bright or sharp (like cymbals).
- If most energy is in low frequencies, the centroid is low the sound is dull or bassy (like drums or male voices).
![Spectral Centroid]()
Spectral Rolloff
Tells us the frequency, where below this frequency (point) is 85% of the total energy (amplitude) in the sound.
Example
Male Voice (Deep, Low-pitched)
- Most energy is in low frequencies.
- We might reach 85% of the energy by 2000 Hz.
- So the Spectral Rolloff is low.
Female Voice (High-pitched)
- Energy is spread into higher frequencies.
- We may need to go up to 5000 Hz to reach 85% of the energy.
- So the Spectral Rolloff is higher.
![Spectral Rolloff]()
6. MFCC (Mel-Frequency Cepstral Coefficients)
Captures the overall shape of the audio spectrum in a way that mimics human hearing. Commonly used in speech and music analysis.
![MFCC]()
7. RMS (Root Mean Square Energy)
Captures the energy or loudness of the signal over time. Useful for understanding how powerful the sound is at each frame.
- High RMS values = loud parts (speech, music, noise).
- Low RMS values = silence or quiet parts.
It helps in voice activity detection, emotion recognition, and even trimming silent segments.
![RMS]()
![RMS]()
![RMS3]()
Final Preprocessing of Data
Audio Preprocessing And Load Data into Local Male or Female folder I used only 9 male voice or 9 female voice data.
![Audio Preprocessing]()
Since the two folders are large (male_folder contains 9 files), we will use parallel processing to speed up the process.
![Folders]()
Now that we loaded our two folders, we can create our data frame and Show the data set.
![Two folders]()
Conclusion
We have classified the voice data of male and female, some of which was male and some was female. We can add other data to this and prepare models of it, such as animals, birds, etc.