Machine Learning  

Gender Classification from Speech Using Machine Learning

We will understand the following steps.

Objectives

  • Preprocess audio data to improve quality and consistency.
  • Extract relevant features from speech data.
  • Train a machine learning model Logistic Regression for classification.
  • Build model performance using accuracy and classification metrics.

Techniques Used

Audio Preprocessing

  • Trimming: Remove silence from the beginning and end of the audio.
  • Normalization: Check all audio levels are consistent files.
  • Resampling: Convert all files to a consistent rate.
  • Padding or Truncating: Ensure all inputs are the same length.

Feature Extraction

We extract powerful features that capture the characteristics of the speaker's voice.

  • Mel-Frequency Cepstral Coefficients: Capture timbral and phonetic content.
  • Spectral Centroid: Measures the "center of mass" of frequencies.
  • Spectral Rolloff: Frequency below which a set percentage of 85% of the energy is contained.
  • Zero-Crossing Rate: Counts how often the signal changes sign higher for noisy or unvoiced sounds.
  • RMS Energy: Captures the loudness of the signal.

Model

We use a Logistic Regression model to classify audio based on extracted features. The model is trained on labeled examples of male and female voices.

Evaluation

The final model is evaluated using.

  • Accuracy
  • Precision, Recall, F1-Score
  • Confusion Matrix

Note. Whatever techniques and technology I am telling you about in this practical, I have already defined the theory like recall precision and F1 score or machine learning, EDA data, everything. If you do not understand anything in this topic, then you can read the previous article first.

Practical

Install Required Pip Files.

Install Required Pip Files

Importing Libraries

Importing Libraries

Experimenting on a Sample

let's take a sample and use it in experimenting and visualizing.

Sample

Sample2

Audio Processing

1. Load Audio File

Load Audio File

Audio Wave

2. Trim Silence

Removes unnecessary silence from the beginning and end of the audio. This helps eliminate parts of the audio that contain no useful information.

We use librosa.effects.trim() function with a parameter top_db which stands for "how many decibels below the peak" should be considered silence.

"Decibels" is a logarithmic unit used to measure sound level.

  • Lower top_db (20) stricter silence removal (only trims very quiet parts).
  • Higher top_db (60) is more aggressive (trims even moderately quiet parts).

We will choose 35 which is somewhat in the middle.

Trim Silence

3. Noise Reduction

Reduces background noise such as hums, hisses, or ambient sounds using filters or noise reduction algorithms.

  • High-pass filters remove low-frequency noise.
  • noise reduce library can automatically detect and reduce noise.

Noise Reduction

Resized

4. Normalization

Ensures all audio signals are on the same volume scale by scaling the waveform so its peak is consistent across samples.

5. Resampling

Resamples all audio to the same sampling rate (16,000 Hz), which ensures uniformity across the dataset.

Sample rate: how many samples per second are used to represent the audio.

Let's look at the audio wave at a more zoomed level.

Resampling

Feature Extraction

Now let's look at various feature extraction techniques.

1. Spectrogram

Shows how the frequencies of the audio signal change over time.

  • X-axis: time
  • Y-axis: frequency
  • Color: amplitude of each frequency at that moment

Spectrogram

2. Mel Spectrogram

Similar to a regular spectrogram, but the frequency axis is scaled to match how humans hear (the Mel scale).

It focuses more on low to mid frequencies, which are most important for speech and music.

Mel Spectrogram

3. Spectral Centroid

The Spectral Centroid tells us where the "center of mass" of the sound frequencies is it shows us how "bright" or "dark" a sound is.

  • If most energy is in high frequencies, the centroid is high the sound is bright or sharp (like cymbals).
  • If most energy is in low frequencies, the centroid is low the sound is dull or bassy (like drums or male voices).

Spectral Centroid

Spectral Rolloff

Tells us the frequency, where below this frequency (point) is 85% of the total energy (amplitude) in the sound.

Example

Male Voice (Deep, Low-pitched)

  • Most energy is in low frequencies.
  • We might reach 85% of the energy by 2000 Hz.
  • So the Spectral Rolloff is low.

Female Voice (High-pitched)

  • Energy is spread into higher frequencies.
  • We may need to go up to 5000 Hz to reach 85% of the energy.
  • So the Spectral Rolloff is higher.

Spectral Rolloff

6. MFCC (Mel-Frequency Cepstral Coefficients)

Captures the overall shape of the audio spectrum in a way that mimics human hearing. Commonly used in speech and music analysis.

MFCC

7. RMS (Root Mean Square Energy)

Captures the energy or loudness of the signal over time. Useful for understanding how powerful the sound is at each frame.

  • High RMS values = loud parts (speech, music, noise).
  • Low RMS values = silence or quiet parts.

It helps in voice activity detection, emotion recognition, and even trimming silent segments.

RMS

RMS

RMS3

Final Preprocessing of Data

Audio Preprocessing And Load Data into Local Male or Female folder I used only 9 male voice or 9 female voice data.

Audio Preprocessing

Since the two folders are large (male_folder contains 9 files), we will use parallel processing to speed up the process.

Folders

Now that we loaded our two folders, we can create our data frame and Show the data set.

Two folders

Conclusion

We have classified the voice data of male and female, some of which was male and some was female. We can add other data to this and prepare models of it, such as animals, birds, etc.

CDN Solutions Group a leading development company, started off as a team of four in the year 2000.