FPGA Speaker Recognition + FSK Modem
Verilog · PyTorch
Artix-7 XC7A35T
Apr 2026 to May 2026

FPGA Neural Network
Speaker ID + FSK Modem

A fully hardware-accelerated audio intelligence system on the Basys3 FPGA. Microphone audio passes through a complete signal processing chain: pre-emphasis filtering, Hamming windowing, 512-point FFT, a 26-bin mel filterbank, and DCT to produce 13 MFCC coefficients per frame. A 32-frame sliding window feeds a 2-layer integer neural network for real-time speaker identification. A parallel FSK modem provides LFSR-encrypted inter-FPGA wireless messaging.

Verilog HDL PyTorch Xilinx Artix-7
85%+Val Accuracy
416NN Inputs
320 msFrame Window
70 usPer Inference
// 02

System Architecture

Speak into the microphone. The FPGA processes each 25 ms audio frame through the full signal chain and classifies the speaker every 10 ms. Speaker 1 (Sindhu) is granted access. Speaker 2 (Kaushik) is declined. On a successful identification, the authenticated user can send or receive a message of up to 5 characters over the FSK wireless link. A failed identification triggers a 5-second lockout before a retry is permitted.

SPEAK INTO MIC MFCC PIPELINE pre-emph / FFT / mel / DCT NN INFERENCE 416 inputs, 32-frame window SPEAKER? majority vote Speaker 1 (Sindhu) ACCESS GRANTED proceed to FSK TX or RX? up to 5 chars TX FSK RX FSK Speaker 2 (Kaushik) ACCESS DENIED 5s lockout retry after 5s

Hardware Pipeline Overview

Speaker ID Chain
🎙MIC UART RX, 16 kHz
DC + PRE-EMPHASIS alpha = 0.969
HAMMING WINDOW 400 samples, 25 ms
512-PT FFT 256 useful bins
26-BIN MEL FILTERBANK + log magnitude
DCT-II 13 MFCC coefficients
32-FRAME BUFFER 416 coefficients
NN INFERENCE 416 - 16 - 2
SPEAKER ID logit0 vs logit1
FSK Modem Chain
MESSAGE INPUT up to 5 chars, 40 bits
🔐LFSR ENCRYPT seed 0xBEEF, XOR key
FRAME preamble + start + payload
FSK TRANSMIT 1 kHz / 2 kHz tones
WIRELESS AUDIO LINK inter-FPGA
📡IIR ENVELOPE DETECT dual-tone, hysteresis
PREAMBLE SYNC 10101010 + 11111110
🔐LFSR DECRYPT same LFSR sequence
DECODED MESSAGE plaintext output
// 03

Pre-emphasis + Hamming Window

Before any spectral analysis, each audio frame goes through three conditioning stages: DC removal, a first-order high-pass pre-emphasis filter, and a Hamming window applied over 400 samples. These stages collectively improve the SNR and spectral resolution of the downstream FFT.

DC Removal

The UART mic receiver delivers 12-bit unsigned samples in the range [0, 4095]. A fixed offset of 2048 is subtracted to centre the signal around zero before any filtering:

x_dc = sample - 2048; // 12-bit unsigned -> 16-bit signed, zero-centred

Pre-emphasis Filter

Human speech has a natural spectral tilt: higher frequencies carry less energy than lower ones. Pre-emphasis applies a first-order high-pass filter that lifts high-frequency components by roughly 6 dB/octave, compensating for this tilt and improving the conditioning of mel filterbank outputs at higher bands.

Filter Equation

y[n] = x[n] - alpha * x[n-1], where alpha = 124/128 = 0.969

Alpha is approximated in fixed-point as 124/128: the previous sample is multiplied by 124 and then right-shifted by 7 bits, avoiding any division. The output remains 16-bit signed.

// Fixed-point alpha = 124 >> 7 = 0.96875 coeff_term = x_prev * 124; alpha_xprev = coeff_term >>> 7; y_preemph = x_dc - alpha_xprev;

Spectral Effect of Pre-emphasis

FREQUENCY RESPONSE (dB) +12 +6 0 -6 -12 0 2k 4k 6k 8kHz before (flat) +pre-emphasis

Circular Buffer and Hamming Window

Pre-emphasised samples accumulate in a 400-sample circular buffer backed by block RAM. Every 160 new samples (10 ms), a frame trigger fires and the readout FSM sweeps all 400 samples through the Hamming window, then zero-pads to 512 samples for the FFT.

Frame Parameters

Frame length: 400 samples (25 ms at 16 kHz). Hop size: 160 samples (10 ms). Overlap: 60%. Zero-pad to 512 for FFT.

Without windowing, the discontinuity at frame edges introduces spurious spectral energy, a phenomenon known as spectral leakage. The Hamming window suppresses this by tapering the signal smoothly to 0.08 at both ends while reaching 1.0 at the centre:

w[n] = 0.54 - 0.46 * cos(2*pi*n / 399), n = 0..399 Stored as Q0.15 unsigned in hamming_coeff.hex (400 entries): w[0] = 0x0A3D (0.08 * 32767 = 2621) w[199] = 0x7FFF (1.0 * 32767 = 32767) w[399] = 0x0A3D (0.08 * 32767 = 2621) Windowing: windowed[n] = preemph[n] * hamming[n] >> 15
HAMMING WINDOW w(n) = 0.54 - 0.46*cos(2*pi*n/399) 1.0 0.5 0.0 0 200 399 peak=1.0 400 samples (25 ms) +112 zeros
// 04

FFT + Mel Filterbank

512-Point FFT

The 512 windowed-and-padded samples are fed into the Xilinx FFT v9.1 IP core configured for 16-bit fixed-point complex output. Because the input signal is real-valued, the FFT output is conjugate-symmetric: the upper 256 bins are mirror images of the lower 256 and carry no additional information. Only bins 0..255, covering DC to 8 kHz at a 16 kHz sample rate, are forwarded to the magnitude stage.

A Xilinx CORDIC v6.0 IP core converts each complex bin from Cartesian (Re, Im) to polar form. The 16 lower bits of the CORDIC magnitude output are taken as the spectral magnitude for each bin.

Frequency Resolution

Bin width = 16000 / 512 = 31.25 Hz per bin. Useful range: bins 0 to 255, covering 0 to 7968.75 Hz.

26-Bin Mel Filterbank

The mel scale is a perceptual frequency scale that approximates how the human ear resolves pitch. It is approximately linear below 1 kHz and logarithmic above. Triangular filters placed on this scale give each mel band a weighted sum of nearby FFT bins, capturing perceptually relevant spectral shape while discarding fine-grained spectral detail.

The filterbank uses 26 triangular mel-spaced filters over the range 0 to 8 kHz. The filter weights are precomputed and stored in a 256-entry distributed ROM (mel_filter_rom.hex). Each ROM entry encodes the two overlapping mel band indices and the triangular weight for that FFT bin in a packed 26-bit word:

// mel_filter_rom.hex: 256 entries, one per FFT bin 0..255 // Each entry: [mel_bank_lo(5b) | mel_bank_hi(5b) | weight_lo(16b)] // mel_wt_hi = 0xFFFF - weight_lo (complementary triangular weight) // Accumulate: mel_energy[lo] += mag * weight_lo // mel_energy[hi] += mag * weight_hi
MEL FILTERBANK (26 triangular filters, 0-8 kHz) 0 2k 4k 6k 8kHz narrow wide (log scale) 26 filters: linear below 1kHz, logarithmic above

Log Magnitude

After accumulation, each of the 26 mel band energies passes through a log2 approximation built from a leading-zero counter (LZC) and a 64-entry fractional lookup table. The result is in Q8.8 unsigned format, covering the full 32-bit dynamic range. Taking the log is standard in MFCC computation because it converts multiplicative gain variations into additive offsets. The DCT then separates those offsets into a low-index cepstral coefficient, which can be discarded without affecting speaker discrimination.

// log2 approximation: LZC + fractional LUT integer_part = 31 - lzc(mel_energy); // floor(log2(x)) mantissa = mel_energy >> (integer_part - 5); frac_part = frac_lut[mantissa[5:0]]; // 64-entry LUT, 8-bit output log_mel[k] = {integer_part[4:0], frac_part}; // Q8.8, 16-bit unsigned
// 05

DCT-II and MFCC Coefficients

The final stage applies the Discrete Cosine Transform Type-II to the 26 log-mel energies, producing 13 cepstral coefficients. Adjacent mel filters overlap substantially, so their energies are correlated. The DCT decorrelates these outputs into a compact set of orthogonal coefficients that each describe a distinct aspect of the spectral envelope shape.

DCT-II Formula

c[m] = sum over l=0..25 of log_mel[l] * cos(pi * m * (l + 0.5) / 26), for m = 0..12

The cosine weights are precomputed as Q1.15 signed integers (scaled by 32767) and stored in a 338-entry ROM (dct_cos_rom.hex, 13 x 26 entries). The RTL computes all 13 coefficients sequentially, performing 26 multiply-accumulate operations per coefficient through a single 40-bit signed accumulator to prevent overflow before the final rounding shift.

// 13 passes, 26 MACs each -- dct_cos_rom indexed as (cep_idx * 26) + mel_idx acc_40b = 0; for mel_idx = 0..25: cos_w = dct_cos_rom[cep_idx * 26 + mel_idx]; // Q1.15 signed acc_40b += (log_mel[mel_idx] >> 3) * cos_w; // >>3 prevents overflow mfcc[cep_idx] = round(acc_40b) >>> 8; // final scale -> INT32

Frame to Buffer

Each complete frame produces 13 MFCC coefficients signalled by mfcc_valid pulses, with mfcc_last marking coefficient 12 (the final one). The nn_input_buffer module collects coefficients one by one and on mfcc_last shifts the full frame into a 32-frame sliding window register. Once 32 frames have accumulated, nn_start pulses every subsequent frame.

// nn_input_buffer: 32 x 13 = 416 coefficients, each 32-bit mfcc_window_flat[13311:0] -- 416 * 32 = 13312 bits total nn_start -- 1-cycle pulse per new frame once buffer full buffer_full -- held high after 32nd frame arrives
// 06

Neural Network Inference Engine

The speaker identification network is a 2-layer fully-connected network described entirely in synthesisable Verilog. All arithmetic uses the Artix-7 DSP48 slices in integer mode. No floating-point units are used anywhere in the design.

Architecture
FC(416, 16, ReLU) then FC(16, 2)

416 normalised INT8 inputs. 16 hidden neurons with ReLU. 2 output logits (one per speaker). argmax(logit0, logit1) gives the speaker ID.

Quantisation
W_SCALE=1024, MAC_SHIFT=10

All weights stored as INT16, scaled by 2^10. After MAC accumulation, a right-shift of 10 bits recovers the float-equivalent magnitude before bias addition.

Input Normalisation
Z-score normalisation, INT8 clip

Each of the 416 inputs is zero-normalised using a precomputed mean array and a reciprocal-std array (both stored in ROM). Result is clipped to INT8 range [-128, 127].

Inference Latency
~7004 cycles = 70 us at 100 MHz

417 norm cycles + 16 x 417 L1 MAC cycles + 2 x 17 L2 cycles + drain/writeback cycles. Comfortably within real-time budget per 10 ms hop.

Input Normalisation Pipeline

Before the MAC layers, the raw INT32 MFCC values are normalised. For each of the 416 input features, the precomputed mean is subtracted, then the result is multiplied by the precomputed reciprocal of the standard deviation. The reciprocal is stored in Q0.14 format (scaled by 2^14) so the multiply uses only integer DSP48 operations. The product is shifted right by NORM_SHIFT=14 bits and clipped to INT8.

x_norm[i] = clip( (mfcc_in[i] - x_mean_rom[i]) * recip_rom[i] >> 14, -128, 127 ) // x_mean_rom: 416 x INT32 (precomputed per-feature mean) // recip_rom: 416 x 11-bit (precomputed 1/sigma * 2^14, pattern repeats 32x) // NORM_SHIFT = 14

FSM State Machine

// 7-state FSM drives the full inference sequence S_IDLE -- wait for nn_start pulse S_NORM -- normalise 416 inputs: 3-stage pipeline + 3 drain cycles S_L1_MAC -- MAC: 416 inputs * weight, 2-stage DSP48 pipeline per neuron S_L1_WB -- write ReLU(acc >> MAC_SHIFT) to l1_out[neuron], advance neuron S_L2_MAC -- MAC: 16 l1_out * w2, same pipeline S_L2_WB -- write logit0 or logit1 S_OUTPUT -- pulse valid=1, return to S_IDLE

Memory Layout

ROMEntriesWidthMapping
w1_rom6656 (16 x 416)INT16Block RAM
w2_rom32 (2 x 16)INT16Distributed LUT
x_mean_rom416INT32Block RAM
recip_rom41611-bitDistributed LUT
// 07

Timing Analysis and Closure

After initial synthesis, Vivado reported catastrophic timing violations concentrated entirely in the neural network module. The root cause was combinatorial paths that exceeded the 10 ns clock period before reaching DSP48 multiply inputs.

Before Fix
TNS: -27,518 ns
WNS -5.1 ns / ~9,751 failing endpoints / all inside nn_inst
After Fix
TNS: 0 ns
WNS positive / zero failing endpoints / full timing closure

Root Cause: Two Critical Paths

Both paths share the same failure mode. A long combinatorial chain arrives at a DSP48 multiply input without being registered first. On Artix-7 at 100 MHz, any combinatorial path through a LUT mux tree or BRAM output that feeds a DSP48 multiply input exceeds the 10 ns budget on its own.

Path 1: Normalisation multiply (before fix)
recip_rom[idx]--> LUT mux decode--> DSP48 input A--> multiply--> norm_clipped--> x_norm[slot] ~14 ns x
Path 2: L1 MAC multiply (before fix)
w1_rom[addr]--> BRAM output--> DSP48 input A ~12 ns x
x_norm[idx]--> 416-entry LUT mux tree--> DSP48 input B ~11 ns x

Fix: 2-Stage Registered Pipelines

The principle is to ensure both DSP48 multiply inputs are registered flip-flop outputs before the multiply fires. A flip-flop output has near-zero setup path, so the LUT mux tree and BRAM read latency are absorbed in the preceding cycle.

Normalisation path after fix (3 pipeline stages)
Stage 1:diff + index registered into norm_diff_r, norm_idx_r
Stage 2:recip_rom[norm_idx_r] registered into recip_r(LUT decode absorbed)
Stage 3:norm_diff_r2 * recip_r registered into norm_prod_r~7 ns ok
+ 3 drain cycles at end of S_NORM to flush all pipeline stages before S_L1_MAC
L1 MAC path after fix (2 pipeline stages on both DSP inputs)
Stage 1:w1_rom[addr] registered into w1_val_r(BRAM sync read, 1 cycle)
Stage 2:w1_val_r into w1_val_r2; x_norm[idx_br] registered into xn_r(mux tree settles)
Stage 3:w1_val_r2 * xn_r registered into mac_r~8 ns ok
mac_idx_r resets to sentinel 9'h1FF rather than 0, preventing false accumulation on the first cycle of S_L1_MAC
// 08

Training and Results

The model was trained in PyTorch on MFCC data captured directly from the FPGA via the Vivado Integrated Logic Analyzer (ILA). Training on ILA data is critical: every training sample has passed through the same RTL signal processing chain as inference, so the training distribution includes all fixed-point rounding artefacts from the hardware.

Data Collection via ILA

A Python host script streams live microphone audio over UART to the FPGA, which runs the full RTL MFCC pipeline. The ILA is configured to capture mfcc_frame_flat[13311:0] (the full 416-coefficient window) whenever nn_start pulses. Each ILA export is a CSV file with one window per row. This ensures every training sample has passed through the exact same FFT, mel filterbank, DCT, and fixed-point arithmetic as inference.

Training Data
2 speakers, 2 sessions each

k_p1.csv, k_p2.csv (Kaushik) and s_p1.csv, s_p2.csv (Sindhu). ILA-captured MFCC windows from the FPGA in live conditions.

Windowing
8-frame windows, stride 1

Training uses 8-frame x 13-coeff = 104-dimensional windows with stride 1 for maximum data augmentation. The FPGA inference uses the full 32-frame window (416 dims).

PyTorch Model
Linear(416,16) ReLU Linear(16,2)

Adam optimiser (lr=1e-3, weight_decay=1e-4), StepLR scheduler (step 50, gamma 0.5), CrossEntropyLoss, 200 epochs, batch size 64.

Validation
80/20 train/val split

A golden model test verifies bit-exact match between PyTorch INT16 arithmetic and the manual fixed-point simulation before writing the hex weight files.

Quantisation and Export

After training converges in float32, the weights are quantised post-training. The quantisation scheme mirrors the hardware fixed-point format exactly so that inference on the FPGA produces the same classification as the PyTorch integer model.

# W_SCALE = 1024 = 2^10 (matches MAC_SHIFT=10 in RTL) w1_int = np.clip(np.round(w1_float * 1024), -32767, 32767).astype(np.int16) w2_int = np.clip(np.round(w2_float * 1024), -32767, 32767).astype(np.int16) # Biases scaled by W_SCALE^2 because the hardware accumulates before shifting b1_int = np.clip(np.round(b1_float * 1024**2), -(2**31), 2**31-1).astype(np.int32) # Normalisation: recip_rom = round(127.0 * 2^14 / X_std) recip_vals = np.round(127.0 * 2**14 / X_std).astype(np.int32) # Write hex files for $readmemh in Verilog np.savetxt("speaker_id_w1_new.hex", w1_int & 0xFFFF, fmt="%04x") np.savetxt("speaker_id_w2_new.hex", w2_int & 0xFFFF, fmt="%04x")

Accuracy

Overall Validation Accuracy85%+
Speaker 0 (Kaushik)~87%
Speaker 1 (Sindhu)~83%
Volume Robustness Issue and Fix

A training vs. inference volume mismatch caused MFCC coeff[0] (log-energy) to dominate classification, making the model always predict one speaker. The fix was to train exclusively on ILA-captured live FPGA output and to drop the four most volume-correlated coefficients (0, 4, 5, 9), using the nine most stable ones instead.

// 09

FSK Modem

A parallel FSK (Frequency Shift Keying) modem enables encrypted inter-FPGA wireless messaging. The transmitter generates audio tones, the receiver demodulates them via IIR envelope detection, and a frame synchroniser recovers the byte-aligned payload.

Modulation
Binary FSK: 1000 Hz / 2000 Hz

Bit 0: ~1000 Hz (16 samples/cycle at 16 kHz). Bit 1: ~2000 Hz (8 samples/cycle). Each bit lasts 160 samples = 10 ms. Tone waveforms from an 8-entry sine LUT.

Encryption
LFSR XOR, seed 0xBEEF

16-bit LFSR with polynomial taps [15,13,11,0]. The LFSR is run 12 times to generate a 40-bit key, which is XOR'd with the payload before transmission and again on receive to decrypt.

Frame Format
56 bits total, 560 ms

8-bit preamble (10101010) for carrier detect, 8-bit start flag (11111110) for byte alignment, and a 40-bit LFSR-encrypted payload. Total duration: 560 ms at 10 ms per bit.

Receiver
Dual IIR envelope detection

Two independent IIR filters (alpha=1/16) track the rectified energy of the high (2k Hz) and low (1k Hz) tones. Bit decision by comparing both envelopes with a configurable hysteresis threshold.

Frame Structure

// Total: 56 bits @ 10ms/bit = 560 ms per frame [ 8-bit preamble: 10101010 ] -- carrier detect, edge synchronisation [ 8-bit start flag: 11111110 ] -- frame alignment marker [ 40-bit payload: XOR(data, key) ] -- LFSR-encrypted message // LFSR key generation seed = 0xBEEF -- 16-bit initial state taps = [15, 13, 11, 0] -- feedback polynomial // Run 12 shifts -> collect 3 x 16-bit states (+ upper 8 bits) = 40-bit key key = {lfsr11, lfsr11, lfsr11[15:8]}

Receiver FSM and IIR Detection

The receiver runs two IIR envelope filters in parallel. One is tuned to the energy pattern of the 2 kHz tone via consecutive sample differences, and the other targets the 1 kHz tone via 2-sample stride differences. Each filter implements an exponential moving average with alpha ~1/16 as a single right-shift, requiring no division logic.

// IIR envelope: alpha = 1/16 (one shift operation, no division) env_hi = env_hi - (env_hi >>> 4) + abs(sample[n] - sample[n-1]); // 2 kHz env_lo = env_lo - (env_lo >>> 4) + abs(sample[n] - sample[n-2]); // 1 kHz // Bit decision with hysteresis (default: deadzone=15, thresh=300) if (env_hi > env_lo + env_thresh): bit = 1; else if (env_lo > env_hi + env_thresh): bit = 0; else: hold previous bit;
Sensitivity Profiles

Four sensitivity modes are selectable via FPGA switches at runtime: Normal (deadzone=15, thresh=300), High Sensitivity (deadzone=8, thresh=100), Noise Reject (deadzone=30, thresh=800), Debug/Max (deadzone=4, thresh=50). This allows adaptation to different acoustic environments without recompiling.

Preamble Synchronisation

The receiver FSM starts in S_IDLE, waiting for a rising edge on the decoded bit stream. In S_HUNT it clocks in bits looking for the start byte pattern (11111110) to establish byte alignment, with a timeout of 45 bit periods to prevent lockup on noise. Once the start byte is detected, S_RECEIVE collects the 40 payload bits, and the decoded data is XOR'd with the same LFSR key to recover the plaintext message.

The design includes forward key ratcheting. Every successfully acknowledged transmission advances the LFSR seed by running it through several more shift iterations, producing a new XOR key for the next frame. Because both FPGAs apply the same deterministic LFSR sequence independently, they stay synchronised without ever exchanging the key explicitly. A replayed frame from a previous exchange carries a stale key and decrypts to garbage, providing replay protection over the wireless link.

// 10

Scope for Improvement

The current system is functional but constrained by the experimental hardware and dataset size. Several clear directions exist for making it more robust and accurate.

Hardware

Dedicated microphone module. The laptop UART path introduces non-deterministic latency: USB buffers, OS scheduling, and the USB-to-serial bridge all add variable delay between audio capture and FPGA ingestion. A microphone connected directly to the FPGA GPIO would eliminate this jitter entirely, giving the pipeline a truly synchronous sample stream at a fixed 16 kHz with no dropped or duplicated samples.

Neural Network and Training

More training data. The current model is trained on a limited number of ILA-captured utterances. Increasing the dataset to several hundred labelled frames per speaker, covering different vocal effort levels, ambient noise conditions, and microphone distances, would significantly improve generalisation and reduce the chance of misclassification at the boundary.
More speakers. The current architecture outputs a 2-class softmax. Extending to N speakers requires only widening the output layer and retraining; the FPGA datapath is otherwise unchanged. A rejection class (none of the above) would also make the system safer for real deployment.

FSK Modem

Error correction. The current frame has no forward error correction. Adding a simple Hamming(7,4) or CRC-8 check over the 40-bit payload would let the receiver detect and optionally correct single-bit errors introduced by acoustic noise on the wireless link.
Forward ratcheting robustness. The LFSR ratchet currently relies on both FPGAs receiving a clean ACK. A missed ACK causes key desynchronisation. The RX FPGA could assert an explicit acknowledgement signal back to the TX once a frame is successfully decoded; both sides only advance the LFSR after that confirmation, keeping the keys in lockstep even across noisy transmissions.