FPGA Neural Network
Speaker ID + FSK Modem
A fully hardware-accelerated audio intelligence system on the Basys3 FPGA. Microphone audio passes through a complete signal processing chain: pre-emphasis filtering, Hamming windowing, 512-point FFT, a 26-bin mel filterbank, and DCT to produce 13 MFCC coefficients per frame. A 32-frame sliding window feeds a 2-layer integer neural network for real-time speaker identification. A parallel FSK modem provides LFSR-encrypted inter-FPGA wireless messaging.
System Architecture
Speak into the microphone. The FPGA processes each 25 ms audio frame through the full signal chain and classifies the speaker every 10 ms. Speaker 1 (Sindhu) is granted access. Speaker 2 (Kaushik) is declined. On a successful identification, the authenticated user can send or receive a message of up to 5 characters over the FSK wireless link. A failed identification triggers a 5-second lockout before a retry is permitted.
Hardware Pipeline Overview
Pre-emphasis + Hamming Window
Before any spectral analysis, each audio frame goes through three conditioning stages: DC removal, a first-order high-pass pre-emphasis filter, and a Hamming window applied over 400 samples. These stages collectively improve the SNR and spectral resolution of the downstream FFT.
DC Removal
The UART mic receiver delivers 12-bit unsigned samples in the range [0, 4095]. A fixed offset of 2048 is subtracted to centre the signal around zero before any filtering:
x_dc = sample - 2048; // 12-bit unsigned -> 16-bit signed, zero-centred
Pre-emphasis Filter
Human speech has a natural spectral tilt: higher frequencies carry less energy than lower ones. Pre-emphasis applies a first-order high-pass filter that lifts high-frequency components by roughly 6 dB/octave, compensating for this tilt and improving the conditioning of mel filterbank outputs at higher bands.
y[n] = x[n] - alpha * x[n-1], where alpha = 124/128 = 0.969
Alpha is approximated in fixed-point as
// Fixed-point alpha = 124 >> 7 = 0.96875
coeff_term = x_prev * 124;
alpha_xprev = coeff_term >>> 7;
y_preemph = x_dc - alpha_xprev;
Spectral Effect of Pre-emphasis
Circular Buffer and Hamming Window
Pre-emphasised samples accumulate in a 400-sample circular buffer backed by block RAM. Every 160 new samples (10 ms), a frame trigger fires and the readout FSM sweeps all 400 samples through the Hamming window, then zero-pads to 512 samples for the FFT.
Frame length: 400 samples (25 ms at 16 kHz). Hop size: 160 samples (10 ms). Overlap: 60%. Zero-pad to 512 for FFT.
Without windowing, the discontinuity at frame edges introduces spurious spectral energy, a phenomenon known as spectral leakage. The Hamming window suppresses this by tapering the signal smoothly to 0.08 at both ends while reaching 1.0 at the centre:
w[n] = 0.54 - 0.46 * cos(2*pi*n / 399), n = 0..399
Stored as Q0.15 unsigned in hamming_coeff.hex (400 entries):
w[0] = 0x0A3D (0.08 * 32767 = 2621)
w[199] = 0x7FFF (1.0 * 32767 = 32767)
w[399] = 0x0A3D (0.08 * 32767 = 2621)
Windowing: windowed[n] = preemph[n] * hamming[n] >> 15
FFT + Mel Filterbank
512-Point FFT
The 512 windowed-and-padded samples are fed into the Xilinx FFT v9.1 IP core configured for 16-bit fixed-point complex output. Because the input signal is real-valued, the FFT output is conjugate-symmetric: the upper 256 bins are mirror images of the lower 256 and carry no additional information. Only bins 0..255, covering DC to 8 kHz at a 16 kHz sample rate, are forwarded to the magnitude stage.
A Xilinx CORDIC v6.0 IP core converts each complex bin from Cartesian (Re, Im) to polar form. The 16 lower bits of the CORDIC magnitude output are taken as the spectral magnitude for each bin.
Bin width = 16000 / 512 = 31.25 Hz per bin. Useful range: bins 0 to 255, covering 0 to 7968.75 Hz.
26-Bin Mel Filterbank
The mel scale is a perceptual frequency scale that approximates how the human ear resolves pitch. It is approximately linear below 1 kHz and logarithmic above. Triangular filters placed on this scale give each mel band a weighted sum of nearby FFT bins, capturing perceptually relevant spectral shape while discarding fine-grained spectral detail.
The filterbank uses 26 triangular mel-spaced filters over the range 0 to 8 kHz.
The filter weights are precomputed and stored in a 256-entry distributed ROM
(
// mel_filter_rom.hex: 256 entries, one per FFT bin 0..255
// Each entry: [mel_bank_lo(5b) | mel_bank_hi(5b) | weight_lo(16b)]
// mel_wt_hi = 0xFFFF - weight_lo (complementary triangular weight)
// Accumulate: mel_energy[lo] += mag * weight_lo
// mel_energy[hi] += mag * weight_hi
Log Magnitude
After accumulation, each of the 26 mel band energies passes through a log2 approximation built from a leading-zero counter (LZC) and a 64-entry fractional lookup table. The result is in Q8.8 unsigned format, covering the full 32-bit dynamic range. Taking the log is standard in MFCC computation because it converts multiplicative gain variations into additive offsets. The DCT then separates those offsets into a low-index cepstral coefficient, which can be discarded without affecting speaker discrimination.
// log2 approximation: LZC + fractional LUT
integer_part = 31 - lzc(mel_energy); // floor(log2(x))
mantissa = mel_energy >> (integer_part - 5);
frac_part = frac_lut[mantissa[5:0]]; // 64-entry LUT, 8-bit output
log_mel[k] = {integer_part[4:0], frac_part}; // Q8.8, 16-bit unsigned
DCT-II and MFCC Coefficients
The final stage applies the Discrete Cosine Transform Type-II to the 26 log-mel energies, producing 13 cepstral coefficients. Adjacent mel filters overlap substantially, so their energies are correlated. The DCT decorrelates these outputs into a compact set of orthogonal coefficients that each describe a distinct aspect of the spectral envelope shape.
c[m] = sum over l=0..25 of log_mel[l] * cos(pi * m * (l + 0.5) / 26), for m = 0..12
The cosine weights are precomputed as Q1.15 signed integers (scaled by 32767) and stored in a 338-entry ROM (
// 13 passes, 26 MACs each -- dct_cos_rom indexed as (cep_idx * 26) + mel_idx
acc_40b = 0;
for mel_idx = 0..25:
cos_w = dct_cos_rom[cep_idx * 26 + mel_idx]; // Q1.15 signed
acc_40b += (log_mel[mel_idx] >> 3) * cos_w; // >>3 prevents overflow
mfcc[cep_idx] = round(acc_40b) >>> 8; // final scale -> INT32
Frame to Buffer
Each complete frame produces 13 MFCC coefficients signalled by
// nn_input_buffer: 32 x 13 = 416 coefficients, each 32-bit
mfcc_window_flat[13311:0] -- 416 * 32 = 13312 bits total
nn_start -- 1-cycle pulse per new frame once buffer full
buffer_full -- held high after 32nd frame arrives
Neural Network Inference Engine
The speaker identification network is a 2-layer fully-connected network described entirely in synthesisable Verilog. All arithmetic uses the Artix-7 DSP48 slices in integer mode. No floating-point units are used anywhere in the design.
416 normalised INT8 inputs. 16 hidden neurons with ReLU. 2 output logits (one per speaker). argmax(logit0, logit1) gives the speaker ID.
All weights stored as INT16, scaled by 2^10. After MAC accumulation, a right-shift of 10 bits recovers the float-equivalent magnitude before bias addition.
Each of the 416 inputs is zero-normalised using a precomputed mean array and a reciprocal-std array (both stored in ROM). Result is clipped to INT8 range [-128, 127].
417 norm cycles + 16 x 417 L1 MAC cycles + 2 x 17 L2 cycles + drain/writeback cycles. Comfortably within real-time budget per 10 ms hop.
Input Normalisation Pipeline
Before the MAC layers, the raw INT32 MFCC values are normalised. For each of the 416 input features, the precomputed mean is subtracted, then the result is multiplied by the precomputed reciprocal of the standard deviation. The reciprocal is stored in Q0.14 format (scaled by 2^14) so the multiply uses only integer DSP48 operations. The product is shifted right by NORM_SHIFT=14 bits and clipped to INT8.
x_norm[i] = clip( (mfcc_in[i] - x_mean_rom[i]) * recip_rom[i] >> 14, -128, 127 )
// x_mean_rom: 416 x INT32 (precomputed per-feature mean)
// recip_rom: 416 x 11-bit (precomputed 1/sigma * 2^14, pattern repeats 32x)
// NORM_SHIFT = 14
FSM State Machine
// 7-state FSM drives the full inference sequence
S_IDLE -- wait for nn_start pulse
S_NORM -- normalise 416 inputs: 3-stage pipeline + 3 drain cycles
S_L1_MAC -- MAC: 416 inputs * weight, 2-stage DSP48 pipeline per neuron
S_L1_WB -- write ReLU(acc >> MAC_SHIFT) to l1_out[neuron], advance neuron
S_L2_MAC -- MAC: 16 l1_out * w2, same pipeline
S_L2_WB -- write logit0 or logit1
S_OUTPUT -- pulse valid=1, return to S_IDLE
Memory Layout
| ROM | Entries | Width | Mapping |
|---|---|---|---|
| w1_rom | 6656 (16 x 416) | INT16 | Block RAM |
| w2_rom | 32 (2 x 16) | INT16 | Distributed LUT |
| x_mean_rom | 416 | INT32 | Block RAM |
| recip_rom | 416 | 11-bit | Distributed LUT |
Timing Analysis and Closure
After initial synthesis, Vivado reported catastrophic timing violations concentrated entirely in the neural network module. The root cause was combinatorial paths that exceeded the 10 ns clock period before reaching DSP48 multiply inputs.
Root Cause: Two Critical Paths
Both paths share the same failure mode. A long combinatorial chain arrives at a DSP48 multiply input without being registered first. On Artix-7 at 100 MHz, any combinatorial path through a LUT mux tree or BRAM output that feeds a DSP48 multiply input exceeds the 10 ns budget on its own.
Fix: 2-Stage Registered Pipelines
The principle is to ensure both DSP48 multiply inputs are registered flip-flop outputs before the multiply fires. A flip-flop output has near-zero setup path, so the LUT mux tree and BRAM read latency are absorbed in the preceding cycle.
Training and Results
The model was trained in PyTorch on MFCC data captured directly from the FPGA via the Vivado Integrated Logic Analyzer (ILA). Training on ILA data is critical: every training sample has passed through the same RTL signal processing chain as inference, so the training distribution includes all fixed-point rounding artefacts from the hardware.
Data Collection via ILA
A Python host script streams live microphone audio over UART to the FPGA, which runs
the full RTL MFCC pipeline. The ILA is configured to capture
k_p1.csv, k_p2.csv (Kaushik) and s_p1.csv, s_p2.csv (Sindhu). ILA-captured MFCC windows from the FPGA in live conditions.
Training uses 8-frame x 13-coeff = 104-dimensional windows with stride 1 for maximum data augmentation. The FPGA inference uses the full 32-frame window (416 dims).
Adam optimiser (lr=1e-3, weight_decay=1e-4), StepLR scheduler (step 50, gamma 0.5), CrossEntropyLoss, 200 epochs, batch size 64.
A golden model test verifies bit-exact match between PyTorch INT16 arithmetic and the manual fixed-point simulation before writing the hex weight files.
Quantisation and Export
After training converges in float32, the weights are quantised post-training. The quantisation scheme mirrors the hardware fixed-point format exactly so that inference on the FPGA produces the same classification as the PyTorch integer model.
# W_SCALE = 1024 = 2^10 (matches MAC_SHIFT=10 in RTL)
w1_int = np.clip(np.round(w1_float * 1024), -32767, 32767).astype(np.int16)
w2_int = np.clip(np.round(w2_float * 1024), -32767, 32767).astype(np.int16)
# Biases scaled by W_SCALE^2 because the hardware accumulates before shifting
b1_int = np.clip(np.round(b1_float * 1024**2), -(2**31), 2**31-1).astype(np.int32)
# Normalisation: recip_rom = round(127.0 * 2^14 / X_std)
recip_vals = np.round(127.0 * 2**14 / X_std).astype(np.int32)
# Write hex files for $readmemh in Verilog
np.savetxt("speaker_id_w1_new.hex", w1_int & 0xFFFF, fmt="%04x")
np.savetxt("speaker_id_w2_new.hex", w2_int & 0xFFFF, fmt="%04x")
Accuracy
A training vs. inference volume mismatch caused MFCC coeff[0] (log-energy) to dominate classification, making the model always predict one speaker. The fix was to train exclusively on ILA-captured live FPGA output and to drop the four most volume-correlated coefficients (0, 4, 5, 9), using the nine most stable ones instead.
FSK Modem
A parallel FSK (Frequency Shift Keying) modem enables encrypted inter-FPGA wireless messaging. The transmitter generates audio tones, the receiver demodulates them via IIR envelope detection, and a frame synchroniser recovers the byte-aligned payload.
Bit 0: ~1000 Hz (16 samples/cycle at 16 kHz). Bit 1: ~2000 Hz (8 samples/cycle). Each bit lasts 160 samples = 10 ms. Tone waveforms from an 8-entry sine LUT.
16-bit LFSR with polynomial taps [15,13,11,0]. The LFSR is run 12 times to generate a 40-bit key, which is XOR'd with the payload before transmission and again on receive to decrypt.
8-bit preamble (10101010) for carrier detect, 8-bit start flag (11111110) for byte alignment, and a 40-bit LFSR-encrypted payload. Total duration: 560 ms at 10 ms per bit.
Two independent IIR filters (alpha=1/16) track the rectified energy of the high (2k Hz) and low (1k Hz) tones. Bit decision by comparing both envelopes with a configurable hysteresis threshold.
Frame Structure
// Total: 56 bits @ 10ms/bit = 560 ms per frame
[ 8-bit preamble: 10101010 ] -- carrier detect, edge synchronisation
[ 8-bit start flag: 11111110 ] -- frame alignment marker
[ 40-bit payload: XOR(data, key) ] -- LFSR-encrypted message
// LFSR key generation
seed = 0xBEEF -- 16-bit initial state
taps = [15, 13, 11, 0] -- feedback polynomial
// Run 12 shifts -> collect 3 x 16-bit states (+ upper 8 bits) = 40-bit key
key = {lfsr11, lfsr11, lfsr11[15:8]}
Receiver FSM and IIR Detection
The receiver runs two IIR envelope filters in parallel. One is tuned to the energy pattern of the 2 kHz tone via consecutive sample differences, and the other targets the 1 kHz tone via 2-sample stride differences. Each filter implements an exponential moving average with alpha ~1/16 as a single right-shift, requiring no division logic.
// IIR envelope: alpha = 1/16 (one shift operation, no division)
env_hi = env_hi - (env_hi >>> 4) + abs(sample[n] - sample[n-1]); // 2 kHz
env_lo = env_lo - (env_lo >>> 4) + abs(sample[n] - sample[n-2]); // 1 kHz
// Bit decision with hysteresis (default: deadzone=15, thresh=300)
if (env_hi > env_lo + env_thresh): bit = 1;
else if (env_lo > env_hi + env_thresh): bit = 0;
else: hold previous bit;
Four sensitivity modes are selectable via FPGA switches at runtime: Normal (deadzone=15, thresh=300), High Sensitivity (deadzone=8, thresh=100), Noise Reject (deadzone=30, thresh=800), Debug/Max (deadzone=4, thresh=50). This allows adaptation to different acoustic environments without recompiling.
Preamble Synchronisation
The receiver FSM starts in S_IDLE, waiting for a rising edge on the decoded bit stream. In S_HUNT it clocks in bits looking for the start byte pattern (11111110) to establish byte alignment, with a timeout of 45 bit periods to prevent lockup on noise. Once the start byte is detected, S_RECEIVE collects the 40 payload bits, and the decoded data is XOR'd with the same LFSR key to recover the plaintext message.
The design includes forward key ratcheting. Every successfully acknowledged transmission advances the LFSR seed by running it through several more shift iterations, producing a new XOR key for the next frame. Because both FPGAs apply the same deterministic LFSR sequence independently, they stay synchronised without ever exchanging the key explicitly. A replayed frame from a previous exchange carries a stale key and decrypts to garbage, providing replay protection over the wireless link.
Scope for Improvement
The current system is functional but constrained by the experimental hardware and dataset size. Several clear directions exist for making it more robust and accurate.