1. Voice Volume Tracker - Parte 2: ML, Alertas y Testing
Continuación del Caso de Estudio Principal
1.1 🤖 4. Machine Learning: Identificación de Voz (Speaker Verification)
1.1.1 4.1 ¿Por qué Speaker Verification?
Problema: Necesitamos diferenciar la voz del usuario de otras voces en el entorno
Solución: Speaker Verification usando embeddings de voz
Alternativas Consideradas:
- Umbral de volumen simple (rechazado: alerta con cualquier voz)
- Filtro por dirección del micrófono (rechazado: no funciona con omnidireccionales)
- Speaker Verification con ML (seleccionado)
1.1.2 4.2 Modelo: SpeechBrain ECAPA-TDNN
Decisión: Usar modelo pre-entrenado SpeechBrain ECAPA-TDNN
Características:
- Arquitectura: ECAPA-TDNN (Emphasized Channel Attention, Propagation and Aggregation Time Delay Neural Network)
- Embedding size: 192 dimensiones
- Precisión: > 98% en VoxCeleb dataset
- Latencia: < 50ms en CPU moderno
Conversión a ONNX:
# scripts/convert_to_onnx.py
import torch
from speechbrain.pretrained import EncoderClassifier
# Cargar modelo pre-entrenado
classifier = EncoderClassifier.from_hparams(
source="speechbrain/spkrec-ecapa-voxceleb",
savedir="pretrained_models/spkrec-ecapa-voxceleb"
)
# Crear input dummy
dummy_input = torch.randn(1, 16000) # 1 segundo de audio a 16kHz
# Exportar a ONNX
torch.onnx.export(
classifier.mods.embedding_model,
dummy_input,
"speechbrain_ecapa.onnx",
input_names=['audio'],
output_names=['embedding'],
dynamic_axes={
'audio': {0: 'batch', 1: 'time'},
'embedding': {0: 'batch'}
}
)
1.1.3 4.3 Fase de Enrollment (Registro de Voz)
Objetivo: Capturar embeddings de la voz del usuario para crear un perfil
Proceso:
1. Usuario abre wizard de enrollment
↓
2. Lee 5 frases diferentes (15 segundos cada una)
↓
3. Sistema extrae embedding de cada frase
↓
4. Promedia los 5 embeddings → Embedding de referencia
↓
5. Guarda embedding encriptado en DB
Implementación:
// VoiceTracker.Infrastructure/ML/EmbeddingExtractor.cs
using Microsoft.ML.OnnxRuntime;
using Microsoft.ML.OnnxRuntime.Tensors;
public class EmbeddingExtractor
{
private readonly InferenceSession _session;
private readonly int _sampleRate = 16000;
public EmbeddingExtractor(string modelPath)
{
_session = new InferenceSession(modelPath);
}
public float[] Extract(float[] audioSamples)
{
// Preparar input tensor
var inputTensor = new DenseTensor<float>(
audioSamples,
new[] { 1, audioSamples.Length }
);
var inputs = new List<NamedOnnxValue>
{
NamedOnnxValue.CreateFromTensor("audio", inputTensor)
};
// Ejecutar inferencia
using var results = _session.Run(inputs);
var embedding = results.First(r => r.Name == "embedding")
.AsEnumerable<float>()
.ToArray();
// Normalizar embedding (L2 norm)
return NormalizeEmbedding(embedding);
}
private float[] NormalizeEmbedding(float[] embedding)
{
var norm = Math.Sqrt(embedding.Sum(x => x * x));
return embedding.Select(x => x / (float)norm).ToArray();
}
}
// VoiceTracker.Domain/UseCases/EnrollVoiceUseCase.cs
public class EnrollVoiceUseCase
{
private readonly IEmbeddingExtractor _extractor;
private readonly IVoiceProfileRepository _repository;
public async Task<VoiceProfile> EnrollAsync(List<float[]> audioSamples)
{
if (audioSamples.Count < 3)
throw new ArgumentException("Need at least 3 samples for enrollment");
// Extraer embeddings de cada muestra
var embeddings = new List<float[]>();
foreach (var sample in audioSamples)
{
var embedding = _extractor.Extract(sample);
embeddings.Add(embedding);
}
// Promediar embeddings
var referenceEmbedding = AverageEmbeddings(embeddings);
// Crear perfil
var profile = new VoiceProfile
{
Id = Guid.NewGuid(),
ReferenceEmbedding = new VoiceEmbedding(referenceEmbedding),
EnrollmentDate = DateTime.UtcNow,
SampleCount = audioSamples.Count
};
await _repository.SaveAsync(profile);
return profile;
}
private float[] AverageEmbeddings(List<float[]> embeddings)
{
var embeddingSize = embeddings[0].Length;
var averaged = new float[embeddingSize];
for (int i = 0; i < embeddingSize; i++)
{
averaged[i] = embeddings.Average(e => e[i]);
}
return averaged;
}
}
1.1.4 4.4 Verificación en Tiempo Real
Objetivo: Comparar embedding actual con embedding de referencia
Métrica: Similitud de coseno
similarity = (A · B) / (||A|| * ||B||)
Umbral: similarity > 0.75 → Es el usuario
Implementación:
// VoiceTracker.Infrastructure/ML/ONNXSpeakerVerifier.cs
public class ONNXSpeakerVerifier : ISpeakerVerifier
{
private readonly EmbeddingExtractor _extractor;
private readonly IVoiceProfileRepository _repository;
private readonly double _similarityThreshold = 0.75;
private VoiceProfile _currentProfile;
public async Task LoadProfileAsync()
{
_currentProfile = await _repository.GetCurrentProfileAsync();
if (_currentProfile == null)
throw new InvalidOperationException("No voice profile found. Please enroll first.");
}
public bool IsUserSpeaking(float[] audioSamples)
{
if (_currentProfile == null)
return false;
// Extraer embedding del audio actual
var currentEmbedding = _extractor.Extract(audioSamples);
// Calcular similitud de coseno
var similarity = CosineSimilarity(
currentEmbedding,
_currentProfile.ReferenceEmbedding.Values
);
return similarity >= _similarityThreshold;
}
private double CosineSimilarity(float[] a, float[] b)
{
if (a.Length != b.Length)
throw new ArgumentException("Embeddings must have same length");
double dotProduct = 0;
double normA = 0;
double normB = 0;
for (int i = 0; i < a.Length; i++)
{
dotProduct += a[i] * b[i];
normA += a[i] * a[i];
normB += b[i] * b[i];
}
return dotProduct / (Math.Sqrt(normA) * Math.Sqrt(normB));
}
}
1.1.5 4.5 Optimización de Performance
Problema: Inferencia ML puede ser costosa en CPU
Soluciones:
- Batching: Procesar múltiples frames juntos
- Caching: No re-calcular si el audio es similar al anterior
- Throttling: Verificar cada 500ms en lugar de cada frame
// VoiceTracker.Infrastructure/ML/OptimizedSpeakerVerifier.cs
public class OptimizedSpeakerVerifier : ISpeakerVerifier
{
private readonly ONNXSpeakerVerifier _verifier;
private readonly TimeSpan _verificationInterval = TimeSpan.FromMilliseconds(500);
private DateTime _lastVerification = DateTime.MinValue;
private bool _lastResult = false;
public bool IsUserSpeaking(float[] audioSamples)
{
var now = DateTime.UtcNow;
// Throttling: solo verificar cada 500ms
if (now - _lastVerification < _verificationInterval)
return _lastResult;
_lastResult = _verifier.IsUserSpeaking(audioSamples);
_lastVerification = now;
return _lastResult;
}
}
Referencia: 20 - Machine Learning
1.2 🚨 5. Sistema de Alertas Visuales
1.2.1 5.1 Notificaciones Toast de Windows
Decisión: Usar WinUI 3 Notifications para alertas discretas
Implementación:
// VoiceTracker.Infrastructure/Alerts/ToastNotificationService.cs
using Microsoft.Toolkit.Uwp.Notifications;
public class ToastNotificationService : IAlertService
{
public void ShowAlert(Decibel currentVolume, Decibel threshold)
{
new ToastContentBuilder()
.AddText("🔊 Volumen Alto Detectado")
.AddText($"Tu volumen: {currentVolume.Value:F1} dB")
.AddText($"Límite: {threshold.Value:F1} dB")
.AddButton(new ToastButton()
.SetContent("Entendido")
.AddArgument("action", "dismiss"))
.Show();
}
public void HideAlert()
{
ToastNotificationManagerCompat.History.Clear();
}
}
1.2.2 5.2 Overlay de Bordes Rojos (DirectX)
Decisión: Crear overlay transparente con bordes rojos usando DirectX
Características:
- Ventana transparente fullscreen
- Siempre en primer plano (topmost)
- No bloquea interacción con otras ventanas
- Animación de fade in/out
Implementación:
// VoiceTracker.Infrastructure/Alerts/ScreenOverlayService.cs
using System.Windows;
using System.Windows.Media;
using System.Windows.Media.Animation;
public class ScreenOverlayWindow : Window
{
private readonly int _borderThickness = 10;
public ScreenOverlayWindow()
{
// Configurar ventana
WindowStyle = WindowStyle.None;
AllowsTransparency = true;
Background = Brushes.Transparent;
Topmost = true;
ShowInTaskbar = false;
// Fullscreen
Left = 0;
Top = 0;
Width = SystemParameters.PrimaryScreenWidth;
Height = SystemParameters.PrimaryScreenHeight;
// Permitir clicks a través de la ventana
var hwnd = new WindowInteropHelper(this).Handle;
SetWindowExTransparent(hwnd);
}
protected override void OnRender(DrawingContext dc)
{
base.OnRender(dc);
// Dibujar bordes rojos
var redBrush = new SolidColorBrush(Colors.Red);
var rect = new Rect(0, 0, Width, Height);
var pen = new Pen(redBrush, _borderThickness);
dc.DrawRectangle(Brushes.Transparent, pen, rect);
}
public void ShowWithAnimation()
{
Opacity = 0;
Show();
var animation = new DoubleAnimation
{
From = 0,
To = 1,
Duration = TimeSpan.FromMilliseconds(300),
EasingFunction = new QuadraticEase()
};
BeginAnimation(OpacityProperty, animation);
}
public void HideWithAnimation()
{
var animation = new DoubleAnimation
{
From = 1,
To = 0,
Duration = TimeSpan.FromMilliseconds(300),
EasingFunction = new QuadraticEase()
};
animation.Completed += (s, e) => Hide();
BeginAnimation(OpacityProperty, animation);
}
[DllImport("user32.dll")]
private static extern int SetWindowLong(IntPtr hWnd, int nIndex, int dwNewLong);
private void SetWindowExTransparent(IntPtr hwnd)
{
const int GWL_EXSTYLE = -20;
const int WS_EX_TRANSPARENT = 0x00000020;
const int WS_EX_LAYERED = 0x00080000;
SetWindowLong(hwnd, GWL_EXSTYLE, WS_EX_LAYERED | WS_EX_TRANSPARENT);
}
}
public class ScreenOverlayService : IAlertService
{
private ScreenOverlayWindow _overlay;
public void ShowAlert(Decibel currentVolume, Decibel threshold)
{
if (_overlay == null)
{
_overlay = new ScreenOverlayWindow();
}
_overlay.ShowWithAnimation();
}
public void HideAlert()
{
_overlay?.HideWithAnimation();
}
}
1.2.3 5.3 Niveles de Severidad
Decisión: 3 niveles de alerta según cuánto se excede el umbral
| Nivel | Exceso | Color | Acción |
|---|---|---|---|
| Advertencia | 0-5 dB | Amarillo | Toast notification |
| Moderado | 5-10 dB | Naranja | Toast + Bordes finos |
| Severo | > 10 dB | Rojo | Toast + Bordes gruesos + Vibración |
// VoiceTracker.Domain/ValueObjects/AlertLevel.cs
public enum AlertLevel
{
None,
Warning,
Moderate,
Severe
}
public static class AlertLevelExtensions
{
public static AlertLevel GetLevel(Decibel current, Decibel threshold)
{
var excess = current.Value - threshold.Value;
if (excess <= 0) return AlertLevel.None;
if (excess <= 5) return AlertLevel.Warning;
if (excess <= 10) return AlertLevel.Moderate;
return AlertLevel.Severe;
}
}
1.3 🧪 6. Testing Exhaustivo
1.3.1 6.1 Unit Tests para Cálculo de Decibeles
// VoiceTracker.Tests/Unit/DecibelCalculatorTests.cs
using Xunit;
using FluentAssertions;
public class DecibelCalculatorTests
{
private readonly DecibelCalculator _calculator = new();
[Fact]
public void Calculate_WithSilence_ReturnsSilenceLevel()
{
// Arrange
var samples = new float[1000]; // All zeros
// Act
var result = _calculator.Calculate(samples);
// Assert
result.Should().Be(Decibel.Silence);
}
[Theory]
[InlineData(0.1, -20.0)] // Aproximadamente
[InlineData(0.5, -6.0)]
[InlineData(1.0, 0.0)]
public void Calculate_WithKnownRMS_ReturnsExpectedDB(double rms, double expectedDb)
{
// Arrange
var samples = Enumerable.Repeat((float)rms, 1000).ToArray();
// Act
var result = _calculator.Calculate(samples);
// Assert
result.Value.Should().BeApproximately(expectedDb, 0.5);
}
}
1.3.2 6.2 Property-Based Testing para Audio
Objetivo: Generar miles de inputs aleatorios y verificar propiedades invariantes
// VoiceTracker.Tests/Property/AudioProcessingProperties.cs
using FsCheck;
using FsCheck.Xunit;
public class AudioProcessingProperties
{
[Property]
public Property DecibelCalculation_ShouldNeverExceedZero()
{
return Prop.ForAll<float[]>(samples =>
{
if (samples == null || samples.Length == 0)
return true;
var calculator = new DecibelCalculator();
var result = calculator.Calculate(samples);
return result.Value <= 0;
});
}
[Property]
public Property VoiceActivityDetection_ShouldBeConsistent()
{
return Prop.ForAll<float[]>(samples =>
{
if (samples == null || samples.Length < 100)
return true;
var vad = new VoiceActivityDetector();
var result1 = vad.DetectVoice(samples);
var result2 = vad.DetectVoice(samples);
return result1 == result2; // Debe ser determinístico
});
}
}
1.3.3 6.3 Integration Tests para FSM
// VoiceTracker.Tests/Integration/FSMIntegrationTests.cs
public class FSMIntegrationTests
{
[Fact]
public async Task FSM_ShouldTransitionCorrectly_WhenVolumeExceedsThreshold()
{
// Arrange
var config = new AppConfig
{
DayThreshold = new Decibel(-20),
NightThreshold = new Decibel(-30)
};
var alertService = new Mock<IAlertService>();
var fsm = new VoiceTrackerFSM(config, alertService.Object);
// Act
var frame = new AudioFrame
{
Samples = GenerateLoudAudio(), // -15 dB
HasAudio = true
};
fsm.ProcessAudioFrame(frame, isUserVoice: true, new Decibel(-15));
// Assert
fsm.CurrentState.Should().Be(SystemState.AlertActive);
alertService.Verify(a => a.ShowAlert(It.IsAny<Decibel>()), Times.Once);
}
private float[] GenerateLoudAudio()
{
// Generar audio con RMS = 0.2 → -14 dBFS
return Enumerable.Repeat(0.2f, 1600).ToArray();
}
}
1.3.4 6.4 Performance Profiling
Objetivo: Asegurar que el consumo de CPU/RAM esté dentro de los límites
// VoiceTracker.Tests/Performance/PerformanceTests.cs
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
[MemoryDiagnoser]
public class AudioProcessingBenchmarks
{
private float[] _audioSamples;
private DecibelCalculator _calculator;
private EmbeddingExtractor _extractor;
[GlobalSetup]
public void Setup()
{
_audioSamples = GenerateRandomAudio(16000); // 1 segundo
_calculator = new DecibelCalculator();
_extractor = new EmbeddingExtractor("speechbrain_ecapa.onnx");
}
[Benchmark]
public Decibel CalculateDecibels()
{
return _calculator.Calculate(_audioSamples);
}
[Benchmark]
public float[] ExtractEmbedding()
{
return _extractor.Extract(_audioSamples);
}
private float[] GenerateRandomAudio(int length)
{
var random = new Random(42);
return Enumerable.Range(0, length)
.Select(_ => (float)(random.NextDouble() * 2 - 1))
.ToArray();
}
}
// Ejecutar: dotnet run -c Release --project VoiceTracker.Tests
Resultados Esperados:
| Method | Mean | Allocated |
| ------------------- | ----------: | ----------: |
| CalculateDecibels | 15.2 μs | 32 B |
| ExtractEmbedding | 45.3 ms | 2.4 KB |
Referencia: 04 - Testing
1. ⬆️ Volver arriba | ⬅️ Parte 1 | ➡️ Parte 3 | 🏠 Casos de Estudio