Work & Data Flow

Knovvu's Speech Recognition system allows users to convert spoken language into text. The system utilizes advanced algorithms and models to accurately transcribe spoken words into written text.

As a first stage, integration method is decided to use SR service. Provided integration methodologies are defined on Integrations page.

After that, the recognition method is decided based on the usage scenario. If grammar recognition is used, a grammar file is prepared based on the requirement of the scenario. Otherwise, an appropriate language model is uploaded to use the dictation method of SR. Details for recognition methodologies are defined on Recognition Measurement page.

Audio Input

Users provide audio input through various sources such as microphones, audio files, or real-time streaming.

Audio Preprocessing

The incoming audio is preprocessed to enhance quality and remove noise. This may involve techniques like noise reduction, normalization, and segmentation.

Feature Extraction

The preprocessed audio is transformed into a format suitable for analysis. Mel-frequency cepstral coefficients (MFCCs) or spectrograms are commonly extracted as features that represent the audio's frequency and time characteristics.

Acoustic Model

This step is performed if it is a Hybrid model.

The extracted features are fed into an acoustic model. This model has been trained on vast amounts of speech data to recognize phonetic patterns and acoustic characteristics of spoken language.

Language Model

Simultaneously, the system employs a language model that predicts the likelihood of word sequences in the given language. This helps refine the transcription by incorporating linguistic context.

Alignment

This step is performed if it is a Hybrid model.

The acoustic and language models work in tandem to align the recognized phonetic patterns with potential word sequences. This alignment process refines the transcription accuracy.

Decoding

The aligned data goes through a decoding process where the system generates the most probable transcription for the given audio input. This is where the recognized speech is converted into text.

Post-Processing

The generated text may undergo post-processing to correct any contextual errors or inconsistencies in the transcription. This step may involve grammar correction, punctuation insertion, and homophone disambiguation.

Response Generation

Recognized text gives a confidence score that shows the accuracy of the recognition result. Lastly if Cloud SR is used, Generated response undergoes a normalization process.

Output

The final transcription is presented as the output to the user. This text representation of the spoken language can be used for various purposes, such as documentation, real-time captions, or voice-enabled applications.