AI-based Auto-Transcription

<< Click to Display Table of Contents >>

Navigation:  START YOUR OBSERVATIONS > Transcriptions and Comments >

AI-based Auto-Transcription

Since 2024, INTERACT Premium offers an offline, AI-based auto-Transcription module.

After correct installation, this module enables you to auto-transcribe any video with clear audio locally - fully GDPR compliant.

Note: During the installation of INTERACT and the first run of the auto-transcribe module, an internet connection is required.

Start an Automated Transcription

Open you video in INTERACT

Open an INTERACT data file.

Create a DataSet by clicking the Add Set Btn_AddSet button.

Click Insert - Multimedia link - To DataSet Brn_LinkToSet

Or...

Make a right click inside the corresponding DataSet.

Choose Insert file reference > Link current videos to current DataSet from the context menu.

Make sure you can open your linked video by double-clicking a time stamp of the DataSet, otherwise the transcription tool is not able to 'find' the video.

Configure Auto-Transcript

Open the configuration dialog with the command Text - Text analysis > Autotranscribe-Whisper.

Auto-TranscribeMenu

The configuration dialog appears:

Auto-Transcribe-Dialog Whisper_TranscriptExportFormats

With those default settings, you already receive a rather good transcription of your video.

Speech and text options

Model - The selected model determines both the quality of the result and the time it takes to complete the transcription.
The base Model is a very good compromise.
For a rough index of the spoken words, even the tiny model might be sufficient.
You'll need to test which model works best for your videos and hardware setup.

Repetitive Transcription passes

This drop down list offers the following options:

oSkip file and do not create Events - If the video has already been transcribed, nothing happens

oOverwrite and transcribe again - Previous transcriptions are overwritten and the video is processed again.

oUse existing transcript for creating Events - Previous transcriptions are used to re-create Events in the current data file.

Add transcripts as INTERACT Events - This option ensures the automated creation of INTERACT Events. If you clear this option, you can import the SRT file into INTERACT later.

Speaker recognition options

Specify the number of speakers in your video, to prevent the identification of speakers that are not there.

Note: Speaker recognition only works if the voices are easy to separate. Participants with similar voices will receive the same speaker ID. You need to manually verify the results and may need to change the speaker ID for certain Events.

TIP:If speaker detection is not required or difficult due to the similarity in voices, set the Max. Speaker value to 1. This speeds up the transcription routine a lot because all lines receive the same Speaker label

Exporting Options

Transcript format - Specifies the file format of the resulting text file. SRT and VTT are specific sub-title formats that can also be imported directly into INTERACT.

Transcript type - Specifies how the Events are created: Per sentence or per Word. A per Word transcription creates an Event for every word, which results in accurate timing per word.

Highlight words on subtitles - Only of interest if you are indeed planning on using the exported sub-title file for a video, like on YouTube.

Output path

Specifies where the transcript file is stored. Creating this file in the same directory as the video makes it easy to find it.

Transcription Process

The Model you select determines the quality of the transcription.
The better the quality, the longer it takes for the transcription to complete.

IMPORTANT: Larger Models require much more computer resources and will not work on every system!

Auto-TranscribeProgressBar

The length of a video and the number of spoken words is another important factor for the duration of the task.

Some indications about the duration of the transcription process:

oA 30 second video running on a decent Corei7 CPU takes about 30 seconds when using the base model, but 5 Minutes when using the medium model.

oThe same 30 second video on a correctly configured GPU takes less than 20 Seconds for the Medium model and about 3 minutes for the Large model (if your GPU offers enough memory)

These are only rough estimates and cannot be multiplied linearly for longer video, but it indicates the difference between those three models.

Models & Resources

Graphic boards (GPU) that support CUDA can speed-up the transcription process and larger Models require more memory than small ones. Below there is a list that indicates the amount of GPU-Memory that is required per Model:

oModel "tiny": 1 GB

oModel "tiny.en": 1 GB

oModel "base": 1 GB

oModel "base.en": 1 GB

oModel "small": 2 GB

oModel "small.en": 2 GB

oModel "medium": 5 GB

oModel "medium.en": 5 GB

oModel "large-v1": 10 GB

oModel "large-v2": 10 GB

oModel "large-v3": 10 GB

oModel "large": 10 GB

oModel "large-v3-turbo": 6 GB

oModel "turbo": 6 GB

Thus, if you select one of the larger models but your computer only has a regular GPU, transcription will not work.