Hi, I'm @ryusei__46.
The content of this article is as the title suggests. However, when I tried to go a little further, I found that there was a lot of missing information, so I thought I would share it with you.
There is a desktop application called "Transmedia" that I have created in the past, which takes videos and audio from overseas (mainly English), translates them into Japanese using DeepL, and displays them with subtitles in a dedicated player.
When producing this application, we needed to do a robust processing of the Python characters coming in. We used a machine learning model called "Whisper" provided as open source by Open AI, which is also the Python library we will deal with in this article.
Now that I've advertised the application I produced, I'd like to get down to the meat of the article from here.
Confirmation of Execution Environment
My execution environment (hardware) is as follows
- OS:Windows 11 Pro 64bit
- CPU:AMD Ryzen 9 5900X 12 Core Processor
- GPU:NVIDIA GeForce RTX 2060 SUPER 8G
- Memory:DDR4 32GB 3200MHz
- Storage:MVME M.2 SSD 1TB Read/Write 7000MB/s
Run Open AI's "Whisper" model in the above hardware environment.
To use the "Whisper" model, we will use a Python library called "faster-whisper".
The library we will use this time is called "faster-whisper", which is developed by a loan and can provide better performance than the library officially provided by Open AI.
Specifically, The two main reasons for this are increased processing speed and reduced memory size used.
The processing speed is approximately 5 times faster than the official library, and the memory size used is approximately 3 to 4 times smaller.
Unless you have a specific reason to use this library, it is better to use this library.
Preparing to use the Whisper model
If you have not yet prepared a Python environment for Windows, please refer to the following sites, which are also available for Mac and Linux users.
Now, we will first install a library for machine learning called "pytorch", which is a dependent library of faster-Whisper.
The installation of this "pytorch" is a somewhat troublesome point.
If the PC to be used is not equipped with a graphics card, processing will be performed by the CPU, but in this case, installing pytorch is easy. You can install it with the following pip
command.
pip3 install torch torchvision torchaudio
However, if you want to use a graphics card for processing, you will also need to build an environment for using "CUDA". At the same time, you will also need to install CUDNN. CUDNN is a deep learning architecture that runs on CUDA.
It is possible to use the Whisper model with CPU alone, but it takes a lot of time to use the most accurate "large-v2" or the "large-v3" size recently released by Open AI. On my PC, it took more than twice as long as the original time for audio, depending on CPU performance.
When I tried it with a graphics card, the RTX 2060 SUPER took about one-fifth of the original time to complete the process. It is still blazing fast, so if you have a PC with a graphics card, I strongly recommend that you set up an environment where you can use CUDA.
Check the supported versions of CUDA and CUDNN
First, go to the official PyTorch "GET STARTED" page to check the supported CUDA versions.
It seems that cuda 12.1 is supported up to cuda 12.1 at the article expense tsuji store. We will use version 12.1 this time.
In this page, you can also select the corresponding items to see the installation commands for each environment. once you have installed CUDA and CUDNN, you will be able to run the pip command here to install PyTorch.
Next, open CUDA's Support Matrix - NVIDIA Docs" page and check that the CUDA version, OS, NVIDIA driver version, and GPU architecture you are using are OK.
For "NVIDIA Driver Version," if you have installed the graphics card driver, a tool called "NVIDIA Control Panel" is installed, so use "Win + Q" on the keyboard to search and launch the application.
"Supported NVIDIA Hardware" is a bit confusing, but it is the name of the architecture used in each graphics card. Here is the "List of Nvidia graphics processing units" where you can see the architecture of the graphics card you are using.
The amount of detailed information on the graphics cards of the past generations is quite huge in this page, but you can use "Ctrl + F" on the keyboard to search the page and find the graphics card you are using. In my case, it is "GeForce RTX 2060 SUPER", so I know that it is classified as "GeForce 20 series" and the name of the architecture is "Turing".
So, for "Supported NVIDIA Hardware", I see that it is "NVIDIA Turing", so there is no problem.
Install Build Tools for Visual Studio
Before installing CUDA, the "Build Tools for Visual Studio" must be included first. Without it, the CUDA Tool Kit will not work.
Scroll down to the bottom of this page, and you will see an item titled "All Downloads." Select "Tools for Visual Studio" in that section, and in the accordion that opens, you will find "Build Tools for Visual Studio 2022," so click "Download" to download the installer.
When the installer starts, a screen similar to the one below will appear. Please check the "Desktop Development with C++" checkbox to complete the installation.
Installation of CUDA and CUDNN
Now, we will install the confirmed versions of "CUDA Tool Kit" and "CUDNN" on Windows.
For the CUDA Tool Kit, open the "CUDA Toolkit Archive" page and click on "CUDA Toolkit 12.1.0 (February 2023)" since I am using version 12.1.
On the download screen, you can choose the type of installer that best suits your environment.
Once the installer has been downloaded, run it directly to complete the installation.
To confirm that CUDA has been installed, execute the following command at a command prompt or in Powershell.
$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_May__3_19:00:59_Pacific_Daylight_Time_2022
Cuda compilation tools, release 12.1, V11.7.64
Build cuda_12.1.r11.7/compiler.31294372_0
Next, install "CUDNN". To install CUDNN, open the "NVIDIA cuDNN" page and click the "Download cuDNN Library" button. If you are downloading CUDNN for the first time, you will need to register an account, so please do so on your own.
On the next screen, check the "I Agree To the Terms of the cuDNN Software License Agreement" checkbox and the version of cuDNN available for download will be displayed. In my case, I selected "Download cuDNN v8.9.6 (November 1st, 2023), for CUDA 12.x" and click "Local Installer for Windows (Zip)" from the installer download links displayed below, Download the installer in a compressed ZIP file.
The CUDA installation path should be "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1". Computing Toolkit\CUDA\v12.1
". Overwrite it here and the CUDNN installation is complete.
Installation of "PyTorch" and "Faster-Whisper
Now that you are ready to use Pytorch with CUDA, open the official PyToch "GET STERTED" page here and copy the installation commands that match your environment.
In my case, it will be "PyTorch Build: Stable (2.1.1)", "Your OS: Windows", "Package: Pip", "Language: Python", and "Compute Platform: CUDA 12.1".
Finally, copy the installation command output in the "Run this Command:" section.
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
You should now be able to use CUDA with PyTorch; to verify that CUDA is enabled, run the following code in Python interactive mode or by executing a Python script file.
import torch
torch.cuda.is_available() // trueが出力されればOK
The next step is to install f "Faster-Whisper", which can be done by executing the following command.
pip3 -U install faster-whisper
You are now ready to use the Whisper model to visit the text.
Practicing text welding with "Faster-Whisper".
If you simply want to bullet text, you can use the following code.
from faster_whisper import WhisperModel
model_size = "large-v2"
model = WhisperModel(model_size, device="cuda", compute_type="int8")
# or run on GPU with INT8
# model = WhisperModel(model_size, device="cuda", compute_type="int8_float16")
# or run on CPU with INT8
# model = WhisperModel(model_size, device="cpu", compute_type="int8")
fileName = "audio.m4a"
segments, info = model.transcribe(
"audio.m4a", word_timestamps=True,
initial_prompt="こんにちは、私は山田です。最後まで句読点を付けてください。",
beam_size=5, language='ja'
)
print("Detected language '%s' with probability %f" % (info.language, info.language_probability))
with open("transcribed.txt", 'w') as f:
for segment in segments:
f.write( segment.text + "\n" )
print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
for model_size
,
- tiny:39M (megabytes) / Faster processing, but less accurate
- base:74M
- small:244M /default model
- medium:769M
- large:1550M / Takes longer to process, but most accurate
- large-v1:1550M / Updated version of the large model
- large-v2:1.54B (Gigabytes) / the latest version, the model with the highest precision (was, but "large-v3" was released around late September 2023).
- large-v3: Current latest model
The lower the size, the higher the system, but the processing load and memory usage will increase. If faster-whisper is used, even large-v2, which is the most accurate, requires only about 6 GB of VRAM.
For compute_type
, int8
, float16
, and float32
can be selected. This value sets how the Whisper model handles numbers; the larger the digit, the higher the system, but the greater the load.
The word_timestamps
parameter determines whether the start and end times of occurrence of each individual word are also included in the results. segment.words
allows access to word-by-word information. This is useful when shaping the data in detail.
The initial_prompt
parameter is used to prompt for the initial content to be endorsed to the Whisper model. In the case of this example, "Hello, my name is Yamada. Please punctuate to the end." is set to hint that the segment should be punctuated well and cut off at a good time.
The language
parameter specifies the language of the audio or video file to be processed. Although the language can be determined automatically without specifying it, it incurs an overhead, so it is better to specify it to reduce processing time.
Format and output results to a JSON file
The following code I created will output all data, including word-by-word information.
// transcribe.py
import sys, glob, json, re
print("########## Start of character arrival ##########")
try:
from torch import cuda
from faster_whisper import WhisperModel
except ImportError:
sys.stderr.write("「faster_whisper」か「pytorch」モジュールが見つかりません。\n")
sys.exit(1)
mediaSourcePath = sys.argv[1]
mediaFilePath = glob.glob( f"{ mediaSourcePath }\\*" )[0]
model_size = sys.argv[2]
useLang = sys.argv[3] if sys.argv[3] else None
transcribe_results = []
plasticated_result = []
model = WhisperModel(
model_size, compute_type="float16",
device= 'cuda' if cuda.is_available() else 'cpu'
)
segments, info = model.transcribe(
mediaFilePath, beam_size=5, word_timestamps=True,
initial_prompt="こんにちは、私は山田です。Hello, I am Yamada.",
language=useLang
)
print("Detected language '%s' with probability %f" % (info.language, info.language_probability))
for segment in segments:
transcribe_tmp = { "start": segment.start, "end": segment.end, "subtitle": segment.text, "word_timestamps": [] }
for word in segment.words:
transcribe_tmp["word_timestamps"].append({ "start": word[0], "end": word[1], "text": word[2] })
transcribe_results.append( transcribe_tmp )
print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
with open( f"{ mediaSourcePath }\\transcribe.json", "w", encoding="utf-8" ) as json_file:
json.dump( transcribe_results, json_file, indent=2, ensure_ascii=False )
if len( glob.glob( f"{ mediaSourcePath }\\transcribe.json" ) ) > 0:
word_tmp, word_start_tmp, skip_flag = "", 0, False
for transcribe in transcribe_results:
for word in transcribe["word_timestamps"]:
word_tmp += word["text"]
if not skip_flag:
word_start_tmp = word["start"]
skip_flag = True
if re.search( r"(\.|\?|。)$", word["text"] ):
plasticated_result.append({
"start": word_start_tmp,
"end": word["end"], "text": re.sub( r"^\s+", "", word_tmp )
})
skip_flag = False
word_tmp = ""
with open( f"{ mediaSourcePath }\\plasticated.json", "w", encoding="utf-8" ) as json_file:
json.dump( plasticated_result, json_file, indent=2, ensure_ascii=False )
with open( f"{ mediaSourcePath }\\plasticated.plain.txt", mode="w", encoding="utf-8" ) as txt_file:
for index, plasticated in enumerate( plasticated_result ):
txt_file.write( str( index ) + ': ' + plasticated['text'] + "\n" )
Pass arguments when executing on the Command line.
$ py transcribe.py [media_source_path] [model_size] [language]
In media_source_path
, set the path to the prepared video and audio files in the desired folder. model_size
is the name of the model size you want to use, such as base
or small
. language
is the language code of the media to be processed. code of the media to be processed. If nothing is specified, it will be automatic.
When processing is completed, "plasticated.json" and "transcribe.json" are generated in the folder. A sample of the processing result is shown below.
// transcribe.json
{
{
"start": 8.540000000000003, // 出現開始時刻(秒)
"end": 10.88, // 出現終了時刻
"subtitle": " Introducing Apple Vision Pro.", // 文字お越しされたテキスト
"word_timestamps": [ // 更に細分化された単語ごとの情報
{
"start": 8.540000000000003,
"end": 9.280000000000001,
"text": " Introducing"
},
{
"start": 9.280000000000001,
"end": 10.02,
"text": " Apple"
},
{
"start": 10.02,
"end": 10.42,
"text": " Vision"
},
{
"start": 10.42,
"end": 10.88,
"text": " Pro."
}
]
},
{
"start": 12.16,
"end": 16.0,
"subtitle": " The era of spatial computing is here.",
"word_timestamps": [
{
"start": 11.870000000000001,
"end": 12.24,
"text": " The"
},
{
"start": 12.24,
"end": 12.62,
"text": " era"
},
{
"start": 12.62,
"end": 12.96,
"text": " of"
},
{
"start": 12.96,
"end": 13.28,
"text": " spatial"
},
{
"start": 13.28,
"end": 13.8,
"text": " computing"
},
{
"start": 13.8,
"end": 15.58,
"text": " is"
},
{
"start": 15.58,
"end": 16.0,
"text": " here."
}
]
},
// ・・・・・・・以下省略
}
In "plasticated.json", word-by-word data is formatted based on the "transcribe.json" file, and ". (period) in English. (period) in English as a single segment.
// plasticated.json
{
{
"start": 8.540000000000003,
"end": 10.88,
"text": "Introducing Apple Vision Pro."
},
{
"start": 11.870000000000001,
"end": 16.0,
"text": "The era of spatial computing is here."
},
{
"start": 22.57,
"end": 27.38,
"text": "When you put on Apple Vision Pro, you see your world and everything in it."
},
// ・・・・・・・以下省略
}
I believe the above data can be used flexibly for a variety of applications.
Creation of SRT files (subtitles)
An SRT file (SubRip Subtitle File) is a file that is used to display subtitles along with a video. It is supported by many media players and is used when you want to add subtitles to your video.
However, by using Python and Whisper to automate the creation of the text and SRT file, you can greatly reduce the workload and improve work efficiency.
If you just want to use the SRT file during video playback in a media player, just name the SRT file the same name as the video file and place it in the same directory, and it will load on its own. For more information, see the Microsoft page "Using SRT Files to Display Subtitles During Video Playback".
The following code can be used to create an SRT file.
// srt-whisper.py
from faster_whisper import WhisperModel
import math, sys
def convert_seconds_to_hms(seconds):
hours, remainder = divmod(seconds, 3600)
minutes, seconds = divmod(remainder, 60)
milliseconds = math.floor((seconds % 1) * 1000)
output = f"{int(hours):02}:{int(minutes):02}:{int(seconds):02},{milliseconds:03}"
return output
media_path = sys.argv[1]
language = sys.argv[2]
model_path = sys.argv[3]
model = WhisperModel(model_path, device="cuda", compute_type="int8")
segments, info = model.transcribe(
media_path,
initial_prompt="こんにちは、私は山田です。最後まで句読点を付けてください。",
beam_size=5, language="ja"
)
print("Detected language '%s' with probability %f" % (info.language, info.language_probability))
count = 0
with open(media_path, 'w') as f: # Open file for writing
for segment in segments:
count +=1
duration = f"{convert_seconds_to_hms(segment.start)} --> {convert_seconds_to_hms(segment.end)}\n"
text = f"{segment.text.lstrip()}\n\n"
f.write(f"{count}\n{duration}{text}") # Write formatted string to the file
print(f"{duration}{text}",end='')
The following arguments are passed on the command line
$ py srt-whisper.py [media_path] [language] [model_size]
This will output an SRT file with the same name as the media file in the current directory.
This concludes the article.
Extra: Introducing the Character Tools
We would like to introduce a text coming web tool that is still in beta but operates in the form of self-hosted Whisper. Currently, the model size is "small", but you can easily visit the text by uploading audio or video files.
Output formats are SRT, JSON, CSV, and TEXT.
Ultimately, the goal is to support text translation as well, so that foreign videos can be easily dubbed into Japanese.