Converting DVD (VobSub) subtitles to SRT (SubRip) format under Linux
2008, December 30.
This article explains how you can convert subtitles extracted from a DVD (VobSub) into a human readable format (SubRip) under Linux. The process described below should also apply for Windows users, because AviDemux is available for Windows too.
If you ever had a look at the files stored on a regular video DVD you will have noticed that it contains files with the VOB extension. These files contain all the data for a movie: audio, video, menu and even the subtitles.
Unfortunately the subtitles inside such a VOB file are stored in the image-based VobSub format. You can imagine these subtitles as separate images containing the text to be displayed for specific parts of a movie. Changing the text of the subtitles is not directly possible as it would require you to repaint each letter in each image by hand.
The way to bypass this issue is converting the images into text using optical character recognition (OCR) algorithms. Such an algorithm scans the separate subtitle images and tries to recognise the characters in them. This is similar to the procedure used to figure out the handwriting performed on a touch-screen. One should keep in mind that OCR algorithms are currently still very error-prone and often falsely recognise characters they encounter. Therefore it is important to always spell-check the produced results!
One convenient way to convert VobSub subtitles into text-based subtitles on Linux is the built-in OCR feature of the AviDemux application. This feature lets you convert from the VobSub to the SRT (SubRip) format. This is a useful thing, because SubRip is the format supported by most media player applications and even DVD players that are capable of playing MPEG-4 files. Moreover the result is human readable and requires much less disc space compared to the VobSub version. So it might be a good idea to convert the subtitles to the SubRip format to make them fit on a CD together with a movie file. It should be noted that AviDemux also comes with a function to extract the subtitles from VOB files of a DVD, which produces the VobSub file itself.
Follow these steps to convert subtitles from VobSub to SubRip format:
Step 1: Start AviDemux and select "Tools / OCR (VobSub -> srt)" from the menu this will pop up the dialog shown in Figure 1.
Figure 1: Initial OCR dialog of AviDemux
Step 2: Click the "Select .idx file:" button which will pop up another dialog. Click the "Select .idx" button and select the file with the IDX extension of the source VobSub subtitle (not the one with the ending SUB).
Step 3: There might be more than one subtitle stream inside the VobSub file. Select one language from the subtitle stream that you would like the subtitles to be extracted for, using the "Select Language:" combo box. See Figure 2 for a selected file and language. Click the OK button.
Figure 2: File selection dialog for OCR in AviDemux
Note: Due to copyright issues the filenames used in the example figures are rendered unreadable.
Step 4: The OCR procedure takes place in the appearing dialog, which is titled "Mini OCR". You will see black images with white text on them. These are the images that are extracted from the VobSub file and will be scaned for characters to produce the text-based SubRip subtitle. See Figure 3 for the interface of this dialog.
Figure 3: Performing the OCR process in AviDemux
Each recognised character is displayed in the "Bitmap" section and you are asked to enter the character that you see in the extracted portion of the current image. The image may contain more than one character. In such a case enter each character that you recognise in the image. If you are not sure which character is dispalyed in the bitmap, check the text already recognised, which is located on the right side of it. The "Bitmap" contains the next character for the "Current Glyph Text:".
Each recognised character will have to be entered only once and will be automatically recognised the next time it is encountered in any of the images in the current VobSub subtitle. Therefore the further you are in the OCR procedure the fewer times you will have to enter characters yourself. This makes the conversion from image to text very fast, but can on the other side lead to falsely recognised characters, which you will notice as spelling errors in the resulting SubRip file.
Step 5: After the OCR process has finished you will be offered to save the recognised characters into a GyphSet file (Figure 4). Such a file can be used to speed up future OCR procedures by supplying information about already identified characters. The dialog in Figure 1 contains an option to use such a file when opening the VobSub IDX file. Once again keep in mind that falsely identified characters in a GlyphSet will result in spelling errors for each occurence of the affected characters.
Figure 4: Dialog for saving the GlyphSet
Step 6: Now that your SubRip text file is created open it and look for spelling errors. Figure 5 shows the result of an OCR process and you can see that the file has several errors in it. For example each text has a leading space, which is not necessary and should be removed to save disk space and ensure correct display in your media player. The entries 32, 33, 34 could be put together into one single entry as the timing of them is continuous and they have the same content. This however requires each following subtitle entry to be renumbered. The entry numbered 35 starts with an "l" instead of a capital "i". This kind of character recognation error is very common and should also be fixed in this final step.
Figure 5: Example result of the OCR process
Best practice is to read through each subtitle entry and check them one-by-one. This is the only way to ensure a perfect result, which should be everyone's goal when converting between subtitle formats.