[GH-ISSUE #695] Add video subtitles/transcript to sonic full-text search index #3456

Open
opened 2026-03-14 23:01:47 +03:00 by kerem · 0 comments
Owner

Originally created by @pirate on GitHub (Apr 8, 2021).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/695

We currently download video subtitles automatically whenever they're available (using youtubedl in the media extractor), however we are only submitting the page content (description, comments, etc.) to the full-text search index.

We should add support for converting .srt subtitles files to plain text and then feeding them into the full-text search index with sonic.

TODO:

  • add a post-processing step to the media extractor to convert any subtitles to a plain-text transcript.txt file (see below)
  • add the transcript.txt to the list of index_texts returned in the ArchiveResult to be indexed by sonic
def main():
    # read file line by line
    file = open( "sample.srt", "r")
    lines = file.readlines()
    file.close()

    text = ''
    for line in lines:
        if re.search('^[0-9]+$', line) is None and re.search('^[0-9]{2}:[0-9]{2}:[0-9]{2}', line) is None and re.search('^$', line) is None:
            text += ' ' + line.rstrip('\n')
        text = text.lstrip()
    print(text)
Originally created by @pirate on GitHub (Apr 8, 2021). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/695 We currently download video subtitles automatically whenever they're available (using youtubedl in the `media` extractor), however we are only submitting the page content (description, comments, etc.) to the full-text search index. We should add support for converting .srt subtitles files to plain text and then feeding them into the full-text search index with sonic. TODO: - [ ] add a post-processing step to the `media` extractor to convert any subtitles to a plain-text `transcript.txt` file (see below) - [ ] add the `transcript.txt` to the list of `index_texts` returned in the `ArchiveResult` to be indexed by sonic ```python3 def main(): # read file line by line file = open( "sample.srt", "r") lines = file.readlines() file.close() text = '' for line in lines: if re.search('^[0-9]+$', line) is None and re.search('^[0-9]{2}:[0-9]{2}:[0-9]{2}', line) is None and re.search('^$', line) is None: text += ' ' + line.rstrip('\n') text = text.lstrip() print(text) ```
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#3456
No description provided.