Video: Guidelines for Editing Auto-Generated Captions

Captions are the text of auditory information in a video, including words and non-speech sounds. Closed captions do not appear under a video by default but can be turned on by the viewer.

At the University of Minnesota, all uploaded videos should include captions that are:

  • accurate
  • complete
  • well-placed 

After uploading videos to a platform with automatically-generated captions (e.g. YouTubeKalturaVoiceThread), follow this guide to proofread and edit your captions. 

In this article:

Editing Closed Captions Guidelines

Fix mistakes in spelling, add missing words, and fix punctuation

Automatically-generated captions will miss some speech sounds and misinterpret what the speaker is saying.

  • Ensure that all spoken words are correct and accurate.
  • Do not paraphrase or censor what the speaker is saying.

Adjust captions to align with the audio

  • Ensure each block of caption text is on-screen for between 1.5 and 6 seconds.
  • Generally, use no more than two lines in each block of text.
    • Include a speaker identifier using a third line of captions.
  • Consider how phrases break across lines.
    • Make lines of caption short and easy to read.
    • Aim for five to six words per line, or about 32 characters per line.
    • Break long caption lines into two shorter lines. Consider:
      • Individual word length: some words are longer than others.
      • Sentence cadence: make sure the sentence break is at a logical point where speech normally pauses.

Inappropriate (too long): 
She said I could order popcorn at the movie theatre.

Inappropriate (unnatural break):
She said I could order
popcorn at the movie theatre.

She said I could order popcorn
at the movie theatre.

Include speakers, non-speech sounds, and other auditory information


  • If there is more than one speaker, add speaker identifiers.
  • If it is unclear who is speaking, add speaker identifiers.
  • If the speaker's name is known, label it in parentheses.
  • If there is back-and-forth conversation between speakers, give each speaker their own block of text.
        Put the pumpkin on the table.

        Can you hand me the carving knife?

  • If names are unknown, use generic labels.
        Turn to page 394.
  • If it's clear who's speaking on screen, they do not need to be identified.
    • Use an angled bracket to identify the speaker.
      >Put the pumpkin on the table.
    • Use a double-angled bracket if the speaker changes.
      >>Can you hand me the carving knife?


  • Omit the sound's source if the source is visible on-screen.
  • Put non-speech sounds in brackets on their own line.
        [Applause with cheering]
  • Include the sound's source description.
        [Plane passing overhead]


  • Use objective words to describe music.
        [intense percussive music]
  • Caption lyrics verbatim.
  • Caption the performer and song title, if known.
        [Prince singing "Sometimes It Snows in April"]

Review the video with captions turned on

Check the quality and accuracy of caption lines and non-speech sounds in your video.

Last modified