Video materials require learners to manage concurrent verbal and pictorial processing. To facilitate second language (L2) learners’ video comprehension, the amount of presented information should thus be compatible with human beings’ finite cognitive capacity. In light of this, the current study explored whether a reduction in multimodal comprehension scaffolding would lead to better L2 comprehension gain when viewing captioned videos and, if so, which type of reduction (verbal vs. nonverbal) is more beneficial. A total of 62 L2 learners of English were randomly assigned to one of the following viewing conditions: (1) full captions + animation, (2) full captions + static key frames, (3) partial captions + animation, and (4) partial captions + static key frames. They then completed a comprehension test and cognitive load questionnaire. The results showed that while viewing the video with reduced nonverbal visual information (static key frames), the participants had well-rounded performance in all aspects of comprehension. However, their local comprehension (extraction of details) was particularly enhanced after viewing a key-framed video with full captions. Notably, this gain in local comprehension was not as manifest after viewing animated video content with full captions. The qualitative data also revealed that although animation may provide a perceptually stimulating viewing experience, its transient feature most likely taxed the participants’ attention, thus impacting their comprehension outcomes. These findings underscore the benefit of a reduction in nonverbal input and the interplay between verbal and nonverbal input. The findings are discussed in relation to the use of verbal and nonverbal input for different pedagogical purposes.