This study investigated how semantically relevant auditory information might affect the reading of subtitles, and if such effects might be modulated by the concurrent video content. Thirty-four native Chinese speakers with English as their second language watched video with English subtitles in six conditions defined by manipulating the nature of the audio (Chinese/L1 audio vs. English/L2 audio vs. no audio) and the presence versus absence of video content. Global eye-movement analyses showed that participants tended to rely less on subtitles with Chinese or English audio than without audio, and the effects of audio were more pronounced in the presence of video presentation. Lexical processing of subtitles was not modulated by the audio. However, Chinese audio, which presumably obviated the need to read the subtitles, resulted in more superficial post-lexical processing of the subtitles relative to either the English or no audio. On the contrary, English audio accentuated post-lexical processing of the subtitles compared with Chinese audio or no audio, indicating that participants might use English audio to support subtitle reading (or vice versa) and thus engaged in deeper processing of the subtitles. These findings suggest that, in multimodal reading situations, eye movements are not only controlled by processing difficulties associated with properties of words (e.g., their frequency and length) but also guided by metacognitive strategies involved in monitoring comprehension and its online modulation by different information sources.