When they process sentences, language comprehenders activate perceptual and motor representations of described scenes. On the “immersed experiencer” account, comprehenders engage motor and perceptual systems to create experiences that someone participating in the described scene would have. We tested two predictions of this view. First, the distance of mentioned objects from the protagonist of a described scene should produce perceptual correlates in mental simulations. And second, mental simulation of perceptual features should be multimodal, like actual perception of such features. In Experiment 1, we found that language about objects at different distances modulated the size of visually simulated objects. In Experiment 2, we found a similar effect for volume in the auditory modality. These experiments lend support to the view that language-driven mental simulation encodes experiencer-specific spatial details. The fact that we obtained similar simulation effects for two different modalities—audition and vision—confirms the multimodal nature of mental simulations during language understanding.