Previous research has shown language-specific features play a guiding role in how children develop expression of events with speech and gestures. This study adopts a multimodal approach and examines Mandarin Chinese, a language that features context use and verb serializations. Forty children (four-to-seven years old) and ten adults were asked to describe fourteen video stimuli depicting different types of causal events involving location/state changes. Participants’ speech was segmented into clauses and co-occurring gestures were analyzed in relation to causation. The results show that the older the children, the greater the use of contextual clauses which contribute meaning to event descriptions. It is not until the age of six that children used adult-like structures – namely, using single gestures representing causing actions and aligning them with verb serializations in single clauses. We discuss the implications of these findings for the guiding role of language specificity in multimodal language development.