The rapid advancement of large language models (LLMs) has enabled their integration into a wide range of scientific disciplines. This article introduces a comprehensive benchmark dataset specifically designed for testing recent LLMs in the hydrology domain. Leveraging a collection of research articles and hydrology textbooks, we generated a wide array of hydrology-specific questions in various formats, including true/false, multiple-choice, open-ended, and fill-in-the-blank. These questions serve as a robust foundation for evaluating the performance of state-of-the-art LLMs, including GPT-4o-mini, Llama3:8B, and Llama3.1:70B, in addressing domain-specific queries. Our evaluation framework employs accuracy metrics for objective question types and cosine similarity measures for subjective responses, ensuring a thorough assessment of the models’ proficiency in understanding and responding to hydrological content. The results underscore both the capabilities and limitations of artificial intelligence (AI)-driven tools within this specialized field, providing valuable insights for future research and the development of educational resources. By introducing HydroLLM-Benchmark, this study contributes a vital resource to the growing body of work on domain-specific AI applications, demonstrating the potential of LLMs to support complex, field-specific tasks in hydrology.