Introduction
Canola (Brassica napus L.) has gained global significance as a valuable oilseed crop that is widely cultivated because of its high-quality oil and protein-rich by-products (Neik et al. Reference Neik, Amas, Barbetti, Edwards and Batley2020). Canola is the second major oilseed crop globally with increasing world demand and production, followed by soybean oil (Tu et al. Reference Tu, Wang, Guo, Xu, Zhu, Dong, Yao and Jiang2024). Its versatility as a source of edible oil, animal feed, and biofuel contributes to canola’s pivotal role in food security and renewable energy sectors (Tileuberdi et al. Reference Tileuberdi, Turgumbayeva, Yeskaliyeva, Sarsenova and Issayeva2022). Canola exports have increased in recent decades, they are expected to expand by 40% by 2025 (Tiwari et al. Reference Tiwari, Nasreen, Shahbaz and Hammoudeh2020).
Since 1996, canola production in Iran has grown consistently in the international oilseed marketplace (Spörl et al. Reference Spörl, Speer and Jira2022). The increasing demand for sustainable agriculture highlights the necessity of efficient canola cultivation practices, promoting its resilience to environmental stressors and optimizing yield (Majidian et al. Reference Majidian, Ghorbani and Farajpour2024). A notable challenge in canola production is weed management, which can significantly reduce crop yield and quality by competing for resources such as nutrients, water, and sunlight (Hassan et al. Reference Hassan, Naz, Ali, Ali, Akram, Iqbal, Ajmal, Ali, Ercisli, Golokhvast and Hassan2023). This significant threat not only affects grain production and yield but may also compromise the quality of canola oil, showing the urgent need for the agricultural sector to explore innovative practices and technologies to mitigate this challenge (Walia and Kumar Reference Walia and Kumar2023). A critical component in the formulation of effective management plans is a comprehensive understanding of weed flora and its geographic distribution. Such knowledge facilitates the application of herbicides and development of other appropriate management techniques (Krähmer et al. Reference Krähmer, Andreasen, Economou-Antonaka, Holec, Kalivas, Kolářová, Novák, Panozzo, Pinke, Salonen and Sattin2020; Nath et al. Reference Nath, Singh, Choudhary, Datta, Nandan and Singh2024).
Several weed species have been recognized for their significant effects on canola yield and cultivation. The management and control of these weeds are crucial for maintaining the productivity and profitability of canola crops (Asaduzzaman et al. Reference Asaduzzaman, Pratley, Luckett, Lemerle and Wu2020). Wild oats (Avena fatua L.), belonging to the Gramineae family, is one of the most dominant weeds in canola and is currently found in approximately 50 countries globally (Matsuhashi et al. Reference Matsuhashi, Asai and Fukasawa2021). Some studies have shown that A. fatua can significantly reduce crop yield, highlighting their severe impact on agricultural productivity (Tang et al. Reference Tang, Li, Guo, Chen, Wang, Miao, Yang, Xiong and Sun2024). Moreover, A. fatua presents a significant challenge because of their substantial resistance to herbicides, increasing control efforts and causing an ongoing threat to canola cultivation (Onkokesung et al. Reference Onkokesung, Brazier-Hicks, Tetard-Jones, Bentham and Edwards2022).
Geographic information systems (GIS) are one of the most effective and precise tools for producing weed distribution maps. These systems leverage advanced technologies to accurately identify areas infested by weeds, thereby facilitating targeted management approaches (Mohan and Giridhar Reference Mohan and Giridhar2022). Detailed species distribution and habitat suitability modeling (HSM) enabled by GIS technology plays a critical role in environmental management by providing in-depth assessments of the interactions between different species and their environments.
In recent years, machine learning algorithms have emerged as powerful tools for modeling the habitat suitability of weeds based on environmental variables (Rather et al. Reference Rather, Kumar and Khan2020). By analyzing data on soil composition, climate conditions, and other ecological factors, machine learning models can predict the likelihood of weed proliferation in specific areas (Bi et al. Reference Bi, Sun, Xie, Gu, Zhang, Zheng, Ou, Liu, Li, Peng, Gao and Wei2024). These insights can aid in preemptive weed management strategies tailored to environmental conditions, thereby enhancing the precision of crop management practices (Akhter et al. Reference Akhter, Jensen, Mathiassen, Melander and Kudsk2020).
The integration of machine learning techniques (MLTs) into HSM represents a cutting-edge approach that enhances the prediction and understanding of geographic distribution (Beery et al. Reference Beery, Cole, Parker, Perona and Winner2021). By utilizing the power of algorithms and computational models, machine learning can analyze complex environmental and biological data to identify patterns and relationships that influence the presence or absence of species across different fields (Jeon et al. Reference Jeon, Lee and Oh2023). The use of habitat suitability as a measure for assessing the risk of weed infestation has increased globally (Hartl et al. Reference Hartl, Srivastava, Prager and Wist2024). HSM has been increasingly employed to identify areas that are potentially vulnerable to various weed species over extensive geographic areas (Qazi et al. Reference Qazi, Saqib, Zaman-ul-Haq, Gardezi, Khan, Khan, Munir and Ahmed2023; Wang et al. Reference Wang, Yin, Wang, Chen, Mao, Lin and Wang2023). Schartel et al. (Reference Schartel, Cooper, May and Daugherty2021) determined the habitat suitability of eight exotic species that were invasive in Baja California and assessed their distribution and invasion risk. Wan and Wang (Reference Wan and Wang2019) evaluated the compatibility of habitats for 10 dangerous weed species and proposed a strategy for mitigating the risks posed by these weed species by modifying prevention and control methods.
Several studies have used MLTs, such as support vector machine (SVM), random forest (RF), boosted regression trees (BRTs), classification and regression trees (CARTs), generalized additive models, and generalized linear models to predict species distribution and for HSM (Gholami et al. Reference Gholami, Mohamadifar, Rahimi, Kaskaoutis and Collins2021; Mondal and Bhat Reference Mondal and Bhat2021). RF is a group-learning method that uses multiple decision trees to improve prediction accuracy and is ideal for assessing habitat suitability by evaluating diverse environmental variables (Renjana et al. Reference Renjana, Firdiana, Angio, Ningrum, Lailaty, Rahadiantoro, Martiansyah, Zulkarnaen, Rahayu, Raharjo, Abywijaya, Usmadi, Risna, Cropper and Yudaputra2024). Environmental research widely employs the SVM framework, rooted in statistical learning theory. Although SVM demonstrates significant utility, its effectiveness in modeling habitats that favor the growth of specific plant species remains an area of ongoing investigation (Tazikeh et al. Reference Tazikeh, Davoudi, Shafiei, Parsaei, Atabaev and Ivakhnenko2022). The BRT model combines the principles of boosting, an MLT, with regression trees, and creates a powerful predictive model (Salditt et al. Reference Salditt, Humberg and Nestler2023). Models such as RF, SVM, and BRT have gained prominence in predicting natural events and hazard backgrounds because of their simplicity and efficacy (Berhane et al. Reference Berhane, Kebede and Alfarrah2021; Hasan et al. Reference Hasan, Roy, Talha, Ferdous and Nasher2024; Hasannejadasl et al. Reference Hasannejadasl, Osong, Bermejo, van der Poel, Vanneste, van Roermund, Aben, Zhang, Kiemeney, Van Oort, Verwey, Hochstenbach, Bloemen, Dekker and Fijten2023). However, the use of these models to assess habitat suitability for weed species in canola fields remains relatively underexplored in scientific literature.
This study set forth two primary aims to address the key challenges in canola farming within the Fars Province of Iran. First, we sought to identify and document the predominant weed species affecting canola cultivation across the region, thereby contributing essential data to local agronomic research. Second, we implemented and compared three advanced modeling approaches—RF, SVM, and BRT—to predict habitat suitability for the identified dominant weed species. The assessment of influential environmental factors facilitated by the Boruta algorithm further enhances the model interpretability and ecological insight. Additionally, the selection of the optimal model based on the receiver operating characteristic (ROC) curve and area under the curve (AUC) maximizes predictive accuracy, pioneering the application of these MLTs in weed habitat modeling. These aims collectively address a significant research gap, offering foundational knowledge that can improve precision in weed management strategies, reduce yield losses, and promote sustainable canola production.
Materials and Methods
Study Area
This investigation was performed in the southwestern region of the Fars Province, Iran, in 2023 (Figure 1). The research region is situated between 27.26°N and 30.41°N and between 51.49°E and 54.48°E. According to the FAO (2024), Iran has expanded its canola cultivation significantly, reaching a total of around 200,000 ha. This growth is part of the country’s efforts to boost self-sufficiency in oilseed production, with regions like Fars Province playing a crucial role. Geographic analyses based on topographic maps show that Fars Province encompasses both mountainous terrain and plains. The province is also distinguished by its climatic diversity, with the four seasons exerting distinct effects on regional flora. This variation in climate is largely attributed to the varied elevation, ranging from 182 to 3,183 m above sea level. The Fars Province has an average annual rainfall of 315 mm and an average annual temperature of 15 C (Kheiri et al. Reference Kheiri, Kambouzia, Rahimi-Moghaddam, Moghaddam, Vasa and Azadi2024). The average slope of the Fars Province is 7°, which is particularly favorable for canola cultivation.

Figure 1. The research region as situated in Iran’s Fars Province (left). Topographic map of research region showing locations for training and validation datasets (right).
Methodology
This research followed a five-stage methodology: (1) data collection; (2) preparation of influential factors; (3) HSM using three models: RF, SVM, and BRT; (4) evaluation of models and selection of the best model; and (5) variable importance analysis, as illustrated in Figure 2.

Figure 2. An Avena fatua habitat suitability mapping flowchart. AUC, area under the curve; BRT, boosted regression tree; DEM, digital elevation model; EC, electrical conductivity; RF, random forest; ROC, receiver operating characteristic; SVM, support vector machine.
Data Collection and Sampling
In the present study, sampling was conducted through 114 canola fields in 28 different counties of the Fars Province, based on the cultivation area of this crop. Some studies have demonstrated that the presence of weeds at the 6- to 8-leaf growth stage significantly reduces canola yield (Bečka et al. Reference Bečka, Bečková, Kuchtová, Cihlář, Pazderů, Mikšík and Vašák2021). Chao et al. (Reference Chao, Anderson, Li, Gesch, Berti and Horvath2023) stated that the critical period for weeds in autumn canola growth can reduce plant performance by more than 10%, so canola should be maintained without weeds. Therefore, sampling was carried out during the winter season in 2023, when canola is in the 6- to 8-leaf growth stage. Sampling was conducted using a 0.25-m2 quadrat in the form of a W-shaped field based on the cultivation area (Fried et al. Reference Fried, Le Corre, Rakotoson, Buchmann, Germain, Gounon and Chauvel2022) in each country in Fars province (Table 1; Figure 3).
Table 1. Reviewing the fields of any county in Fars province


Figure 3. Spatial distribution of canola and weed sampling.
In addition to weed sampling, the geographic coordinates of each farm (latitude and longitude) were determined by using a GPS device. After weeds were collected from various canola fields, they were accurately identified and counted based on genus and species. Soil samples from each point were transferred to the laboratory to determine the chemical and physical properties of the soil in each canola field. Based on Equations 1–5, the frequency %, uniformity %, mean field density (plant/m2), and abundance index of different species (Thomas Reference Thomas1985) were evaluated in Fars Province:

where n is the number of fields visited, Y i is the presence or absence of species k in field i, and F k is the frequency of species k across all the quadrats. The following formula was used to obtain the uniformity index for species k (U k ) :

where Xij indicates the presence or absence of species k in the ith quadrat and jth field, with n fields and m quadrats.

In Equation 3, m is the number of thrown quadrats and Zj is the number of plants in the quadrat. Dki is the density (number of plants per meter) of the k species at field number i.

In Equation 4, n is the number of fields visited, Dki is the density (number of plants per meter) of k species on field number i, and MDSK is the mean density of species k.

Finally, Equation (5) was used to determine the dominance index of the weeds. Using this equation, the frequency (F k ), field uniformity (U k ), and mean density of species k (MDSK) were combined to determine the predominant weed species.
Important Factors
In general, for HSM, it is necessary to identify the factors that affect weed growth and development. For example, some studies have demonstrated that environmental factors, including topography, soil chemical and physical properties, road development, temperature, and rainfall, can affect weed distribution (Jehangir et al. Reference Jehangir, Khan, Ahmad, Ejaz, Ain, Lho, Han and Raposo2024). Twelve layers were used as influencing factors: elevation, slope degree, slope aspect, plan curvature, distance from rivers, mean annual precipitation, mean annual temperature, pH, EC, and soil clay, silt, and sand percentages, which were considered to affect the growth and development of weed species. These 12 study layers were then converted to 30-m resolution for future analyses (Kabiri et al. Reference Kabiri, Allen, Okuonzia, Akello, Ssabaganzi and Mubiru2022) in ArcGIS v. 10.8.1 (https://www.esri.com/en-us/arcgis/products/arcgis-desktop/overview). The annual mean temperature and rainfall data were gathered from 29 meteorological organizations in the counties of Fars Province. The data were then converted to a point map using ArcGIS v. 10.8.1 software. The point map and study area were converted into temperature and rainfall maps using a 30-m resolution with the IDW algorithm (Figure 4A and 4B).

Figure 4. Important layers, including: (A) mean annual temperature, (B) mean annual precipitation, (C) sand percent, (D) silt percent, (E) clay percent, (F) electrical conductivity (EC), (G) pH, (H) elevation/digital elevation model, (I) slope degree, (J) slope aspect, (K) plan curvature, and (L) distance from rivers.


In total, 189 soil samples were collected at a depth of 30 cm. A hydrometer was used to determine the physical characteristics of the soil, such as the amounts of sand, silt, and clay (Feng et al. Reference Feng, Khalil, Aslam, Ghaffar, Tariq, Jamil, Farhan, Aslam and Soufan2024). A pH meter and a conductivity meter were used to test the pH and EC of the soil, respectively. Sand, silt, clay, pH, and EC layers were also converted into a raster map with 30-m resolution (Figure 4C–G). A digital elevation model (DEM) of Fars Province was applied to assess elevation, slope degree, slope aspect, and plan curvature with a 30-m resolution (Figure 4H–K). Topographic maps at a resolution of 1:25,000 were used to create a raster map of the distance from the rivers to assess the impact of the rivers on habitat suitability (Figure 4L).
RF
RF is a supervised learning method developed by Breiman (Reference Breiman2001) and consists of an ensemble of decision trees used for both classification and regression tasks. The RF model operates by constructing multiple trees during training and outputting the mode of the classes or mean prediction for classification or regression, respectively. This approach, which enhances model robustness and accuracy, is particularly effective for complex data, making it highly suitable for HSM (Talhami et al. Reference Talhami, Wakjira, Alomar, Fouladi, Fezouni, Ebead, Altaee, Al-Ejji, Das and Hawari2024).
In this study, the key parameters for RF, such as n_estimators (number of trees in the forest), max_depth (maximum depth of each tree), and min_samples_split (minimum number of samples required to split a node), were optimized. We used grid search cross-validation to tune these parameters, with n_estimators ranging from 100 to 500, max_depth from 10 to 50, and min_samples_split set to identify the optimal values. The performance of the model was evaluated using accuracy, F1 score, and AUC/ROC metrics, providing a comprehensive assessment of model accuracy and threshold-specific performance. The RF model was implemented using the random forest package in R (https://cran.r-project.org/web/packages/randomForest/index.htm),which facilitates parameter tuning and cross-validation.
SVM
SVM, introduced by Vapnik (Reference Vapnik1997), is a nonparametric statistical method that does not assume any particular distribution of the dataset. SVM is effective for high-dimensional data with a relatively small number of samples, making it suitable for species distribution modeling (Kumar et al. Reference Kumar, Sinha, Saurav and Chauhan2024).
For our SVM model, the key parameters included C (penalty parameter) and the kernel type (linear, polynomial, sinusoidal, or radial basis function). The C parameter was tuned from a range of 0.1 to 10 on a log scale to balance the margin and misclassification tolerance, while kernel selection was optimized based on model performance. Accuracy, F1 score, and AUC/ROC metrics were used for model evaluation, emphasizing precision and recall owing to potential class imbalance. The SVM model was implemented using the e1071 package in R (https://cran.r-project.org/web/packages/e1071/index.html), which provides comprehensive support for parameter optimization and evaluation.
BRT
A BRT is an ensemble method that combines the predictions of several weak classifiers into a stronger overall model (Alnahit et al. Reference Alnahit, Mishra and Khan2022). It uses the CART framework to iteratively add trees that correct errors made by previous ones, optimizing both the learning_rate (learning speed) and n_estimators.
For this study, learning rate and estimators were optimized using a range of 0.01 to 0.1 for learning_rate and up to 500 trees for n_estimators. We also tuned max_depth to control tree complexity and prevent overfitting. The model was evaluated using accuracy, F1 score, and AUC/ROC metrics to provide a robust assessment of the predictive accuracy across thresholds.
We implemented BRT using the gbm package in R (https://cran.r-project.org/web/packages/gbm/index.html), which facilitates parameter tuning and cross-validation, including early stopping based on AUC/ROC performance.
Boruta Algorithm
A critical component of this research involves evaluating the importance of variables in spatial modeling for habitat suitability to guide optimal management strategies (López-Torres et al. Reference López-Torres, Sánchez-García, Núñez-Ríos and López-Hernández2023). The Boruta algorithm was chosen for this purpose because it effectively identifies influential variables by leveraging the RF model’s capacity for variable selection (Li et al., Reference Li, Wei, Zhang, Che, Yao, Wang, Shi, Tang and Song2023). The algorithm operates by iteratively comparing the importance of actual features to shadow features, which are randomized duplicates, thus distinguishing truly important predictors from noise (Xiao et al. Reference Xiao, Ma, Gan, Li, Zhang and Xia2024).
For the implementation, we used the Boruta package in R (https://cran.r-project.org/web/packages/Boruta/index.html). The key parameters included maxRuns, set to 500 to ensure sufficient iterations for stable results, and doTrace, set to 2 for detailed output during the execution of the algorithm. The maxRuns parameter influences the stability and reliability of the variable importance ranking. Higher values provide more robust assessments by allowing more comparisons across iterations. Additionally, we used a P-value threshold of 0.05 to statistically identify significant variables.
The Boruta algorithm outputs three categories of variables: confirmed, tentative, and rejected (Han et al. Reference Han, Wang, Wang, Yang, Wan, Liang and Rinklebe2022). This categorization helps refine the selection process by confirming variables with a statistically meaningful impact on habitat suitability while excluding non-informative features (Wang et al. Reference Wang, Liu, Wang, Yang, Wan and Liang2022). The results of the Boruta algorithm provide a clearly ranked list of predictor variables crucial for understanding and managing habitat suitability patterns across different regions (Prasad et al. Reference Prasad, Loveson, Das and Kotha2022). The variable importance derived from Boruta was instrumental in identifying which factors were most relevant in weed HSM, thereby guiding targeted management strategies.
Accuracy of Models
In HSM, where the goal is to forecast the presence or absence of a species in various locations, ROC and AUC metrics are essential tools for assessing model performance (Jamali et al. Reference Jamali, Amininasab, Taleshi and Madadi2024). For this purpose, 70% of the presence data of the dominant weed were used in the modeling process, while the remaining 30% of the data were utilized for validation and to evaluate the model’s projected accuracy.
In this study, the 70:30 split between training and validation datasets was selected based on its established utility in predictive modeling and its practical alignment with the dataset size. This ratio is widely used in ecological and machine learning applications as a standard practice (Fielding and Bell Reference Fielding and Bell1997), balancing the competing requirements of sufficient data for model training and a reasonable subset for validation. The chosen split minimizes overfitting risk while allowing the evaluation of model performance on an independent dataset.
Given the dataset size, this split is particularly well suited to maximize the reliability of model parameter estimation and predictive accuracy. Despite the relatively modest dataset size, ecological modeling often operates with limited datasets due to challenges such as field collection constraints and environmental variability (Elith et al. Reference Elith, Graham, Anderson, Dudík, Ferrier, Guisan and Zimmermann2006). While larger datasets are ideal, the 70:30 split effectively uses the available data to produce statistically sound results, consistent with studies in similar contexts (Hameed and Alamgir Reference Hameed and Alamgir2022).
The ROC curve and AUC are widely used metrics for assessing prediction models’ accuracy. The ROC curve, a graphical representation, plots two parameters to show how well a classification model performs: the true-positive rate (TPR), or sensitivity, and the false positive rate (FPR), or 1-specificity, across different threshold values (Muschelli 2020). The TPR, represented on the y axis, indicates the proportion of real positives correctly identified by the model, while the FPR, shown on the x axis, represents the proportion of real negatives that are incorrectly classified as positives (Carrington et al. Reference Carrington, Manuel, Fieguth, Ramsay, Osmani, Wernly, Bennett, Hawken, Magwood, Sheikh, McInnes and Holzinger2022). A single aggregate performance metric across all potential classification thresholds is provided by the ROC curve and AUC (Saha et al. Reference Saha, Bera, Shit, Bhattacharjee and Sengupta2023; Verbakel et al. Reference Verbakel, Steyerberg, Uno, De Cock, Wynants, Collins and Van Calster2020). The AUC value ranges from 0 to 1 and is classified into five performance categories: 0.5 to 0.6 (poor), 0.6 to 0.7 (moderate), 0.7 to 0.8 (good), 0.8 to 0.9 (very good), and 0.9 to 1.0 (excellent) (Table 2). In this study, the ROC-AUC was utilized to evaluate the RF, BRT, and SVM models using SPSS software v. 26 (http://www.ibm.com).
Table 2. The receiver operating characteristic (ROC) curve classification (Richardson et al. Reference Richardson, Trevizani, Greenbaum, Carter, Nielsen and Peters2024).

Collinearity Test of Effective Factors
The collinearity test of useful elements is a crucial technique in statistical analysis employed to diagnose the extent of multicollinearity among independent variables within a regression model (Barman et al. Reference Barman, Biswas and Rao2024). To quantitatively assess multicollinearity, the variance inflation factor (VIF) and tolerance indices were utilized. These metrics offer insights into the degree of linear association between an independent factor and the remaining independent variables in the model. A VIF value of 5 or 10 and above is generally regarded as demonstrating a problematic level of multicollinearity, indicating an exaggerated variance in an estimated regression coefficient by a factor of 5 or 10 because of its linear relationship with other variables (Cheng et al. Reference Cheng, Sun, Yao, Xu and Cao2022). The percentage of volatility of an independent variable that cannot be accounted for by other independent variables is called the tolerance. Hence, a lower tolerance value indicates a higher overlap of explanatory information among variables, signifying a potential multicollinearity issue. Typically, a tolerance value of less than 0.20 or 0.10 is considered indicative of significant multicollinearity (Negash and Alelgn Reference Negash and Alelgn2022).
Results and Discussion
Determining the Dominant Weed
Frequency percentages of genera and species were used to assess the dominant weeds. The initial findings from the sampling process indicated that A. fatua emerged as the most prevalent weed species, signifying its significant presence and impact in the sampled areas. Notably, 32 dominant weed species were identified, with A. fatua being the primary dominant species, with a frequency of 58.48% (Table 3). This indicated the critical importance of A. fatua in terms of its abundance and ecological influence on the studied environments.
Table 3. Frequency (%) of weeds in canola fields.

Multicollinearity Test
Table 4 shows the collinearity between the factors affecting the species distribution modeling of A. fatua in the study area. Thus, based on the findings obtained, the tolerance coefficient is not less than 0.1 in any of the indices, and the VIF was not 5 or greater in any of the indices; therefore, there was no collinearity between the indices used. Otherwise, there will be multicollinearity between the independent parameters and parameter estimates, and statistical significance standards will be targeted (Rovetta Reference Rovetta2023). This leads to a lack of acceptable accuracy for spatial analysis, especially in RF, BRT, and SVM modeling.
Table 4. Variance inflation factor (VIF).

a DEM, digital elevation model.
MLTs
The final maps of the RF, SVM, and BRT models were divided into four classes to determine the suitability of the A. fatua habitats (Figure 5A).

Figure 5. Habitat suitability maps of Avena fatua based on (A) random forest (RF), (B) boosted regression tree (BRT), and (C) support vetor machine (SVM).
RF Algorithm
According to the RF model, the low (66.56%), moderate (16.35%), high (11.71%), and very high (5.38%) classes had the largest relative areas (Table 5). In addition, the RF model map showed that the northern, northwestern, central, eastern, western, southeastern, and southwestern regions of the study area had the highest habitat suitability for A. fatua, although some centers had low habitat suitability for A. fatua (Figure 5A). However, the northeast and parts of the center were not affected by this weed invasion.
Table 5. Habitat suitability classes areas for all applied models.

BRT Algorithm
The habitat suitability map of A. fatua created using BRT showed that the low (56.65%), moderate (26.93%), high (11.71%), and very high (4.55%) classes had the largest relative areas (Table 5). The situation of the counties regarding the suitability of the habitat for A. fatua based on the BRT model was the same as that of the RF model (Figure 5B). This demonstrated that these models had the same performance in terms of predicting the habitat suitability of this weed.
SVM Algorithm
The SVM model had different classification conditions, such that the moderate (37.89%), low (37.89%), high (19.81%), and very high (6.74%) classes had the highest relative areas (Table 5). The suitability map of the SVM model showed that parts of the northern, northwestern, and southern study areas had greater habitat suitability for A. fatua (Figure 5C). In this model, small portions of the research area (west, northwest, southeast, east, and north) had the highest habitat suitability. According to the findings of the SVM model, it can be emphasized that the east had the highest habitat suitability for A. fatua (Figure 5C). In addition, counties in the southwest, southeast, and a large portion of the center of the research area had low habitat suitability.
Evaluation of Algorithms
In this study, the models were evaluated using the ROC curve and AUC. The most accurate models were the RF, BRT, and SVM models according to the ROC curve (Figure 6). Also, the AUC confirmed the accuracy of the RF (0.99%), BRT (0.97), and SVM (0.96) models (Table 6). Huang et al. (Reference Huang, Liu, Zhang, Mi, Tong, Xiao and Shuai2021) have reported that the areas under the curve are 0.5 to 0.6 (poor), 0.6 to 0.7 (moderate), 0.7 to 0.8 (good), 0.8 to 0.9 (very good), and 0.9 to 1 (excellent). Therefore, the RF, BRT, and SVM models were excellent in this study.

Figure 6. The receiver operating characteristic (ROC) curve for evaluating algorithms. BRT, boosted regression tree; RF, random forest; SVM, support vector machine.
Table 6. Area under the curve (AUC).

Importance of Variables
In this study, the relevance of these variables is evaluated through the application of the Boruta algorithm. This method was used to determine the most influential factors in the analysis. The results of the Boruta algorithm demonstrated that the slope, plan curvature, clay, temperature, and silt factors had the greatest impact on the modeling of A. fatua habitat suitability (Table 7). Differences in the slope of the soil throughout the terrain may have affected the growth and expansion of A. fatua. This factor has a profound effect on vegetation dispersal patterns. One of the important effects of land slope is moisture absorption. For example, south-facing slopes subjected to higher solar irradiance typically exhibit reduced soil moisture levels, constraining plant growth.
Table 7. Examining the significance of variables using the Boruta algorithm.

a DEM, digital elevation model.
Practical Implications and Conclusion
This study highlights the practical applications of machine learning algorithms, including RF, SVM, and BRT, for modeling the habitat suitability of A. fatua in canola fields. Each algorithm brings unique advantages to understanding weed distribution, which is crucial for devising sustainable and site-specific management strategies to mitigate the detrimental effects of A. fatua on crop productivity. By leveraging the strengths of these models, this research provides actionable insights that align with contemporary agricultural goals of improving efficiency while minimizing environmental impacts.
The RF model emerged as the most effective algorithm, achieving the highest accuracy (99%) in predicting habitat suitability. This model was instrumental in identifying key environmental predictors, such as slope, soil texture, and plan curvature, that significantly influence A. fatua distribution. Its embedded feature selection capabilities not only enhanced interpretability but also allowed for the refinement of management practices in heterogeneous agricultural landscapes. Studies by Kang et al. (Reference Kang, Kim and Park2022) and Melash et al. (Reference Melash, Bogale, Migbaru, Chakilu, Percze, Ábrahám and Mengistu2023) further validate the efficacy of RF in handling complex ecological datasets with numerous interacting variables. Additionally, RF’s ensemble approach ensures model stability and robustness to outliers, making it particularly suitable for field-based ecological studies characterized by high variability in environmental conditions.
SVM also demonstrated its utility in analyzing high-dimensional datasets, with a classification accuracy of 96%. This algorithm excelled in differentiating between habitat suitability classes, providing detailed ecological niche maps that are indispensable for spatially targeted weed management. The ability of SVM to handle complex interactions among environmental variables has been documented in recent works, including those of Akhtar et al. (Reference Akhtar, Tanveer and Arshad2024) and O’Neill et al. (Reference O’Neill, Khalid, Spink and Thorpe2023). These studies emphasize the importance of SVM in addressing challenges posed by diverse agroecological conditions, where precision in habitat differentiation directly impacts the effectiveness of weed control measures.
The BRT model, with an accuracy of 97%, effectively captured nonlinear relationships between A. fatua occurrence and predictor variables. This capacity for addressing nonlinearity is particularly significant in weed science, where ecological interactions are rarely linear. The ensemble-based nature of BRT enhances its prediction precision, a feature corroborated by studies such as those of Montoya-Jiménez et al. (Reference Montoya-Jiménez, Valdez-Lazalde, Ángeles-Perez, De Los Santos-Posadas and Cruz-Cárdenas2022) and Kumari et al. (Reference Kumari, Kotiyal, Singh, Kumar, Kumar, Malik and Singh2024). By integrating BRT into HSM, this study adds to the growing body of evidence supporting its applicability in managing invasive species in agricultural systems.
Although the 70:30 training–validation split provides an efficient framework for ecological modeling, the dataset size remains a potential limitation of this study (Garcés et al. Reference Garcés, Baumeister, Mason, Chatham, Holiga, Dukart, Jones, Banaschewski, Baron-Cohen, Bölte, Buitelaar, Durston, Oranje, Persico and Beckmann2022). Smaller datasets inherently constrain the ability to capture rare patterns and subtle environmental interactions, which could impact model generalizability (Yu et al. Reference Yu, Sun, Chen, Reynolds, Chaudhary and Batmanghelich2024). However, this study operates within the boundaries of a case study approach, wherein the primary goal is to explore and demonstrate a method’s applicability rather than achieve universal generalizability. To address this limitation, the dataset size and split were carefully chosen to balance robustness in model training and reliable validation. Previous studies have demonstrated that even smaller datasets can yield valuable insights when the modeling methodology is rigorous (Wisz et al. Reference Wisz, Hijmans, Li, Peterson, Graham and Guisan2008). Additionally, the model’s performance metrics, assessed using cross-validation, support the inference that the chosen split is sufficient for the study’s aims. Future research could address this limitation by expanding the dataset through additional sampling or leveraging synthetic data-generation techniques to augment the dataset size. Nevertheless, for a case study framework, this approach aligns well with established methodologies, and the results provide meaningful insights into the ecological processes under investigation.
The complementary strengths of RF, SVM, and BRT underscore their collective utility in ecological modeling. RF and BRT were particularly effective in assessing feature importance, while SVM provided the highest resolution in classification tasks. This integrated approach offers a more comprehensive understanding of A. fatua habitat suitability and enables the creation of nuanced maps tailored to specific regional conditions. Such detailed mapping provides a critical basis for targeted interventions, ensuring that management resources are deployed efficiently and effectively in areas at high risk of weed invasion.
The practical implications of this study extend beyond theoretical modeling. By generating habitat suitability maps, this research equips agricultural practitioners with precise tools for implementing site-specific weed management strategies. This targeted approach not only minimizes herbicide usage but also supports environmentally conscious practices that align with the principles of sustainable agriculture. Topographic factors, such as slope and aspect, emerged as pivotal predictors, corroborating findings from Yang et al. (Reference Yang, Zhang, Zhang, Bidegain, Dong, Hu, Li, Zhang and Guo2023) and Vykydalová et al. (Reference Vykydalová, Barroso, Děkanovský, Neoralová, Lumbantobing and Winkler2024) that highlight the role of microclimatic conditions in shaping weed distribution. Similarly, the influence of soil texture and temperature on habitat suitability aligns with broader ecological studies, such as those by Dastres et al. (Reference Dastres, Jahangiri, Edalat, Zamani, Amiri and Pourghasemi2023) and Yao (Reference Yao, Nan, Li, Li, Liang and Zhao2023), emphasizing the adaptive strategies of A. fatua in diverse agroecological contexts.
While the study showcases the effectiveness of RF, SVM, and BRT, it also acknowledges limitations inherent to these models. The accuracy of predictions is influenced by data quality and representativeness, as highlighted in recent works by Hasan et al. (Reference Hasan, Roy, Talha, Ferdous and Nasher2024) and Xu et al. (Reference Xu, Liang, Hahn, Zhao, Lo, Haller, Sobkowiak, Chitwood, Colijn, Cohen, Rhee, Messer, Wells, Clark and Kim2024). Algorithmic biases, environmental variability, and scalability challenges further underscore the need for continuous refinement of these methods. For instance, temporal and spatial changes in environmental conditions may reduce the reliability of predictions over time, necessitating the development of more adaptive and scalable modeling frameworks. Future research should focus on addressing these limitations to enhance the robustness and generalizability of machine learning applications in weed science.
In conclusion, this research advances the field of weed science by demonstrating the potential of machine learning models to improve habitat suitability predictions for dominant weeds like A. fatua. By integrating ecological, agronomic, and computational insights, the study lays a foundation for the development of sustainable, data-driven weed management strategies. The findings not only highlight the efficacy of RF, BRT, and SVM in ecological modeling but also provide a road map for their broader application in addressing challenges associated with agricultural sustainability and biodiversity conservation.
Funding statement
This research was funded by Research Council of Shiraz University, grant/award no.: 98GCU1M75346.
Competing interests
No competing interests have been declared.