Introduction
Non-clinical aspects of life, such as social, environmental, behavioral, psychological, and economic factors, play significant roles in shaping patient health and health outcomes. These are broadly studied as Social Determinants of Health, which the World Health Organization defines as “conditions in which people are born, grow, live, work and age” and “fundamental drivers of [health] [1].” Including sociome datasets is often a burdensome data problem, both in finding and integrating disparate datasets, where clinical patient data have to be integrated with other data sources to characterize a patient’s life outside of their clinical interactions. We refer to the entirety of these non-clinical or social factors as a patient’s “sociome.” Due to the diversity of data sources and file types that sociome research has to consider, key bottlenecks in scaling such research to large patient populations include data integration [Reference Mazilu, Paton, Konstantinou and Fernandes2], data harmonization [Reference Krasikov and Legner3], uneven data quality [Reference Batini, Cappiello, Francalanci and Maurino4], and statistical modeling of multimodal datasets [Reference Storås, Strümke, Riegler and Halvorsen5]. Consequently, studies often focus on one factor, a composite index, or a set of highly related factors [Reference Patel, Hall, Broadhurst, Smith, Schultz and Foong6], where potentially crucial nuances and interactions among factors can be lost.
Here, we report on the design and implementation of the Sociome Data Commons (SDC). Leveraging the expertise of the Pediatric Cancer Data Commons in collecting, harmonizing, and sharing data [Reference Volchenboum, Cox, Heath, Resnick, Cohn and Grossman7], we created a repository of pre-harmonized, geospatial sociome datasets that can be used in concert with clinical data to predict a variety of outcomes. To this end, we:
-
Assembled and integrated publicly available geocoded datasets about social, environmental, behavioral, psychological, and economic exposures.
-
Developed a data governance framework using a structured, standardized metadata model that conforms to FAIR (findable, accessible, interoperable, reusable) [Reference Wilkinson, Dumontier and Aalbersberg8] principles.
-
Established a statistical methodology for analyzing sociome datasets of varying scope and quality, and for scaling and sustaining such analysis over large populations, environments, and diverse data sources.
To evaluate the SDC, we performed a pilot use case to identify sociome factors associated with pediatric asthma exacerbations on the South Side of Chicago. Pediatric asthma was selected as it is a community priority [9], has well-documented social disparities [Reference Dircksen and Prachand10], and is known to have many sociome influences, including housing and environmental conditions [Reference Grant, Croce and Matsui11,Reference Sullivan and Thakur12]. In addition, clinical factors alone or models with limited variables have lacked sufficient predictive power for asthma outcomes [Reference Patel, Hall, Broadhurst, Smith, Schultz and Foong6,Reference Khanam, Gao and Adamko13,Reference Kothalawala, Kadalayil and Weiss14].
Materials and Methods
In this manuscript, we use the following terms as defined below:
-
Sociome Data Commons (SDC): A cloud-based repository of datasets characterizing a variety of local social, environmental, behavioral, psychological, and economic exposures.
-
Metadata: Standardized descriptions of the content, quality, ownership, lineage, and scope of a dataset in the SDC.
-
Model: A statistical or machine learning analysis that associates factors in the SDC to an outcome, often derived from the electronic health record (EHR) or a clinical study.
-
Data Governance: The overall management and control of the assets in SDC, encompassing the policies, procedures, and frameworks that ensure data quality, accessibility to researchers, ethics, and privacy throughout its lifecycle.
-
Generalizability: 1. The ability of the SDC to store and serve multiple types of data and multiple types of models. 2. Purposefully, only generalizable data is included in the SDC, whether probability-based survey data, modeled environmental metrics, direct measurements, or surveillance records.
-
Sustainability: The software infrastructure and organizational processes that govern the SDC will persist beyond initial pilot studies. The SDC is intended to be a persistent platform that can be meaningfully engaged by researchers across disciplines. It is built with the guiding principles described in Fig. 1. Each of the guiding principles is elaborated in the supplement.
Software Implementation
The SDC team has assembled and integrated diverse datasets into a simple, well-documented interoperable format. Researchers can use an application programing interface (API) to access these datasets directly from code or via an interactive website (Fig. 1). The datasets are categorized with a structured, standardized metadata model that conforms to FAIR principles. These sociome datasets can be pulled into a protected enclave where they can be joined to clinical data (protected health information or a limited data set). Deployments behind an appropriate firewall can simplify privacy and security requirements for PHI, while a cloud-based multi-tenant solution could facilitate larger-scale collaborative research projects. Each dataset is documented with metadata describing its scope, quality, and units of measure. The project utilizes a Python toolkit (that will be made available with an open-source license) to aid with common data integration and harmonization steps that researchers using the data might encounter. Researchers can identify the types of sociome factors they wish to investigate and easily build an integrated profile for a certain region.
Governance and Sustainability
The SDC establishes standards for dataset quality, dataset inclusion, metadata annotation, and data access. These standards will help promote trust in the included data and any derived conclusions. Details are included in the supplement. Novel contributions are presented in Table 1.
FAIR = findable, accessible, interoperable, reusable.
Asthma Pilot Methodology
We conducted a pilot use case of the SDC for demonstration and to test workflows, beginning with a period of discovery. The pilot investigated sociome factors potentially related to pediatric asthma exacerbations. Clinical data was extracted from the University of Chicago Hospital’s EHR for all pediatric visits (age<18). Data management and analysis of the extracted data occurred mainly in Python [Reference Van Rossum and Drake15]. All clinical data (address history, demographics, diagnoses, and encounters) were stored and analyzed on University of Chicago HIPAA-compliant compute and storage infrastructure. Some potentially important clinical data, including allergy testing, asthma control test score, and overweight status, were not available at the time of this pilot. This study was approved by the University of Chicago BSD IRB, #21-1920, and a waiver of consent was granted for this retrospective study.
Geocoding
To adhere to privacy requirements for PHI, an on-site geocoder was preferred. To test and show robustness between available geocoder platforms (both cloud-based and on-premises), a test was performed using 1,000 randomly selected publicly available Chicago addresses [16] as well as systematic misspellings of Chicago’s city hall address (a public landmark). We tested batch geocoding with Decentralized Geomarker Assessment for Multi-Site Studies (DeGAUSS, locally-hosted geocoding software) [Reference Brokamp, Wolfe, Lingren, Harley and Ryan17] against industry standards: OpenStreetMap, GoogleV3 (both via GeoPy [18]), and the Census geocoder [19]. DeGAUSS performed as well as GoogleV3 and the Census, and all outperformed OpenStreetMap. Details are in the supplement.
Missing Data
The level of missingness in the clinical data was low. Insurance was 5% missing, race/ethnicity 2%, and gender<1%. Missingness was resolved in two phases. First, by patient and sorted by date, values were filled forward and backward. Second, remaining missingness (2% for race/ethnicity, <1% for insurance and gender) was resolved with multiple imputations [Reference Patel, Hall, Broadhurst, Smith, Schultz and Foong6,20].
Outcome Definition
Visits for asthma and asthma exacerbations were categorized by the encounter text description including “asthma” and “exacerbation,” respectively. Using text captured 2,010 additional asthma encounters (out of 3.3 million total visits, 2006–2021) than using ICD codes alone. Both terms were required in the visit text to qualify for a visit for an acute asthma exacerbation.
Spatial Clustering
Spatial clustering is based on the assumption that “location matters.” Events near each other are often related more than events far apart. Clustering combines geographic areas (here, census tracts) together to maximize similarity among the census tracts and maximize dissimilarity between the clusters. This clustering is performed to find meaningful spatial commonalities. Further, clustering reduces the size of large location datasets. For example, over 300 census tracts on the South Side of Chicago can be reduced to fewer than 10 spatial clusters. Different clusters can have different characteristics, summary descriptions, and, especially of interest here, risk profiles.
Clustering itself is an unsupervised learning problem in that the researcher does not know which areas will be grouped together (though the researcher does need to decide on the number of clusters). We used the skater algorithm [Reference AssunÇão, Neves, Câmara and Da Costa Freitas21] in R [22], which creates a connectivity graph of tracts’ central points as well as edges from tracts’ contiguous borders. The “cost” of each edge is established from the dissimilarity between neighboring tracts, and edges with greater costs are pruned. Statistical significance is assessed with Moran’s I statistic, which measures spatial autocorrelation (observations at locations that are either contiguous or not) [Reference Zhou, Lin, Shekhar and Xiong23].
Spatial clustering was conducted with exacerbation visits as a proportion of all asthma visits and only for tracts with at least 10 visits. The exacerbation percentage was then mean-centered and standardized. Since we did not know how many clusters would be meaningful, we output five different spatial clustering variables. These variables had different numbers of cluster assignments – between three and seven – and were designed to be tested in our later model. That is, the model would determine whether three total clusters or up to seven total were more important (described below).
Sociome Data Commons
For SDC datasets, a breadth of high-quality data types were chosen for this pilot study. All data were consistently aggregated at the census tract level to match the included American Community Survey (ACS) 2015–2019 planning database [24] and to avoid the modifiable areal unit problem of using different geographic levels in the same study [Reference Wong, Janelle, Warf and Hansen25]. As available, data were reduced to years 2017–2019 to match the clinical data. Chicago crime [26] was characterized via FBI guidelines [27] as violent or not and rates for all crime, violent crime, and homicide (as a subset of violent crime) were created. Rates were also created for Chicago building code violations [26]. ChiVes, the Chicago data collaborative and community mapping application, assembled a novel dataset including tree cover, biodiversity, summer PM (particulate matter) 2.5 estimates, traffic levels, and housing cost burden, among others [28]. We developed a housing dataset, including building age and repair condition, sourced from the Cook County Tax Assessor’s Office [29]. The Environmental Protection Agency (EPA)’s Environmental Justice mapping and screening tool (EJ Screen) data was also included [30].
Select variables in the ACS planning database [24] are relative to the population (census tract averages, percentages, or median values). ACS average and percentages variables with less than a 10% range were also excluded, as data needs to vary to find meaningful differences. Census tract race/ethnicity distributions were also excluded to prevent overfitting on race. Reciprocal measures (e.g., percentage of male and female) were reduced to one variable.
To account for poverty and still identify other components of neighborhood conditions, we collapsed all ACS poverty-related variables together. Correlations were assessed relative to the percentage of persons living below the poverty level. The inclusion threshold was set at |0.5|. Both Pearson (linear relationships) and Spearman (monotonic) methods were used to maximize inclusion. Principal component analysis (PCA), which combines variables together with a linear orthogonal transformation, was conducted for feature reduction for the poverty-correlated variables (hereafter “poverty PCA”) [Reference Pedregosa, Varoquaux and Gramfort31,32].
Model
Data were modeled on the visit level and restricted to asthma visits. To avoid a UChicago-specific EHR artifact that occurred in late 2016 as well as COVID-19 pandemic complications, data were restricted to visits from 2017 to 2019. Because of sparse data outside of the South Side of Chicago, this pilot was limited to census tracts on the South Side (Table 2 and Fig. 2). Patient race/ethnicity and insurance variables were excluded: race/ethnicity to prevent overfitting on race as well as to permit later bias testing and insurance status because public health insurance is a proxy for poverty. All 5 spatial cluster variations were included to determine which provided the most information to the model.
One nonlinear machine learning algorithm, a boosted decision tree [Reference Chen and Guestrin33], was piloted. Decision trees are non-parametric and do not make assumptions about the form of the data, can manage correlated data, have high dimensionality, and tolerate missingness. Boosted trees are an ensemble meta-algorithm, converting weak learning decision trees to strong ones in an iterative manner, such that each new tree improves upon the previous one. The strengths of decision tree-based modeling provide the researcher the ability to include data elements not previously identified as risks in the literature, allowing for discovery of novel influences upon an outcome. Diverse datasets (including factors known to be related to asthma in the literature and, purposefully, elements that are not in the literature) were included in the model to allow for novel factor discovery. After this pilot period of discovery, more specific modeling will be undertaken.
Decision tree models allow more flexibility and use of more complex datasets but also risk overfitting to the noise or randomness in the data. To assess overfitting, we split the data into train and test sets. The model is designed (“fit”) on train data only, and then solely “run” on the test data to predict outcomes. Model prediction accuracy, defined below, is compared between train and test datasets to determine if the model is overfitting on the initial train data.
The outcomes were slightly imbalanced in that there were 2.4 times as many routine asthma visits as for exacerbations; this can affect prediction accuracy, especially for the minority group (exacerbations). The XGBoost model allows for rebalancing to prevent this via adjusting weights as a model hyperparameter. Other model hyperparameters were optimized with Hyperopt [Reference Bergstra, Yamins and Cox34], and manually adjusted after fit assessment with a 70/30 train/test split; with the final hyperparameters, shuffled 5-fold cross-validation was conducted.
Evaluation
Metrics center on comparing model predictions to actual values. “Positive” indicates the outcome of interest, here, an asthma exacerbation, while “negative” indicates no exacerbation. “True” indicates that the actual outcome in the data matches the model-predicted outcome, while “false” indicates a mismatch between real and predicted outcomes. The metrics are: true positive (TP); true negative (TN); false positive (FP); and false negative (FN). Accuracy is the proportion of correct predictions ((TP + TN) / (TP + FP + TN + FN)). Test accuracy is the accuracy of the model on only the test dataset. Recall is the proportion of actual positives identified correctly (TP / (TP + FN)).
Variable importances were determined by gain, which is the relative improvement in accuracy contributed by a particular feature. Gain provides a ranking of all variables in the model to indicate which are most important. It is vital to note that decision trees are not regression lines. If two variables are correlated, the decision tree will rank as more important whichever variable provides better accuracy. For example, we included five different spatial clustering variables to test which model ranked highest, that is, which of the five provided the most gain. This served as variable selection for spatial clusters for future efforts.
A baseline model included clinical and ACS data only. This was followed by a model with all SDC datasets and clusters to determine any improvement in predictive power. Further, the model would decide which of the five cluster variations provided the most gain to the model and, additionally, if clusters were at all important relative to other variables.
Protocol Testing
Two data-driven inclusion protocols were tested as future optional tools for users, and each used lasso-regularized logistic regression [Reference Seabold and Perktold35] and added SDC datasets one by one to the clinical data (serially, not cumulatively). The first protocol assessed including full datasets via the AIC metric (Akaike information criterion; the lower the value, the better the model quality) [Reference Cavanaugh and Neath36]. The second protocol used lasso as variable selection. Each variable that reached statistical significance (p < .05) was included. Lasso regularization compresses coefficients to reduce bias and is a useful technique for variable selection. (See the supplement.)
Challenges
Challenges for this pilot include clinical data availability and the relatively low predictive power of variables (“signal”) with the high variability of individual patients (“noise”). There is also a sampling bias in that the University of Chicago sees a distinct patient population – demographically, socio-economically, and geographically – which likely does not generalize to other populations.
Results
Sociome Data Commons
We report initial technical metrics for the SDC.
SDC Scope and Quality
The initial data repository consists of 22 total datasets and 375 individual metadata entries documenting their quality, provenance, and scope. Dataset content categories include environmental exposures (n = 16), public safety (n = 3), demographics (n = 2), access and mobility (n = 2), property (n = 1), and economic activity (n = 1). The geographic levels include street addresses (n = 14), census tract (n = 7), census block (n = 1), and latitude/longitude points (n = 3).
In the initial pilot use case, only high-quality datasets were included. All 22 datasets to date have a data quality score of four or higher (see the supplement) and three of the datasets have a score of five. Furthermore, we found that datasets had varying geographic granularities. The fine-grained data, i.e., data at a finer precision than census tract, were in varying formats and scope. This requires significant efforts to harmonize and align with the other datasets using our software toolkit. In all, the datasets include 1906 total variables.
Usability Metrics
We evaluated the incremental cost of adding new datasets to the SDC by measuring the time required by a data engineering intern. These datasets had already been assessed by the team for relevance and quality. The metrics measure the time needed to use the automated harmonization software to reformat the data and enter the metadata into the commons. Over 13 random datasets, the average time to find and add to the SDC was 25 minutes with a wide standard deviation of 21 minutes; this does not include data cleaning or any integration activities.
CONSORT and Clinical Trial Ethos
We leveraged the deep expertise of the clinicians on the SDC team to develop a dataset inclusion pipeline that resembles a clinical trial flowchart [37]. Datasets are assessed through a review process and progress through a number of stages until inclusion. Fig. 3 shows the current status of this inclusion process.
Asthma Pilot Results
Clinical data
Using DeGauss, 93% of all clinical data were successfully geocoded from patient address to latitude and longitude, census block, and census tract. The other 7% of addresses were post office boxes (0.4%), non-address text (0.2%), and “imprecise” (6.9%), where “the address was geocoded, but results were suppressed because the precision was intersection, zip, or city and/or the score was less than 0.5 [Reference Brokamp38].” Table 2 is restricted to asthma visits among residents of Chicago between 2017 and 2019.
The UChicago pediatric asthma patient population is 77% Medicaid and 89% non-Hispanic Black (Table 2.A). Given the hospital’s location and consequent patient catchment, restricting the data to the South Side of Chicago captured most asthma visits (Table 2.B). However, population changes from this restriction included reductions in the non-Hispanic Asian/Mideast Indian (78%) and White (81%) patients, while 97% of non-Hispanic Black and 92% of Hispanic patients remained, reflecting the segregation by race/ethnicity in Chicago. Moving between census tracts was not common, with a median count of census tract per patient of 1 (range 1–3).
Spatial clustering of census tract exacerbations as percentage of asthma visits was significant (Moran’s I = 0.5958, p < .0001) and also showed a dominant and geographically large cluster 1 where 39% of asthma visits were for exacerbations (Fig. 3c). In contrast, only 18% of visits were for exacerbations in cluster 5. The University of Chicago hospital itself resides within cluster 1, though the cluster extends far to the south and west (Fig. 3c). Restricting clinical data to patients within cluster 1, most asthma visits were retained. However, important changes include a dramatic reduction in the South Side Hispanic population (31%, Table 2.C).
Sociome Data Commons datasets
For the poverty PCA, Spearman added four additional variables not included in the Pearson results, while Spearman alone would have excluded eight identified from Pearson. Twenty-nine variables correlated with percent below the poverty level at |0.5| or above (see the supplement) and, along with percent below the poverty level, were mean-centered and standardized [Reference Pedregosa, Varoquaux and Gramfort31]. These variables were reduced to one PCA loading, which explained 56% of variance.
Model
A baseline boosted tree was run with only clinical and ACS data (which included the poverty PCA) with a test accuracy of 58%. With all datasets for the entire South Side, accuracy was increased to 62%, with recall for one (an asthma exacerbation) at 65%.
Feature importance for the South Side (Fig. 4a) revealed clear outliers for gain around 21 and again at 28 (Fig. 4c). Most features were at the lowest end of gain. Spatial clustering appeared twice in the top features. Seven clusters and the average age of housing provided the most gain in accuracy, followed by age at visit, proximity to Superfund pollution sites (marked for decontamination by the EPA), median rent, and the violent crime rate. The proportion of residential housing, urban flood susceptibility, and three spatial clusters followed in the top 10 variables.
Given the geographic dominance of spatial cluster 1 and its high proportion of asthma exacerbations, we ran a model only on those patients residing within that cluster. Cross-validated accuracy was 57%, accuracy was 61%, and recall for one was 60%.
For cluster 1, there is a clear grouping of top feature importances at gain above 25. Otherwise, there is an accumulation of features around 8. (Fig. 4d) The average age of housing units, the percentage of those under age 19 with no health insurance, the visit month, and the patient’s age at visit were dominant variables in the model (Fig. 4b). These were followed by a second grouping of features in the top 10 of those 65 and over with no health insurance, housing cost burden, foreign born residents, median cost of rent, the poverty PCA, and several variables indicating a lack of health insurance.
Variable importances changed between the model including all of the South Side (Fig. 4a) and the model restricted to cluster 1 (Fig. 4b). Housing age (Fig. 5a) remained a top variable, as did patient age and median rent. However, moving from the full South Side to only cluster 1, housing cost burden (Fig. 5b) appeared in the top variables while urban flood susceptibility (Fig. 5c), the violent crime rate (Fig. 5d), and proximity to Superfund sites (Fig. 5e) left the top 10 important variables. Instead, lack of health insurance variables and the poverty PCA (Fig. 5f) gained importance.
Conclusion
Platform Discussion
The SDC has been purposefully designed to support and encourage extension of the platform into new data sets as well as the continued development, refinement, and adoption of SDC standards for dataset quality, dataset inclusion, metadata annotation, and data access/governance. The asthma pilot has served as the first driver use case for the SDC. Additional projects will be selected, in part for their ability to exercise and grow the capacity of the SDC to meet its ambitious goals. The purpose of this study is twofold: (1) to further the understanding of sociome factors in a variety of pathologies and (2) to understand principles of sustainable data commons design.
The first purpose has parallels to other similarly-intentioned ongoing efforts. The National Neighborhood Data Archive (NaNDA) at the University of Michigan [39], the Health Equity Explorer (H2E) at Boston Medical Center [Reference Adams, Gasman, Beccia and Cabral40], Exposomics from the University of Utah [41], the City Health Dashboard from New York University [Reference Betro, Breslin and Chen42], PopHR at the University of Tennessee [Reference Shaban-Nejad, Lavigne, Okhmatovskaia and Buckeridge43], and SDOH and Place [Reference Kolak, Bhatt, Park, Padrón and Molefe44] are among many groups aggregating important data expressly to enable researchers’ use of sociome factors.
We believe the second purpose – to understand the principles of sustainable data commons design – is unique to this project. The SDC is studying the design and implementation of such a data commons as a scientific problem including establishing quantitative metrics for assessing data quality, ease of use, researcher adoption, and sustainability. This manuscript contributes a framework for evaluating such criteria and we believe this to be an important contribution to this research area.
This pilot and manuscript have been tailored to Chicago, but we are collecting national datasets and can build additional location-specific datamarts as needed. However, local understanding provides critical context to properly using the data, and our group holds high knowledge of Chicago and available datasets. Key informant interviews with community members and other content experts are underway, and the Institute for Translational Medicine’s Community and Collaboration core is currently conducting community outreach activities. Notably, not all cities have tools such as the City of Chicago’s Data Portal [26].
In ongoing work, the team plans to make the entire SDC infrastructure publicly available and open source. It will include the software artifacts designed as a part of the project, such as the data harmonization and analysis code. It will further open-source the management infrastructure of how to host and serve these datasets. Furthermore, we will release policy templates for data governance and quality assurance. Once we finish integration of datasets, the SDC will serve as a reference implementation for data commons across a variety of social contexts of health research problems.
Analysis Discussion
The pilot use case reported above was helpful in providing direction for our future SDC and analytic efforts. Spatial clustering of exacerbations, housing conditions, proximity to Superfund sites, violent crime, urban flood susceptibility, lack of health insurance, and the poverty PCA all contributed importantly to prediction of asthma exacerbation. Some of these reflect current literature on risks for asthma, such as housing conditions and violence [Reference Bryant-Stephens, Strane, Robinson, Bhambhani and Kenyon45,Reference Landeo-Gutierrez, Forno, Miller and Celedón46]. Further, housing variables appeared in the top 20 variables, such as the building violation rate and mobile home percentage (see the supplement).
Of course, these housing quality indicators are only proxy estimates for each individual patient’s actual housing conditions and exposures, such as indoor air quality, first, second, or third-hand smoke exposure, and mold or pest exposure [Reference Kang, McCreery and Azimi47–Reference Tiotiu, Novakova and Nedeva50]. Given a subset of patients’ actual housing and indoor air quality, we could work to identify which, if any, SDC datasets could serve as the best proxy [Reference Bozigar, Connolly and Legler51]. The addition of personal exposure data might increase predictive accuracy, and the extent to which this occurs would inform how well (or not) generalized survey data like the sociome datasets perform in their stead.
Proximity to Superfund sites merits further exploration of these and other pollution sites such as landfills and risk management plan sites. Exploring patient-level distance to these, rather than census tract estimates, is a next step. Notably, known risks such as PM2.5 exposure and traffic proximity [Reference Grant and Wood48] did not appear in the top 10 variables, though they are in the top 20 (see the supplement).
Other findings, such as the violent crime rate and lack of insurance, also replicate the literature [Reference Landeo-Gutierrez, Forno, Miller and Celedón46,Reference Hasegawa, Stoll, Ahn, Kysia, Sullivan and Camargo52]. Rarely-seen findings [Reference Larson, Gronlund and Thompson53] to be further explored and include urban flood susceptibility, which could indicate poorer housing quality and perhaps indicate susceptibility to damp housing and mold growth. Further exploration of flooding is needed.
In this pilot, spatial clustering proved to be important. Model stratification was only conducted for the dominant cluster 1. A comparison of top variable importances demonstrates the promise of providing geography-specific risks (Fig. 4), though further work is needed to clarify the reasons for the cluster differences, as well as a comparison model for the full South Side excluding clustering variables. Still, future work exploring cluster-specific models could inform geographically tailored interventions.
Limitations
This study suffers from several data set limitations. For example, some probability surveys (such as NHANES [54]) contain important sociome data but are not publicly available at the census tract level.
The models resulted in rather weak signals. While we anticipated that the importance might lie in an aggregation of multiple weak signals, predictive improvement is still needed. By choosing just a few datasets, we might not have yet included the most important datasets. We anticipate that broadening the range of sociome factors in an expanded SDC (which we are currently building) may increase model performance.
The University of Chicago catchment area does not generalize to all pediatric asthma patients in Chicago. Varied data are needed to appreciate differences, and we need EHR data from other metropolitan Chicago health systems to provide greater patient heterogeneity. Efforts to expand the data are underway with our partners from the Institute for Translational Medicine [55].
The spatial clustering of exacerbations might be affected by proximity to the UChicago hospital, as exacerbations are often urgent or emergency events. However, cluster 1 does extend far to the south and southwest of the hospital. Adding other hospitals’ data should clarify clustering of exacerbations against hospital proximity.
Analysis needs to be expanded. We piloted data as cross-sectional, and future efforts will use longitudinal methods and correct for this temporal bias. Including both temporality-based regression discontinuity and high-dimensionality mediation analyses will also enable causal exploration. Patient-level, rather than visit-level, analysis (as a multilevel generalized estimating equation) might also increase predictive performance, as many of the asthma exacerbation visits could have been from a particular subset of patients. By matching patients to specific locations in our housing dataset, we can move to sub-census tract metrics. Future efforts should include patient-specific distance to the nearest hospital and also investigate asthma exacerbations in more detail, for example, if a visit to the emergency department was needed. Modeling itself needs to be expanded. With machine learning, multiple models can be combined, leveraging the models’ strengths and balancing their respective deficiencies.
For this pilot study, we were missing data on many traditional clinical phenotypes (disease severity, atopic status, comorbidities) and all biological (genetic, epigenetic) information about the pediatric asthma patients included in this study. As a next step, via the structured flow sheet, we will include the asthma control test in future efforts. Asthma exacerbations will also be defined via medication prescriptions, though pharmacy dispensing information is not available. Phenotypes (atopic/non, obesity related, etc.) will be assigned. If we are able to obtain biomarker data for select patients, that could allow identification of gene-environment interactions.
The users were restricted to the project team with insights and guidance provided by the team. A robust governance framework will be crafted. In addition to the data governance discussed here, we will address appropriate use, responsible use, and community impact including a flexible regulatory module to meet varied and changing requirements. We will activate a multi-disciplinary work group composed of technical, research, legal experts, ethicists, and community representatives.
In sum, the pilot study reported here demonstrates promise for future analyses of the complex interactions of the sociome and clinical health factors using the SDC. We expect that its further development, including accounting for dataset quality metrics, will facilitate the accounting for sociome factors in a wide range of clinical research topics, from analysis of response to cancer immunotherapy to pandemic preparedness [Reference Badalov, Blackler and Scharf56], to common complex diseases like diabetes mellitus [Reference Egede, Campbell, Walker and Linde57], and more.
Supplementary material
The supplementary material for this article can be found at https://doi.org/10.1017/cts.2023.670.
Acknowledgments
The authors thank Lila Midyett and Krish Modi for their help reviewing the literature.
Author contributions
BB: Software, Data Curation; AB: Writing-Original Draft; SB: Writing-Review & Editing; BC: Data Curation; CC: Resources; KB: Project Administration; EC: Conceptualization; ED: Writing – Review & Editing; BF: Writing-Review & Editing; JJ: Methodology, Investigation, Resources, Data Curation; HZ: Resources; NK: Resources; SK: Conceptualization, Methodology, Software, Validation, Formal Analysis, Investigation, Data Curation, Writing Original-Draft, Writing- Review & Editing, Supervision; KL: Data Curation; KL: Software, Data Curation, Writing-Original Draft; SM: Data Curation; DM: Conceptualization; SN: Writing – Original Draft; JO: Conceptualization, Writing-Review & Editing; CS: Writing – Original Draft, Writing – Review & Editing; MS: Data Curation, Writing - Review & Editing; AS: Conceptualization, Investigation, Supervistion, Writing – Original Draft, Writing- Review & Editing; JS: Conceptualization, Writing – Review & Editing; ST: Methodology, Software, Formal Analysis, Investigation, Data Curation, Writing – Original Draft, Writing – Review & Editing, Visualization; SV: Conceptualization, Resources, Writing – Original Draft, Writing – Review & Editing.
Funding statement
This work was supported by NIH award number UL1TR002389-07.
Competing interests
S.N. served on an advisory board for Avillion/Astra Zeneca, receives royalties from Wolters-Kluwer and Springer, and research funding from NIH and Asthma Allergy Foundation of America.
C.S. served on advisory boards for Gilead and Merck, receives royalties from Wolters-Kluwer, and research funding from CDC.
A.S. Holds voluntary positions in the American Medical Informatics Association and is an equity investor in healthcare companies and other industries.
J.S. reports a potential financial interest in PulmOne Advanced Medical Devices, Ltd, Israel, and research grant funding from NIH, NSF, and Respiratory Health Association of Metropolitan Chicago.
S.V. is co-founder and Chief Medical Officer for Litmus Health, Inc., and receives consulting royalties from CVS Accordant.