Introduction
With the advent of electronic health records and medical devices, modern longitudinal studies typically feature long sequences of observed data. Variables observed over time on study subjects can provide insight into the clinical course of a given biological system or disease, but often through noisy realizations of the true underlying longitudinal process that exhibit natural variation both between- and within-subjects over time. Here, we consider continuous glucose-monitoring (CGM) data where the daily glucose curves were repeatedly observed for each subject along with the duration of the study, and we consider these observations as repeated functional data (FD).
Characterizing and monitoring CGM data have potential value for the assessment of outcomes in clinical studies [Reference Beck, Calhoun and Kollman1]. Maintaining glucose control is essential for individuals with diabetes mellitus, particularly with respect to the development of comorbidities and during pregnancy. Variability in glucose has been studied for decades as a proxy for diabetes control, but research studies and clinical decision-making are typically based on summary measures [Reference Service2]. The most commonly reported summary measures are standard deviation, coefficient of variation, and mean amplitude of glycemic excursion, all of which have been used as clinical indicators for years. With more frequent monitoring using CGM, the percentage of time spent within the target range for a given patient is also utilized for care. Although these summary statistics provide a measure for variation around the mean and duration of time within healthy glycemic ranges, the underlying longitudinal structure, the mean response function, and natural variation over time are ignored [Reference Szczesniak3, Reference Frøslie4]. This approach ignores additional information that such FD offers [Reference Law5] and forces clinicians to rely on summary statistics that are prone to measurement error [Reference Barrett6]. Obtaining additional information, such as an indication of glucose levels across certain time points, measures of change, rate of change, and variability, and making it available for clinical application has the potential to advance clinical decision-making around optimizing glucose control [Reference Law5].
Moreover, summary measures generally result in misleading outcomes if portions of the longitudinal data are missing for a given individual [Reference Everitt7]. Missingness in CGM data can arise due to sensor failure, calibration error, or other reasons. These settings produce unequal numbers of repeated measurements and mistimed measurements, both between-subjects as well as within an individual subject recorded over a long duration of time as in multiple CGM sessions (see Fig. 1). In the statistics literature, having such irregular spacing over time is referred to as sparse longitudinal data. The number of observations per individual can vary considerably. Failing to account for sparse longitudinal data through appropriate estimation methods will lead to biased results [Reference Little and Rubin8]. In the clinical context, there is an abundance of data; however, the data are distributed in a sparse fashion over time.
Core and novel FD techniques [Reference Ramsay and Silverman9] may be useful tools to address issues that are ignored by traditional methods that report simple summary statistics from CGM data. The first goal is to identify phenotypes of type 1 diabetes based on glucose curves. To accomplish this, goal clustering approaches are needed that account for sparse longitudinal patterns of glucose levels. Clustering curve data with strong temporal correlation, like that of longitudinal glucose levels in type 1 diabetes, can be accomplished using the scores from FD analysis technique known as functional principal components analysis for sparse longitudinal data (FPCA) [Reference James, Hastie and Sugar10].
There are other methods that can be utilized to cluster glucose curves such as latent trajectory classification [Reference Szczerbinski11], hierarchical clustering [Reference Kahkoska12], K-means clustering [Reference Bian13], spectral clustering [Reference Hall14], and deep learning-based clustering [Reference de Jong15]. As previously demonstrated [Reference Frøslie4], FPCA will produce similar information to existing techniques when there is an equal number of measurements for each curve and the curves are measured at regular time points across individuals. However, this is not the typical case for CGM data. In this particular CGM application, we employ FPCA as our clustering approach for historical and methodological reasons. FPCA has been effective in clustering tracings from CGM and oral glucose tolerance test (OGTT) data in the literature [Reference Szczesniak3, Reference Frøslie4]. FPCA works by extracting key modes of variation from glucose level trajectories. Furthermore, the FPCA approach provides individual predictions of smoothed curves of nonlinear decline/increase of glucose. The sparse FPCA approach that we employ accommodates incomplete longitudinal data in the form of missing at random (known as MAR) [Reference Szczesniak3], and it accounts for temporal correlation and use of profiles with varying numbers of glucose values. Thus, this methodology is well suited to CGM data analyses, where numbers of values and trajectories of glucose change over time. FPCA is useful for dimension reduction by yielding functional principal components (FPCs) scores that represent modes of variation, and we then utilize these components to identify phenotypes of glycemic variability that ultimately will lead to improved clinical action. In addition to the traditional FPCA method, we characterize the daily specific glucose curves and their secular evolution using the double FPCA approach described previously [Reference Chen and Muller16].
In prior work [Reference Frøslie4], FPCA was performed to extract shape information of OGTT curves from 974 healthy pregnant women in their first trimester, which was not identified by simple summary measures. The obtained information (FPC scores) discriminated between women with and without gestational diabetes later in pregnancy. Additionally, the FPC scores in the first trimester were associated with large-for-gestational-age (LGA) birth, while summary measures suggested there was no association. Other previous work [Reference Szczesniak3] characterized the timing and degree of variability in glucose from 147 women with type 1 diabetes who had repeated monitoring over the course of gestation. The glucose profiles were clustered into three subgroups of high, moderate, or low heterogeneity, relative to the overall mean response. These clusters, referred to as phenotypes, were associated with clinical characteristics of the cohort at the beginning of pregnancy, longitudinal changes in maternal glycohemoglobin (hbA1c), and weight and pregnancy-related outcomes.
The second goal of the current study is to examine the risk of glycemic excursions throughout the monitoring period in real time. Repeated measures of glucose levels from CGM are clearly correlated over time within a subject, and a traditional random intercept model [Reference Laird and Ware17] has been commonly considered to fit longitudinal data. Various semiparametric mixed-effects models have been used to fit glucose trajectories [Reference Vandyke18]. Further, a semiparametric mixed-effects model with penalized regression splines was considered to provide smooth estimates of the longitudinal glycemic profiles for blood glucose data during gestation [Reference Gupta19]. Recently, proposed machine learning and time series techniques for predicting glucose levels have been extensively discussed [Reference Rodríguez-Rodríguez20]. These authors later compared support vector machine (SVM), random forest (RF), and autoregressive integrated moving average (ARIMA) models for forecasting glucose level and concluded that the prediction model developed using the RF method was the most accurate of those considered. Techniques such as recurrent neural network (RNN) [Reference Li21] and artificial neural network (ANN) [Reference Pérez-Gandía22] have been also used for predicting glucose trajectories. Training RNN or ANN models can be computationally expensive. Most RNN models are not able to model sparse and irregularly sampled sequential data [Reference Ngufor23] like that in our CGM application. RNN models automatically learn features with higher complexity and representations, but the learning capacity of ANN is limited due to the model complexity since these models are mostly implemented in fewer than three layers [Reference Li21]. These layers are neurons within the neural network that process a set of input features or the output of those neurons.
To produce predictions while preserving clinical interpretability, we adapt a nonstationary Gaussian linear mixed-effects model (LME) that is preferable over the traditional random slope-intercept mixed-effects models, especially when the data have long follow-up sequences to fit the glycemic profiles and predict excursions. In line with our prediction goal, this model provides a framework for estimating the real-time risk of hypo-hyperglycemic based on target functions that we define in Sect. 3.3. Unlike most of the machine learning methods, our chosen approach does not require model training; hence, it is computationally quite efficient. The proposed model yields predictive probabilities that are interpretable for risk assessment, and the traditional parameter estimates provide useful information about how different covariates/features are associated with the response variable. Additionally, this prediction model allows further analysis of the derivative of the fitted curves (e.g., computing rate of change for glucose trajectory) which could be useful if one is interested in obtaining rapid decline/increase in glucose. By contrast, the weights from machine learning prediction techniques, which serve as the primary decision-making information, are not directly interpretable without performing transformations that require additional assumptions [Reference Connolly24]. Methods stemming from our chosen approach have been mostly utilized for monitoring in clinical trials to compute the predictive probability of success given interim data [Reference Dmitrienko and Wang25]. Recently, predictive probabilities have been expanded in the context of monitoring progression toward renal failure [Reference Diggle, Sousa and Asar26] and rapid progression of lung function in cystic fibrosis [Reference Szczesniak27].
The next section gives the details of the CGM data. Section 3 provides the statistical methods used in our work. Section 4 presents the results for our FDA and stochastic model applications. We conclude in Sect. 5 with a discussion of the statistical methods used in this work. The computer code for our analyses is provided in Supplemental materials.
Description of CGM Data
We downloaded data from the Jaeb Center for Health Research from the Juvenile Diabetes Research Foundation (JDRF) CGM Study [28]. The goal of this randomized trial was to compare unblinded real-time CGM (RT-CGM) to blinded collection (Control). The analysis cohort used for this study consisted of patients with a clinical diagnosis of type 1 diabetes who had used daily insulin therapy for at least 1 year, were at least 8 years of age, had glycohemoglobin A1C (HbA1C) less than or equal to 10% and insulin regimen involving either use of an insulin pump or multiple daily injections of insulin and had been stable for the last 2 months prior to randomization. The approval of the Institutional Review Board at Cincinnati Children’s Hospital Medical Center is not required since the data is publicly available at http://publicfiles.jaeb.org/jdrfapp/dataset/RT_CGM_Randomized_Clinical_Trial.zip.
The analysis cohort includes a total of 443 (232 RT-CGM; 211 Controls) participants [28] and we focus on the primary cohort’s 7-day CGM sessions at the first week. Of the 443, 55% of participants are female, 94% of them are White, and mean age (range) is 25 (8–72) years. The glucose level of individuals is reported every 5 min between clock time 0:00 and 24:00 over a week, which yielded a total of 610,823 measurements. The average number of CGM measurements per subject is 204/day and 1379/week, indicating an overall missing data rate of 31.6% (assuming 5-min observations over 7 days). The daily curves of glucose data are repeatedly collected for each subject as the study progresses from the beginning to the end. The daily glucose curve was repeatedly observed for each subject across the duration of the study. This data is considered to be repeated FD. There is substantial variation both between individuals and within a given individual over time (see Fig. 1).
Methods
In this section, we describe the approaches undertaken to complete the two goals of our study – clustering and prediction. Although these two goals are related, we pursue each goal independently. For the first goal, we used FPCA to extract the modes of temporal variation between glucose curves (Sect. 3.1). We then used the two-stage FPCA method to characterize the daily specific glucose curves and their secular evolution (Sect. 3.2). To address the second goal, we fit a Gaussian LME to predict periods of glycemic excursions by using CGM data (Sect. 3.3).
FPC Analysis
FPCA for sparse longitudinal data was used to extract phenotypes of variation from glucose level trajectories. Similar to traditional principal component analysis (PCA), FPCA utilizes linear combinations of a small number of features to maximize variance across data. FPCA achieves this by extracting the common temporal characteristics of a set of curves. This approach was previously performed to identify phenotypes of type 1 diabetes in pregnancy by finger stick data [Reference Szczesniak3]; similar steps were taken for the observed glucose levels in this study. Briefly to perform the sparse FPCA, we started by choosing a suitable basis function for representing the eigenfunctions using cubic B-splines. We used a routine from the R package “fpca” to implement the restricted maximum likelihood estimation through a Newton–Raphson procedure, and estimate the FPCs from the CGM data. This approach addressed both the selection of the number of basis functions, as well as the dimension of the process (i.e., number of nonzero eigenvalues) used in the model by minimizing an approximation of the leave-one-curve-out cross-validation score. We implemented the algorithm for each subject’s monitored glucose collection of longitudinal data to obtain smooth individual functions across study duration. Cubic B-splines were specified with equally spaced knots. The candidate models had different settings of the number of basis functions for the eigenfunctions (M) and the number of nonzero eigenvalues used in the model (r). We examined combinations of M = (4, 5, 6) and r = (2, 3, 4), and report results from the model with the best (i.e., the smallest) cross-validation score. This procedure for selecting basis functions and eigenfunctions has been used in prior studies [Reference Szczesniak3, Reference Szczesniak29]. After fitting the FPCA model (by trying all the combinations of M and R), the final selected model with convergence was M = 6 and r = 4, which showed four FPCs. Then the scores from the FPC were used to classify CGM sensor tracings as used for previous work in glucose monitoring [Reference Szczesniak3, Reference Frøslie4] and other studies [Reference Szczesniak29] using the first and third quartiles (Q1 and Q3, respectively).
Two-stage FPCA
In this part, we consider the CGM data observed during the first week of the study. The two-stage FPCA method is capable of incorporating the nested design of the data for a whole week. The traditional FPCA cannot accommodate nested data; thus, we used this method to assess variation for a single day from each subject’s profile. The goal of the two-stage FPCA was to characterize the daily specific glucose curves and their secular evolution over a week using the double FPCA approach.
The double FPCA [Reference Yao, Muller and Wang30] procedure is an extension of the conventional FPCA method. It provides a decomposition of the total variation into the variation within the repeatedly observed functions and the variation between these random functions as the second component. The method has several appealing advantages over the traditional FPCA methods. First, it relies on mild assumptions [Reference Yao, Muller and Wang30]. Previous models for repeated FD rely on a general hierarchical structure. The multiple functions observed for each subject are modeled as a multilevel ANOVA design, relying on the additive assumption [Reference Greven31]. In contrast, the two-stage FPCA approach assumes the functions are smoothly changing over the times at which the repeated functions are recorded, which leads to a nonparametric model under minimal assumptions. Second, the results provide the variation within the repeatedly observed functions as one component and the variation between these random functions the second component. It’s easy to interpret the patient’s glucose trend within a 24-h routine, as well as the evolution of the glucose levels over a long period. Third, the method adopts a local-linear smoother approach [Reference Yao, Muller and Wang30] to estimate mean and covariance kernel, thus it is applicable to both dense and sparse observations. Even though our data contain individuals with very sparse observations, such as Fig. 1 (the second plot in the first row), we can still effectively estimate the mean surface by borrowing strength from the entire sample.
Real-time Prediction of Glycemic Excursions
The goal of this section is to develop a reliable prediction model that can accurately predict glycemic levels based on past data collected from the CGM sensor and real-time risk of hypo-hyperglycemic for individuals with type 1 diabetes. Below, we provide details of the Gaussian LME with nonstationary covariance that we used for modeling the trajectories of glucose levels and predicting real-time risk of hypo- and hyperglycemic excursions. This model allows for estimating risk based on target functions, which we specified as glycemic excursions. Previously, this model has been used in studies of renal failure and cystic fibrosis [Reference Szczesniak27]. This model captures between and within patient heterogeneities by a random intercept and stochastic process. The main effects are linearly included in model (1); however, the model is able to predict nonlinear curves. It was shown to outperform traditional random slope-intercept models when data has long sequences of repeated measurements [Reference Asar, Bolin, Diggle and Wallin32, Reference Taylor-Robinson33] which is the case for our CGM data. Some model details are provided below.
We let ${Y_{ij\;}}$ denote the glucose measurements for the ${i^{th}}$ patient taken at time point $\;{t_{ij}}$ , where ${t_{ij}}\;$ is time of measurement since midnight based on clock time (in hours) and $\;i = 1, \ldots ,N;j = 1, \ldots ,{n_i}$ .
The nonstationary LME has the following form:
where ${\bi{{X}}}_i}\left( {{t_{ij}}} \right)$ is the design matrix that includes covariates and α is the corresponding parameter vector. Between-patient heterogeneity is estimated with a random intercept term ${U_i}$ , where ${U_i}_~{{N}}}\left( {0,\;{\omega ^2}} \right)$ . The terms ${Z_{ij}}$ are independent, identically distributed as Gaussian with $N\left( {0,\;{\tau ^2}} \right)$ and represent measurement error. The term ${W_i}\left( {{t_{ij}}} \right)$ represents realizations from the zero-mean (which represents change in a patient’s glucose level over time that cannot be explained by the linear regressions), continuous-time integrated Brownian motion process such that ${W_i}\left( t \right) =\int \nolimits_0^t {B_i}\left( v \right)dv$ , where ${B_i}\left( v \right)$ is the rate of change in glucose level at time $v$ depicted as Brownian motion and ${B_i}\left( 0 \right) = 0$ . The integrated Brownian motion process is nonstationary and follows a Gaussian distribution with covariance function for time points (hours) $s$ and $t$ :
This enables greater flexibility compared to traditional models, in terms of the shape of realizations that have been used to characterize variation in glucose levels.
We construct the target function to predict real-time risk of hypo- and hyperglycemic excursions by using model (1). To predict real-time risk of hypo- and hyperglycemic excursions for ${i^{th}}$ patient at time $\;{t_{ik}}$ , we utilize the predicted glucose level ${Y_i}\left( {{t_{ik}}} \right)$ of that patient at that time point;
We let the covariate history up to a given time $t$ of each patient be represented as $\;{{\bi {\cal H}}}}_i}\left( t \right) = \left\{ {{\bi{X}}_i},\left( {{t_{ij}},{y_{ij}}} \right):{t_{ij}} \le t} \right\}$ . Based on this history, we can build a predictive probability distribution for ${Y_i}\left( {{t_{ik}}} \right){\rm{\;}}\;$ being below or above given thresholds at time $\;{t_{ik}}$ ;
where ${\delta _{\bf{1}}{\rm{\;}}$ is the threshold (in mg/dL) for identifying hypoglycemia and $p_i^{{\rm{hypo}}}\left( {{t_{ik}}} \right)$ is the predicted risk of hypoglycemia.
Similarly, the risk of hyperglycemic excursions at time ${t_{ik}}$ can be obtained with the following predictive probability distribution
where ${\delta _2}{\rm{\;}}$ is the threshold (in mg/dL) for identifying hyperglycemia and $p_i^{{\rm{hyper}}}\left( {{t_{ik}}} \right)$ is the predicted risk of hyperglycemia.
We implemented the model (1) and obtained parameter estimates of the model for our CGM data by using the ngme package in R [Reference Asar, Bolin, Diggle and Wallin32]. Prediction performance of model (1) was assessed with predictive accuracy metrics: root-mean-square-error (RMSE) and mean absolute error (MAE). Predicted probabilities (3–4) are also computed by using R and the computer code is available in the Supplemental file.
Results
FPCA Results
Based on FPCA implementation, we found that the first FPC explained 48% of the total proportion of the variation in the CGM readings. The first FPC classifies how each individual patient’s glucose level trajectory differs from the mean trajectory. Focusing on the first two FPCs (FPC1–FPC2), which accounted for 73% of the total variation, we created clusters of the trajectories (of patients) according to the first and third quartiles (denoted as Q1 and Q3) of the FPC1 and FPC2 scores. This grouping of individuals across FPC1 and FPC2 quartiles results in nine clusters corresponding to CGM tracings. These classifications are presented in Fig. 2, which illustrates that there are clusters of patients who tend to exhibit hyperglycemic excursions more frequently than the overall cohort (top row). Extreme smoothed values were verified as being similar to the observed CGM values (e.g., top row, third plot from left: the most extreme profile had a peak value of 568 mg/dL (31.5 mmol/L)). Clusters of patients tended to have steadier, normal glucose levels (e.g., middle row, second plot from left); other clusters had lower glucose levels (e.g., bottom row).
Two-Stage FPCA Results
The observed glucose level for a given day was assumed to correspond to a random process, which quantifies glucose level as a function of time of the day. This is considered as the first step of the FPCA. Then we study the subject-specific changes of the functional relationships as weekly time progresses from the beginning of the week to the end of the week as the second step of the FPCA.
The fitted mean surface is visualized in Fig. 3A, which reflects the smoothed CGM tracings (glucose levels) for 10 representative patients from the data, noting higher glucose levels on weekends, compared to weekdays. A sharp increase is seen around Saturday. Fig. 3B provides the first harmonic which provides the “modes of variation” of the repeated functions sampled on a specific day, is seen to be quite heterogeneous across days of the week. The values below zero in Fig. 3B suggests that profiles tended to be lower, compared to the overall mean CGM profile, during nocturnal hours and on weekends. The second harmonic (Fig. 3C) illustrates higher order variation in the CGM profiles occurred at nighttime throughout the week, although variation was higher on weekends.
Real-Time Prediction Modeling Results
By performing our dynamic FD model, we model the glucose level trajectories in CGM data and provide real-time risk of glycemic excursions. In this part, for illustration, we only consider the data observed on the second day (randomly selected) of first week of the CGM data since the proposed prediction model cannot accommodate nested random effects. We discuss this as a potential extension in concluding remarks. We included time (hours) and group variable related to the clinical trial (1 if subject is from the control group; 0 otherwise) as covariates in model (1), since we do not have information on other features that can be included in the model as covariate such as meals, exercise, insulin regimen. Although the main goal is prediction, the coefficient estimates of intercept, hour, and group variables are 170, −0.632, and −6.663, respectively. Additionally, the maximum likelihood estimates of the covariance parameters indicated large between-patient heterogeneity ( ${\widehat \omega ^2} = 2984)$ and residual variance $\left( {{\widehat\tau }^2} = 11.14} \right)$ ; estimated variance for the integrated Brownian motion process was ${\widehat \sigma ^2} = 0.0052$ .
Graphs on the left panel of Fig. 4 provide observed and predicted glucose levels with 95% CI for a 62-year-old White female from the control group with height 160 cm and weight 68 kg, an 8-year-old White female from the control group with height 140 cm and weight 32.8 kg, and a 41-year-old White male from the RT-CGM group with height 168 cm and weight 79 kg, (respectively, from top to bottom). Our prediction model is capable of capturing the observed glucose curves (Fig. 4, left panel). The predicted glucose trajectories are quite similar to the observed glucose trajectories of these patients with small mean predictor of MAE and RMSE: 4.8 (SD = 0.45) and 7.2 (SD = 2); 4 (SD = 0.33) and 6.4 (SD = 2.9); 4.31 (SD = 0.4) and 5.9 (SD = 1.7), respectively (SD: standard deviation; units of both MAE and RMSE are mg/dL). We additionally report overall RMSE and MAE to compare the predictive performance of our model with the traditional random intercept-and-slopes model [Reference Laird and Ware17]. Overall RMSE and MAE for our model were 6.3 and 4.1. By contrast, overall RMSE and MAE are 46.4 and 34.8 for the traditional model, which exceeds estimated values from our prediction model.
The graphs on the right panel of Fig. 4 present the predicted risk of hypo- and hyperglycemic excursions for the same three subjects mentioned above. We used ${\delta _1} = 60{\rm{\;mg}}/{\rm{dL\;}}\left( {3.33{\rm{\;mmol}}/{\rm{L}}} \right){\rm{\;}}$ and ${\delta _2} = 200{\rm{\;mg}}/{\rm{dL}}$ $(11.1{\rm{\;mmol}}/{\rm{L}}$ ) for identifying hypoglycemia and hyperglycemia, respectively. The first patient was at high risk of hyperglycemia for the whole day except 6–8 am and 4–12 am. Her risk of hypoglycemia was relatively low except for around 9–12 am in the evening. The second patient was at high risk of hyperglycemia in the afternoon, followed by decreased risk between 6 and 10 pm. Her risk of hypoglycemia was quite high around 9 am in the morning. The third patient was at high risk of hyperglycemia in the afternoon from 1 to 5.30 pm and right before midnight from 11 to 12 am. His risk of hypoglycemia was quite low except between 9 and 10 pm in the evening.
Conclusions
Some studies have shown that having glycemic fluctuations is a deterministic factor for hypo- and hyperglycemic excursions. Recently, CGM systems have emerged as an effective technology with an ability to monitor glucose trends over time. Involving large amounts of irregularly observed data, CGM systems provide information, every 5 min, enabling the capture of frequency of fluctuations regarding blood glucose levels [Reference Chen34], so that efficiently utilizing CGM data would shed light on variation in blood glucose trends over time, which is frequently more difficult to measure. Hence, advanced statistical tools, as we have proposed here, are needed to efficiently study CGM data.
We first presented FD analysis tools for sparse longitudinal patterns of medical-monitoring data to classify glucose curves to identify phenotypes of type 1 diabetes. We considered the temporal information from the CGM FD, and have classified the fitted glucose curves into different clusters. The scores provided by this analysis can be used to examine a range of phenotypes as shown in Fig. 2. The limitation of this method was that it is not accounting for the nested design of the CGM data that was based on daily CGM data observed within a 24-h period. Hence, we then performed two-stage FPCA which incorporates the nested design of the CGM data and enables classification of the GM curves that were repeatedly measured over a week. Performing two-stage FPCA allowed us to compare the glucose curves observed on different days of a week; it was observed that the glucose levels are higher on weekends, compared to weekdays. Further, the glucose profiles tended to be lower, compared to the overall mean CGM profile, during nighttime and on weekends. Moreover, higher order fluctuation in the CGM profiles occurred at nighttime during the week, although variation was higher on weekends which agrees with previous findings [Reference Monzon35, Reference Raj36].
Although the two-stage FPCA has clear advantages over the traditional FPCA approach, we illustrate the utility of each method in CGM analysis for our clustering goal. Our rationale is that, although both approaches have made important but limited appearances in the diabetes literature, the traditional approach has been more frequently applied. However, CGM studies tend to consist of multiple days for a given subject; therefore, our study examined the findings and utility of both traditional FPCA and the two-stage approach. Further, regular FPCA can be used when one is interested in analyzing only one-day data to cluster daily trajectories (or one week if one is clustering the weekly trajectories). However, the two-stage approach assumes CGM recordings follow a nested design. Although our study goals did not involve comparing the FPCA-based clustering method with other approaches, it may be a worthwhile empirical study to examine our chosen approach alongside the aforementioned statistical and machine learning techniques for clustering.
We addressed our second goal by utilizing a statistical method for predicting and detecting real-time risk of hypo- and hyperglycemic excursions by using long, irregularly observed time series, tailoring the approach to type 1 diabetes. Based on the results in Sect. 4.3, we see that the prediction model that we implemented has very low prediction errors (e.g., low RMSE and MAE), especially compared to the errors obtained with the traditional random intercepts-and-slopes model. The predictive performance of our model is not compared with other previously employed techniques in literature; however, this also represents an important area for future empirical investigation. Additionally, by visually inspecting the excursion plots, we can conclude that the predicted real-time risks of hypo-hyperglycemic are highly accurate (the predicted probabilities are in gray band (the area where probabilities ≥ 0.80 or 80%) when the glucose level is over 200 mg/dL (11.1 mmol/L) and below 60 mg/dL (3.33 mmol/L) for hyperglycemia and hypoglycemia, respectively, in the graphs presented in the right panel of Fig. 4). This paper also shows how estimated change/variability in an individual’s glucose trajectory can be used to establish both clinically meaningful and patient-specific thresholds that, when coupled with probabilistic predictive inference provide a useful medical-monitoring tool. Therefore, increasing patients’ and clinicians’ abilities to take timely actions in a more accurate manner.
This is the first study to apply the two-stage FPCA method to CGM data to identify the daily specific glucose curves and their evolution over a week. Similarly, this is the first study to propose different target functions based on glucose thresholds, tailored to hypo and hyperglycemic. Our novel applications have particular relevance to pregnancy outcome. Specifically, our group has recently published using self-glucose-monitoring data to predict premature delivery in women with type 1 diabetes using a joint modeling approach. Using CGM data would most likely yield even better prediction to a very important obstetrical problem [Reference Gupta37].
There are several limitations to the methods provided in our work. Although the original experiment collected daily measurements of glucose over multiple weeks, we only used part of the data for our analysis due to lack of ability to account for the nested design of the whole data (as explained in the previous sections, we considered only 1 day’s data for the procedures provided in Sects. 3.1 and 3.3 and used just one week of data for the method provided in Sect. 3.2). Our analyses did not include covariate adjustment since these additional features of the study cohort were not accessible. As aforementioned, we acquired these data from a previously completed study in which there are limited details on the reasons for missing data. However, missingness in CGM data is common and can occur due to various reasons, such as intermittent sensor errors, sensor compression, and user errors [Reference Fonda, Lewis and Vigersky38]. The most common reason for missing data, however, is patient non-compliance in wearing the monitor. Most trials require a minimum number of days to wear the CGM sensor and a minimum number of hours of glucose values, including night-time values. Thus, missing data is inherent in any analysis of CGM data [39].
There are various new developments that could be undertaken based on our application and findings. An important advancement could be extending the prediction model to incorporate the typically nested design of the CGM data measured over weeks/months to improve the prediction accuracy of glucose levels and real-time risk of hypo-hyperglycemic excursions; which would help in better understanding and interpretation of variability in glucose tracings. Although other longitudinal models can accommodate nested designs, the stochastic processes being used in these models typically do not produce the same degree of predictive accuracy [Reference Diggle, Sousa and Asar26]. Further, improved accuracy of prediction of real-time risk would provide a timely warning of severe hyperglycemia or hypoglycemia. Additional future work could be adjusting the FPCA and the prediction models presented in Sect. 3 to incorporate covariates, when available.
Supplementary material
To view supplementary material for this article, please visit https://doi.org/10.1017/cts.2020.545.
Acknowldgements
This work was supported by the National Institutes of Health (NIH) under Grants R01 DK109956, K25 HL125954, and R01 HL141286; the Cystic Fibrosis Foundation (CFF) under Grant CLANCY15R0. The authors thank the Jaeb Center for Health Research (JCHR) for making available the data to conduct this study. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH, CFF, or JCHR.
Disclosures
The authors have no conflict of interest to declare.