1. Introduction
In recent years, event data have become ubiquitous in the social sciences. For instance, interpersonal structures are examined using face-to-face interactions (Elmer & Stadtfeld, Reference Elmer and Stadtfeld2020). At the same time, political event data are employed to study and predict the occurrence and intensity of armed conflict (Fjelde & Hultman, Reference Fjelde and Hultman2014; Blair & Sambanis, Reference Blair and Sambanis2020; Dorff et al., Reference Dorff, Gallop and Minhas2020). Butts (Reference Butts2008a) introduced the Relational Event Model (REM) to study such relational event data. In comparison to standard network data of durable relations observed at specific time points, relational events describe instantaneous actions or, put differently, interactions at a fine-grained temporal resolution (Borgatti et al., Reference Borgatti, Mehra, Brass and Labianca2009).
However, in some contexts there arise problems regarding the reliability of event data. While data gathered from, for example, direct observations (Tranmer et al., Reference Tranmer, Marcum, Morton, Croft and de Kort2015) or parliamentary records (Malang et al., Reference Malang, Brandenberger and Leifeld2019) should prove unproblematic in this regard, other data collection methods may be prone to spurious events, that is events that are recorded but did not actually occur as such. For instance, data collection on face-to-face interactions relies on different types of sociometric badges (Eagle & Pentland, Reference Eagle and Pentland2006) for which a recent study reports a false-discovery rate of the event identification of around 20 $\%$ when compared to video-coded data (Elmer et al., Reference Elmer, Chaitanya, Purwar and Stadtfeld2019). Political event data on armed conflict, in contrast, are generally collected via automated or human coding of news and social media reporting (Kauffmann, Reference Kauffmann and Deschaux-Dutard2020). Spurious events may arise in this context if reports of fighting are wrong, as may be the case for propaganda reasons or due to reporters’ reliance on rumors, or when fighting took place between different belligerents than those named. Such issues are especially prevalent in machine-coded conflict data where both false-positive and false-discovery rates of over 60 $\%$ have been reported (King & Lowe, Reference King and Lowe2003; Jäger, Reference Jäger2018). However, even human-coded data suffer from this problem (Dawkins, Reference Dawkins2020; Weidmann, Reference Weidmann2015).
This discussion suggests that specific types of event data can include unknown quantities of spurious events, which may influence the substantive results obtained from models such as the REM (Butts, Reference Butts2008a) or the Dynamic Actor-Oriented Model (Stadtfeld et al., Reference Stadtfeld, Hollway and Block2017; Stadtfeld, Reference Stadtfeld2012). We thus propose a Relational Events Model with Spurious Events (REMSE) as a method that allows researchers to study relational events from potentially error-prone contexts or data collections methods. Moreover, this tool can assess whether spurious events are observed under a particular model specification and, more importantly, whether they influence the substantive results. The REMSE can thus serve as a straightforward robustness check in situations where the researcher, due to their substantive knowledge, suspects that there are spurious observations and wants to investigate whether they distort their empirical results.
We take a counting process point of view where some increments of the dyadic counting processes are true events, while others may be attributed to spurious events, that is, exist due to measurement error. This decomposition results in two different intensities governing the two respective types of events. The spurious events are described by a spurious-event intensity that we specify independently of the true-event intensity of true events. We present the model under the assumption that the spurious events are purely random. Therefore, we can model the respective intensity solely as a constant term. However, more complex scenarios involving the specification of exogenous and endogenous covariates for the spurious-event intensity are also possible. In general, we are however primarily interested in studying what factors drive the intensity of true events. We model this intensity following Butts (Reference Butts2008a), but the methodology is extendable to other model types such as Stadtfeld et al. (Reference Stadtfeld, Hollway and Block2017); Vu et al. (Reference Vu, Pattison and Robins2015); DuBois et al. (Reference DuBois, Butts, McFarland and Smyth2013); Perry & Wolfe (Reference Perry and Wolfe2013) or Lerner et al. (Reference Lerner, Lomi, Mowbray, Rollings and Tranmer2021).
This article is structured as follows: We begin in Section 2 by introducing our methodology. In particular, we lay out the general framework to study relational event data proposed by Butts (Reference Butts2008a) in Section 2.1 and introduce an extension to this framework, the REMSE, to correct for the presence of spurious events in the remainder of Section 2. Through a simulation study in Section 3, we investigate the performance of our proposed estimator when spurious events are correctly specified and when they are nonexistent. We then apply the proposed model in Section 4 to analyze fighting incidents in the Syrian civil war as well as social interaction data from a college campus. A discussion of possible implications and extensions for the analysis of events concludes the article in Section 5.
2. A Relational Event Model with Spurious Events
2.1 Modeling framework for relational events
We denote observed events in an event stream $\mathcal{E} = \left \{e_1, \ldots, e_M \right \}$ of $M$ elements. Each object $e \in \mathcal{E}$ consists of a tuple encoding the information of an event. In particular, we denote the two actors of an event by $a(e)$ and $b(e)$ and the time of the event with $t(e)$ . For simplicity of notation, we omit the argument $e$ for $a()$ and $b()$ when no ambiguity exists and write $a_m$ for $a(e_m)$ , $b_m$ for $b(e_m)$ , and $t_m$ for $t(e_m) \;\forall \; m \in \{1, \ldots, M\}$ . Stemming from our application cases, we mainly focus on undirected events in this article; hence, the events $e = (a,b,t)$ and $\tilde{e} = (b,a,t)$ are equivalent in our framework. Note however that the proposed method also generalizes to the directed case. We denote the set of actor-tuples between which events can possibly occur by $\mathcal{R}$ , where, for simplicity, we assume that $\mathcal{R}$ is time-constant.
Following Perry & Wolfe (Reference Perry and Wolfe2013) and Vu et al. (Reference Vu, Asuncion, Hunter and Smyth2011a), we assume that the events in $\mathcal{E}$ are generated by an inhomogeneous matrix-valued counting process
which, in our case, is assumed to be a matrix-valued Poisson process (see Daley & Vere-Jones Reference Daley and Vere-Jones2008 for an introduction to stochastic processes). Without loss of generality, we assume that $\textbf{N}(t)$ is observed during the temporal interval $\mathcal{T}$ , starting at $t = 0$ . The cells of (1) count how often all possible dyadic events have occurred between time $0$ and $t$ ; hence, $\textbf{N}(t)$ can be conceived as a standard social network adjacency matrix with integer-valued cell entries (Butts, Reference Butts2008b). For instance, $N_{ab}(t)$ indicates how often actors $a$ and $b$ have interacted in the time interval $[0,t]$ . Therefore, observing event $e = (a,b,t)$ constitutes an increase in $N_{ab}(t)$ at time point $t$ , that is $N_{ab}(t - h) + 1 =N_{ab}(t)$ for $h \rightarrow 0$ . We denote with $\boldsymbol{\lambda } (t)$ the matrix-valued intensity of process $\mathbf{N}(t)$ . Based on this intensity function, we can characterize the instantaneous probability of a unit increase in a specific dimension of $\textbf{N}(t)$ at time point $t$ (Daley & Vere-Jones, Reference Daley and Vere-Jones2008). We parameterize $\boldsymbol{\lambda } (t)$ conditional on the history of the processes, $\mathcal{H}(t)$ , which may also include additional exogenous covariates. Hence, $\mathcal{H}(t) = (\textbf{N}(u),X(u)| u \lt t )$ , where $X(t)$ is some covariate process to be specified later. Note that we opt for a rather general characterization of Poisson processes, including stochastic intensities that explicitly depend on previous events. We define the intensity function at the tie-level:
where $\vartheta = (\alpha ^\top, \theta ^\top )^\top = \text{vec}(\alpha,\theta )$ is defined with the help of a dyadic operator $\text{vec}(\cdot, \cdot )$ that stacks two vectors and $\lambda _0(t,\alpha )$ is the baseline intensity characterized by coefficients $\alpha$ , while the parameters $\theta$ weight the statistics computed by $s_{ab}(\mathcal{H}(t))$ , which is the function of sufficient statistics. Based on $s_{ab}(\mathcal{H}(t))$ , we can formulate endogenous effects, which are calculated from $(N(u)| u \lt t)$ , exogenous variables calculated from $(X(u) | u\lt t)$ , or a combination of the two which results in complex dependencies between the observed events. Examples of endogenous effects for undirected events include degree-related statistics like the absolute difference of the degrees of actors $a$ and $b$ or hyperdyadic effects, for example, investigating how triadic closure influences the observed events. In our first application case, exogenous factors include a dummy variable whether group $a$ and $b$ share an ethno-religious identity. Alternatively, one may incorporate continuous covariates, for example, the absolute geographic distance between group $a$ and $b$ .
We give graphical representations of possible endogenous effects in Figure 1 and provide their mathematical formulations together with a general summary in Appendix A. When comparing the structures in the first row with the ones in the second row in Figure 1, the respective sufficient statistic of the event indicated by the dotted line differs by one unit. Its intensity thus changes by the multiplicative factor $\exp \{\theta _{endo}\}$ , where $\theta _{endo}$ is the respective parameter of the statistic if all other covariates are fixed. The interpretation of the coefficients is, therefore, closely related to the interpretation of relative risk models (Kalbfleisch & Prentice, Reference Kalbfleisch and Prentice2002).
Previous studies propose multiple options to model the baseline intensity $\lambda _0(t)$ . Vu et al. (Reference Vu, Asuncion, Hunter and Smyth2011a, Reference Vu, Asuncion, Hunter and Smyth2011b) follow a semiparametric approach akin to the proportional hazard model by Cox (Reference Cox1972), while Butts (Reference Butts2008a) assumes a constant baseline intensity. We follow Etezadi-Amoli & Ciampi (Reference Etezadi-Amoli and Ciampi1987) by setting $\lambda _0(t, \alpha ) = \exp \{f(t, \alpha )\}$ , with $f(t, \alpha )$ being a smooth function in time parameterized by B-splines (de Boor, Reference de Boor2001):
where $B_{k}(t)$ denotes the $k$ th B-spline basis function weighted by coefficient $\alpha _k$ . To ensure a smooth fit of $f(t, \alpha )$ , we impose a penalty (or regularization) on $\alpha$ which is formulated through the a priori structure
where $\gamma$ is a hyperparameter controlling the level of smoothing and $\mathbf{S}$ is a penalty matrix that penalizes the differences of coefficients corresponding to adjacent basis functions as proposed by Eilers & Marx (Reference Eilers and Marx1996). We ensure identifiability of the smooth baseline intensity by incorporating a sum-to-zero constraint and refer to Ruppert et al. (Reference Ruppert, Wand and Carroll2003) and Wood (Reference Wood2017) for further details on penalized spline smoothing. Given this notation, we can simplify (2):
with $\mathcal{X}_{ab}(\mathcal{H}(t),t) = \text{vec}(B(t), s_{ab}(\mathcal{H}(t)))$ .
2.2 Accounting for spurious relational events
Given the discussion in the introduction, we may conclude that some increments of $\mathbf{N}(t)$ are true events, while others stem from spurious events. Spurious events can occur because of coding errors during machine- or human-based data collection. To account for such erroneous data points, we introduce the REMSE.
First, we decompose the observed Poisson process into two separate matrix-valued Poisson processes, that is $\mathbf{N}(t) = \mathbf{N}_{0}(t) + \mathbf{N}_1(t) \; \forall \; t \in \mathcal{T}$ . On the dyadic level, $N_{ab,1}(t)$ denotes the number of true events between actors $a$ and $b$ until $t$ , and $N_{ab,0}(t)$ the number of events that are spurious. Assuming that $N_{ab}(t)$ is a Poisson process, we can apply the so-called thinning property, stating that two separate processes that sum up to a Poisson process are also Poisson processes (Daley & Vere-Jones Reference Daley and Vere-Jones2008). A graphical illustration of the three introduced counting processes, $N_{ab,0}(t), \;N_{ab,1}(t),$ and $N_{ab}(t)$ , is given in Figure 2. In this illustrative example, we observe four events at times $t_1, \;t_2, \;t_3,$ and $t_4$ , although only the first and third constitute true events, while the second and fourth are spurious. Therefore, the counting process $N_{ab}(t)$ jumps at all times of an event, yet $N_{ab,1}(t)$ does so only at $t_1$ and $t_3$ . Conversely, $N_{ab,0}(t)$ increases at $t_2$ and $t_4.$
The counting processes $ \mathbf{N}_{0}(t)$ and $\mathbf{N}_1(t)$ are characterized by the dyadic intensities $\lambda _{ab,0}(t|\mathcal{H}_{0}(t), \vartheta _0)$ and $\lambda _{ab,1}(t|\mathcal{H}_{1}(t), \vartheta _1)$ , where we respectively denote the history of all spurious and true processes by $\mathcal{H}_{0}(t)$ and $\mathcal{H}_{1}(t)$ . This can also be perceived as a competing risks setting, where events can either be caused by the true-event or spurious-event intensity (Gelfand et al., Reference Gelfand, Ghosh, Christiansen, Soumerai and McLaughlin2000). To make the estimation of $\theta _0$ and $\theta _1$ feasible and identifiable (Heckman & Honoré, Reference Heckman and Honoré1989), we assume that both intensities are independent of one another, which means that their correlation is fully accounted for by the covariates. Building on the superpositioning property of Poisson processes, the specification of those two intensity functions also defines the intensity of the observed counting process $N_{ab}(t)$ . In particular, $\lambda _{ab}(t|\mathcal{H}(t), \vartheta )= \lambda _{ab,0}(t|\mathcal{H}_0(t), \vartheta _0) + \lambda _{ab,1}(t|\mathcal{H}_1(t), \vartheta _1)$ holds (Daley & Vere-Jones, Reference Daley and Vere-Jones2008).
The true-event intensity $\lambda _{ab,1}(t|\mathcal{H}_{1}(t), \vartheta _1)$ drives the counting process of true events $\mathbf{N}_{1}(t)$ and only depends on the history of true events. This assumption is reasonable since if erroneous events are mixed together with true events, the covariates computed for actors $a$ and $b$ at time $t$ through $s_{ab}(\mathcal{H}(t))$ would be confounded and could not anymore be interpreted in any consistent manner. We specify $\lambda _{ab,1}(t|\mathcal{H}_1(t), \vartheta _1)$ in line with (2) at the dyadic level by:
At the same time, the spurious-event intensity $\lambda _{ab,0}(t|\mathcal{H}_{0}(t), \vartheta _0)$ determines the type of measurement error generating spurious events. One may consider the spurious-event process as an overall noise level with a constant intensity. This leads to the following setting:
The error structure, that is, the intensity of the spurious-event process, can be made more complex, but to ensure identifiability, $\lambda _{ab,0}(t|\mathcal{H}_{0}(t), \vartheta _0)$ cannot depend on the same covariates as $\lambda _{ab,1}(t| \mathcal{H}_1(t), \vartheta )$ . We return to the discussion of this point below and focus on model (7) for the moment.
2.3 Posterior inference via data augmentation
To draw inference on $\vartheta = \text{vec}(\vartheta _0, \vartheta _1)$ , we employ an empirical Bayes approach. Specifically, we will sample from the posterior of $\vartheta$ given the observed data. Our approach is thereby comparable to the estimation of standard mixture (Diebolt & Robert, Reference Diebolt and Robert1994) and latent competing risk models (Gelfand et al., Reference Gelfand, Ghosh, Christiansen, Soumerai and McLaughlin2000).
For our proposed method, the observed data are the event stream of all events $\mathcal{E}$ regardless of being a real or a spurious event. To adequately estimate the model formulated in Section 2, we lack information on whether a given event is spurious or not. We denote this formally as a latent indicator variable $z(e)$ for event $e \in \mathcal{E}$ :
We write $z = (z(e_1), \ldots, z(e_M))$ to refer to the latent indicators of all events and use $z_m$ to shorten $z(e_m)$ . Given this notation, we can apply the data augmentation algorithm developed in Tanner & Wong (Reference Tanner and Wong1987) to sample from the joint posterior distribution of $(Z,\vartheta )$ by iterating between the I Step (Imputation) and P Step (Posterior) defined as:
This iterative scheme generates a sequence that (under mild conditions) converges to draws from the joint posterior of $(\vartheta,Z)$ and is a particular case of a Gibbs’ sampler. Each iteration consists of an Imputation and a Posterior step, resembling the Expectation and Maximization step from the EM algorithm (Dempster et al., Reference Dempster, Laird and Rubin1977). Note, however, that Tanner & Wong (Reference Tanner and Wong1987) proposed this method with multiple imputations in each I Step and a mixture of all imputed complete-data posteriors in the P Step. We follow Little & Rubin (Reference Little and Rubin2002) and Diebolt & Robert (Reference Diebolt and Robert1994) by performing one draw of $Z$ and $\vartheta$ in every iteration, which is a specific case of data augmentation. As Noghrehchi et al. (Reference Noghrehchi, Stoklosa, Penev and Warton2021) argue, this approach is closely related to the stochastic EM algorithm (Celeux et al., Reference Celeux, Chauveau and Diebolt1996). The main difference between the two approaches is that in our P Step, the current parameters are sampled from the complete-data posterior in the data augmentation algorithm and not fixed at its mean as in Celeux et al. (Reference Celeux, Chauveau and Diebolt1996). Consequently, the data augmentation algorithm is a proper multiple imputation procedure (MI, Rubin, Reference Rubin1987), while the stochastic EM algorithm is improper MI (see Noghrehchi et al., Reference Noghrehchi, Stoklosa, Penev and Warton2021). We choose the data augmentation algorithm over the stochastic EM algorithm because Rubin’s combination rule to get approximate standard errors can only be applied to proper MI procedures (Noghrehchi et al., Reference Noghrehchi, Stoklosa, Penev and Warton2021).
In what follows, we give details and derivations on the I and P Steps and then exploit MI to combine a relatively small number of draws from the posterior to obtain point and interval estimates for $\vartheta$ .
Imputation-step: To acquire samples from $Z= (Z_1, \ldots, Z_M)$ conditional on $\mathcal{E}$ and $\vartheta$ , we first decompose the joint density by repeatedly applying the Bayes theorem:
The distribution of $z_m$ conditional on $ z_1, \ldots, z_{m-1},\vartheta$ and $\mathcal{E}$ is:
Note that the information of $z_1, \ldots z_{m-1}$ and $\mathcal{E}$ allows us to calculate $ \mathcal{H}_1(t_m)$ as well as $\mathcal{H}_0(t_m)$ . By iteratively applying (9) and plugging in $\vartheta ^{(d)}$ for $\vartheta$ , we can draw samples in the I Step of $Z= (Z_1, \ldots, Z_M)$ through a sequential design that sweeps once from $Z_1$ to $Z_M$ . The mathematical derivation of (9) is provided in Appendix B.
Posterior step: As already stated, we assume that the true-event and spurious-event intensities are independent. Hence, the sampling from the complete-data posteriors of $\vartheta _0$ and $\vartheta _1$ can be carried out independently. In the ensuing section, we therefore only show how to sample from $\vartheta _1| z, \mathcal{E}$ , but sampling from $\vartheta _0| z, \mathcal{E}$ is possible in the same manner. To derive this posterior, we begin by showing that the likelihood of $\mathcal{E}$ and $z$ with parameter $\vartheta _1$ is the likelihood of the counting process $\mathbf{N}_1(t)$ , which resembles a Poisson regression. Consecutively, we state all priors to derive the desired complete-data posterior.
Given a general $z$ sampled in the previous I Step and $\mathcal{E}$ , we reconstruct a unique complete path of $\mathbf{N}_1(t)$ by setting
where $\mathbb{I}(\cdot )$ is an indicator function. The corresponding likelihood of $\mathbf{N}_1(t)$ results from the property that any element-wise increments of the counting process between any times $s$ and $t$ with $t\gt s$ and arbitrary actors $a$ and $b$ with $(a,b) \in \mathcal{R}$ are Poisson distributed:
The integral in (11) is approximated through simple rectangular approximation between the observed event times to keep the numerical effort feasible, so that the distributional assumption simplifies to:
We specify the priors for $\alpha _1$ and $\theta _1$ separately and independent of one another. The prior for $\alpha _1$ was already stated in (4). Through a restricted maximum likelihood approach, we estimate the corresponding hyperparameter $\gamma _1$ such that it maximizes the marginal likelihood of $z$ and $\mathcal{E}$ given $\gamma _1$ (for additional information on this estimation procedure and general empirical Bayes theory for penalized splines see Wood Reference Wood2011, Reference Wood2020). Regarding the linear coefficients $\theta _1$ , we assume flat priors, that is $p(\theta _1) \propto k$ , indicating no prior knowledge.
In the last step, we apply Wood’s (Reference Wood2006) result that for large samples, the posterior distribution of $\vartheta _1$ under likelihoods resulting from distributions belonging to the exponential family, such as the Poisson distribution in (12), can be approximated through:
Here, $\hat{\vartheta }_1$ denotes the penalized maximum likelihood estimator resulting from (12) with the extended penalty matrix $\tilde{ \mathbf{S}}_1$ defined by
with $\mathbf{O}_{p \times q} \in \mathbb{R}^{p\times q}$ for $p,q \in \mathbb{N}$ being a matrix filled with zeroes and $\mathbf{S}_1$ defined in accordance with (4). For $\vartheta _1 = \text{vec}(\alpha _1, \theta _1),$ let $p$ be the length of $\alpha _1$ and $q$ of $\theta _1$ . The penalized likelihood is then given by:
which is equivalent to a generalized additive model; hence, we refer to Wood (Reference Wood2017) for a thorough treatment of the computational methods needed to find $\hat{\vartheta }_1$ . The variance matrix in (13) has the following structure:
Values for $\gamma _1$ and $\hat{\vartheta }_1$ can be extracted from the estimation procedure to maximize (14) with respect to $\vartheta _1$ , while $\mathcal{X}_1 \in \mathbb{R}^{(M |\mathcal{R}|)\times (p + q)}$ is a matrix whose rows are given by $\mathcal{X}_{ab,1}(\mathcal{H}_1(t_m),t_{m-1})$ as defined in (6) for $m \in \{ 1, \ldots, M\}$ and $(a,b) \in \mathcal{R}$ . Similarly, $\mathbf{W}_1 = \text{diag}\big (\lambda _{ab,1}(t| \mathcal{H}_1(t), \vartheta _1);$ $t \in \{ t_1, \ldots, t_M\}, (a,b) \in \mathcal{R}\big )$ is a diagonal matrix.
For the P Step, we now plug in $z^{(d)}$ for $z$ in (13) to obtain $\hat{\vartheta }_1$ and $\mathbf{V}_1$ by carrying out the corresponding complete-case analysis. In the case where no spurious events exist, the complete estimation can be carried out in a single P Step. In Algorithm 1, we summarize how to generate a sequence of random variables according to the data augmentation algorithm.
Multiple imputation: One could use the data augmentation algorithm to get a large amount of samples from the joint posterior of $(\vartheta, Z)$ to calculate empirical percentiles for obtaining any types of interval estimates. However, in our case this endeavor would be very time-consuming and even infeasible. To circumvent this, Rubin (Reference Rubin1976) proposed multiple imputation as a method to approximate the posterior mean and variance. Coincidentally, the method is especially successful when the complete-data posterior is multivariate normal as is the case in (13); thus, only a small number of draws is needed to obtain good approximations (Little & Rubin, Reference Little and Rubin2002). To be specific, we apply the law of iterative expectation and variance:
Next, we approximate (15) and (16) using a Monte Carlo quadrature with $K$ samples from the posterior obtained via the data augmentation scheme summarized in Algorithm 1 after a burn-in period of $D$ iterations:
where $\hat{\vartheta }^{(k)} = \text{vec}\left (\hat{\vartheta }^{(k)}_0, \hat{\vartheta }^{(k)}_1\right )$ encompasses the complete-data posterior means from the $k$ th sample and $\mathbf{V}^{(k)} = \text{diag} \big ( \mathbf{V}^{(k)}_0, \mathbf{V}^{(k)}_1\big )$ is composed of the corresponding variances defined in (13). We can thus construct point and interval estimates from relatively few draws of the posterior based on a multivariate normal reference distribution (Little & Rubin, Reference Little and Rubin2002).
3. Simulation study
We conduct a simulation study to explore the performance of the REMSE compared to a REM, which assumes no spurious events, in two different scenarios, including a regime where measurement error is correctly specified in the REMSE and one where spurious events are instead nonexistent.
Simulation design: In S=1,000 runs, we simulate event data between $n = 40$ actors under known true and spurious intensity functions in each example. For exogenous covariates, we generate categorical and continuous actor-specific covariates, transformed to the dyad level by checking for equivalence in the categorical case and computing the absolute difference for the continuous information. Generally, we simulate both counting processes $\mathbf{N_1(t)}$ and $\mathbf{N_0(t)}$ separately and stop once $|\mathcal{E}_1| = 500$ .
The data-generating processes for true events is identical in each case and given by:
where we draw the continuous exogenous covariate (cont.) from a standard Gaussian distribution and the categorical exogenous covariates (cat.) from a categorical random variable with seven possible outcomes, all with the same probability. Mathematical definition of the endogenous and exogenous statistics are given in Appendix A. In contrast, the spurious-event intensity differs across regimes to result in correctly specified (DG 1) and nonexistent (DG 2) measurement errors:
Given these intensities, we follow DuBois et al. (Reference DuBois, Butts, McFarland and Smyth2013) to sample the events.
Although the method is estimated in a Bayesian framework, we can still assess the frequentist properties of the estimates of the REMSE and REM. In particular, the average point estimate (AVE), the root-mean-squared error (RMSE) and the coverage probabilities (CP) are presented in Table 1. The AVE of a specific coefficient is the average over the posterior modes in each run:
where $\bar{\vartheta }_t$ is the posterior mean (17) of the $t$ th simulation run. To check for the average variance of the error in each run, we further report the RMSEs of estimating the coefficient vector $\vartheta$ :
where $\vartheta$ is the ground truth coefficient vector defined above. Finally, we assess the adequacy of the uncertainty quantification by computing the percentage of runs in which the real parameter lies within the confidence intervals based on a multivariate normal posterior with mean and variance given in (17) and (18). According to standard statistical theory for interval estimates, this coverage probability should be around $95\%$ (Casella & Berger, Reference Casella and Berger2001).
Results: DG 1 shows how the estimators behave if the true and false intensities are correctly specified. The results in Table 1 suggest that the REMSE can recover the coefficients from the simulation. On the other hand, strongly biased estimates are obtained in the REM, where not only the average estimates are biased, but we also observe high RMSEs and violated CP.
In the second simulation, we assess the performance of the spurious event model when it is superfluous. In particular, we investigate what happens when there are no spurious events in the data, that is, all events are real, and the intensity of $N_{ab,\;2}(t)$ is zero in DG 2. Unsurprisingly, the REM allows for valid and unbiased inference under this regime. But our stochastic estimation algorithm proves to be robust as for most runs, the simulated events were at some point only consisting of true events. In other words, the REMSE can detect the spurious events correctly and is unbiased if none occur in the observed data.
For both DG 1 and DG 2, the PFE estimated by the REMSE closely matches the observed one whereas the REM, by constraining it to zero, severely underestimates the PFE in DG 1. In sum, the simulation study thus offers evidence that the REMSE increases our ability to model relational event data in the presence of measurement error while being equivalent to a standard REM when spurious events do not exist in the data.
4. Application
Next, we apply the REMSE on two real-world data sets motivated by the types of event data discussed in the introduction, namely human-coded conflict events in the Syrian civil war and co-location event data generated from the Bluetooth devices of students in a university dormFootnote 3 . Information on the data sources, observational periods, and numbers of actors and events is summarized in Table 2. Following the above presentation, we focus on modeling the true-event intensity of the REMSE and limit the spurious-event intensity to the constant term. Covariates are thus only specified for the true-event intensity. In our applications, the samples drawn according to Algorithm 1 converged to a stationary distribution within the first 30 iterations. To obtain the reported point and interval estimates via MI, we sampled 30 additional draws. Due to space restrictions, we keep our discussions of the substantive background and results of both applications comparatively short.
4.1 Conflict events in the Syrian civil war
In the first example, we model conflict events between different belligerents as driven by both exogenous covariates and endogenous network mechanisms. The exogenous covariates are selected based on the literature on inter-rebel conflict. We thus include dummy variables indicating whether two actors share a common ethno-religious identity or receive material support by the same external sponsor as these factors have previously been found to reduce the risk of conflict (Popovic, Reference Popovic2018; Gade et al., Reference Gade, Hafez and Gabbay2019). Additionally, we include binary indicators of two actors being both state forces or both rebel groups as conflict may be less likely in the former but more likely in the latter case (Dorff et al., Reference Dorff, Gallop and Minhas2020).
Furthermore, we model endogenous processes in the formation of the conflict event network and consider four statistics for this purpose. First, we account for repeated fighting between two actors by including both the count of their previous interactions as well as a binary indicator of repetition, which takes the value 1 if that count is at least 1. We use this additional endogenous covariate as a conflict onset arguably comprises much more information than subsequent fighting. Second, we include the absolute difference in a and b’s degree to capture whether actors with a high extent of previous activity are prone to engage each other or, instead, tend to fight less established groups to pre-empt their rise to power. Finally, we model hyperdyadic dependencies by including a triangle statistic that captures the combat network’s tendency towards triadic closure.
Given that fighting should be a relatively obvious event, one may wonder why conflict event data may include spurious observations. This is because all common data collection efforts on armed conflict cannot rely on direct observation but instead use news and social media reporting. Spurious events thus occur when these sources report fighting which did not actually take place as such. In armed conflict, this can happen for multiple reasons. For instance, pro-government media may falsely report that state security forces engaged with and defeated rebel combatants to boost morale and convince audiences that the government is winning. Social media channels aligned with a specific rebel faction may similarly claim victories by its own forces or, less obviously, battles where a rival faction fought and suffered defeat against another group. In war-time settings, journalists may also be unable or unwilling to enter conflict areas and thus base their reporting on local contacts, rumors, or hear-say. Finally, spurious observations may arise here when reported fighting occurred but was attributed to the wrong belligerent faction at some point in the data collection process. From a substantive perspective, it is thus advisable to check for the influence of spurious events when analyzing these data.
Table 3 accordingly presents the results of an REM and the REMSE. Beginning with the exogenous covariates, belligerents are found to be less likely to fight each other when they share an ethno-religious identity or receive resources from the same external sponsor. In contrast, there is no support for the idea that state forces exhibit less fighting among each other than against rebels in this type of internationalized civil war, whereas different rebel groups are more likely to engage in combat against one another. Furthermore, we find evidence that endogenous processes affect conflict event incidence. The binary repetition indicator exhibits the strongest effect across all covariates, implying that two actors are more likely to fight each other if they have done so in the past. As indicated by the positive coefficient of the repetition count, the dyadic intensity further increases the more they have previously fought with one another. The absolute degree difference also exhibits a positive effect, meaning that fighting is more likely between groups with different levels of previous activity. And finally, the triangle statistic’s positive coefficient suggests that even in a fighting network, triadic closure exists. This may suggest that belligerents engage in multilateral conflict, attacking the enemy of their enemy, in order to preserve the existing balance of capabilities or change it in their favor (Pischedda, Reference Pischedda2018).
This discussion holds for both the results of REM and REMSE. Their point estimates are generally quite similar in this application, suggesting that spurious events do not substantively affect empirical results in this case. That being said, there are two noticeable differences between the two models. First, the coefficient estimates for the binary indicator of belligerents having fought before differs between the two models. In the REM, it implies a multiplicative change of $\exp \{4.911\}=135.775$ while for the REMSE, it is estimated at $\exp \{5.059\}=157.433$ . While both models thus identify this effect to be positive and significant, it is found to be stronger when spurious events are accounted for. Second, the two models differ in how precise they deem estimates to be. This difference is clearest in their respective Z values, which are always farther away from zero for the REM than the REMSE. As a whole, these results nonetheless show that spurious events have an overall small influence on substantive results in this application. The samples from the latent indicators $z$ also indicate that only approximately 1 $\%$ of the observations, about 50 events, are on average classified as spurious events. These findings offer reassurance for the increasing use of event data to study armed conflict.
4.2 Co-location events in university housing
In our second application, we use a subset of the co-location data collected by Madan et al. (Reference Madan, Cebrian, Moturu, Farrahi and Pentland2012) to model when students within an American university dorm interact with each other. These interactions are deduced from continuous (every 6 minutes) scans of proximity via the Bluetooth signals of students’ mobile phones. Madan et al. (Reference Madan, Cebrian, Moturu, Farrahi and Pentland2012) used questionnaires to collect a host of information from the participating students. This information allows us to account for both structural and more personal exogenous predictors of social interaction. We thus include binary indicators of whether two students are in the same year of college or live on the same floor of the dorm to account for the expected homophily of social interactions (McPherson et al., Reference McPherson, Smith-Lovin and Cook2001). In addition, we incorporate whether two actors consider each other close friendsFootnote 4 . Given that the data were collected around a highly salient political event, the 2008 US presidential election, we also incorporate a dummy variable to measure whether they share the same presidential preference and a variable measuring their similarity in terms of interest in politics (Butters & Hare, Reference Butters and Hare2020). In addition, we include the same endogenous network statistics here as in section 4.1. These covariates allow us to capture the intuitions that individuals tend to socialize with people that they have interacted with before, are not equally popular as they are, and they share more common friends with (Rivera et al., Reference Rivera, Soderstrom and Uzzi2010). Compared to the first application, sources of spurious events here are more evident as students may not actually interact with but be physically close to and even face each other, for example, riding an elevator, queuing in a store, or studying in a common space.
We present the results in Table 4. Beginning with the exogenous covariates, we find that the observed interactions tend to be homophilous in that students have social encounters with people they live together with, consider their friends, and share a political opinion with. In contrast, neither a common year of college nor a similar level of political interest are found to have a statistically significant effect on student interactions. At the same time, these results indicate that the social encounters are affected by endogenous processes. Having already had a previous true event is found to be the main driver of the corresponding intensity, hence having a very strong and positive effect. Individuals who have socialized before are thus more likely to socialize again, an effect that, as indicated by the repetition count, increases with the number of previous interactions. Turning to the other endogenous covariates, the result for absolute degree difference suggests that students $a$ and $b$ are more likely to engage with each other if they have more different levels of previous activity, suggesting that, for example, popular individuals attract attention from less popular ones. As is usual for most social networks (Newman and Park Reference Newman and Park2003), the triangle statistic is positive, meaning that students “socialize” with the friends of their friends.
As in the first application, the REM and REMSE results presented in Table 4 are closely comparable but also show some differences. Again, the effect estimate for binary repetition, at $\exp \{2.715\}=15.105$ , is higher in the REMSE than in the REM ( $\exp \{2.615\}=13.667$ ) while Z values and confidence intervals obtained in the REM are substantially smaller in the REM than in the REMSE. In the co-location data too, the results are thus not driven by the presence of spurious events but accounting for these observations does affect results to some, albeit rather negligible, extent. This is the case even though the average percentage of spurious events here is comparatively high at 3 $\%$ . That leaving out the corresponding 81 events yielded similar estimates may indicate that spurious events were mainly observed at the periphery of the interaction network and hardly affected the behavior in the network’s core. More generally, these results may assuage concerns over sensor data reliability (see Elmer et al., Reference Elmer, Chaitanya, Purwar and Stadtfeld2019).
5. Discussion
In summary, this paper extends the relational event framework to handle spurious events. In doing so, it offers applied researchers analyzing instantaneous interaction data a useful tool to explicitly account for measurement errors induced by spurious events or to investigate the robustness of their results against this type of error. Our proposed method controls for one explicit measurement error, namely that induced by spurious events. The simulation study showed that our approach can detect such false events and even yield correct results if they are not present. Still, we want to accentuate that numerous other types of measurement error may be present when one analyses relational events, which we disregard in this article. For instance, true events may be missing. These false negatives, for example, unreported conflict events between different belligerents, are difficult to tackle because of a lack of information.
We explicitly recommend the use of the REMSE as a method for checking robustness. When substantive knowledge suggests the presence of spurious events, the REMSE can be used to assess whether REM results hold when accounting for them. Spurious events may be common in datasets which come from sensors or are coded from journalistic sources, as discussed above, and more generally seem credibly present in data that are based on secondary sources instead of direct observation. Spurious events also occur and may possibly be more influential, in data where relations are directed, the model we introduce accordingly also generalizes to directed event data. Especially for politically contentious data, where some events may be openly claimed to be false, the REMSE offers a possibility to adjudicate whether overall findings depend on such contested observations. But also where the data content is non-political, it is recommendable to check how common and influential false observations are. We provide replication code implementing the REMSE for this purpose.
When specifying the REMSE, two aspects require caution so that identifiability is ensured. First, given we know which events are spurious, our model simplifies to a competing risk model; thus, the identifiability issues discussed in Heckman & Honoré (Reference Heckman and Honoré1989) or Tsiatis (Reference Tsiatis1975) apply. For this reason, we presented our model under the assumption of independence between the true-event and spurious-event intensities. Second, the particular specification of the covariates might also affect the identifiability of the model. This may occur when one assumes complex dependencies of spurious events and exogenous covariates are unavailable, or the prior information about the coefficients is too weak. For the model specification employed in this article, this is not an issue due to the simple form of the spurious-event intensity as long as at least one exogenous or endogenous term has a nonzero effect on the true-event intensity. For more complex models, one may use multiple starting values of the data augmentation algorithm or formulate more informative priors for $\theta _1$ and possibly $\theta _0$ .
Our latent variable methodology can also be extended beyond the approach presented here. A straightforward refinement along the lines of Stadtfeld & Block (Reference Stadtfeld and Block2017) would be to include windowed effects, that is, endogenous statistics that are only using history ranging into the past for a specific duration, or exogenous covariates calculated from additional networks to the one modeled. The first modification could also be extended to separable models as proposed in Fritz et al. (Reference Fritz, Thurner and Kauermann2021). A relatively simplistic version of the latter type of covariate was incorporated in Section 4.2 to account for common friendships but more complex covariates are possible. This might be helpful, for instance, when we observe proximity and e-mail events between the same group of actors. Moreover, with minor adaptions, the proposed estimation methodology could handle some of the exogenous or endogenous covariates having nonlinear effects on the intensities.
Finally, the framing of the simultaneous counting processes may be modified and their number extended. To better understand the opportunities our model framework entails, it is instructive to perceive the proposed model as an extension to the latent competing risk model of Gelfand et al. (Reference Gelfand, Ghosh, Christiansen, Soumerai and McLaughlin2000) with two competing risks. For time-to-event data, one could thus employ an egocentric versionFootnote 5 of our model for model-based clustering of general duration times, which could prove to be a valuable tool for medical applications. Or our proposed methodology could be conceived as a general tool to correct for additive measurement errors in count data and extend it to spatial data analysis to be used in settings described in Raleigh et al. (Reference Raleigh, Linke, Hegre and Karlsen2010).
Acknowledgments
We thank the anonymous reviewers and Carter Butts for their careful reading and constructive comments. The authors gratefully acknowledge support from the German Federal Ministry of Education and Research (BMBF) under Grant No. 01IS18036A and the German Research Foundation (DFG) for the project TH 697/11-1: Arms Races in the Interwar Period 1919-1939. Global Structures of Weapons Transfers and Destabilization.
Competing interests
None.
A. Definition of undirected network statistics
As REMs for undirected events are so far sparse in the literature, there are no standard statistics that are commonly used (one exception being Bauer et al., Reference Bauer, Harhoff and Kauermann2021). Thus we define all statistics based on prior substantive research (Rivera et al., Reference Rivera, Soderstrom and Uzzi2010; Wasserman & Faust, Reference Wasserman and Faust1994) and undirected statistics used for modeling static networks (Robins et al., Reference Robins, Snijders, Wang, Handcock and Pattison2007). Generally, nondirected statistics have to be invariant to swapping the positions of actor $a$ and $b$ . For the following mathematical definitions, we denote the set of all actors by $\mathcal{A}$ .
For degree-related statistics, we include the absolute difference of the degrees of actors $a$ and $b$ :
where $t^-$ is the point-in-time just before $t$ . Alternatively, one might also employ other bivariate functions of the degrees as long as they are invariant to swapping $a$ and $b$ , such as the sum of degrees. When simultaneously using different forms of degree-related statistics, collinearities between the respective covariates might severely impede the interpretation.
To capture past dyadic behavior, one can include $N_{ah}(t^-)$ directly as a covariate. Since the first event often constitutes a more meaningful action than any further observed events between the actors $a$ and $b$ , we additionally include a binary covariate to indicate whether the respective actors ever interacted before, leading to the following endogenous statistics:
Hyperdyadic statistics in the undirected regime are defined as any type of triadic closure, where actor $a$ is connected to an entity that is also connected to actor $b$ :
Finally, actor-specific exogenous statistics can also be used to model the intensities introduced in this article. We denote arbitrary continuous covariates by $x_{a,\;cont} \; \forall \; a \in \mathcal{A}$ . On the one hand, we may include a measure for the similarity or dissimilarity for the covariate through:
For multivariate covariates, such as location, we only need to substitute the absolute value for any given metric, for example, Euclidean distance. In other cases, it might be expected that high levels of a continuous covariable result in higher or lower intensities of an event:
Which type of statistic should be used depends on the application case and the hypotheses to be tested. Categorical covariates, that we denote by $x_{a,cat} \; \forall \; a \in \mathcal{A}$ , can also be used to parameterize the intensity by checking for equivalence of two actor-specific observations of the variable:
Besides actor-specific covariates also exogenous networks or matrices, such as $x_{Network} \in \mathbb{R}^{|\mathcal{A}|\times |\mathcal{A}|}$ , can also be incorporated as dyadic covariates in our framework:
where $x_{Network, ab}$ is the entry of the $a$ th row and $b$ th column of the matrix $x_{Network}$ . Extensions to time-varying networks are straightforward when perceiving changes to them as exogenous to the modeled events (Stadtfeld & Block, Reference Stadtfeld and Block2017).
B. Mathematical Derivation of (9)
For $m\in \{1, \ldots, M\}$ , let $Y_{a_{m}b_{m},1}(t_{m}) = N_{a_{m}b_{m},1} (t_{m}) - N_{a_{m}b_{m},1}(t_{m-1})$ be the increments of the latent counting process of true events between the time points $t_{m}$ and $t_{m-1}$ , where we additionally define $t_0 = 0$ without the loss of generality. We observe $\mathcal{E}$ ; hence, we can reconstruct the respective increment $Y_{a_{m}b_{m}}(t_{m}) = N_{a_{m}b_{m}}(t_{m}) - N_{a_{m}b_{m}}(t_{m-1}) = Y_{a_{m}b_{m},0}(t_{m}) + Y_{a_{m}b_{m},1}(t_{m})$ , where $ Y_{a_{m}b_{m},0}(t_{m})$ is the increment of the spurious-event counting process. The second equality holds since by design the sum of increments of the processes counting the true and false events is the increment of the observed counting process, that is $N_{ab}(t) = N_{ab,0}(t)+N_{ab,1}(t)$ . To sample from $Z_{m} | z_1, \ldots, z_{m-1},\mathcal{E}$ , note that $Z_{m} = Y_{a_{m}b_{m},1}(t_{m}) | Y_{a_{m}b_{m}}(t_{m})$ holds. Heuristically, this means that if we know that one of the two thinned counting processes jumps at time $t_{m}$ , the probability of the jump being attributed to $N_{a_{m}b_{m},1} (t)$ is the probability that the $m$ th event is a true event. For the increments of the involved counting processes, we can then use the properties of the Poisson processes and the fact that the intensities are piecewise constant between event times to derive the following distributional assumptions $\forall \; m = 1, \ldots, M$ :
where we set $\delta _m = t_{m}- t_{m-1}$ . We can now directly compute the probability of $Z_{m} = 1| z_1, \ldots, z_{m-1},\mathcal{E}, \vartheta$ :
In the last row we plug in (B1) and (B2) for the probabilities in the numerators and (B3) in the denominator to prove claim (9). The calculation for $p(z_{m} = 0| z_1, \ldots, z_{m-1},\mathcal{E}, \vartheta )$ is almost identical to the one shown here.