An unsupervised information-theoretic approach to identifying formulaic clusters in textual data

Gideon Yoffe; Yair Segev; Barak Sober

doi:10.1017/chr.2025.10011

An unsupervised information-theoretic approach to identifying formulaic clusters in textual data

Part of: CHR Missing Data in the Humanities

Published online by Cambridge University Press: 19 September 2025

Gideon Yoffe

Yair Segev and

Barak Sober

Show author details

Gideon Yoffe*: Affiliation:
Department of Statistics and Data Science, The Hebrew University of Jerusalem , Jerusalem, Israel
Yair Segev: Affiliation:
Faculty of Theology, Carl von Ossietzky Universität Oldenburg , Oldenburg, Germany
Barak Sober: Affiliation:
Department of Statistics and Data Science, The Hebrew University of Jerusalem , Jerusalem, Israel
*: Corresponding author: Gideon Yoffe; Email: gideon.yoffe@mail.huji.ac.il

Article contents

Abstract
Plain Language Summary
Introduction
Soft clustering algorithm
Benchmarking
Application to textual data: The priestly source(s) in the Pentateuch
Discussion
Data availability statement
Author contributions
Competing interests
Footnotes
References

Rights & Permissions

Abstract

Texts, whether literary or historical, exhibit structural and stylistic patterns shaped by their purpose, authorship and cultural context. Formulaic texts, which are characterized by repetition and constrained expression, tend to differ in their information content (as defined by Shannon) compared to more dynamic compositions. Identifying such patterns in historical documents, particularly multi-author texts like the Hebrew Bible, provides insights into their origins, purpose and transmission. This study aims to identify formulaic clusters: sections exhibiting systematic repetition and structural constraints, by analyzing recurring phrases, syntactic structures and stylistic markers. However, distinguishing formulaic from non-formulaic elements in an unsupervised manner presents a computational challenge, especially in high-dimensional and sample-poor data sets where patterns must be inferred without predefined labels.

To address this, we develop an information-theoretic algorithm leveraging weighted self-information distributions to detect structured patterns in text. Our approach directly models variations in sample-wise self-information to identify formulaicity. By extending classical discrete self-information measures with a continuous formulation based on differential self-information in multivariate Gaussian distributions, our method remains applicable across different types of textual representations, including neural embeddings under Gaussian priors.

Applied to hypothesized authorial divisions in the Hebrew Bible, our approach successfully isolates stylistic layers, providing a quantitative framework for textual stratification. This method enhances our ability to analyze compositional patterns, offering deeper insights into the literary and cultural evolution of texts shaped by complex authorship and editorial processes.

Keywords

clustering formulaic texts self-information

Information

Type: Research Article
Information: Computational Humanities Research , Volume 1 , 2025 , e9

DOI: https://doi.org/10.1017/chr.2025.10011 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Open Practices: Open data
Copyright: © The Author(s), 2025. Published by Cambridge University Press

Plain Language Summary

This work presents a method for identifying formulaic clusters – segments of text that rely heavily on repetition, structural regularity or stylistic constraint. Such clusters can be found in a wide range of contexts, including religious texts, technical manuals, oral traditions and institutional writing, where language is shaped by convention, genre or the need for consistency. Detecting these patterns can offer insight into how texts were composed, transmitted and shaped by cultural forces over time.

Our approach is grounded in information theory and focuses on the concept of self-information, which measures how predictable or unpredictable a text unit is with respect to the corpus with which it is associated. Formulaic segments tend to be more predictable and, therefore, carry higher self-information. By measuring the distribution of self-information across a corpus, we identify areas where language use is more constrained, indicating a higher degree of formulaicity.

Benchmarking against standard clustering methods on synthetic and real data confirms the method’s robustness, particularly in high-dimensional, low-sample contexts. It outperforms conventional models in isolating internally consistent structures, without requiring strong distributional assumptions.

We apply our method to three books of the Hebrew Bible: Genesis, Exodus and Leviticus. In Genesis, we focus on the distinction between genealogical lists and surrounding narrative material. The genealogical sections are known for their rigid, repetitive structure, and our method effectively isolates them as highly formulaic. In Exodus, we examine the division between Priestly (P) and non-Priestly texts, with the Priestly material comprising legal and cultic instructions. The method reliably separates these layers in line with traditional source-critical scholarship. In Leviticus, we test whether the method can distinguish between the Priestly source and the Holiness code (H), two stylistically distinct strata. The results show that subtle differences in recurring expressions and structural regularity allow for a clear separation between them under appropriate conditions.

This work offers a flexible, information-theoretic framework for analyzing the internal structure of texts. It is well suited to the study of layered, edited or composite documents, where patterns of repetition and constraint signal deeper organizational principles. The method is broadly applicable to literary, historical, legal or technical corpora, wherever formulaic language plays a role in shaping textual form.

Introduction

An intriguing challenge in textual analysis is distinguishing formulaic layers – those characterized by recurring patterns, motifs or stylistic conventions – from non-formulaic ones. In this context, a formulaic element is defined as a word or a sequence thereof, syntactic or grammatical structures, or thematic patterns that recur systematically within a corpus, either due to conventionalized linguistic usage, genre-specific constraints or cultural transmission mechanisms. Formulaic expressions can function as structural scaffolds within texts, shaping their composition and reflecting underlying conventions of specific literary traditions, genres or modes of authorship (e.g., Paquot and Granger Reference Paquot and Granger2012; Read and Nation Reference Read, Nation and Wray2008; Wood Reference Wood, Webb and Nation2019). Beyond their structural significance, analyzing formulaic patterns provides insights into the dynamic interplay between creativity and constraint in language, illuminating the socio-cultural and cognitive forces that have influenced textual transmission and stylistic evolution (e.g., Jensen Reference Jensen1980; Magoun Jr Reference Magoun1953).

The importance of formulaicity is particularly pronounced in historical texts, where composition layers often reflect complex authorship and redaction processes. Many historical documents, created and transmitted over centuries, are products of collective memory and institutional influence. As such, these texts often contain formulaic structures that encode cultural norms, narrative conventions, and theological or political ideologies (e.g., Knohl Reference Knohl2007; Polak Reference Polak, Greenspahn and Rendsburg2017). Understanding these layers can reveal how texts were constructed, transmitted and adapted over time (e.g., Coleman Reference Coleman2019; Gitay Reference Gitay1980; Polak Reference Polak2006).

Among historical texts, the Hebrew Bible provides a compelling case study. As a composite text shaped by centuries of authorship, redaction and transmission, the Hebrew Bible embodies a rich interplay of formulaic and non-formulaic elements. Its stylistic diversity reflects contributions from different schools of authorship, genres and literary motifs, ranging from tightly structured legal codes to more fluid poetic and narrative traditions (Shectman and Baden Reference Shectman and Baden2009; Stipp Reference Stipp, Najman and Schmid2017; Wärters Reference Wärters1976). Detection and analysis of formulaic patterns within the Hebrew Bible offer a pathway to better understanding its compositional history, shedding light on how diverse voices and traditions coalesced into the canonical form.

Understanding the distinction between formulaic and non-formulaic layers requires rigorous methodologies to quantify stylistic and structural patterns across literary feature sets. One promising metric for addressing these challenges is entropy, a concept from information theory that measures the unpredictability or randomness of a dataset. Initially introduced by Rudolf Clausius in 1865 as a measure of energy dispersal in a system, and subsequently formalized statistically by Shannon (Reference Shannon1948), entropy has found applications in diverse fields, including thermodynamics, cryptography and data compression. In the context of textual analysis, entropy provides a framework for quantifying the variability and regularity inherent in linguistic structures. Formally, given a discrete probability distribution $P = \{p_1, p_2, \dots , p_n\}$ over n possible elements, Shannon entropy is defined as

(1)

$$ \begin{align} H(P) = -\sum_{i=1}^{n} p_i \log p_i ,\end{align} $$

where $p_i$ represents the probability of the ith element. Higher entropy values indicate greater variability, while lower entropy values correspond to constrained, repetitive structures, making entropy a powerful tool for distinguishing formulaic clusters from more unpredictable distributions.

The relevance of entropy to textual analysis lies in its ability to capture the degree of predictability of literary or linguistic features within a sequence, such as a sentence, paragraph or chapter. Texts with high formulaicity, such as legal codes or liturgical compositions, are likely to exhibit low entropy, such that text units within them resemble one another, for example, in vocabulary or grammatical structure. Conversely, more creative or extemporaneous texts often exhibit higher entropy, reflecting their greater unpredictability. Information-based metrics, therefore, serve as a natural tool for distinguishing between formulaic and non-formulaic elements within texts (e.g., Church and Hanks Reference Church and Hanks1990).

In textual contexts, entropy is closely related to the concept of perplexity – the exponential of the average of self-information of all features in a sequence, which is widely used in language modeling and computational linguistics. Perplexity measures how well a probabilistic model predicts a sequence of words, with lower perplexity indicating a better fit. Formally, perplexity is defined as the exponential of the average negative log probability of the observed sequence:

(2)

$$ \begin{align} \text{Perplexity} = \exp\left(-\frac{1}{N} \sum_{i=1}^N \log P(w_i \mid w_{i-1}, w_{i-2}, \dots)\right), \end{align} $$

where $w_i$ represents the ith word in the sequence, and $P(w_i \mid w_{i-1}, w_{i-2}, \dots )$ denotes the conditional probability of $w_i$ given its preceding context. Perplexity has proven instrumental in analyzing textual corpora over various linguistic angles (e.g., Gamallo, Campos, and Alegria Reference Gamallo, Campos, Alegria, Nakov, Zampieri, Ljubešić, Tiedemann, Malmasi and Ali2017; Huang and Hansen Reference Huang and Hansen2007; Klakow and Peters Reference Klakow and Peters2002). In the context of formulaic analysis, perplexity can reveal how tightly a text adheres to specific linguistic conventions (e.g., Kurzynski Reference Kurzynski, Hämäläinen, Öhman, Pirinen, Alnajjar, Miyagawa, Bizzoni, Partanen and Rueter2023).

The task of identifying formulaic layers is inherently complex, as clustering formulaic elements within a corpus involves evaluating an exponentially large set of possible ways to divide the text, as each partition represents a different possible grouping of elements that exhibit formulaicity. This combinatorial challenge necessitates algorithmic approaches that efficiently approximate optimal solutions. One common strategy involves modeling linguistic feature distributions using probabilistic frameworks, with Gaussian-based methods being among the most widely applied.

For instance, Gaussian mixture models (GMMs) (Vlassis and Likas Reference Vlassis and Likas2002) assume that the dataset can be represented as a weighted sum of Gaussian components, enabling efficient clustering through iterative optimization of parameters, such as means, covariance matrices and mixing weights. Similarly, cross-entropy clustering (CEC) (Tabor and Spurek Reference Tabor and Spurek2014) explicitly incorporates entropy as a criterion to partition data, refining clusters based on their statistical coherence. These methods, while effective in many contexts, rely on assumptions about the underlying data distribution, which may not always align with the structural and stylistic non-Gaussian variability inherent in textual corpora.

Recent advances in representation learning, including the contrastive language–image pretraining model (CLIP; Radford et al. Reference Radford, Kim, Hallacy, Ramesh, Goh, Agarwal, Sastry, Askell, Mishkin, Clark, Meila and Zhang2021) and bidirectional encoder representations from transformers (BERT; Devlin et al. (Reference Devlin, Chang, Lee and Toutanova2019)), have introduced powerful mechanisms for unsupervised clustering and alignment in high-dimensional semantic spaces. These models often optimize mutual information across paired or contextualized inputs, effectively capturing latent structural relationships without requiring explicit probabilistic assumptions. Building on these architectures, recent work has proposed information-theoretic evaluation metrics computed over embedding spaces to quantify diversity and structural complexity. One such class of methods includes entropy measures based on Rényi kernel entropy (RKE), which estimate entropy non-parametrically using similarity kernels to capture the dispersion of representations in high-dimensional spaces (Giraldo, Rao, and Principe Reference Giraldo, Rao and Principe2014; Principe Reference Principe2010). These approaches are increasingly used to assess representational diversity and clustering compactness in generative models and learned embeddings. Many of these entropy-based methods derive from or build upon neural embedding models such as variational autoencoders (VAEs), which assume that latent features follow a Gaussian distribution in a continuous space (Dilokthanakul et al. Reference Dilokthanakul, Mediano, Garnelo, Lee, Salimbeni, Arulkumaran and Shanahan2016; Li et al. Reference Li, Wang, Chen, Utiyama, Sumita, Zhang and Zhao2020; Yang et al. Reference Yang, Cheung, Li and Fang2019). VAEs excel in capturing the underlying structure of data by encoding inputs into compact latent representations governed by a Gaussian prior (Casale et al. Reference Casale, Dalca, Saglietti, Listgarten, Fusi, Bengio, Wallach, Larochelle, Grauman, Cesa-Bianchi and Garnett2018), which enables smooth interpolations and the generation of new data points resembling the original distribution.

While these approaches provide useful frameworks for clustering and representation learning, their reliance on Gaussian assumptions and metrics that require the estimation of covariance presents notable challenges. In textual data, where features often exhibit categorical, skewed or multimodal distributions, the Gaussian prior may not accurately capture structural variability. Moreover, many of these methods are highly sensitive to the sample-to-dimension ratio (Altonji and Segal Reference Altonji and Segal1996; Yao, Zheng, and Bai Reference Yao, Zheng and Bai2015). When the sample size is small relative to the feature space, covariance estimation becomes unstable, leading to poorly conditioned matrices or singularities (e.g., Ashurbekova et al. Reference Ashurbekova, Usseglio-Carleve, Forbes and Achard2021). Regularization techniques, such as adding small positive constants to the diagonal of covariance matrices, can partially mitigate these issues but introduce biases that require careful tuning (Bickel and Levina Reference Bickel and Levina2008). This bias arises because regularization alters the estimated eigenvalues of the covariance matrix, effectively shrinking them toward a predetermined value, which can distort the true structure of the data and influence clustering or classification outcomes.

Crucially, while these techniques are effective in high-resource, general-purpose NLP settings, their applicability to low-resource or singular historical corpora remains limited. In domains such as ancient languages, where pretrained embeddings are unreliable or unavailable, and where interpretability is essential for source-critical and literary-historical inquiry, alternative approaches are required.

These challenges are particularly pronounced in textual analysis, where statistical significance often necessitates dividing texts into relatively large units, such as paragraphs or entire chapters. While this improves the reliability of individual measurements, it also limits the number of available samples. At the same time, textual data often exhibit high dimensionality due to feature-rich representations, such as frequent words, n-grams or syntactic constructions, leading to a severe imbalance between the number of samples and features. This issue is especially prevalent in stylometric analysis, where subtle stylistic differences are captured through sparse, high-dimensional feature spaces (Stamatatos Reference Stamatatos2009). Given these constraints, clustering methods that rely on covariance estimation or strong parametric assumptions struggle to produce stable and interpretable results.

To address these limitations, we introduce an entropy-driven soft clustering framework that identifies formulaic structures based on the log probability (log p) of data points within a given distribution. The choice of − log p as our core metric stems from fundamental principles in information theory: the information content of an observation x, given a probability distribution $p(x)$ , is defined as $-\log p(x)$ , known as its self-information. This quantity directly captures the degree of predictability of x – highly predictable elements (i.e., those frequently occurring in structured, formulaic segments) have higher probabilities and thus lower self-information values, whereas less structured, unpredictable elements exhibit lower probabilities and correspondingly higher self-information values.

By leveraging this information-theoretic perspective, our method provides a natural and rigorous approach to distinguishing formulaic from non-formulaic structures. Our approach evaluates the information content of each data point within its feature space, allowing for a more granular and structurally sensitive identification of formulaic patterns. This circumvents the instability issues associated with covariance estimation in high-dimensional spaces while ensuring robustness to sparsity and multimodal distributions. By directly quantifying the predictability of textual elements, our method provides a principled, data-driven alternative to traditional clustering approaches that struggle with complex linguistic structures.

Our framework integrates soft clustering weights (Peters et al. Reference Peters, Crespo, Lingras and Weber2013), which adaptively re-weight samples based on their self-information contributions, mitigating the influence of sparse or noisy data points. This probabilistic approach ensures robust partitioning of text segments by capturing both localized structural patterns and broader stylistic trends. Unlike traditional clustering methods that assume specific data distributions, such as GMMs, our information-based approach remains effective across a wide range of feature distributions, given an adequate choice of the score function, including multimodal and categorical datasets. Furthermore, by avoiding direct covariance inversion, our method enhances numerical stability and reduces susceptibility to the curse of dimensionality. Through this paradigm, we offer a flexible, data-driven solution for identifying formulaic structures in textual corpora while maintaining robustness in challenging high-dimensional settings.

To accommodate different types of textual feature representations, we consider two complementary formulations of the method: one for discrete data and one for continuous data. The discrete formulation operates directly on binary or count-based features, avoiding any form of covariance estimation. In contrast, the continuous formulation, used for real-valued vector representations, such as embeddings, does involve estimating a soft covariance matrix. While this introduces some of the challenges seen in classical approaches, our method confines the estimation to weighted subsets of the data associated with candidate clusters exhibiting internal coherence. This restriction improves numerical stability by avoiding cross-cluster covariance terms, which are often the most ill-conditioned. Moreover, in both formulations, the objective function remains grounded in self-information, prioritizing structural predictability over explicit density modeling. This unifying information-theoretic perspective allows the method to remain robust across data modalities while still benefiting from statistical sensitivity to underlying stylistic regularities.

The article is structured as follows: In the “Soft Clustering Algorithm” section, we introduce the mathematical foundations of the information-based soft clustering framework, defining the self-information distribution and formulating an optimization scheme for distinguishing formulaic from non-formulaic textual components in the “Benchmarking” section, we benchmark the method against existing clustering techniques on synthetic datasets, assessing its ability to classify structured and unstructured elements under controlled conditions, in the “Application to Textual Data: The Priestly Source(s) in the Pentateuch” section, we apply the framework to the analysis of the biblical corpus, evaluating its capacity to differentiate hypothesized literary strata based on stylistic and formulaic properties, and in the “Discussion” section, we conclude.

Soft clustering algorithm

The following sections establish the mathematical foundations of the proposed information-based soft clustering framework and formulate an optimization scheme for distinguishing formulaic from non-formulaic text layers. In this framework, “soft clustering” refers to the relaxation techniques employed during optimization to avoid indeterminate solutions (e.g., Royer Reference Royer, Guyon, Luxburg, Bengio, Wallach, Fergus, Vishwanathan and Garnett2017). This approach follows established clustering methodologies, where constraints are relaxed to transform discrete, hard assignment problems into continuous optimization problems, making them more tractable. An example of this is semidefinite relaxation in graph clustering, which reformulates discrete clustering problems as continuous ones, enabling more efficient optimization. By utilizing such relaxation techniques, our framework can more effectively identify structured patterns in text while maintaining flexibility in handling ambiguity in cluster assignments.

Self-information as a measure for identifying formulaic structures

The use of self-information stems from its foundational role in information theory, where the information content of an observation x is defined as

(3)

$$ \begin{align} I(x) = -\log p(x), \end{align} $$

which quantifies how predictable or surprising an observation is. More predictable elements (those that occur frequently in structured, formulaic segments) have higher probabilities and thus lower − log p(x) values, while less predictable elements exhibit lower probabilities and higher − log p(x) values.

In the context of textual analysis, formulaic structures, by their nature, are repetitive and highly predictable, leading to higher − log p(x) values for formulaic segments. Our method directly measures the degree of structural predictability of each text segment, providing a granular, data-driven approach to distinguishing formulaic from non-formulaic patterns. Unlike existing information-based clustering methods, which rely on global variability measures, our approach evaluates the local predictability of individual elements, ensuring sensitivity to complex and varied linguistic structures.

Layman explanation: Self-information provides a measure of how surprising or expected a given unit of text is, based on how likely it is to appear. Frequent and structurally constrained expressions are more predictable and therefore contribute higher self-information values. By measuring this quantity locally for each segment, our method identifies areas of the text that exhibit heightened regularity or constraint, which are often associated with formulaic language.

Clustering approach

For a two-cluster problem, clustering methods aim to partition a dataset $X = \{x_1, x_2, \dots , x_n\}$ into two disjoint subsets, or clusters, based on a given similarity measure. Traditional hard clustering assigns each data point $x_i$ to exactly one of the two clusters $\mathcal {C}_1$ or $\mathcal {C}_2$ such that

(4)

$$ \begin{align} x_i \in \mathcal{C}_j \quad \text{where} \quad \mathcal{C}_1 \cup \mathcal{C}_2 = X, \quad \mathcal{C}_1 \cap \mathcal{C}_2 = \emptyset. \end{align} $$

Soft clustering relaxes this constraint by allowing each data point $x_i$ to have a continuous membership weight $s_i \in [0,1]$ , representing its degree of association with one of the two clusters. Instead of assigning $x_i$ to a single cluster, we define

(5)

$$ \begin{align} s_i \in [0,1], \quad s_i &= 0 \text{ for full membership in } \mathcal{C}_1,\nonumber\\s_i &= 1 \text{ for full membership in } \mathcal{C}_2. \end{align} $$

Thus, each data point has a soft assignment, where

$$\begin{align*}\begin{cases} x_i \in \mathcal{C}_1, & \text{if } s_i = 0, \\ x_i \in \mathcal{C}_2, & \text{if } s_i = 1, \\ x_i \in \mathcal{C}_1 \text{ and } \mathcal{C}_2 \text{ with weight } (1 - s_i, s_i), & \text{if } 0 < s_i < 1. \end{cases} \end{align*}$$

Our clustering approach employs a relaxation strategy within the optimization process, allowing for continuous rather than binary sample assignments. At each step of optimization, we iteratively update the probabilistic membership of each sample in the formulaic class, refining its classification based on structural predictability. This dynamic adjustment enhances the stability of the optimization, preventing indeterminate solutions while maintaining flexibility in high-dimensional or noisy datasets. However, this relaxation is not a final conclusion; rather, it is an intermediate step that ensures a more robust and well-conditioned optimization process, at the end of which each sample is determined by an integer label through a thresholding process.

Our method utilizes self-information to quantify structural predictability in text. Unlike hard clustering, which enforces discrete partitioning, we assign probabilistic memberships based on information content. Since formulaic and non-formulaic text layers differ in their information-theoretic properties – formulaic sections exhibiting higher predictability and lower self-information – our optimization scheme ensures a stable, information-guided classification.

Layman explanation: Rather than forcing each text unit of text into one cluster or the other, our method allows for partial or uncertain membership between the two. This flexibility is important in cases where text units may contain a mix of formulaic and non-formulaic features. During optimization, each segment is evaluated based on how structurally predictable it is. Highly predictable text units lean toward the formulaic cluster, while less predictable ones lean toward the non-formulaic cluster. This “soft” classification makes the method more stable and better suited for analyzing noisy or stylistically mixed texts.

Clustering formalism for discrete categorical data

To determine optimal cluster assignments, we minimize the cross-entropy between the empirical data distribution $\mathcal {P}_{\text {data}}(x)$ and the model distribution $\mathcal {P}_{\text {model}}(x)$ . The empirical distribution $\mathcal {P}_{\text {data}}(x)$ represents the observed frequencies of different textual elements in the dataset, while $\mathcal {P}_{\text {model}}(x)$ is derived from the soft clustering assignments, representing the likelihood of each sample belonging to a given cluster. Minimizing cross-entropy ensures that $\mathcal {P}_{\text {model}}(x)$ closely approximates the observed data, allowing the model to generalize the underlying structure while maintaining probabilistic flexibility.

Layman explanation: The goal is to ensure that the feature distributions within each cluster align well with the actual data. Cross-entropy serves as a standard measure of divergence between observed and modeled probabilities. Minimizing it improves the model’s ability to describe how features are distributed in the data.

Formally, cross-entropy is defined as

(6)

$$ \begin{align} H(\mathcal{P}_{\text{data}}, \mathcal{P}_{\text{model}}) &= - \mathbb{E}_{\mathcal{P}_{\text{data}}} \left[ \log \mathcal{P}_{\text{model}}(x) \right]\nonumber\\&= - \sum_{x} \mathcal{P}_{\text{data}}(x) \log \mathcal{P}_{\text{model}}(x). \end{align} $$

We adopt a Bernoulli distribution-based model to characterize cluster assignments, as it provides a simple yet effective representation of feature occurrences in categorical data. This assumption is particularly relevant for textual analysis, where features – such as word occurrences or syntactic markers – tend to follow distinct distributions in formulaic and non-formulaic clusters. Given a soft clustering assignment, we estimate the model probability of a given word empirically by computing its weighted (i.e., soft) frequency within a cluster.

Layman explanation: The Bernoulli model treats each feature as either present or absent in a given segment of text. This is well-suited to binary or count-based feature representations. We estimate the frequency of each feature within a cluster by aggregating across samples, weighted by how strongly they belong to the cluster.

Specifically, the soft probability of the jth word appearing in a given cluster is estimated as

(7)

$$ \begin{align} f(w_j | \vec{s}) = \frac{\sum_{i=1}^{n} s_i x_{ij}}{\sum_{i=1}^{n} s_i}, \quad p(w_j | \vec{s}) = \frac{f(w_j | \vec{s})}{\sum_{w' \in \chi} f(w' | \vec{s})}, \end{align} $$

where $\chi $ is the set of unique categorical features satisfying $w \in \chi $ , and $p(w | \vec {s})$ represents the probability of category w given a weight sequence $\vec {s}$ . The variable $x_i$ denotes the feature vector of the ith sample. Under the Bernoulli assumption, $x_{ij} = 1$ if the jth feature is present in the sample and $x_{ij} = 0$ otherwise. In the cumulative case, $x_{ij}$ instead represents the number of times the jth feature appears in the sample. This formulation allows us to compute empirical word distributions within each cluster while incorporating soft clustering assignments as weighting factors.

Layman explanation: This step produces a probability distribution over features for each cluster. These distributions capture what makes each cluster statistically distinct, based on how often specific features appear in text segments that are likely to belong to that cluster.

With these estimated word probabilities, we can compute the likelihood of a complete message (e.g., a verse) given a cluster. Under the Bernoulli assumption, the probability of a sample $x_i$ conditioned on its cluster assignment is

(8)

$$ \begin{align}\kern-14pt P(x_i | \vec{s}) = \prod_{j=1}^{m} p(w_j | \vec{s})^{x_{ij}} (1 - p(w_j | \vec{s}))^{1 - x_{ij}}, \end{align} $$

(9)

$$ \begin{align} P(x_i | 1 - \vec{s}) = \prod_{j=1}^{m} p(w_j | 1 - \vec{s})^{x_{ij}} (1 - p(w_j | 1 - \vec{s}))^{1 - x_{ij}}. \end{align} $$

Here, $p(w_j | \vec {s})$ and $p(w_j | 1 - \vec {s})$ represent the soft Bernoulli probabilities of the jth feature being present in each cluster. These probabilities are derived directly from the empirical word distributions calculated in Eq. (7).

Layman explanation: These expressions compute how likely a full set of features is for a given sample under each cluster model. If a text segment contains many features that are common in one cluster but rare in the other, the value will reflect that asymmetry.

Given this formulation, our score function naturally takes the form of the negative log-likelihood under a Bernoulli mixture model:

(10)

$$ \begin{align} S(\vec{s}, X) = -\sum_{i=1}^{n} \left( s_i \cdot \log P(x_i | \vec{s}) + (1 - s_i) \cdot \log P(x_i | 1 - \vec{s}) \right), \end{align} $$

where $\vec {s} \in \mathbb {R}^n$ is the weight vector representing the soft assignment of samples to clusters, and $X \in \mathbb {R}^{n \times d}$ is the binary feature matrix of n samples with d features.

Layman explanation: The score function evaluates how well the current cluster assignments explain the data. The algorithm updates these assignments to minimize the score, which corresponds to improving the model’s alignment with the observed feature patterns.

To accommodate non-binary data, we extend the Bernoulli model to a Binomial formulation, where the number of trials per sample is determined by the total feature occurrences in that sample:

(11)

$$ \begin{align}\kern-30pt P(x_i | \vec{s}) = \prod_{j=1}^{m} \binom{n_i}{x_{ij}} p(w_j | \vec{s})^{x_{ij}} (1 - p(w_j | \vec{s}))^{n_i - x_{ij}}, \end{align} $$

(12)

$$ \begin{align} P(x_i | 1 - \vec{s}) = \prod_{j=1}^{m} \binom{n_i}{x_{ij}} p(w_j | 1 - \vec{s})^{x_{ij}} (1 - p(w_j | 1 - \vec{s}))^{n_i - x_{ij}}, \end{align} $$

where $n_i = \sum _{j=1}^{m} X_{ij}$ represents the total count of present features in sample $X_i$ . This ensures that the likelihood formulation remains valid when dealing with count-based or real-valued feature representations.

Layman explanation: When features can appear multiple times in a sample, such as when counting words as many times as they appear in a text unit, rather than whether the word appears at all or not, as in the binary case, the Binomial model accounts for these counts. This extension preserves the logic of the model while improving its fit to data with elements that are counted repeatedly.

Finally, the soft weighted log-likelihood of the dataset under the Binomial mixture model, which serves as the basis for our clustering optimization, is given by

(13)

$$ \begin{align} \log P(x_i | \vec{s}) = \sum_{i=1}^{n} \sum_{j=1}^{m} \left[ \log \binom{k_j}{x_{ij}} + x_{ij} \log p(w_j | \vec{s}) + (k_j - x_{ij}) \log (1 - p(w_j | \vec{s})) \right]. \end{align} $$

This formulation connects back to the cross-entropy minimization framework introduced in Eq. (6), ensuring that cluster assignments reflect the probabilistic structure of the data.

Layman explanation: This final expression combines all the elements described above into a unified scoring objective. The clustering process then proceeds by adjusting the sample weights to find the configuration that most efficiently captures the structural differences in the data.

While this formulation provides a principled probabilistic basis for clustering, it does not explicitly account for competing structural signals that may emerge from different generative processes within the data. However, assessing the presence and dominance of such alternative signals is beyond the scope of this work, as our primary focus is on the probabilistic clustering framework itself. That said, the relative influence of entropy-based clustering versus clustering driven by feature distributions is inherently dependent on dataset-specific parameters, such as the baseline activation probability of features and the presence of structured formulaic dimensions. A heuristic analysis of this interplay is provided in Appendix C, where we illustrate how these factors affect clustering outcomes and discuss the need for future work to develop a predictive framework for determining when each signal will dominate.

Clustering formalism for continuous multivariate Gaussian data

In the case of continuous data, applying a Bernoulli model is not feasible, as such data does not inherently exhibit discrete binary properties. Instead, we adopt an information-based formalization, which provides a natural and mathematically elegant means of characterizing the structure of high-dimensional continuous distributions. Specifically, for data modeled as multivariate Gaussians, entropy offers a well-parameterized measure of dispersion and structure, making it a suitable alternative for defining soft cluster assignments.

Layman explanation: In the discrete case, each text segment is represented by binary indicators, capturing the presence or absence of individual features (e.g., words or n-grams) in a text unit. However, this representation does not extend to settings where features are continuous-valued, such as neural embeddings or normalized frequency vectors. In these cases, one must model how the values themselves vary across samples rather than simply whether a feature occurs. The multivariate Gaussian distribution provides a natural and mathematically grounded way to describe such continuous variation. Within this framework, entropy reflects how concentrated or dispersed the distribution is, offering a principled measure of structural regularity in continuous feature space.

Formally, in a d-dimensional space, the entropy of a multivariate Gaussian distribution is given by

(14)

$$ \begin{align} H(X) = \frac{d}{2}(1 + \ln(2\pi)) + \frac{1}{2} \ln |\Sigma|, \end{align} $$

where $\Sigma $ is the covariance matrix, and $|\Sigma |$ denotes its determinant.

To incorporate weights in our soft clustering framework, we define a weighted covariance matrix:

(15)

$$ \begin{align} \mu_w = \frac{\sum_{i=1}^n s_i x_i}{\sum_{i=1}^n s_i}, \quad \Sigma_w = \frac{\sum_{i=1}^n s_i (x_i - \mu_w)(x_i - \mu_w)^\top}{\sum_{i=1}^n s_i}, \end{align} $$

where $\{s_i\}$ are the weights assigned to samples. To enhance numerical stability in cases where the covariance matrix may be ill-conditioned, a regularization term is applied:

(16)

$$ \begin{align} \Sigma_w \gets \Sigma_w + \epsilon I, \end{align} $$

where $\epsilon $ is a small positive constant, and I is the identity matrix of rank d. Using this regularized covariance, the total entropy of the weighted dataset is computed as

(17)

$$ \begin{align} H(X) = \frac{d}{2}(1 + \ln(2\pi)) + \frac{1}{2} \ln |\Sigma_w|. \end{align} $$

This expression provides a global measure of the dataset’s uncertainty based on its covariance structure. However, in the context of soft clustering, we require a measure that quantifies the contribution of individual samples to the overall uncertainty. Since entropy characterizes unpredictability, a natural way to approximate the uncertainty associated with a specific sample $x_{i'}$ is by considering its likelihood under the dataset’s distribution.

For a multivariate Gaussian, the log-likelihood of a sample is

(18)

$$ \begin{align} \log P(x_{i'}|\vec{s}) = -\frac{d}{2} \log(2\pi) - \frac{1}{2} \log |\Sigma_w| - \frac{1}{2} (x_{i'} - \mu_w)^\top \Sigma_w^{-1} (x_{i'} - \mu_w). \end{align} $$

This decomposition separates the likelihood into three terms: (i) a normalization constant, (ii) a term encoding the dataset’s covariance structure, and (iii) a term measuring a sample’s deviation from the mean under the Mahalanobis metric. Subsequently, the self-information of a sample is

(19)

$$ \begin{align} I(x_{i'}|\vec{s}) = -\log p(x_{i'}) = \frac{d}{2} \log(2\pi) + \frac{1}{2} \log |\Sigma_w| + \frac{1}{2} D_M(x_{i'}|\vec{s}), \end{align} $$

where $D_M(x_{i'}|\vec {s}) = (x_{i'} - \mu _w)^\top \Sigma _w^{-1} (x_{i'} - \mu _w)$ is the Mahalanobis distance given the weighting by $\vec {s}$ (see Appendix A).

Layman explanation: A sample’s self-information increases the further it lies from the mean of the data, especially if it does so along directions where the data varies less. This distance, measured by the Mahalanobis metric, provides a principled way of quantifying how typical or atypical a given point is relative to the distribution of other points.

This equation directly relates a sample’s entropy contribution to its likelihood under the dataset’s distribution.

The Mahalanobis distance quantifies how far a sample deviates from the mean, making it a natural proxy for self-information. Samples close to $\mu _w$ have higher likelihoods and lower entropy, while those further away contribute greater uncertainty. This formulation provides a probabilistic approach to clustering, where samples are distinguished based on their deviation from the global covariance structure.

Score function for cluster identification: To identify structured (formulaic) clusters, we define a score function that optimizes cluster coherence while ensuring a sufficient number of samples:

(20)

$$ \begin{align} S(I(X|\vec{s})) = \text{std}(I(X|\vec{s})) \cdot \text{avg}(I(X|\vec{s})) \cdot \frac{1}{||\vec{s}||}, \end{align} $$

where $I(X|\vec {s})$ is the soft distribution of self-information values (negative log-likelihood), and $\vec {s}$ is the weight vector. The terms $\text {std}$ and $\text {avg}$ represent the empirical standard deviation and mean of $I(X|\vec {s})$ , respectively.

Layman explanation: The score function favors clusters that are internally consistent (low variability), strongly predictable (low average self-information), and include enough samples to be meaningful. These three criteria together help isolate regions of the data that exhibit formulaic regularity.

This score combines the standard deviation, the mean and the inverse of the sample weights’ norm to quantify cluster coherence by measuring variability and predictability. The optimization minimizes variance while maintaining sufficient samples, distinguishing the formulaic cluster. Here, several differences from cross-entropy-based clustering (see the “Clustering Formalism for Discrete Categorical Data” section) are noteworthy: (1) Gaussian-based cross-entropy (e.g., Tabor and Spurek Reference Tabor and Spurek2014), relies on full covariance estimation, which becomes unstable when the sample size is small relative to the feature dimensionality, leading to singular covariance matrices. Our approach mitigates this issue by refining the score function to enhance the separability of the formulaic cluster. By focusing on isolating this cluster, we effectively reduce reliance on cross-cluster covariance terms, which are typically the most ill-conditioned. This ensures that covariance estimation is primarily constrained to a more compact and homogeneous subset of the data, where the conditioning of the covariance matrix remains more stable. (2) In Gaussian data, formulaic clusters have low self-information due to high predictability. Low entropy in these clusters corresponds to high likelihood, yielding lower negative log-likelihood. In contrast, in discrete categorical data, formulaic clusters are associated with higher self-information due to sparse but predictable occurrences of categories. Thus, Gaussian formulaic clusters exhibit low self-information, while discrete ones exhibit high self-information.

Optimization scheme

The algorithm optimizes weights $\vec {s} \in \mathbb {R}^n$ to isolate the self-information distribution of the structured or formulaic layer within the data. The optimization scheme is summarized in Algorithm 1. While our method is flexible and conceptually transparent, it is also more computationally intensive than conventional iterative clustering algorithms, such as k-means or expectation–maximization (EM). Those methods benefit from closed-form updates for centroids or covariance matrices, leading to fast convergence in low-dimensional, well-behaved data. In contrast, our approach involves nonlinear optimization over a continuous weight vector. This added cost reflects the complexity of modeling self-information and structural predictability, rather than geometric proximity or global variance. In scenarios where sample sizes are large and feature dimensionality is low, iterative methods may indeed be more efficient and comparably effective. However, this is rarely the case in historical or literary corpora, which tend to exhibit sparse, high-dimensional feature spaces and relatively small sample counts. It is precisely these conditions that motivated the design of an information-theoretic clustering approach tailored to such data.

Benchmarking

Clustering benchmarking on categorical data

To evaluate the resolving power of our algorithm on one-hot encoded data, we design an experiment that generates synthetic datasets with features reflective of textual embeddings. These datasets simulate challenges, such as sparsity, feature dependencies and formulaic activation patterns.

The dataset is generated as follows: For each cluster, we create n samples with dimensionality d, where features are binary, representing the activation (1) or absence (0) of specific elements. One cluster, referred to as the uniform cluster, has an equal small activation probability p across all dimensions, resulting in sparse, randomly distributed features. The other cluster, referred to as the formulaic cluster, increases the activation probability for a subset of $d_{\mathrm {{form}}}$ dimensions by a bias factor f, while the remaining dimensions maintain the uniform activation probability, such that $p_{\text {form}} = p + f$ .

To introduce feature interdependencies within the formulaic cluster, we enforce correlations between pairs of dimensions. Specifically, for each sample in the formulaic cluster, m dimensions are randomly selected from the formulaic subset, and the activation state of one dimension is set to match the other. This process ensures that certain features are not independent but instead exhibit a structured co-occurrence pattern. Such interdependencies mimic the stylistic or structural dependencies commonly found in textual data, where the presence of one linguistic feature often implies the presence of another.

However, it is important to acknowledge the inherent difficulty of modeling textual data. Texts are complex, with nuanced structures and dependencies that cannot be fully captured by simplistic binary features or synthetic generation processes. Despite these limitations, the purpose of this benchmark is not to replicate textual data exactly but to demonstrate the ability of our algorithm to solve a difficult nonlinear problem – optimizing the distinction between sparse, discrete datasets that differ in entropy. This serves as proof of the algorithm’s robustness in handling challenging clustering problems.

We consider three nominal clustering algorithms that represent different clustering paradigms: (1) GMM, which assumes data is generated from a mixture of Gaussian distributions, making it suitable for probabilistic clustering in continuous spaces, (2) k-means (Hartigan and Wong Reference Hartigan and Wong1979), which partitions data into clusters based on minimizing intra-cluster variance, making it effective for Euclidean distance-based separation, and (3) DBSCAN (Ester et al. Reference Ester, Kriegel, Sander, Xiaowei, Simoudis, Han and Fayyad1996), which identifies clusters as dense regions of points separated by low-density areas, making it robust to arbitrary-shaped clusters but sensitive to parameter tuning. The results of this experiment, shown in Figure 1, demonstrate the superior performance of our algorithm in resolving formulaic clusters in sparse categorical encoded datasets whose sample-to-dimension ratio is small.

Figure 1. Classification results for the benchmarking experiment of discrete categorical one-hot encoded discrete data described in the “Clustering Benchmarking on Categorical Data” section. The test datasets included 100 samples of (equally-sized) formulaic and non-formulaic classes, of 200 (top panel), 50 (middle panel), and 20 dimensions (bottom panel), with varying degrees of the probability of the base- (p) and formulaic- feature activation ( $p_{\text {form}}$ ), respectively, and the fraction of formulaic dimensions in the formulaic class. The colored areas represent one-standard-deviation intervals derived from 100 simulations.

The classification accuracy is measured using the Matthews correlation coefficient (MCC), given by

(21)

$$ \begin{align} \text{MCC} = \frac{\text{TN} \times \text{TP} - \text{FN} \times \text{FP}}{\sqrt{(\text{TP} + \text{FP})(\text{TP} + \text{FN})(\text{FN} + \text{FP})(\text{TN} + \text{FN})}}, \end{align} $$

where T (F) stands for true (false) and P (N) stands for positive (negative), which are evaluations of some sequence against another. We subsequently normalize the original metric, spanning $\text {MCC}\in [-1, 1]$ to percent within the range of 50%–100%, where 50% suggests an arbitrary overlap between the output and real label sequences, and 100% suggests perfect overlap.

As shown in Figure 1, our method consistently outperforms traditional clustering approaches in the sparse low sample-size-to-dimensions ratio case by effectively distinguishing structured formulaic patterns even in highly sparse discrete encoded data.

Clustering benchmarking on multivariate Gaussian data

We perform a series of classification experiments to demonstrate the classification power of our algorithm compared to well-established classification routines designed for multivariate Gaussian data. These methods are: (1) Differential-Entropy (Davis and Dhillon Reference Davis, Dhillon, Schölkopf, Platt and Hofmann2006), (2) Gaussian Mixture EMFootnote ¹ (GMM), (3) Regularized EM (Reg. EM) (Houdouin, Ollila, and Pascal Reference Houdouin, Ollila and Pascal2023), and (4) CEC (Tabor and Spurek Reference Tabor and Spurek2014). Additionally, we test the resolving power of the $L_2$ norm distribution, incorporating it into a similar optimization scheme as described for the self-information distribution due to its relevance in high-dimensional settings where the concentration of measure ensures the norm becomes a distinguishing feature (but is independent of the number of samples).

We consider a baseline number of dimensions of $d = 50$ and evaluate performance across varying sample sizes per class. Each class is drawn from a multivariate Gaussian distribution, with both distributions centered at zero and defined by covariance matrices $\Sigma _1$ and $\Sigma _2$ . The eigenvalues of these covariance matrices are set to be uniformly $\lambda _1 = 10$ and $\lambda _2 = 30$ , respectively.

To introduce noise, we modify several eigenvalues of $\Sigma _1$ to match those of $\Sigma _2$ . The noise level is defined as the relative proportion of eigenvalues in $\Sigma _1$ replaced with corresponding eigenvalues of $\Sigma _2$ . In Figure 2, we present the results of these experiments, showcasing the classification performance of each method under varying sample sizes, numbers of dimensions, and noise levels.

Figure 2. Classification results of the experiment described in the “Clustering Benchmarking on Multivariate Gaussian Data” section, for a varying number of sample sizes and numbers of dimensions of multivariate Gaussian classes of varying entropy. Upper panel: Varying sample sizes for $d = 50$ . Bottom panel: Varying sample sizes for $d = 10$ . The colored areas represent one-standard-deviation intervals, derived from 100 simulations.

We find that our classification approach benefits both from the sample size and the number of dimensions and exhibits a considerably weaker sensitivity to a small sample-size-to-dimensions ratio than other classification routines that leverage the multivariate Gaussian assumption. This is because estimating covariance matrices in high-dimensional, small-sample settings is ill-conditioned, leading to unstable classification. Our method circumvents this issue by leveraging auto-entropy distributions rather than direct covariance estimation. Specifically, as shown in Appendix A, self-information-based classification avoids reliance on explicit covariance inversion by computing the soft auto-entropy of individual samples with respect to the weighted dataset entropy.

This formulation improves numerical stability and mitigates singularity issues that arise when the sample size is small relative to the number of dimensions, as illustrated in Figure 2.

Application to textual data: The priestly source(s) in the Pentateuch

The composition of the Pentateuch, or Torah, has been a longstanding subject of critical scholarship, with various theories proposed to account for its linguistic diversity, stylistic inconsistencies, narrative discrepancies, and theological tensions. One of the most widely discussed models is the Documentary Hypothesis, which suggests that the Torah is a composite text formed through the combination of multiple sources, each reflecting different historical, ideological, or scribal traditions (Holzinger Reference Holzinger1893; Kuenen and Wicksteed Reference Kuenen and Wicksteed1886; Wellhausen Reference Wellhausen1885). Among the sources reconstructed by this framework are the Yahwist (J), the Elohist (E), the Deuteronomist (D), and the Priestly (P) source. While the precise delineation of these sources remains debated, with alternative models proposing alternative groupings or rejecting the hypothesis altogether (Albertz Reference Albertz2018), the existence of the P material is still widely acknowledged, owing to its distinctive style and content and relative internal consistency (e.g., Albertz Reference Albertz2018; Haran Reference Haran1981; Holzinger Reference Holzinger1893; Kuenen and Wicksteed Reference Kuenen and Wicksteed1886). The texts typically classified as P exhibit certain recurring characteristics, including highly structured formulations, genealogical lists, legal material and a focus on ritual and cultic concerns (e.g., Albertz Reference Albertz2018; Vink Reference Vink and Deissler1969). Scholars have further proposed that P itself is not monolithic but contains sub-strata, most notably distinguishing between the preceding (or succeeding) Priestly Source (P) and a later Holiness School (H) (e.g., Ross Reference Ross1997), which is thought to emphasize purity laws and ethical concerns (e.g., Klostermann Reference Klostermann1907; Knohl Reference Knohl2007). However, the extent to which these strata represent distinct compositional layers rather than later redactional activity is an ongoing debate, and no consensus exists regarding their precise historical development. Given these complexities, our study does not assume the validity of any particular compositional model but instead seeks to evaluate whether quantitative, entropy-based methods can identify distinct stylistic patterns in the Pentateuch better than previous computerized characterization attempts (Dershowitz et al. Reference Dershowitz, Akiva, Koppel and Dershowitz2015; Faigenbaum-Golovin et al. Reference Faigenbaum-Golovin, Kipnis, Bühler, Piasetzky, Römer and Finkelstein2025; Yoffe et al. Reference Yoffe, Bühler, Dershowitz, Römer, Piasetzky, Finkelstein and Sober2023). If such patterns emerge, this may lend support to certain divisions proposed in source-critical scholarship. If they do not, it may suggest that the distinction between the various traditions, particularly between P and non-P, or P and H, is more fluid than often assumed. We selected P as the main test case for our clustering algorithm due to its highly structured and repetitive nature, which contrasts with the more vivid and colorful style of other Pentateuchal traditions. Because our approach leverages entropy-based clustering to detect formulaic patterns, it provides an opportunity to evaluate whether the hypothesized Priestly texts exhibit distinct formulaic stylistic tendencies that set them apart from surrounding material. Additionally, if a similar distinction between P and H exists (as proposed by Knohl (Reference Knohl2007)), our method may provide an independent and complementary means of quantifying and assessing it. Notably, there is a broad literary consensus that P constitutes a stylistically distinct body of text, often considered a literary outlier due to its rigid structure, formulaic expressions and characteristic lexical choices (e.g., Bühler et al. Reference Bühler, Yoffe, Römer, Sober, Finkelstein, Piasetzky and Dershowitz2024; Smith Reference Smith1996). Here, we consider three hypothesized partitions of the three books in the Pentateuch for evaluation:

• Genesis: We test the distinction between genealogical lists and the surrounding narrative material. The great majority of the lists are attributed to P due to their molded expression (e.g., the Toledot formula) and emphasis on lineage continuity (e.g., Long, Tengström, and Tengstrom Reference Long, Tengström and Tengstrom1984; Plum Reference Plum1989). However, a few texts belong to non-P. Hence, the focus in this case is on matters of genre and form in the light of entropy rather than source division. By isolating these passages, we assess whether their self-information properties diverge significantly from the surrounding material. In the second stage, the possibility of internal division between P and non-P within the genealogical lists will be examined/demands further inquiry.
• Exodus: We analyze the division between Priestly (P) and non-Priestly (non-P) material. Priestly passages in Exodus largely consist of legal and cultic instructions, particularly concerning the construction of the Tabernacle, whereas non-Priestly sections retain a more narrative-driven style (Römer Reference Römer, Shectman and Baden2009; Schmid Reference Schmid, Levy, Schneider and Propp2015). This partition has been previously explored (Bühler et al. Reference Bühler, Yoffe, Römer, Sober, Finkelstein, Piasetzky and Dershowitz2024; Yoffe et al. Reference Yoffe, Bühler, Dershowitz, Römer, Piasetzky, Finkelstein and Sober2023), making it a useful benchmark for gauging the performance of our method against previous results.
• Leviticus: We investigate whether a distinction can be quantitatively detected between the Priestly (P) and Holiness (H) materials. While both exhibit high degrees of formulaicity, the Holiness texts are often argued to introduce distinctive stylistic and theological elements (Knohl Reference Knohl2007). If our algorithm can recover a meaningful distinction, this may provide additional empirical support for the hypothesized P/H division.

Each of these test cases represents a distinct type of contrast in structural predictability. In Genesis, the distinction between genealogical lists and surrounding narrative material constitutes the most pronounced formulaicity contrast, as the genealogies are highly repetitive and rigid in structure. In Exodus, the contrast between Priestly and non-Priestly material also exhibits a clear difference, particularly in sections related to cultic legislation and Tabernacle construction, which display strongly regularized patterns (Bühler et al. Reference Bühler, Yoffe, Römer, Sober, Finkelstein, Piasetzky and Dershowitz2024; Yoffe et al. Reference Yoffe, Bühler, Dershowitz, Römer, Piasetzky, Finkelstein and Sober2023). The P/H distinction in Leviticus had not previously been analyzed using computerized methods. Its inclusion here allows us to assess whether subtle stylistic differences, such as those associated with the Holiness corpus, manifest in statistically measurable differences in self-information. Together, these examples are chosen to gauge the method’s ability to quantify varying degrees of formulaicity across multiple strata of the biblical text.

Experimental setup

Digital biblical corpus and annotation by experts

We use a digital corpus of the Masoretic variant of the Hebrew Bible in (biblical) Hebrew, which is a version of the Leningrad codex freely available by STEPBible.Footnote ² This dataset comes parsed with full morphological and semantic tags for all words, prefixes and suffixes. In this work, we consider only the morphological (i.e., grammatical) representation of the text, as word-based representations may introduce additional variability due to synonymy and context-dependent meanings. By focusing on morphological features, we ensure greater consistency in capturing structural patterns relevant to authorship and textual stratification.

For each hypothesized partition, we obtain expert annotations provided by biblical scholars specializing in source-critical analysis. These annotations serve as reference classifications, delineating the hypothesized textual divisions. The expert-labeled datasets allow us to evaluate the extent to which our clustering algorithm aligns with established scholarly hypotheses and assess its ability to recover traditionally proposed literary strata in an unsupervised manner.

While this study primarily employs morphological features, due to their interpretive clarity and availability in the STEPBible corpus, the method is not restricted to such representations. In settings where morphological annotation is unavailable, surface word forms can also be used. However, surface-level representations present certain challenges when evaluating formulaicity, as variation in inflection, cliticization or syntactic context may obscure structural repetition. Morphological normalization helps reduce this variability, revealing underlying patterns of grammatical constraint. Nonetheless, there are textual domains where surface-level lexical repetition may be more meaningful.

Embedding

We consider a cumulative-one-hot-encoded representation of the corpus D, such that D ∈ ℝ ^n×χ, where n is the number of text units, χ is the set of all unique n-grams in the corpus and d_ij represents the number of occurrences of the jth feature in the ith text unit.

We consider a parameter space of embedding possibilities whose permutations we fully explore. Specifically, each book is embedded according to a combination of several parameters across a grid of all permutations thereof: n and ℓ, spanning the following ranges: n ∈{1, 2, 3, 4, 5} for word n-grams, and ℓ ∈{2, 3, 4, 6, 8, 10, 12, 14, 18, 22, 24, 26, 28}, which defines the number of (overlapping) consecutive verses over which features are aggregated. This windowing approach allows the model to capture local textual dependencies by smoothing over short-range fluctuations, helping to account for stylistic continuity while preserving meaningful segment-level distinctions. To further balance feature richness and sparsity, we restrict our analysis to the $f\in \{100, 300,500, \text {all} \}$ most frequent features in each embedding configuration. This ensures that selected features are statistically meaningful while preventing noise from rare or idiosyncratic terms that may introduce spurious clustering patterns. By systematically varying the feature set size, we evaluate the robustness of our method across different levels of vocabulary complexity.

This scope is motivated by the desire to exhaustively cover potential feature spaces while ensuring that the extracted features remain both meaningful and sufficiently frequent within the biblical corpus, thereby avoiding excessive sparsity due to the limited length of biblical texts (e.g., Antonia, Craig, and Elliott Reference Antonia, Craig and Elliott2014). Given the relatively small size of the corpus compared to modern datasets, this balance is particularly crucial in detecting meaningful linguistic patterns while maintaining statistical robustness.

We opt for traditional, count-based embeddings, operating under the Binomial score function (Eqs. 11 and 12), to ensure interpretability, particularly given the highly structured and constrained nature of biblical Hebrew. While modern deep learning techniques, such as transformer-based models, have revolutionized text representation in many languages (e.g., Shmidman et al. Reference Shmidman, Guedalia, Shmidman, Shmidman, Handel and Koppel2022), no robust pre-trained language model for biblical Hebrew currently exists. As a result, neural embeddings would require extensive domain-specific pertaining (e.g., Huertas-Tato, Martín, and Camacho Reference Huertas-Tato, Martín and Camacho2023), making their applicability to this study uncertain. By employing interpretable feature representations, we ensure that clustering results remain transparent and analyzable within the context of existing linguistic and source-critical scholarship.

Figure 3. Clustering results for the book of Genesis across different parameter combinations, evaluated against expert annotations distinguishing between the main textual body and genealogical lists traditionally attributed to P. Results are shown for our cross-information-based clustering method (left) and k-means (right). Top panel: The 20 feature combinations that yield the highest MCC scores, indicating the strongest agreement with expert annotations. Bottom panel: Distribution of MCC scores across all parameter combinations, sorted into discrete performance intervals.

To further clarify the effect of parameter settings, we note that varying the n-gram size directly affects both sparsity and entropy in the feature space. Larger n-grams tend to emphasize rigid phraseology and reduce overlap across segments, making them useful for identifying formulaic repetitions. Smaller n-grams increase coverage but may conflate stylistically distinct phrases. Likewise, broader running windows ( $\ell $ ) smooth over local variation and can help reveal segmental regularities, while narrower windows are more sensitive to fine-grained shifts. Feature count thresholds (f) balance interpretability and statistical stability by filtering out rare or idiosyncratic features. These parameters jointly determine which aspects of formulaicity are foregrounded and how they align with the internal segmentation of the text.

Comparison with previous work

There is very limited prior work on unsupervised clustering approaches for exploring hypothesized divisions of biblical texts. Most previous studies have relied on supervised classification methods, where textual divisions were predefined, and models were trained using manually curated feature sets (Dershowitz et al. Reference Dershowitz, Akiva, Koppel and Dershowitz2015; Radday and Shore Reference Radday and Shore1985). These approaches often rely on cherry-picked linguistic or stylistic markers that align with existing scholarly expectations, making the claim of statistical significance somewhat circular. Since the feature selection process is influenced by prior assumptions about textual divisions, such methods do not provide independent verification of whether distinct literary strata emerge naturally from the data.

Figure 4. Clustering results for the P/non-P partition in the book of Exodus, similar to Figure 3.

Figure 5. Clustering results for the P/H partition in the book of Leviticus, similar to Figure 3.

By avoiding reliance on predefined features and instead allowing clusters to emerge based on intrinsic textual properties, we aim to assess whether computational methods can independently recover traditionally hypothesized partitions. Due to the lack of prior unsupervised studies, we resort to comparing our results with previous supervised approaches, evaluating whether our method aligns with existing classifications or suggests alternative structuring. In addition, we employ k-means clustering, previously utilized in unsupervised clustering of biblical corpora (e.g., Yoffe et al. Reference Yoffe, Bühler, Dershowitz, Römer, Piasetzky, Finkelstein and Sober2023), as a baseline method to provide a comparative reference for our information-based approach. We also performed this analysis using GMM clustering for completeness, whose results are nearly identical to those of k-means (similarly to what was observed on synthetic data; see Figure 1) and can be found in Appendix D.

Results

The results of our clustering analysis for the Genesis, Exodus and Leviticus hypothesized partitions are presented in Figures 3–5, respectively. These figures compare the performance of our information-based clustering method with k-means in identifying the hypothesized partitions of the text. To assess clustering accuracy, we categorize MCC scores into discrete performance intervals: 50%–74%, 75%–84%, 85%–89%, 90%–95%, and 96%–100%, which provide a benchmark for evaluating the reliability of different parameter settings. In the case of the genealogical lists in Genesis and the P/non-P partition in Exodus, our approach resulted in a substantially larger number of parameter combinations yielding MCC scores within the two highest MCC intervals (85%–89%, 90%–95%) compared to k-means.

While each individual parameter combination in our grid (defined by n, $\ell $ and f) may yield results of interpretive interest, the primary objective of this study is to introduce and validate the clustering framework rather than to evaluate any specific configuration in detail. Accordingly, we assess method performance by analyzing the distribution of MCC scores across the entire parameter space. If a textual partition genuinely reflects differences in structural predictability, then this signal should emerge robustly across a wide range of embedding configurations, not just under fine-tuned settings.

Conversely, when only a limited subset of parameter configurations yields strong agreement with a hypothesized division, this suggests that certain levels of segmentation and feature granularity are particularly well suited to capturing certain formulaic patterns present in the data. Rather than treating this as a sign of feature selection artifacts, we interpret such behavior as evidence that distinct formulaic patterns may emerge more clearly at specific scales of representation. Some clusters, especially those linked to genre, register or discourse framing, may only become statistically salient when, for example, the n-gram size and running window align with their internal structure. In this view, variation in clustering performance is not a sign of instability. Rather, it reflects both methodological sensitivity and the heterogeneous nature of formulaicity across textual layers.

This perspective is especially relevant for corpora where no predefined labels or gold standards exist. In such cases, exploring the consistency or divergence of outputs across parameter settings can help identify stable signals and locate boundaries where stylistic coherence shifts. Disagreements between parameter configurations are not necessarily failures of the method. Instead, they offer interpretive entry points into the structure of the data and the scale at which formulaic patterns operate.

Figure 6. Distinctive n-grams extracted from the formulaic cluster for two parameter combinations, capturing 30% of the variance (see Section 2.8 in Yoffe et al. (Reference Yoffe, Bühler, Dershowitz, Römer, Piasetzky, Finkelstein and Sober2023)). Left panel: Clustering of morphologically-represented Leviticus using $\ell = 28$ , $n = 4$ and $f = \text {all}$ , achieving an MCC score of 94%. Right panel: Similar to the left panel but with $\ell = 20$ , $n = 2$ and $f = 300$ , achieving an MCC score of 93%. n-grams discussed in the “Formulaic Structure and Parameter Sensitivity in the P/H Partition of Leviticus” section as examples for H- and P-associated features are outlined in red. The insets display the self-information distributions of both clusters, with blue and orange representing the non-formulaic and formulaic clusters, respectively.

In the case of the P/H partition in Leviticus, our method yielded more parameter combinations in the highest MCC interval by roughly a factor of three, but k-means yielded slightly more MCC scores in the highest two intervals overall. This can be attributed to the fact that not all parameter combinations produce partitions with strong auto-entropy contrasts. Unlike our information-based approach, which depends on isolating structural predictability differences, k-means clusters based on direct feature-space separations, allowing it to identify partitions even when distinctions arise from lexical or thematic variation rather than underlying structural consistency. Because k-means optimizes intra-cluster similarity without requiring an information-theoretic distinction, it remains more robust across a wider range of parameter settings, whereas information-based clustering excels only when the chosen parameters successfully capture fundamental differences in textual predictability. In the “Formulaic Structure and Parameter Sensitivity in the P/H Partition of Leviticus” section, we conduct an in-depth analysis of the results for the P/H partition in Leviticus and discuss their implications.

Formulaic structure and parameter sensitivity in the P/H partition of Leviticus

Among the three case studies, the Leviticus P/H division shows the most variation in performance across parameter settings. In Genesis and Exodus, the contrasts between genealogical and narrative passages in Genesis, and between priestly and non-priestly material in Exodus, lead to many strong results clustered in the 85%–90% and 90%–95% MCC ranges. This suggests that the stylistic differences in those cases are captured consistently across a wide range of feature combinations. In contrast, the P/H division in Leviticus produces more extreme outcomes. It yields more high-performing results in the 90%–95% range than k-means-based partitions, but also many more low-performing results below 75%. This behavior stands in contrast to the k-means baseline, which produces fewer strong results but also fewer weak ones. These patterns suggest that the formulaic differences between P and H are more sensitive to how the text is segmented and represented. In the analysis that follows, we explore how different parameter settings lead to shifts in which passages are classified as more internally regular.

In Figure 6, we illustrate how varying parameter choices, specifically, using 2-grams versus 4-grams with different running window widths, affect the identification of formulaic patterns in the text. These differences in parameterization lead to both hypothesized classes (P and H) being classified as the formulaic cluster under different settings. In the first case, we apply the parameter combination that yields the highest MCC score of 94% (see Figure 5). Here, the formulaic cluster is characterized by 4-grams, such as (“I am the LORD thy God”) and { } (“[Speak to] the Israelites and say [to them]”), which are prosaic refrains strongly associated with H (Knohl Reference Knohl2007). Notably, in this setting, H is classified as the formulaic cluster, while P is identified as the non-formulaic one. There, the first 4-gram serves as a theological and moral refrain, reinforcing divine authority and the covenantal obligations of Israel. It often punctuates commandments, emphasizing that obedience is not merely a legal or social requirement but a direct response to God’s sovereignty (e.g., Lev 19:2, 19:4, 19:10), and the second introduces divine laws and instructions, repeatedly marking the transition between legislative units in Leviticus (e.g., Lev 17:2, 18:1, 19:1), which constitute the P-associated corpus. Notably, the identification of a running-window width of 28 is not coincidental; it corresponds to the average length of the priestly legislative units (Knohl Reference Knohl2007). This choice of parameters (n = 4, $\ell $ = 28, $f = \text {all}$ ) distinguishes the more formulaic nature of H (only in the case of 4-grams), from the P material, whose structural organization aligns with the larger 28-verse window.

In the case of 2-grams-based clustering results (third highest MCC score with $n = 2$ , $\ell = 20$ and $f = 300$ ), we demonstrate that the distinctive features of the formulaic cluster are explicitly associated with the priestly legislative units. These include recurring phrases such as {} (“The Priest”), which frequently appear in ritual contexts, delineating the priestly role in sacrificial procedures, purity regulations and atonement rites (e.g., Lev 4:5, 6:7, 16:32). Similarly, {} (“The Altar”) is central to the sacrificial system, often marking legal prescriptions concerning offerings and the sanctity of cultic spaces (e.g., Lev 1:5, 6:9, 8:15). These formulaic elements reflect the institutional and hierarchical focus of P, where priestly duties and sacrificial procedures are codified with precise and repetitive language. The strong association of these 2-grams with priestly legislative discourse underscores the structured, procedural nature of P, distinguishing it from the moral and theological refrains characteristic of H.

These results underscore that the identification of formulaic structure is not fixed to a single representation but emerges differently depending on the linguistic and structural scales chosen. In the case discussed here, distinct sets of formulaic features become salient under different n-gram and segmentation settings, yielding competing clusterings where either P or H is classified as more formulaic. This divergence does not undermine the method, but rather illustrates its strength: it allows researchers to test how formulaicity is distributed under different structural assumptions, exposing layered regularities that vary in grammatical, semantic, and discourse scale. The ability to generate high-confidence classifications under some parameterizations, and not others, suggests that formulaicity itself is not an absolute property, but a perspective-sensitive feature of textual organization.

Discussion

In this study, we introduced an information-theoretic soft clustering framework designed to identify structured patterns within textual data. Our approach leverages self-information as a statistical indicator of structural regularities, enabling a systematic exploration of linguistic and stylistic consistency across different types of corpora. By treating clustering as a probabilistic process rather than enforcing hard boundaries, our method accommodates uncertainty in textual attributions, making it particularly suited for cases where transitions between sources or styles are gradual rather than abrupt.

We validated our method on both Gaussian-simulated and categorical datasets to assess its effectiveness across different data structures. Benchmarking against traditional clustering techniques, including k-means and various GMMs, demonstrated that our approach provides a more stable classification of structured and unstructured components, especially in cases where the sample-to-dimension ratio is small. The information-based constraint further improved cluster separability, ensuring that formulaic structures were identified with higher precision.

The Gaussian simulations allowed us to evaluate performance under controlled conditions, where we modeled clusters using multivariate Gaussian distributions with varying covariance structures. Our method proved effective in correctly classifying clusters despite high-dimensional covariance estimation challenges, particularly in settings where the number of observations was small relative to the number of features. Unlike conventional clustering approaches that rely on full covariance estimation, which can be unstable in low-sample regimes, our information-based approach remains robust by leveraging structural differences rather than density estimation alone.

The categorical data experiments, in turn, tested the robustness of our framework in realistic scenarios where feature dependencies, sparsity and small sample sizes of high-dimensional datasets introduce significant challenges to clustering. These experiments demonstrated that our method remains effective even when conventional clustering approaches struggle due to the high-dimensional and discrete nature of the data. Our model successfully captured co-occurrence structures within the data, identifying clusters with statistically significant variations in information-based distributions, making it particularly suitable for structured linguistic data where underlying dependencies are not easily captured by Euclidean distance-based methods.

Applying this method to the biblical corpus, we analyzed key textual partitions to determine whether the identified clusters align with traditional source-critical hypotheses. Our approach consistently yielded the highest classification accuracy across various parameter configurations, surpassing k-means in the number of high-confidence classifications. In Genesis and Exodus, our method produced a significantly larger number of classifications within the two highest MCC intervals (85%–89% and 90%–95%) compared to k-means, indicating its robustness in capturing the structural properties of these partitions. In Leviticus, while k-means exhibited more high-MCC classifications overall, our approach yielded approximately twice the number of classifications in the highest MCC interval (90%–95%), demonstrating its effectiveness in isolating the most formulaic structures within the text.

These findings suggest that information-theoretic clustering offers a promising avenue for unsupervised textual analysis, providing an independent measure of structural consistency that does not rely on predefined linguistic features. The stability of our method across different corpora highlights its potential applicability beyond biblical studies, particularly in fields where latent structures within textual datasets require tractable and unbiased identification.

More broadly, our findings demonstrate that structural predictability, quantified through sample-wise self-information, captures a core dimension of textual organization that aligns with, and in some cases clarifies, traditional source-critical hypotheses. In the case of computerized source criticism of the Hebrew Bible, prior computational studies have shown that blocks, such as P and non-P differ in feature distributions (Dershowitz et al. Reference Dershowitz, Akiva, Koppel and Dershowitz2015; Faigenbaum-Golovin et al. Reference Faigenbaum-Golovin, Kipnis, Bühler, Piasetzky, Römer and Finkelstein2025; Yoffe et al. Reference Yoffe, Bühler, Dershowitz, Römer, Piasetzky, Finkelstein and Sober2023). Here, we show that they also differ in their degree of formulaicity. This distinction is especially significant when made explicit through statistical modeling, as it formalizes what has often been an implicit intuition in biblical exegesis: that certain textual strata are not only lexically distinct but also stylistically more rigid, repetitive or constrained. Crucially, our results further reveal that the expression of formulaicity is dependent on the feature set used; certain structures yield different interpretations of which passages are most internally consistent. This variability underscores the interpretive value of entropy-based methods: rather than producing a single fixed partition, they allow scholars to examine how different linguistic scales foreground different dimensions of textual structure, enabling a more nuanced and data-grounded exploration of compositional layers.

This makes entropy-based clustering particularly well suited for historical texts with compositional depth, where multiple stylistic logics may operate simultaneously. In such settings, soft clustering based on self-information variation provides a principled way to detect internal regularities without relying on predefined labels or explicit boundaries. Exploring the parameter space, across different n-gram granularities, segmentation schemes and feature types, becomes an interpretive tool for probing how distinct textual regimes express constraint, formulaicity or stylistic coherence.

For example, the distinction between Early and Late Biblical Hebrew is often expressed in systematic lexical and idiomatic shifts, making diachronic strata promising candidates for unsupervised clustering based on word-level features (Hurvitz Reference Hurvitz1968, Reference Hurvitz2014). In the Deuteronomistic history, recurring thematic formulas and narrative structures may yield internally coherent segments whose predictability aligns with redactional layers (Peckham Reference Peckham2019). Rabbinic legal corpora, such as the Mishnah, show compositional depth through repeated conditional constructions and formulaic attributions that vary across tractates or schools (Fraade Reference Fraade1991; Neusner Reference Neusner1988). In such cases, entropy-based clustering may help identify stylistic layers that are not organized by topic or subject matter, but by the underlying patterns of discourse structure and phrasing. Similarly, in oral-derived corpora like the Old Norse sagas, genealogical prologues, episodic transitions and set-piece scenes create localized structural regularities that could be identified by entropy-guided segmentation (Byock Reference Byock1984; Clover Reference Clover1982). In all these contexts, the method can be used to test source-critical hypotheses, uncover latent boundaries or evaluate competing assumptions about where and how regularity is embedded in the textual fabric.

In future work, we will extend our framework by incorporating additional embedding techniques, such as tf-idf representations, neural embeddings and transformer-based contextualized representations. These approaches will allow us to assess the model’s performance across a broader range of textual features and evaluate its adaptability to modern NLP methodologies. By integrating embeddings that capture both statistical word importance and deep semantic relationships, we aim to refine the clustering process and further improve the detection of latent structural patterns in diverse textual corpora.

Acknowledgements

We acknowledge Prof. Israel Knohl for useful discussions and the anonymous referees for improving the quality of this manuscript.

Data availability statement

All textual data are available online and is referenced herein. An example of our code can be found in https://github.com/YoffeG/cross_entropy_clustering.

Author contributions

Conceptualization: G.Y.; Methodology: G.Y. and B.S.; Software: G.Y.; Validation: G.Y.; Writing – Original Draft: G.Y.; Writing – Review and Editing: All authors.

Competing interests

The authors declare none.

Appendix A. Self-information in multivariate Gaussian distributions

For a continuous random variable X with probability density function $p(x)$ , the self-information of a realization x is given by

(A1)

$$ \begin{align} I(x) = -\log p(x), \end{align} $$

which quantifies the surprise or information content of an observation x. Lower values correspond to more probable events, and higher values correspond to less probable events.

A.1.1 Self-information of a multivariate Gaussian distribution

Let $X \sim \mathcal {N}(\mu , \Sigma )$ be a d-dimensional multivariate Gaussian random variable with mean $\mu \in \mathbb {R}^d$ and covariance matrix $\Sigma \in \mathbb {R}^{d \times d}$ , where the probability density function is given by

(A2)

$$ \begin{align} p(x) = \frac{1}{(2\pi)^{d/2} |\Sigma|^{1/2}} \exp\left(-\frac{1}{2}(x-\mu)^\top \Sigma^{-1}(x-\mu)\right). \end{align} $$

Substituting this expression for $p(x)$ into the definition of self-information:

(A3)

$$ \begin{align} I(x) &= -\log p(x) \nonumber \\ &= -\left[ -\frac{d}{2} \log(2\pi) - \frac{1}{2} \log |\Sigma| - \frac{1}{2}(x-\mu)^\top \Sigma^{-1} (x-\mu) \right] \nonumber \\ &= \frac{d}{2} \log (2\pi) + \frac{1}{2} \log |\Sigma| + \frac{1}{2} D_M(x), \end{align} $$

where $D_M(x) = (x-\mu )^\top \Sigma ^{-1} (x-\mu )$ is the Mahalanobis distance. This expression shows that the self-information of a sample from a multivariate Gaussian distribution consists of three terms:

• A constant term $\frac {d}{2} \log (2\pi )$ , which depends only on the dimensionality d.
• A global covariance term $\frac {1}{2} \log |\Sigma |$ , which depends on the determinant of $\Sigma $ .
• A local deviation term $\frac {1}{2} D_M(x)$ , which quantifies how far x is from the mean $\mu $ , normalized by the covariance structure.

A.1.2 Resolving power of self-information and covariance estimation

To compare two multivariate Gaussian distributions, let $X_1 \sim \mathcal {N}(\mu _1, \Sigma _1)$ and $X_2 \sim \mathcal {N}(\mu _2, \Sigma _2)$ . The difference in expected self-information between these two distributions is given by

(A4)

$$ \begin{align} \Delta I = \mathbb{E}[I(x_2)] - \mathbb{E}[I(x_1)]. \end{align} $$

Using the previous result for $I(x)$ , we can express the expected self-information as

(A5)

$$ \begin{align} \mathbb{E}[I(x)] &= \frac{d}{2} \log(2\pi) + \frac{1}{2} \log |\Sigma| + \frac{1}{2} \mathbb{E}[D_M(x)]. \end{align} $$

Since $\mathbb {E}[D_M(x)] = d$ (the expected Mahalanobis distance for a Gaussian distribution, see B), we simplify this to

(A6)

$$ \begin{align} \mathbb{E}[I(x)] = \frac{d}{2} \log(2\pi) + \frac{1}{2} \log |\Sigma| + \frac{d}{2}. \end{align} $$

Now, to compute the difference in expected self-information between the two distributions, we have

(A7)

$$ \begin{align} \Delta I &= \left[ \frac{d}{2} \log(2\pi) + \frac{1}{2} \log |\Sigma_2| + \frac{d}{2} \right] - \left[ \frac{d}{2} \log(2\pi) + \frac{1}{2} \log |\Sigma_1| + \frac{d}{2} \right] \nonumber \\ &= \frac{1}{2} \log \frac{|\Sigma_2|}{|\Sigma_1|}. \end{align} $$

Since the determinant of a matrix $\Sigma _k$ is the product of its eigenvalues, $|\Sigma _k| = \prod _{i=1}^d \lambda _{k,i}$ , where $\lambda _{k,i}$ are the eigenvalues of $\Sigma _k$ , we can rewrite the difference as

(A8)

$$ \begin{align} \Delta I = \frac{1}{2} \sum_{i=1}^d \ln \frac{\lambda_{2,i}}{\lambda_{1,i}}. \end{align} $$

If the eigenvalue ratios $\lambda _{2,i} / \lambda _{1,i}$ are approximately constant across dimensions, say $\lambda _{2,i} / \lambda _{1,i} \approx C$ , then

(A9)

$$ \begin{align} \Delta I \approx \frac{d}{2} \ln C. \end{align} $$

Thus, the resolving power of self-information increases with the dimensionality d, allowing for a better distinction between two distributions.

A.1.3 Fluctuations and finite sample effects

In practice, the covariance matrix $\Sigma $ is estimated from a finite sample:

(A10)

$$ \begin{align} \hat{\Sigma} = \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})(x_i - \bar{x})^\top. \end{align} $$

The eigenvalues $\hat {\lambda }_i$ of the estimated covariance matrix $\hat {\Sigma }$ fluctuate around the true eigenvalues $\lambda _i$ :

(A11)

$$ \begin{align} \hat{\lambda}_i = \lambda_i + \Delta \lambda_i. \end{align} $$

Using a first-order Taylor expansion for $\ln \hat {\lambda }_i$ around $\lambda _i$ , we approximate the logarithm of the estimated eigenvalue $\hat {\lambda }_i$ as

(A12)

$$ \begin{align} \ln \hat{\lambda}_i \approx \ln \lambda_i + \frac{\Delta \lambda_i}{\lambda_i}, \end{align} $$

where $\Delta \lambda _i = \hat {\lambda }_i - \lambda _i$ is the fluctuation between the estimated eigenvalue $\hat {\lambda }_i$ and the true eigenvalue $\lambda _i$ .

Now, to compute the variance of $\ln \hat {\lambda }_i$ , we use the fact that for small fluctuations $\Delta \lambda _i$ , the variance of $\ln \hat {\lambda }_i$ is approximated by the variance of the term $\frac {\Delta \lambda _i}{\lambda _i}$ . Thus, we write the variance as

(A13)

$$ \begin{align} \text{Var}(\ln \hat{\lambda}_i) = \text{Var}\left( \frac{\Delta \lambda_i}{\lambda_i} \right). \end{align} $$

Since $\lambda _i$ is a constant, we can factor it out of the variance expression:

(A14)

$$ \begin{align} \text{Var}(\ln \hat{\lambda}_i) = \frac{1}{\lambda_i^2} \text{Var}(\Delta \lambda_i). \end{align} $$

This shows that the variance of the logarithm of the eigenvalue is inversely proportional to the square of the true eigenvalue $\lambda _i$ , and is proportional to the variance of the fluctuation $\Delta \lambda _i$ . For a sample size n, the fluctuation $\Delta \lambda _i = \hat {\lambda }_i - \lambda _i$ arises from the estimation of the eigenvalue from a finite sample. According to the Central Limit Theorem, as n increases, the sample eigenvalue $\hat {\lambda }_i$ becomes increasingly concentrated around the true eigenvalue $\lambda _i$ , with the fluctuation $\Delta \lambda _i$ decreasing as n grows. Specifically, the variance of $\Delta \lambda _i$ scales inversely with n, reflecting the improved precision of the estimate as the number of samples increases, expressed as

(A15)

$$ \begin{align} \text{Var}(\Delta \lambda_i) \sim \frac{1}{n}. \end{align} $$

The variance of self-information. The self-information $I(x)$ involves a term $\log |\Sigma |$ , which is the logarithm of the determinant of the covariance matrix $\Sigma $ . The determinant $|\Sigma |$ is the product of the eigenvalues $\lambda _i$ of $\Sigma $ :

$$\begin{align*}|\Sigma| = \prod_{i=1}^d \lambda_i. \end{align*}$$

Taking the logarithm of $|\Sigma |$ , we get

$$\begin{align*}\log |\Sigma| = \sum_{i=1}^d \log \lambda_i. \end{align*}$$

Thus, the term $\log |\Sigma |$ contributes a factor of d because the sum involves d eigenvalues.

Now, considering the Mahalanobis distance $D_M(x) = (x - \mu )^\top \Sigma ^{-1} (x - \mu )$ , which is influenced by the eigenvalues of $\Sigma $ , the variance of the Mahalanobis distance is proportional to the eigenvalue fluctuations. Using the result from the previous sections that the variance of the logarithm of eigenvalues scales as $\frac {1}{n \lambda _i^2}$ , we see that the overall variance of self-information, which combines these effects, scales as

$$\begin{align*}\text{Var}(I(x)) \sim \frac{d}{4n \lambda^2}. \end{align*}$$

A.1.4 Impact on resolving power and signal-to-noise ratio (SNR)

The resolving power for distinguishing between two distributions using self-information is given by

(A16)

$$ \begin{align} \text{Resolving Power} = \frac{\Delta I}{\sqrt{\text{Var}(I(x_1)) + \text{Var}(I(x_2))}}. \end{align} $$

Substituting our earlier results:

(A17)

$$ \begin{align} \text{Resolving Power} \sim \sqrt{n d}. \end{align} $$

Figure A1 illustrates this result through empirical validation.

Figure A1. Dependence of the resolving power of self-information on sample size and dimensionality. (a) Linear dependence of variance on dimension. (b) Linear dependence of variance on $1/n$ . (c) Linear dependence of $\Delta I$ on dimension.

Appendix B. Expected Mahalanobis distance

Let $X \sim \mathcal {N}(\mu , \Sigma )$ be a d-dimensional Gaussian random variable with mean vector $\mu $ and covariance matrix $\Sigma $ . The Mahalanobis distance of a realization x is defined as

$$\begin{align*}D_M(x) = (x - \mu)^\top \Sigma^{-1} (x - \mu). \end{align*}$$

The quantity $D_M(x)$ measures how far the vector x is from the mean $\mu $ , normalized by the covariance matrix $\Sigma $ .

Since x is drawn from a Gaussian distribution, we can express $D_M(x)$ as the sum of squared standard normal variables:

$$\begin{align*}D_M(x) = \sum_{i=1}^d Z_i^2, \end{align*}$$

where $Z_i \sim \mathcal {N}(0, 1)$ are independent standard normal random variables, each representing the standardized components of the vector $x - \mu $ under the covariance structure defined by $\Sigma $ .

The expectation of the squared standard normal variables is

$$\begin{align*}\mathbb{E}[Z_i^2] = 1 \quad \text{for each} \quad i = 1, 2, \dots, d. \end{align*}$$

Therefore, the expected Mahalanobis distance is the sum of the expectations for each component:

$$\begin{align*}\mathbb{E}[D_M(x)] = \mathbb{E}\left[ \sum_{i=1}^d Z_i^2 \right] = \sum_{i=1}^d \mathbb{E}[Z_i^2] = d. \end{align*}$$

Thus, the expected value of the Mahalanobis distance for a sample from a multivariate Gaussian distribution is d, which depends solely on the dimensionality of the distribution:

$$\begin{align*}\mathbb{E}[D_M(x)] = d. \end{align*}$$

Appendix C. Distinguishing entropy-based and feature distribution-based clusters

While the previous categorical benchmark (§2) demonstrated the effectiveness of our approach in distinguishing formulaic from non-formulaic structures in sparse, high-dimensional settings, an additional challenge arises when two clusters exhibit identical entropy yet differ in their feature distributions. In real-world textual corpora, stylistic or formulaic patterns are not always characterized by large entropy contrasts but may instead be defined by shifts in the distribution of certain linguistic elements. If our method is to serve as a robust tool for identifying latent structures in textual data, it must be capable of distinguishing between two cases: (1) when formulaic patterns manifest as an overall reduction in entropy, and (2) when distinct compositional layers exist despite maintaining equivalent entropy levels.

C.1 Experimental setup

To investigate this, we construct a controlled dataset in which two classes share identical overall entropy but differ in their feature distributions. We then introduce a formulaic subset by modifying the activation probabilities of a selected fraction of features. This experiment tests whether the self-information-based clustering framework correctly resolves compositional differences when entropy alone does not provide a discriminative signal. By evaluating the model’s classification performance in both the entropy-based and feature-distribution-based scenarios, we assess the extent to which our approach captures deeper structural regularities beyond simple variance constraints.

Feature probability distributions. Given a binary feature space $X \in \{0,1\}^{n \times d}$ with n samples and d dimensions, we construct two distinct classes, A and B, while ensuring that their overall entropy remains identical.

Initially, all dimensions are assigned a base activation probability p. To create distinct but entropy-balanced distributions, half of the dimensions are randomly selected and modified as follows:

• For the selected dimensions, class A retains the original activation probability p, while class B receives a reduced probability of $0.1p$ .
• The remaining half of the dimensions follow the inverse assignment: class B retains the original activation probability p, while class A receives $0.1p$ .

Formally, let $S_{\text {high}} \subset \{1, \dots , d\}$ , $|S_{\text {high}}| = d/2$ be a randomly selected subset of dimensions. The activation probabilities for each feature in each class are then:

$$\begin{align*}p_A(j) = \begin{cases} p, & j \in S_{\text{high}}, \\ 0.1p, & j \notin S_{\text{high}}, \end{cases} \quad p_B(j) = \begin{cases} 0.1p, & j \in S_{\text{high}}, \\ p, & j \notin S_{\text{high}}. \end{cases} \end{align*}$$

Samples for each class are then drawn from independent Bernoulli distributions:

$$\begin{align*}X_A(j) \sim \text{Bern}(p_A(j)), \quad X_B(j) \sim \text{Bern}(p_B(j)). \end{align*}$$

Since both classes contain the same number of dimensions at probability p and $0.1p$ , their total entropy remains identical:

$$\begin{align*}H(X_A) \approx H(X_B). \end{align*}$$

Formulaic cluster construction. To introduce a formulaic cluster, a subset of each class ( $n/2$ samples) is modified by increasing the activation probability of a selected fraction $\rho $ of dimensions. These dimensions, referred to as formulaic dimensions, are chosen randomly from the total feature set.

The activation probabilities of these formulaic dimensions are modified as follows:

$$\begin{align*}p_A'(j) = \begin{cases} p_A(j) + \Delta p, & j \in S_{\text{form}}, \\ p_A(j), & \text{otherwise}, \end{cases} \quad p_B'(j) = \begin{cases} p_B(j) + \Delta p, & j \in S_{\text{form}}, \\ p_B(j), & \text{otherwise}, \end{cases} \end{align*}$$

where $S_{\text {form}} \subset \{1, \dots , d\}$ is a randomly selected subset of dimensions with size $\rho d$ .

Labeling. Each sample is labeled according to two independent classification axes:

• Feature distribution label: $Y_{\text {dist}} \in \{0,1\}$ where 0 corresponds to class A and 1 to class B.
• Formulaicity label: $Y_{\text {form}} \in \{0,1\}$ where 0 corresponds to original samples and 1 to the formulaic subset.

This formulation allows evaluation of whether the clustering framework differentiates entropy-driven structure from feature-distribution-based separation.

C.2 Experimental results

To assess the capacity of the information-theoretic clustering framework to distinguish between entropy-based and feature-distribution-based structures, we analyze clustering performance as a function of key dataset parameters. Specifically, we vary the baseline activation probability of features ( $p_{\text {feature}}$ ), the activation probability of formulaic dimensions ( $p_{\text {formulaic}}$ ), and the fraction of dimensions designated as formulaic ( $d_{\text {formulaic}}$ ).

In Figure C1, we illustrate the comparison between the efficacy of feature-distribution-based clustering and formulaic clustering under these conditions. While this analysis provides insights into when each signal tends to dominate, it remains a heuristic demonstration rather than a quantitative predictive framework. We do not establish precise conditions under which one clustering paradigm will prevail over the other; developing such predictive models is an objective for future work. Instead, we illustrate that each approach becomes dominant in different parameter regimes, reinforcing the need for a nuanced understanding of structural differentiation in clustering applications.

Figure C1. Clustering performance (MCC) as a function of key dataset parameters. Orange represents formulaic clustering, blue represents feature-distribution-based clustering. Upper panel: MCC vs. fraction of formulaic dimensions $d_{\text {form}}$ , with fixed $p_{\text {form}} = 0.5$ and varying $p_{\text {feature}} \in \{0.1, 0.3, 0.5\}$ . Middle panel: MCC vs. baseline feature activation probability $p_{\text {feature}}$ , with fixed $d_{\text {formulaic}} = 0.1$ and varying $p_{\text {form}} \in \{0.8, 0.5, 0.1\}$ Lower panel: MCC vs. formulaic activation probability $p_{\text {form}}$ , with fixed $d_{\text {form}} = 0.3$ and varying $p_{\text {feature}} \in \{0.1, 0.3, 0.5\}$ .

As seen in Figure C1, the relative strength of the two structural signals affects the extent to which each clustering method is effective. Formulaic clustering exhibits higher performance when the formulaic subset of dimensions is strongly expressed, suggesting that structural repetition enhances separability. Conversely, feature-distribution-based clustering improves when global differences in feature distributions are more pronounced, indicating that the method is primarily sensitive to shifts in overall distributional structure rather than localized regularities.

These findings suggest that the clustering framework captures distinct statistical signatures depending on the dominance of formulaic constraints or feature-distributional differences. Further investigation is required to formalize the interaction between these two effects and to assess the generality of these results across different types of data. Future work will explore whether hybrid approaches incorporating both signals can improve robustness in cases where neither signal is dominant.

D GMM clustering results

In Figure D1, we illustrate the classification results using the Gaussian Mixture Model (GMM) algorithm as applied to the three biblical texts, as discussed in §2.

Figure D1. Clustering results for the hypothesized partitions of the books of Genesis, Exodus, and Leviticus (left to right, respectively), using GMM clustering, similar to Fig. 3.

Footnotes

This article was awarded the Open Data badge for transparent practices. See the Data availability statement for details.

1 https://scikit-learn.org/1.5/modules/mixture.html

2 https://github.com/STEPBible/STEPBible-Data

References

Albertz, Rainer 2018. “The Recent Discussion on the Formation of the Pentateuch/Hexateuch.” Hebrew Studies 59: 65–92.Google Scholar

Altonji, Joseph G., and Segal, Lewis M.. 1996. “Small-Sample Bias in GMM Estimation of Covariance Structures.” Journal of Business & Economic Statistics 14, no. 3: 353–66.Google Scholar

Antonia, Antonia, Craig, Hugh, and Elliott, Jack. 2014. “Language Chunking, Data Sparseness, and the Value of a Long Marker List: Explorations with Word n-Grams and Authorial Attribution.” Literary and Linguistic Computing 29, no. 2: 147–63.Google Scholar

Ashurbekova, Karina, Usseglio-Carleve, Antoine, Forbes, Florence, and Achard, Sophie. 2021. “Optimal Shrinkage for Robust Covariance Matrix Estimators in a Small Sample Size Setting.” Preprint.Google Scholar

Bickel, Peter J., and Levina, Elizaveta. 2008. “Regularized Estimation of Large Covariance Matrices.” Annals of Statistics 36, no. 1: 199–227. https://doi.org/10.1214/009053607000000758.Google Scholar

Bühler, Axel, Yoffe, Gideon, Römer, Thomas, Sober, Barak, Finkelstein, Israel, Piasetzky, Eli, and Dershowitz, Nachum. 2024. “Exploring the Stylistic Uniqueness of the Priestly Source in Genesis and Exodus Through a Statistical/Computational Lens.” Zeitschrift f ü r die alttestamentliche Wissenschaft 136, no. 2: 165–90.Google Scholar

Byock, Jesse L. 1984. “Saga Form, Oral Prehistory, and the Icelandic Social Context.” New Literary History 16, no. 1: 153–73.Google Scholar

Casale, Francesco Paolo, Dalca, Adrian, Saglietti, Luca, Listgarten, Jennifer, and Fusi, Nicolo. 2018. “Gaussian Process Prior Variational Autoencoders.” In Advances in Neural Information Processing Systems 31 (NeurIPS 2018), edited by Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R.. Red Hook, NY: Curran Associates.Google Scholar

Church, Kenneth, and Hanks, Patrick. 1990. “Word Association Norms, Mutual Information, and Lexicography.” Computational Linguistics 16, no. 1: 22–9.Google Scholar

Clover, Carol J. 1982. The Medieval Saga. Ithaca, NY: Cornell University Press.Google Scholar

Coleman, Stephen. 2019. “The Psalmic Oral Formula Revisited: A Cognitive-Performative Approach.” Biblical Interpretation 27, no. 2: 186–207.Google Scholar

Davis, Jason, and Dhillon, Inderjit. 2006. “Differential Entropic Clustering of Multivariate Gaussians.” In Advances in Neural Information Processing Systems 19 (NIPS 2006), edited by Schölkopf, Bernhard, Platt, John, and Hofmann, Thomas. Cambridge, MA: MIT Press.Google Scholar

Dershowitz, Idan, Akiva, Navot, Koppel, Moshe, and Dershowitz, Nachum. 2015. “Computerized Source Criticism of Biblical Texts.” Journal of Biblical Literature 134, no. 2: 253–71.Google Scholar

Devlin, Jacob, Chang, Ming-Wei, Lee, Kenton, and Toutanova, Kristina. 2019. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–86. Minneapolis, MN: Association for Computational Linguistics.Google Scholar

Dilokthanakul, Nat, Mediano, Pedro A. M., Garnelo, Marta, Lee, Matthew C. H., Salimbeni, Hugh, Arulkumaran, Kai, and Shanahan, Murray. 2016. “Deep Unsupervised Clustering with Gaussian Mixture Variational Autoencoders.” arXiv Preprint, arXiv:1611.02648.Google Scholar

Ester, Martin, Kriegel, Hans-Peter, Sander, Jörg, and Xiaowei, Xu. 1996. “Density-Based Spatial Clustering of Applications with Noise.” In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96), edited by Simoudis, Evangelos, Han, Jiawei, and Fayyad, Usama M.. Menlo Park, CA: AAAI Press.Google Scholar

Faigenbaum-Golovin, Shira, Kipnis, Alon, Bühler, Axel, Piasetzky, Eli, Römer, Thomas, and Finkelstein, Israel. 2025. “Critical Biblical Studies via Word Frequency Analysis: Unveiling Text Authorship.” PLoS One 20, no. 6: e0322905.Google Scholar

Fraade, Steven D. 1991. From Tradition to Commentary: Torah and its Interpretation in the Midrash Sifre to Deuteronomy. Vol. 73. Albany, NY: State University of New York Press.Google Scholar

Gamallo, Pablo, Campos, José Ramom Pichel, and Alegria, Iñaki. 2017. “A Perplexity-Based Method for Similar Languages Discrimination.” In Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), edited by Nakov, Preslav, Zampieri, Marcos, Ljubešić, Nikola, Tiedemann, Jörg, Malmasi, Shevin, and Ali, Ahmed. Valencia, Spain: Association for Computational Linguistics.Google Scholar

Giraldo, Luis Gonzalo Sanchez, Rao, Murali, and Principe, Jose C. 2014. “Measures of Entropy From Data Using Infinitely Divisible Kernels.” IEEE Transactions on Information Theory 61, no. 1: 535–48.Google Scholar

Gitay, Yehoshua. 1980. Tradition and Interpretation: A Study of the Use and Application of Formulaic Language in the So-Called Ebed YHWH-Psalms. Jerusalem: The Hebrew University Magnes Press.Google Scholar

Haran, Mehahem. 1981. “Behind the Scenes of History: Determining the Date of the Priestly Source.” Journal of Biblical Literature 100, no. 3: 321–33.Google Scholar

Hartigan, John A., and Wong, Manchek A., 1979. “A k-Means Clustering Algorithm.” Applied Statistics 28, no. 1: 100–8.Google Scholar

Holzinger, H. 1893. Einleitung in den Hexateuch. Vol. 1. Tübingen: Mohr Siebeck.Google Scholar

Houdouin, Pierre, Ollila, Esa, and Pascal, Frédéric. 2023. “Regularized EM Algorithm.” In ICASSP 2023 – IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), edited by ICASSP 2023 Organizing Committee, 1–5. Piscataway, NJ: IEEE.Google Scholar

Huang, Rongqing, and Hansen, John H. L.. 2007. “Dialect Classification on Printed Text Using Perplexity Measure and Conditional Random Fields.” In 2007 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’07), vol. 4, edited by ICASSP 2007 Organizing Committee, IV–993. Piscataway, NJ: IEEE.Google Scholar

Huertas-Tato, Javier, Martín, Alejandro, and Camacho, David. 2023. “Understanding Writing Style in Social Media with a Supervised Contrastively Pre-Trained Transformer.” Preprint, eprint: 2310.11081. https://doi.org/10.48550/arXiv.2310.11081.Google Scholar

Hurvitz, A. 1968. “The Chronological Significance of Aramaisms in Biblical Hebrew.” Israel Exploration Journal 18: 234–40.Google Scholar

Hurvitz, Avi. 2014. A Concise Lexicon of Late Biblical Hebrew: Linguistic Innovations in the Writings of the Second Temple Period. Leiden and Boston: Brill.Google Scholar

Jensen, Minna Skafte. 1980. The Homeric Question and the Oral-Formulaic Theory. Vol. 20. Copenhagen: Museum Tusculanum Press.Google Scholar

Klakow, Dietrich, and Peters, Jochen. 2002. “Testing the Correlation of Word Error Rate and Perplexity.” Speech Communication 38, nos. 1–2: 19–28.Google Scholar

Klostermann, August. 1907. Der Pentateuch: beitrge zu seinem verstndis und seiner entstehungsgeschicte. Leipzig: G. Böhme.Google Scholar

Knohl, Israel 2007. The Sanctuary of Silence: The Priestly Torah and the Holiness School. Winona Lake, IN: Eisenbrauns.Google Scholar

Kuenen, Abraham, and Wicksteed, Philip Henry. 1886. An Historico-Critical Inquiry into the Origin and Composition of the Hexateuch (Pentateuch and book of Joshua). London: Macmillan.Google Scholar

Kurzynski, Maciej. 2023. “The Stylometry of Maoism: Quantifying the Language of Mao Zedong.” In Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages, edited by Hämäläinen, Mika, Öhman, Emily, Pirinen, Flammie, Alnajjar, Khalid, Miyagawa, So, Bizzoni, Yuri, Partanen, Niko, and Rueter, Jack. Tokyo, Japan: Association for Computational Linguistics.Google Scholar

Li, Zuchao, Wang, Rui, Chen, Kehai, Utiyama, Masso, Sumita, Eiichiro, Zhang, Zhuosheng, and Zhao, Hai. 2020. “Data-Dependent Gaussian Prior Objective for Language Generation.” In International Conference on Learning Representations (ICLR 2020). Addis Ababa, Ethiopia (virtual): OpenReview.net.Google Scholar

Long, Burke O., Tengström, Sven, and Tengstrom, Sven. 1984. Die toledotformel und die literarische struktur der priesterlichen erweiterungsschicht im Pentateuch. Uppsala: Uppsala Universitet.Google Scholar

Magoun, Francis P. Jr, 1953. “Oral-Formulaic Character of Anglo-Saxon Narrative Poetry.” Speculum 28, no. 3: 446–67.Google Scholar

Neusner, Jacob. 1988. The Mishnah: an introduction. London: Bloomsbury Publishing PLC.Google Scholar

Paquot, Magali, and Granger, Sylviane. 2012. “Formulaic Language in Learner Corpora.” Annual Review of Applied Linguistics 32: 130–49.Google Scholar

Peckham, Brian. 2019. The Composition of the Deuteronomistic History. Vol. 35. Leiden: Brill.Google Scholar

Peters, Georg, Crespo, Fernando, Lingras, Pawan, and Weber, Richard. 2013. “Soft Clustering–Fuzzy and Rough Approaches and their Extensions and Derivatives.” International Journal of Approximate Reasoning 54 (2): 307–22.Google Scholar

Plum, Karin Friis. 1989. “Genealogy as Theology.” Scandinavian Journal of the Old Testament 3, no. 1: 66–92.Google Scholar

Polak, Frank H. 2006. “Linguistic and Stylistic Aspects of Epic Formulae in Ancient Semitic Poetry and Biblical Narrative.” In Biblical Hebrew in Its Northwest Semitic Setting, edited by Steven E. Fassberg and Avi Hurvitz, 285–304. Jerusalem and Winona Lake, IN: The Hebrew University Magnes Press and Winona Lake: Eisenbrauns.Google Scholar

Polak, Frank H. 2017. “Syntactic-Stylistic Aspects of the So-Called ‘Priestly’ Work in the Torah.” In Le-ma‘an Ziony: Essays in Honor of Ziony Zevit, edited by Greenspahn, Frederick E. and Rendsburg, Gary A.. Eugene, OR: Wipf and Stock Publishers.Google Scholar

Principe, Jose C. 2010. Information Theoretic Learning: Renyi’s Entropy and Kernel Perspectives. New York: Springer Science & Business Media.Google Scholar

Radday, Yehuda T., and Shore, Haim. 1985. Genesis: An Authorship Study in Computer-Assisted Statistical Linguistics. Analecta Biblica 103. Rome: Biblical Institute Press (Pontifical Biblical Institute).Google Scholar

Radford, Alec, Kim, Jong Wook, Hallacy, Chris, Ramesh, Aditya, Goh, Gabriel, Agarwal, Sandhini, Sastry, Girish, Askell, Amanda, Mishkin, Pamela, Clark, Jack, et al. 2021. “Learning Transferable Visual Models from Natural Language Supervision.” In Proceedings of the 38th International Conference on Machine Learning (ICML 2021), edited by Meila, Marina and Zhang, Tong. PMLR 139. Virtual (Vienna, Austria): PMLR.Google Scholar

Read, John, and Nation, I. S. P.. 2008. “Measurement of Formulaic Sequences.” In Formulaic Sequences: Acquisition, Processing and Use, edited by Wray, Roberta, 23–35. Amsterdam/Philadelphia: John Benjamins.Google Scholar

Römer, Thomas. 2009. “The Exodus Narrative according to the Priestly Document.” In The Strata of the Priestly Writings: Contemporary Debate and Future Directions, edited by Shectman, Sarah and Baden, Joel S.. Zürich: Theologischer Verlag Zürich.Google Scholar

Ross, Jerome Clayton. 1997. The Composition of the Holiness Code (Lev. 17-26). Pittsburgh, PA: University of Pittsburgh.Google Scholar

Royer, Martin. 2017. “Adaptive Clustering through Semidefinite Programming.” In Advances in Neural Information Processing Systems 30 (NIPS 2017), edited by Guyon, I., , U. von, Luxburg, ,Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R.. Red Hook, NY: Curran Associates.Google Scholar

Schmid, Konrad. 2015. “Distinguishing the World of the Exodus Narrative from the World of Its Narrators: The Question of the Priestly Exodus Account in Its Historical Setting.” In Israel’s Exodus in Transdisciplinary Perspective: Text, Archaeology, Culture, and Geoscience, edited by Levy, Thomas E., Schneider, Thomas, and Propp, William H. C.. Cham: Springer.Google Scholar

Shannon, Claude Elwood. 1948. “A Mathematical Theory of Communication.” The Bell System Technical Journal 27, no. 3: 379–423.Google Scholar

Shectman, Sarah, and Baden, Joel S. 2009. The Strata of the Priestly Writings: Contemporary Debate and Future Directions. Vol. 95. Zürich: Theologischer Verlag Zürich.Google Scholar

Shmidman, Avi, Guedalia, Joshua, Shmidman, Shaltiel, Shmidman, Cheyn Shmuel, Handel, Eli, and Koppel, Moshe. 2022. “Introducing BEREL: BERT Embeddings for Rabbinic-Encoded Language.” Preprint, arXiv:2208.01875.Google Scholar

Smith, Mark S. 1996. “The Literary Arrangement of the Priestly Redaction of Exodus: A Preliminary Investigation.” The Catholic Biblical Quarterly 58, no. 1: 25–50.Google Scholar

Stamatatos, Efstathios. 2009. “A Survey of Modern Authorship Attribution Methods.” Journal of the American Society for Information Science and Technology 60, no. 3: 538–56.Google Scholar

Stipp, Hermann-Josef. 2017. “Formulaic Language and the Formation of the Book of Jeremiah.” In Jeremiah’s Scriptures: Production, Reception, Interaction, and Transformation, edited by Najman, Hindy and Schmid, Konrad, 145–65. Leiden/Boston: Brill.Google Scholar

Tabor, Jacek, and Spurek, Przemyslaw. 2014. “Cross-Entropy Clustering.” Pattern Recognition 47, no. 9: 3046–59.Google Scholar

Vink, J. G. 1969. “The Date and Origin of the Priestly Code in the Old Testament.” In The Priestly Code and Seven Other Studies, edited by Deissler, Alfons (series editor, Vetus Testamentum Supplements), 1–144. Leiden: Brill.Google Scholar

Vlassis, Nikos, and Likas, Aristidis. 2002. “A Greedy EM Algorithm for Gaussian Mixture Learning.” Neural Processing Letters 15: 77–87.Google Scholar

Wärters, William R. 1976. Formula Criticism and the Poetry of the Old Testament. Leiden: Brill.Google Scholar

Wellhausen, Julius. 1885. Prolegomena to the History of Israel: with a Reprint of the Article Israel from the Encyclopaedia Britannica. Edinburgh: A. & C. Black.Google Scholar

Wood, David. 2019. “Classifying and Identifying Formulaic Language.” In The Routledge Handbook of Vocabulary Studies, edited by Webb, Stuart and Nation, Paul, pp. 30–45. London/New York: Routledge.Google Scholar

Yang, Linxiao, Cheung, Ngai-Man, Li, Jiaying, and Fang, Jun. 2019. “Deep Clustering by Gaussian Mixture Variational Autoencoders with Graph Embedding.” In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV 2019). Seoul, Korea: IEEE Computer Society Conference Publishing Services.Google Scholar

Yao, Jianfeng, Zheng, Shurong, and Bai, Zhidong. 2015. Sample Covariance Matrices and High-Dimensional Data Analysis. New York: Cambridge University Press.Google Scholar

Yoffe, Gideon, Bühler, Axel, Dershowitz, Nachum, Römer, Thomas, Piasetzky, Eli, Finkelstein, Israel, and Sober, Barak. 2023. “A Statistical Exploration of Text Partition into Constituents: The Case of the Priestly Source in the Books of Genesis and Exodus.” In Findings of the Association for Computational Linguistics: ACL 2023, edited by Association for Computational Linguistics Program Committee, 1918–40. Toronto: Association for Computational Linguistics. https://aclanthology.org/2023.findings-acl.121.Google Scholar

Figure 1. Classification results for the benchmarking experiment of discrete categorical one-hot encoded discrete data described in the “Clustering Benchmarking on Categorical Data” section. The test datasets included 100 samples of (equally-sized) formulaic and non-formulaic classes, of 200 (top panel), 50 (middle panel), and 20 dimensions (bottom panel), with varying degrees of the probability of the base- (p) and formulaic- feature activation ($p_{\text {form}}$), respectively, and the fraction of formulaic dimensions in the formulaic class. The colored areas represent one-standard-deviation intervals derived from 100 simulations.

Figure 2. Classification results of the experiment described in the “Clustering Benchmarking on Multivariate Gaussian Data” section, for a varying number of sample sizes and numbers of dimensions of multivariate Gaussian classes of varying entropy. Upper panel: Varying sample sizes for $d = 50$. Bottom panel: Varying sample sizes for $d = 10$. The colored areas represent one-standard-deviation intervals, derived from 100 simulations.

Figure 3. Clustering results for the book of Genesis across different parameter combinations, evaluated against expert annotations distinguishing between the main textual body and genealogical lists traditionally attributed to P. Results are shown for our cross-information-based clustering method (left) and k-means (right). Top panel: The 20 feature combinations that yield the highest MCC scores, indicating the strongest agreement with expert annotations. Bottom panel: Distribution of MCC scores across all parameter combinations, sorted into discrete performance intervals.

Figure 4. Clustering results for the P/non-P partition in the book of Exodus, similar to Figure 3.

Figure 5. Clustering results for the P/H partition in the book of Leviticus, similar to Figure 3.

Figure 6. Distinctive n-grams extracted from the formulaic cluster for two parameter combinations, capturing 30% of the variance (see Section 2.8 in Yoffe et al. (2023)). Left panel: Clustering of morphologically-represented Leviticus using $\ell = 28$, $n = 4$ and $f = \text {all}$, achieving an MCC score of 94%. Right panel: Similar to the left panel but with $\ell = 20$, $n = 2$ and $f = 300$, achieving an MCC score of 93%. n-grams discussed in the “Formulaic Structure and Parameter Sensitivity in the P/H Partition of Leviticus” section as examples for H- and P-associated features are outlined in red. The insets display the self-information distributions of both clusters, with blue and orange representing the non-formulaic and formulaic clusters, respectively.

Figure A1. Dependence of the resolving power of self-information on sample size and dimensionality. (a) Linear dependence of variance on dimension. (b) Linear dependence of variance on $1/n$. (c) Linear dependence of $\Delta I$ on dimension.

Figure C1. Clustering performance (MCC) as a function of key dataset parameters. Orange represents formulaic clustering, blue represents feature-distribution-based clustering. Upper panel: MCC vs. fraction of formulaic dimensions $d_{\text {form}}$, with fixed $p_{\text {form}} = 0.5$ and varying $p_{\text {feature}} \in \{0.1, 0.3, 0.5\}$. Middle panel: MCC vs. baseline feature activation probability $p_{\text {feature}}$, with fixed $d_{\text {formulaic}} = 0.1$ and varying $p_{\text {form}} \in \{0.8, 0.5, 0.1\}$Lower panel: MCC vs. formulaic activation probability $p_{\text {form}}$, with fixed $d_{\text {form}} = 0.3$ and varying $p_{\text {feature}} \in \{0.1, 0.3, 0.5\}$.

Figure D1. Clustering results for the hypothesized partitions of the books of Genesis, Exodus, and Leviticus (left to right, respectively), using GMM clustering, similar to Fig. 3.

Submit a response

Rapid Responses

No Rapid Responses have been published for this article.

Article contents

An unsupervised information-theoretic approach to identifying formulaic clusters in textual data

Abstract

Keywords

Information

Plain Language Summary

Introduction

Soft clustering algorithm

Self-information as a measure for identifying formulaic structures

Clustering approach

Clustering formalism for discrete categorical data

Clustering formalism for continuous multivariate Gaussian data

Optimization scheme

Benchmarking

Clustering benchmarking on categorical data

Clustering benchmarking on multivariate Gaussian data

Application to textual data: The priestly source(s) in the Pentateuch

Experimental setup

Digital biblical corpus and annotation by experts

Embedding

Comparison with previous work

Results

Formulaic structure and parameter sensitivity in the P/H partition of Leviticus

Discussion

Acknowledgements

Data availability statement

Author contributions

Competing interests

Appendix A. Self-information in multivariate Gaussian distributions

A.1.1 Self-information of a multivariate Gaussian distribution

A.1.2 Resolving power of self-information and covariance estimation

A.1.3 Fluctuations and finite sample effects

A.1.4 Impact on resolving power and signal-to-noise ratio (SNR)

Appendix B. Expected Mahalanobis distance

Appendix C. Distinguishing entropy-based and feature distribution-based clusters

C.1 Experimental setup

C.2 Experimental results

D GMM clustering results

Footnotes

References

Rapid Responses

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests