Wednesday, January 29, 2020

Joey McCollum: Introducing the open-cbgm library

The following is a guest post from Joey McCollum. Joey is a research associate at Virginia Tech, a co-editor of the Solid Rock Greek New Testament (with Stephen Brown), and one of the translators behind Max and Moritz in Biblical Greek (with Brent Niedergall, Dave Massa, and Steve Young). I’m very happy to share with you his independent work to produce an editable version of the CBGM. The changes he introduces are worth discussing and I hope that conversation can begin here on the ETC blog.

1. Introduction & Goals

The open-cbgm library is an open-source software implementation of the Coherence-Based Genealogical Method (CBGM).  In this guest post, I’d like to highlight how the open-cbgm library has accomplished the following objectives.

When I began the project back in October 2019, I had a few goals in mind. First, I wanted it to be open source, so that others could use the CBGM from end-to-end independently, study the code to understand what’s going on “under the hood,” or copy and modify the code to suit their own needs. Second, I wanted the library to fit into existing workflows with other text-critical tools. Third, I wanted to implement features that other textual critics have expressed an interest in seeing in the CBGM. And finally, I wanted the library to be fast—specifically, fast enough to handle the daunting task of constructing a complete global stemma for a book of the New Testament.

2. How It Works

Regarding the first goal, I’m pleased to say that the software is now freely accessible on GitHub at It works on Linux, Mac, and Windows computers. Platform-specific instructions on installing and using it are available on the GitHub page.

Towards the second goal, the open-cbgm library works with inputs in the Text Encoding Initiative (TEI) XML format, a digital humanities standard used by transcription and collation tools developed by the Institute of Textual Scholarship and Electronic Editing (ITSEE) and supported in the INTF’s Virtual Manuscript Room (VMR) workspace. The TEI guidelines offer natural ways to encode lists of witnesses, variation units, and collation data, and the TEI graph-related elements lend themselves well to representing local stemmata of variant readings (see Fig. 1).

Figure 1. Representation of a local stemma in TEI XML. The “directed” graph type indicates that specified edges are one-directional. The “node” elements correspond to readings, and the “arc” elements to proposed genealogical relationships between prior and posterior readings. Note the inclusion of subvariants (the defective reading cf) and ambiguous readings (zw-b/d).

The idea is that with minimal modification (the addition of local stemmata to variation units), the output of existing tools could serve as the input to the open-cbgm library. To my knowledge, no one else has encoded local stemmata using TEI XML, so my hope is that the practice will catch on. It seems more convenient, consistent, and compliant with known standards to have all of the input data in one place.

3. Features

3.1 Adjustments to Local Stemmata

The source code of the open-cbgm library is free to be copied and modified by anyone, and the local stemmata generated by the library can be changed by anyone with access to the input XML file. These two features alone would suffice to meet the third goal, as probably the most desired feature in an implementation of the CBGM among scholars is customizability. But being able to change the topology of local stemmata is just the beginning. To demonstrate some noteworthy features that have been implemented, let’s start with the local stemma in Fig. 2.

Figure 2: Local stemma whose XML representation was shown in Fig. 1. Readings e and f are isolated because the ECM editors were unclear on their origin. In the INTF’s Genealogical Queries tool, they are displayed as children of a “?” placeholder reading. Since this notation is potentially misleading (the readings may not have a common parent), open-cbgm uses a simpler and more functional notation.

Notice that the defective and ambiguous readings included in the XML are also included in this local stemma, with edges directed to them. In practice, we may want to treat ambiguous readings as lacunae, ignoring them for the purposes of genealogical comparison. Alternatively, we might prefer to treat such readings as agreeing with any of their “parent” readings. The open-cbgm library can handle either case.

Figure 3: Local stemma with ambiguous readings dropped

Figure 4: Local stemma with ambiguous readings treated as indistinct from their parents
Suppose we’ve dropped the ambiguous reading from the stemma, and now we want to ignore the defective subvariation of reading cf. The open-cbgm library can treat subvariations of different types, such as orthographic or defective, as trivial, eliminating any disagreement with their parent readings.

Figure 5: Local stemma with ambiguous readings dropped and defective readings treated as indistinct from their parents.

3.2 Textual Flow Strength

In his dissertation on the CBGM, Andrew Edmondson proposed the idea of highlighting flow strength in textual flow diagrams. The idea is that the textual flow between two witnesses is strong if the first witness predominantly has readings prior to those of the second witness, and weak otherwise. Weak textual flow between two witnesses means that their genealogical relationship could easily be reversed if relationships between readings in just a few local stemmata are changed. This is useful to textual critics because it helps them know when they should or shouldn’t revise a local stemma based on the results of a textual flow diagram. The open-cbgm library supports both classic and flow strength-formatted versions of all three types of textual flow diagrams (see Figs. 6 and 7).

Figure 6: Classic version of a “coherence in variant passages” textual flow diagram

Figure 7: Flow strength-formatted version of a “coherence in variant passages” textual flow diagram

3.3 Weighted Local Genealogies

Another desired feature is the ability to weigh agreements. While the open-cbgm library does not support this feature directly, it does offer a more nuanced option: it allows us to weigh textual changes as represented by the edges of local stemmata. In 3 John 1:13/24–26, we might consider the transpositions from reading a (σοι γραφειν) to reading b (γραφειν σοι) and from c (σοι γραψαι) to d (γραψαι σοι) to be less significant or more common than the change in tense from a to c, and the dittography leading from c to cf (σοι σοι γραψαι) as almost trivial. We can assign weights to these changes in the XML file to reflect our judgment (see Fig. 8), and the open-cbgm library will take these values into account for its calculations. It can also render the local stemma graph with the weights included for convenience (see Fig. 9).

Figure 8: Representation of a local stemma with weighted edges in TEI XML. Per the TEI guidelines, the “label” elements contain the weights we want associated with particular edges. Since we’ll be dropping the ambiguous reading zw-b/d, we have not bothered to assign weights to the edges leading to it.

Figure 9: Local stemma with ambiguous readings dropped and edge weights displayed

But how, and why, would edge weights figure into the calculations of the open-cbgm library? As it turns out, paths between readings in local stemmata, and the lengths of those paths, are key ingredients in the open-cbgm library’s recipe for constructing the global stemma. I’ll explain below.

Let’s start by considering the local stemma for 3 John 1:13/24–26. We can see that reading c is prior to reading d. Reading a is prior to reading c, so it is also prior to reading d—but more distantly. In an unweighted local stemma like that of Fig. 3, we would say that one change takes place from reading c to reading d, while two changes take place from reading a to reading d. In a weighted local stemma like that in Fig. 9, the same idea would hold, but with the weights of changes rather than their counts. Similarly, for a pair of witnesses, we would express the genealogical distance from the ancestor to the descendant as the sum of the distances between their readings at all passages where the ancestor’s reading is prior to the descendant’s. In other words, genealogical distance between witnesses measures how much change occurred between an earlier textual state and a later one. 

The value of this concept is that it allows us to speak more broadly of which readings “explain” others for the purpose of finding stemmatic ancestors. In other implementations of the CBGM, a reading (say, reading d from the above example) can only be explained by agreement or by descent from a parent reading (reading c), but not by descent from a further ancestor reading (reading a). This constraint is intended to isolate stemmatic ancestors that are genealogically close to a witness, but in sections of the global stemma where the extant textual tradition is sparse—a scenario observed in real data—it can also create a situation where a witness has readings that none of its potential ancestors can explain.

Gerd Mink was aware of this possibility, and he proposed adding intermediary nodes to the global stemma so that such witnesses can have feasible substemmata. While this solves the problem, it is clearly an ad hoc solution, and one that unnecessarily creates the appearance of contamination where common ancestry is a more parsimonious explanation.

By contrast, if we replace the hard constraint on the definition of explained readings with the soft penalty function in the form of genealogical distance, then we can avoid this problem altogether, and we can account for gaps in the extant tradition using “long branches” in the global stemma rather than additional nodes. This modification simplifies the process of global stemma construction and makes its results more consistent with those of other phylogenetic approaches. Perhaps more importantly, in the few cases where its decisions in the global stemma might differ from those of previous implementations, the open-cbgm library is flexible enough to help us replicate decisions made according to the rules of those implementations.

3.4 Faster Optimization for Substemmata

With regard to the goal of speed, the open-cbgm library benefits from a number of optimizations and algorithmic tricks. Following Edmondson’s Python implementation of the CBGM (, the library uses a SQLite database to store genealogical comparisons between witnesses, making their calculation (a one- to two-minute process in 3 John) one-time work. This streamlines more common tasks like finding a witness’s relatives at a variation unit, finding candidates for a witness’s stemmatic ancestors, and printing out a variety of graphs. The library encodes genealogical relationships between pairs of witnesses as bitmaps, making their storage in memory compact and leveraging hardware optimizations for operations involving them.

These techniques are critical for substemma optimization, or selecting the best ancestors for each witness in the global stemma. Until now, no one has been able to construct a complete global stemma for an entire book of the New Testament (although Gerd Mink, Peter Gurry, and Andrew Edmondson have produced partial global stemmata for different corpora using different subsets of witnesses).

To my knowledge, there is no prescribed algorithm for this task, only a set of guiding principles: the stemmatic ancestors of a witness must explain every reading of that witness; on the basis of parsimony, a solution involving fewer stemmatic ancestors is generally better than a solution involving more; on the basis of faithful copying, a solution that agrees more with the readings of the witness is better than one that agrees less. 

As I have described above, I have interpreted the first guiding principle more broadly, and as a result, my approach to substemma optimization in the open-cbgm library differs somewhat from the approaches of other implementations. In the open-cbgm library, a witness’s stemmatic ancestors still need to explain all of its readings, but a reading can be explained by agreement or by descent from any of its ancestor readings.

In place of the constraints enforced by other implementations, the open-cbgm library treats genealogical distance between stemmatic ancestors and descendants as a cost function to be minimized in substemma optimization. Formulated this way, substemma optimization is reduced to what is known in computer science as a weighted set cover problem. This class of problems happens to have heuristics that yield fast and exact solutions in practice. Thanks to these heuristics, the open-cbgm library can find the lowest-cost substemmata for a given witness in a fraction of a second and construct a complete global stemma for 3 John in under ten seconds.

A preliminary version of the complete global stemma for John appears in Fig. 10. The open-cbgm library has an option to format edges as dotted, dashed, or solid depending on levels of agreement between stemmatic ancestors and descendants.

Figure 10: Global stemma for 3 John with edges formatted based on agreement levels
Why is A not the only root of the stemma? For two reasons.

First, one of the orphaned witnesses, GA 365, is fragmentary, and where it’s not lacunose, it agrees completely with A. As a result, no other witness can be its ancestor. To handle this problem, the open-cbgm library offers preprocessing settings for excluding fragmentary witnesses.

Second, any reading with an unclear source will leave the highest-priority witness supporting it without a feasible set of stemmatic ancestors, because no potential ancestor’s reading can explain it. In 3 John 1:13/24–26, GA 0142 and 61 are the highest-priority witnesses to the isolated readings e and f; both are orphans in the global stemma. Preferably, we would solve this by deciding the sources of all of the readings in question in their respective local stemmata, but as a quick fix, we can just remove the variation units containing them from consideration. This gives us the connected global stemma in Fig. 11.

Figure 11: The same global stemma, but with fragmentary witnesses and variation units with unclear reading sources excluded

4. Conclusion

The open-cbgm library has fulfilled all of the goals I had in mind when I started designing it, but I’ll be the first to say that there is room for improvement. For one thing, all of its modules are run entirely from the command line, so a more user-friendly interface would be ideal. In the meantime, I’ve endeavored to provide ample documentation on the GitHub page in the hope that users from any background will be able to use the CBGM independently and comprehensively.

Special thanks to Brent Niedergall for looking over drafts of this post and offering helpful comments.


  1. Thanks for this, Joey. I'm really happy to see an open source tool made available (with documentation!). It's good news for text critics and critics of the CBGM whose chief complaint is that the CBGM is a "black box in Münster." Peter and Tommy have helped people to understand what the CBGM does and how its practitioners use it, but access to the software itself has always been limited.

    1. Thanks, David! It was a pleasure to work on the software, and I'm glad it's been received well so far! One of the other things that's great about open-source code is that it encourages feedback about errors in the code and differences in methodology. If you end up using the library and have any questions or issues with what you see in the code, feel free to reach out to me or (in the case of a bug in the code or a feature request) add an issue on the project's GitHub page!

  2. << Gerd Mink was aware of this possibility, and he proposed adding intermediary nodes to the global stemma so that such witnesses can have feasible substemmata. While this solves the problem, it is clearly an ad hoc solution, and one that unnecessarily creates the appearance of contamination where common ancestry is a more parsimonious explanation. >>

    Could you expand on that a bit?

    1. Sure! For reference, Mink discusses this issue in "Problems of a Highly Contaminated Tradition: the New Testament: Stemmata of Variants as a Source of a Genealogy for Witnesses," pp. 59–63, and Edmondson briefly covers the same topic in pp. 139–140 of his dissertation.

      Assume we have two witnesses B and C with a common ancestor A and that more scribal change has occurred from A to C than from A to B. So in general, A has more prior readings than B, and A and B both have more prior readings than C. According to the rules of the CBGM, A would be a potential ancestor to both B and C, and B would be a potential ancestor of C. C would not be any other witness's potential ancestor, as it has the most posterior readings.

      Now let's suppose that at one variation unit, A has an early reading, which we'll denote a. In this same variation unit, C has a reading posterior to a reading in A, which we'll call b, and B has a reading c that's posterior to reading b. So we have one variation unit where the genealogy of readings goes against the predominant trend. According to Mink's formulation of the rules, reading a can explain reading b, and reading b can explain reading c, but reading a can't explain reading c. This leaves us without a way to explain reading c in witness B using only B's set of potential ancestors (which, remember, consists only of witness A). So we don't have a feasible set of ancestors to assign to B in the global stemma.

      Mink's solution is to add an intermediary node to the global stemma whose textual support consists solely of reading b in the problematic variation unit. This way, A and the intermediary node can cover all the readings found in B and all the readings found in C, and we've bridged a gap caused by a lack of extant evidence to states of the text between witnesses A and B and between A and C. This sort of thing typically happens near the top of the global stemma, because evidence of the earliest states of the text is more sparse.

      My issue with this solution is that you could avoid adding anything to the global stemma if you just let reading a explain reading c, knowing that the change to reading b in between likely happened, but wasn't preserved in the manuscript tradition. It seems much less complicated to say that B and C derived from A independently, which was our underlying assumption in the first place. This conclusion is reflected in my implementation choices behind the open-cbgm library.

      (Sorry this was so long; illustrations would help a lot for this explanation, but I can't put them in a comment. I hope this helps, but if you still have questions, please let me know!)

    2. Thank you, James, for asking the question about clarification on the Gerd Mink's proposal. And thank you, Joey, for the expanded clarification, i.e., "two witnesses B and C with a common ancestor A" (assuming more scribal change from A-C than A-B.) I had to write out those A-a, B-c, and C-b relationships, and re-read several times, in order to understand the proposal. I do see how you "don't have a feasible set of ancestors to assign to B in the global stemma." More dialogue will be much appreciated.