Thursday, June 18, 2009

A mathematical conundrum

According to Ehrman (Misquoting Jesus, p. 84), John Mill (1707) examined readings in 100 Greek mss, and concluded there were 30,000 variants. Ehrman notes that we now have 57 times (his figures) the number of witnesses available to Mill (p. 88) and that there are between 200,000 and 400,000 variants (p. 89), which, incidentally, is more variants than there are words in the NT (p. 90).

Why should there be approximately 57 times more witnesses and only (very) approximately 10 times more variants? Were Mill's figures bloated? Or do subsequently discovered mss reveal ever lower proportions of previously unknown variants? What role do different estimates of the versions play in this?


  1. //Or do subsequently discovered mss reveal ever lower proportions of previously unknown variants? //

    Isn't it expected that the answer to this should be yes, purely on mathematical grounds? The more manuscripts we collate, the more of the variants within them will be previously known, and the fewer will be previously unknown. As the number of manuscripts increases, the number of new variants found in each one should gradually decrease.

    Here's an analogous situation that is also possibly familiar to biblical scholars. Suppose we are counting the total number of words in a given ancient author's vocabulary. We do this by going page by page through his extant works. On our first page we will count each word on its first occurrence. The second page will turn up more words, but probably not as many as the first page, since they will have some words in common that will have already been counted. The third page will turn up yet fewer new words. As we proceed through hundreds of pages the total number of words we have identified in his vocabulary will continue to increase, but at ever decreasing rates of new words/page.

  2. Eric,
    Thanks for your comments. Are there extrapolations that one could make?

  3. As the previous comment suggested, this is intuitively to be expected. But there is also a branch of mathematics that tries to quantify this sort of situation. It is called capture theory, because it was originally developed by those capturing animals. (If we capture, tag and release animals, we find that we recapture many animals. Can we estimate the number of uncaptured animals from the proportion that are recaptured? The answer is yes.) But the theory is now used in many different fields - I used it in my studies in criminology. (Given that a certain number of criminals are recaptured, how many criminals are there that have never been captured?)

    As an example of this situation: If there are n variants in all the manuscripts ever written, and each has a fixed and independent probability p of being in a given manuscript (an obviously false assumption) then (mathematical details skipped) after finding m manuscripts we can expect to have found n-n(1-p)^m variants. With two data points (m=100, 30,000 variants; m=5700, 300,000 variants (assuming Mill and Ehrman define "variant" in the same way, which I doubt)) we can then solve the equations to find n and p:
    so there are only 753 variants left to find! Well, I said that the assumption was not very realistic.

  4. Richard - Fascinating, thanks for that.

  5. Richard, I really enjoyed your comment, but as a non-mathematician I struggle with the concept that one could calculate how many variants there ever have been without knowing how many NT manuscripts there ever have been.

    Am I committing a basic logical fallacy? I would be happy to be instructed if I am.

  6. The fact that we discover less new variants as we find more manuscripts is intuitive; the fact that we can count uncaptured variants (animals, criminals) on the basis of how many and how often variants appear and are repeated is not intuitive, but mathematically it works. Of course, it all depends on your assumptions (like equal capture probability, or that the number of captures for each variant has an exponential distribution).

    Wikipedia has a brief article about this at
    but it only describes the simplest situation in which this technique is used.

  7. [as I was writing this, Richard Wilson posted his comment - which I saw after I finished mine]

    Concerning the total number of MSS available for study, that is an ongoing process as discussions on this blog indicate (see posts about Dan Wallace's work for example). Probably it is safe to say that there are over 5700 NT MSS available with both continuous text and lectionaries included in that count.

    I note the number of MSS in order to preface a comment concerning the 200,000 to 400,000 variants. First, where did Ehrman get his count of variants. I recently heard that Ehrman was asked where the count came from, and he noted to one scholar that he got it from him. Then another scholar extrapolated, based on a recent publication, that there could even be up to 800,000 variants. (this is what I heard, so if the scholars wish to name themselves they may). The main point being, just how did someone come up with the count of 400,000 variants?

    Second, I believe that Ehrman's comparison of 400,000 variants (if that is accurate) being greater than the number of words in the NT is an invalid comparison. The 400,000 variants are not from a single copy of the NT, rather they are from the entire tradition of NT MS transmissions. In other words, the 400,000 variants must be compared to the sum total of ALL the words contained in the NT text of over 5700 MSS. When Ehrman was in New Orleans for a conference there, I was able to ask him if he had compared the 400,000 variants to the number of words in the NT MSS tradition as a whole. His response was, "Hmm, I haven't thought about that."

    Just for perspective, allow me to propose the following figures. IF one is able to compile enough text from the 5700 MSS to form the equivalent of 300 complete NTs (I believe that 300 is a low number considering the total number of MSS available.), and if the total number of words in the NT is about 138,000 (a quick search for * in the GNT in Accordance), then the total number of words for 300 NTs would be 41,400,000. This would mean that a total of 400,000 variants accounts for 0.966% of the total words in the NT MSS tradition. Yes, I know that each variant can contain more than one word, but even of there are as many as 2,000,000 words involved the percentage compared to the words in 300 NTs is still only 4.83% .

    I know that the figures I proposed for perspective are speculative. I do, however, believe that they make the point that the number of variants should be compared to the number of words in all the NT MSS and not just to one NT as Ehrman likes to propose.

    Any comments as to another estimate of the total number of words in the NT MSS tradition? They will be welcome.

  8. PJ,
    I seemed to recall that an article had been mentioned here before that made a similar claim to what Richard is saying here. After a little googling I turned it up:

    Be sure to check the embedded link to a BBC article by Cisne as well as the link Dr. Head provides in his comment. It looks to me like Dr. Cisne is talking about using the same basic statistical concepts that Richard mentioned.

  9. The mathematical theory is probably similar, but he has applied it in a different way: capturing manuscripts over time to estimate the number of manuscripts, rather than capturing variants across manuscripts to estimate the number of variants.

  10. Dr. Ehrman got his 400,000 variants from Dan Wallace. Here are the numbers/stats I put together after my now infamous 43 percent of the verses of the NT are attested by the Second Century. This 43 percent can not be historically or "paleographically" refuted (see my earlier questions directed toward Dr. Peter Head in a previous post related to this subject matter).

    NT Stats:

    138,000 words in GNT

    5,700 GNT mss

    2.6 million pages of GNT mss (per Dan Wallace)

    125 avg words per page (my estimate)

    325 million total number of words in all GNT mss

    400,000 variants

    By the way, someone said that what needs to be done is to relate the 400,000 variants to the 325,000,000 words. This actually turns out to be incorrect, since the 400,000 variants are UNIQUE variants, whereas the 325 million words are not unique words.

    Brett Williams

  11. Questions:

    Why don't you scholars add the Greek quotations from the Greek Church Fathers, if you are looking for attestation to a particular reading? (I can see how difficult it would be to determine a correct reading based on a non-Greek quotation.)

    Why don't you scholars add quotations found in the Greek and Latin (and any other language) Church Fathers in order to determine the number of verses attested in the second century?

    I guess I'm saying all your numbers and statistics seem skewed to me by omitting these other valuable sources. These appear to be restricted numbers but without any benefit to restricting them.

  12. Good point Roger.

    Here's another similar argument I use on that issue.

    Suppose hypothetically, that tomorrow we uncover a treasure trove of 100 previously unknown complete manuscripts of the Gospel of John, all dating to around 200. Suppose further that these manuscripts all agree exactly with the NA27 text of John except that each manuscript has a single previously unknown variant at some point, and that all of these previously unknown variants occur at different points in each of these manuscripts. Therefore, the total number of variants now known for the Gospel of John will have increased by 100. However, the result of such a find would clearly not decrease our confidence in the NA27 text of the Gospel of John. Rather, it would increase it tremendously.

  13. PJW,

    That's not quite what Ehrman wrote. Mill did not just compare MSS, but also patristic citations, and early versions, such as the Gothic and Bohairic.

    Another factor is that Mill's text was a complete New Testament text, while the vast majority of MSS are not complete NT MSS; some contain only one or a few books and some are fragmentary.

    So, saying that we have 57 times the number of witnesses that Mill used is not the same as saying that we have 57 times the amount of evidence that Mill used.

    One might similarly say that we have collected 57 times more antelope than Mill. But lots of our antelope are merely the remains of antelope which were devoured by predators.

    So in addition to the principle drawn from the "Animal Recapture" analogy, when we ask, "Why should there be approximately 57 times more witnesses and only (very) approximately 10 times more variants?" another factor is that many of the new witnesses are fragmentary.

    Yours in Christ,

    James Snapp, Jr.

  14. Brett Williams,

    Thank you for your response. I have a question. Since the 400,000 variants are extant among the 2.6 million pages of GNT MSS and therefore are contained within the 325 million words (your estimate) contained therein, and since they exist no where else, with what should they be compared? The fact that the variants are unique does not alter the fact of where they are extant - they are extant in the MSS themselves and thus derived from the entire NT MS tradition and not merely one copy (138,000 words).

    I believe that the issue here is not the numbers, but how the comparisons are made. If one compares 400,000 variants to 138,000 words in the NT, the result is a terrible copy process. If, however, one compares the 400,000 variants to the 325 million words (or even my lesser 41,400,000) then the result is a very accurate process, i.e. only 400,000 variants out of 325 million of words, that is impressively accurate.

    I look forward to your perspective on this. Again, thank you for your response.

    Steven Whatley

  15. Brett,

    I am not so sure that the unique variants to non-unique words isn't a valid comparison. Think of it as 325,000,000 opportunities to introduce a new variant (copying error). You can't simply take all non-unique variants and make the comparison because you don't know how many of those non-unique variants were simply copies of earlier errors.

  16. I meant to add thanks for the contributions so far. It has been very interesting.

  17. I'm now wondering how you define 'unique variant.'

    You are aware that 400,000 is not the TOTAL NUMBER of variants, right? The TOTAL NUMBER of variants is in the millions.

    Hence, you can not divide UNIQUE variants with TOTAL NUMBER of words in mss. The TOTAL NUMBER of variants (millions) is significantly reduced to yield the UNIQUE variants (400k).

    If you want to see some relationship between words and variants, you would have to divide apples with apples, right? In other words, you would have to divide UNIQUE with UNIQUE or TOTAL with TOTAL.

    Have I overlooked the obvious?

    Brett Williams

  18. Brett,

    To me yes you have overlooked the obvious.

    If I take a manuscript copy it and make one error. Then 100 others take my copy of the manuscript copy it again and make no new errors but accurately copy my error. Is that now 101 variants or is it 1?

    I think the answer is 1.

  19. In all the quasi-mathematical discussion over hypotheticals, Paul hits the nail squarely on the head:

    "You have overlooked the obvious....the answer is 1."

    To put it more clearly:

    If 100 MSS share 10 variant readings, there remain only 10 variant readings, and not 1000 (apples).

    If the same 100 MSS each possess 10 variant readings unique to each MS, then, yes, 1000 variant readings do exist (oranges).

    Equally, if the same 100 MSS each possess 10 distinctive orthographic differences (whether irrelevant itacisms, movable nu, spelling with doubled letters, etc.) or nonsense readings, some people seem to add these to the total as though they were actual variant readings, when in fact they are not (strawberries).

    Much of the statistical extrapolation seems to claim totals based more upon apples and strawberries being counted as though they were oranges rather than being recognized for what they really are.

    In the end, far fewer variant readings exist than some people extravagantly claim.

  20. "In the end, far fewer variant readings exist than some people extravagantly claim."

    Out of curiosity, Maurice, about where would you peg it? Do you have an estimate? (Or guesstimate?)

  21. Paul:

    I am trying to figure out what is the best way for us to continue our discussion. How about the following?

    Dr. Tommy Wasserman (TW) has done some very thorough work on the letter of Jude.

    Here are some stats on Jude:

    1. Jude contains 461 words
    2. TW consulted about 560 extant mss of Jude (wow!!!)
    3. TW found 1,271 'variants'

    To help me understand where I overlooked the obvious, would you mind drawing some conclusions from the above data, especially about textual variants, and then I will respond.

    My email address is if that is a better place for us to discuss this issue.


  22. Brett,

    I am not a Greek nor a textual scholar. I am just an interested reader and approaching this from a purely logical position.

    My initial post was primarily me thinking out loud. It seems logical to me that some non-unique variants are merely a perpetuation of an original scribal error. If that is so then the problem is not as simple as comparing unique variants with unique words and non-unique variants with non-unique words. I realise that perpetuating errors probably does not explain all the cases. But in how many cases is it likely that two different scribes would independently make the same copying error.

    This is not something that I think we can resolve because I do not think we have enough data, bir do I ever expect we will. I was merely suggesting that Steven Whatley's arithmetic might not be that unreasonable.

    Do I have a point or am I way off? Is it possible that some variants are merely accurate copies of an earlier error?


    BTW: If any offense was taken to my opening sentence "To me yes you have overlooked the obvious" I apologise unreservedly. It was a poor attempt at a witty riposte to your post.

  23. Assuming Jude to be representative of the NT textual tradition, we would have under 360,000 variants across the whole NT. However, if I remember rightly, Aland and Aland calculated that Jude had more variation than any other NT book.

  24. PJW:
    Aland and Aland calculated that Jude had more variation than any other NT book.

    That's interesting, as Maurice Robinson has said that the PA has more variation than any other 12-verse passage. Calculating the 12 verses in 2143 mss of John 7-8 versus 25 verses in 560 mss of Jude (all you mathematicians out there), which has more weighted variants per verse?

  25. DB:

    Variants per verse...

    I think we need to know the number of variants in the PA of John 7-8. We already know the number of variants for Jude.

  26. Paul, I think you are right to want to make a distinction between repeated variants and unique ones, though I do wonder how you discern the difference between a repeated variant and a unique one that has simply arisen multiple times simultaneously by co-incidence? Especially when dealing with errors arising from common scribal mechanical mistakes (e.g. homioteleuton, etc), included in the premise of such errors is that different unrelated scribes could easily fall into the same mistake. Should such variants be counted separately, since their origin was unrelated, or collectively, since they are the same reading? In Munster, the CBGM project has dealt with this issue in a very interesting manner under the rubric of "connectivity," I'd refer you to their online presentation of that, on their website.

  27. Ryan,

    Thanks for that post. I agree that we cannot tell how many copying errors were made independently and how many are as a result of perpetuating earlier errors. That was part of my point. I simply wanted to point out that it is not as easy as saying we can only compare unique variants with unique word in the NT.

    I don't think anyone has the time or inclination to determine how many independently introduced errors there are.

    Where Ehrman should be challenged is his pre-supposition that if God inspired the text of scripture then He must necessarily have inspired the scribes who copied the text.No statistics will get past that assumption. Or should I say presumption?

  28. Where Ehrman should be challenged is his pre-supposition that if God inspired the text of scripture then He must necessarily have inspired the scribes who copied the text.

    It doesn't matter if there are 50.000 or 250.000 variants, does it? I don't really understand why this question comes up so often.

    The majority of the variants are orthographical. The second largest group are just (clearly identifiable) errors of various kinds.
    Left is a comparatively small group of variants, perhaps a few thousand that are worth discussing and contemplating, IMHO.

  29. WW wrote:
    Where Ehrman should be challenged is his pre-supposition that if God inspired the text of scripture then He must necessarily have inspired the scribes who copied the text.

    I think you are exactly right. I can't think of one scholar who would agree with Ehrman's presupposition, or even find it academically interesting.

    I think the reason that Ehrman is getting any attention at all is because of the impact he is having on the masses who have no TC training. The variants become somewhat important because they support his contention that the Bible is full of errors. We've all heard that accusation against the Bible for years, but now a scholar has "confirmed" this objection to the reliability of the Bible, which now rests comfortably in the mind of the masses and skeptics.

  30. Richard Wilson wrote: so there are only 753 variants left to find!

    I haven't yet seen a clear statement that all these calculations are estimates. Even if the assumptions mentioned by RW (fixed and independent probability for each event) are correct, the result of the calculation is only an expected value.
    Some more calculations – still based on the assumption of fixed and independent probabilities – could determine what the expected error is. (Unfortunately, I have no access to my statistics books just now.)

    The assumption “independent probabilities” is linked to the later discussion about (non)uniqueness of variants: once a variant has been introduced, subsequent copyists will, in most cases, stick with it. So the copies of a variant are not independent! Therefore, to make any sense out of the calculations, we have to count the unique variants.