Saturday, June 13, 2020

Optimal Substemmata Now Available for Acts

15
Just today I noticed that the CBGM for Acts has added substemmata to its repertoire. This is really exciting. When I produced the portion of the global stemma for the Catholic Letters in my thesis, Klaus Wachtel had to personally provide me substemmata using software on a very old Mac. And the data were not publicly available. The ability to do this online means that, using my faster approach explained here (pp. 167–168) and here (p. 102), one could probably put together a global stemma for Acts without too much trouble. I should say that I can’t find any documentation on the CBGM site for this new feature. If anyone has more details about it, please let me know.

15 comments

  1. The developer documentation at https://cceh.github.io/ntg/ provides some help pages on the interface; the relevant page for substemma optimization is https://cceh.github.io/ntg/server_dev.html#module-server.set_cover. The source code is accessible at https://github.com/cceh/ntg, if you'd like to see what's going on under the hood.

    From what I can make of the code in https://github.com/cceh/ntg/blob/master/server/set_cover.py, the Genealogical Queries substemma optimization routine casts the task as a set cover problem, which, if I recall correctly, is essentially how you described it in your introduction to the CBGM. One function in the script, set_cover_json, attempts to approximate the minimum set cover (i.e., the smallest set of witnesses that explain all readings) using the following greedy heuristic:
    1. select the witness that agrees the most the with witness in question;
    2. remove any passages that are now explained from consideration;
    3. reiterate steps 1 and 2 with this process on the remaining set of unexplained passages until every passage is explained.

    A separate function, _optimal_substemma, takes the target witness and a selection of candidate stemmatic ancestors and performs a brute-force comparison of all combinations of the specified potential ancestors, starting with the smallest (1-witness) combinations and proceeding to the largest (up to 12 witnesses), then outputs the results of the comparison to the user.

    The script also appears to implement a custom scoring scheme for substemmata, although I'm not sure if it is ever used. A substemma's score is 10 times the number of passages in the main witness that it explains by agreement, plus 5 times the number of passages in the main witness that it explains by posteriority.

    ReplyDelete
    Replies
    1. Isn't the set cover problem in NP-hard?

      Delete
    2. It is. A fast, exact algorithm for set cover would be a theoretical breakthrough, but the algorithms used in the CCeH code are either approximate and fast or exact and slow.

      If m denotes the number of extant passages in the witness whose substemma we want to optimize and n denotes the number of potential ancestors to consider, the greedy heuristic implemented in the set_cover_json function is guaranteed to find a cover at most log m times the size of the minimum set cover. The procedure is efficient (i.e., polynomial-time in m and n), but it is only an approximation and not an exact algorithm.

      The _optimal_substemma function solves the set cover problem exactly, but not efficiently. If n potential witnesses are specified, then a brute-force comparison would check up to 2^n - 1 combinations, which scales exponentially rather than polynomially in n. I assume that's why the developer(s) hardcoded a limit of 12 on the value of n.

      Delete
    3. Thanks for the explanation. To clarify, the exact solution mentioned here is only exact up to n=12, after which point it also becomes an approximation, right?

      Delete
    4. As far as I can tell, the greedy heuristic that produces an approximate solution isn't used at all. (I could be wrong, as I was just searching within the set_cover.py script, so if anyone knows otherwise, please correct me!) And now that I look at the script again, it appears that the limit to covers involving 12 or fewer potential ancestors is only enforced in the greedy heuristic. Sorry for missing that!

      This means that the exact solution should always be exact, regardless of how many potential ancestors the user specifies, but it will also take a very long time if many ancestors are specified. For n specified potential ancestors, it considers all 2^n - 1 combinations of them. The procedure will therefore enumerate all set covers (i.e., valid substemmata) involving those potential ancestors, but it will take an exponential number of steps to do so.

      Delete
  2. It is a bit unfortunate that this feature became available for public use although it is still work in progress. Now that the genie has prematurely left the bottle I can only state that what you get by clicking on the stemma icon in a list of potential ancestors is not an optimal substemma but a list of candidates for such a substemma. It will never be possible to produce an optimal substemma or even a global stemma by just clicking a button. For a description of the relevant procedures see Gerd Mink’s Introductory Presentation, p. 475ff .

    ReplyDelete
    Replies
    1. Klaus, good catch. I should have noted that.

      Delete
    2. And sorry I jumped the gun in my excitement.

      Delete
  3. This is a major concern that I have with the CBGM as a whole. Supposedly, the CBGM using computer analysis can identify the order of textual flow between manuscripts, but somehow it can’t be used in other areas, even when it appears that it does. Which is it? Can the CBGM deal with contamination or not? Can it identify the flow between manuscripts or not? If it can identify a list of candidates for an optimal su stemma, why can’t it then be used to distinguish between them like it supposedly can with texts? Simple straightforward answers to questions like this would help me and maybe others see the advantages of the CBGM beyond sorting!

    Tim

    ReplyDelete
    Replies
    1. Sub stemma😎
      One additional thought, if PG, who co-authored a book On CBGM and has written papers, maybe even a dissertation on it, can be mislead that this was an optimal sub stemma, is it any wonder that many have described the CBGM as a black box?
      Tim

      Delete
    2. Timothy, while we wait a more qualified answer, I hope you don't mind if I answer your questions as I understand them, and then whoever responds can mediate between our differing understandings (this is TC after all!).

      My understanding is that that the human editors make all of the directional decisions. And I understand that the the human editors are the ones who optimize the substemma through a labor intensive and iterative process that uses the CBGM to keep track of decisions. It seems that PG simply expected that if the feature was live, then it must have meant that the editors had completed the optimization by hand.

      I think Klaus Wachtel's point--that optimal substemma can never be produced by "clicking a button"--is true of the CBGM in general. It is a tool that helps to establish the fact of relationships and potential relationships, but the human editors must work through all of the data point by point to slowly reveal things such as directional relationships, contamination, and optimal substemma.

      Does the CBGM deal with contamination? I thought the standard answer was, "It's be best tool we've got for that, but still imperfect."

      Can the CBGM identify flow between manuscripts? I think the answer is that there is no computer script that performs this task, but rather the human editor working iteratively. In the case of Acts, it seems that this process is not complete.

      "If it can identify a list of candidates for an optimal su stemma, why can’t it then be used to distinguish between them like it supposedly can with texts?" I thought the answer to this was that the human editors CAN use the CBGM to help optimize the substemma, but this information is either not publicly available yet, or the process has not been completed.

      I also assume that even if the editorial work necessary for optimizing the substemma, it must take a lot of work to represent the data in a user-friendly way on their website.

      I hope that someone will jump in and correct any misconceptions of mine.

      Delete
    3. David,
      First, thanks for your input, I appreciate you! Second, you seem to be conveying a lot more human interaction into the process than anyone else. I understand there is human decisions, but mainly in the initial decisions 😎
      Again, thanks!
      Tim

      Delete
    4. Timothy, David is right.

      To your points, yes, the CBGM reports on the textual flow between witnesses. That flow is determined by the editors' own textual decisions at places of variation. This flow, combined with the overall agreement between witnesses (pregenealogical coherence) is what determines potential ancestors.

      I don't know what you mean by it being used in "other areas."

      Yes, the CBGM can help deal with contamination though I argue it does not quite solve the problem. (Probably nothing can completely.)

      I don't know what you mean by the CBGM "distinguishing" between substemmata. But if you mean why can't it determine the optimal substemma, I argue that it can though not, as Klaus says, with the click of a button. I should note that Mink is uncomfortable with my approach to producing the global stemma because it (mostly) removes the editor from that part of the process. In my view, the time savings is worth it for the heuristic value.

      Delete
    5. PG,
      So if David is right than humans make all the directional decisions? Really, then what is the CBGM doing when it does pre-genealogical coherence?
      Tim

      Delete
    6. Pregenealogical coherence is directionless. That's what the pre- means. It's nothing more than bare agreement between witnesses. Genealogical coherence then builds on pregenealogical by adding the data from the editors own decisions.

      Delete