Skip to content

Notebook

Cliff Wulfman edited this page Jun 28, 2022 · 17 revisions

Notebook

<2022-02-18 Fri>

I wrote a short script to split the manifest for the Kennan Papers (MC076) into individual containers (1276 in number) and generate a graph of named entities for each. This is a resource-intensive process (each master TIFF must be downloaded, run through OCR, and processed with SpaCy); using my laptop connected to the Internet via a standard home FIOS service, I was able to process 501 in three days.

I then loaded all 501 graphs into a local instance of GraphDB-Free running on my laptop, resulting in approximately 40 million statements.

Below are some exploratory SPARQL queries. Recall that named entities are represented as Symbolic Objects (Appellations, when all is said and done) which have been recorded as Inscriptions on IIIF Canvases.

How many pages are we talking about?

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix ecrm: <http://erlangen-crm.org/200717/> 
prefix entity: <https://figgy.princeton.edu/concerns/entities/> 
prefix etype: <https://figgy.princeton.edu/concerns/adam/> 
prefix inscription: <https://figgy.princeton.edu/concerns/inscriptions/> 
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
select distinct ?canvas { 
  ?something ecrm:P128i_is_carried_by ?canvas .
}

Sparql returns 26,917 results: there are about 27,000 pages in this sample.

How many named entities did SpaCy recognize?

prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix ecrm: <http://erlangen-crm.org/200717/> 
prefix entity: <https://figgy.princeton.edu/concerns/entities/> 
prefix etype: <https://figgy.princeton.edu/concerns/adam/> 
prefix inscription: <https://figgy.princeton.edu/concerns/inscriptions/> 
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> 

select ?inscription where { ?inscription a ecrm:E34_Inscription .} 

This query returns 738,696 results in less than 0.1 seconds. This is how many “hits” SpaCy recorded, but this isn’t a very useful number, though: how many of these are distinct names?

prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix ecrm: <http://erlangen-crm.org/200717/> 
prefix entity: <https://figgy.princeton.edu/concerns/entities/> 
prefix etype: <https://figgy.princeton.edu/concerns/adam/> 
prefix inscription: <https://figgy.princeton.edu/concerns/inscriptions/> 
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
select distinct ?name { 
  ?inscription a ecrm:E34_Inscription .
  ?inscription ecrm:P106_is_composed_of ?entity .
  ?entity ecrm:P190_has_symbolic_content ?name .
}

This query returns 254,699 distinct strings that SpaCy identified as named entities.

How many names of people are there?

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix ecrm: <http://erlangen-crm.org/200717/> 
prefix entity: <https://figgy.princeton.edu/concerns/entities/> 
prefix etype: <https://figgy.princeton.edu/concerns/adam/> 
prefix inscription: <https://figgy.princeton.edu/concerns/inscriptions/> 
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
select distinct ?name { 
  ?inscription a ecrm:E34_Inscription ; ecrm:E55_Type etype:PERSON .
  ?inscription ecrm:P106_is_composed_of ?entity .
  ?entity ecrm:P190_has_symbolic_content ?name .
}

These are the strings SpaCy, running in naive mode over dirty OCR, identified as the names of persons. Sparql returns an astounding 95,982 results: almost 96,000 distinct names of people. Not good.

But what do these strings actually look like? The first dozen or so are promising:

?name
Reith
Jimmy Carter
Reith Lectures
Reagan
Wilson
Bill Casey
Ronald Reagan
Kennedy
Robert Gates
Gorbachev
"Ronald Reagan\n"
McNamara
Buddenbrooks

On first glance, this isn’t a bad result; SpaCy picked out strings that are clearly names of one kind or another, and its classification of these names as names of persons is good, with a few exceptions: Reith Lectures is almost certainly the name of an event, not the name of a person. Buddenbrooks is harder to determine without context: it is probably the title of Thomas Mann’s novel, but it could be someone’s last name. More problematic, for a different reason, are the multiple appearances of Ronald Reagan in this list. We can be pretty sure Reagan and Ronald Reagan are the same person, though they might not be, but Ronald Reagan and “Ronald Reagan\n” are certainly the same. SpaCy’s tokenizer has failed to strip off the trailing newline in the second of those Ronald Reagans, and as a result, SpaCy’s named-entity recognizer has treated it as a separate name. This seems like a weakness in SpaCy, perhaps in our configuration (or lack of configuration), and we should flag this for further investigation.

There is something else to note here. Ronald Reagan and “Ronald Reagan\n” are orthographical variants of the same name, while Reagan is not, though all three refer (almost certainly) to the 40th President of the United States. That is, all three refer to the same named entity even though there are two names. Our application is not interested in names (or appellations, as they are called in CIDOC-CRM) but in those named entities, so our tools must help investigators (the archivists, in this case) weed and winnow these names and assign them to identifiable entities.

Of course, for our purposes this repetition may not be a problem: our application favors recall over precision, so we’re more concerned with not missing names than we are with picking up variants. The sheer number of names, though, could create challenges. Here are all the instances of Kissinger in this partial data set (the numbers are line numbers in the output file):

   60:Henry Kissinger
   63:Kissinger
  144:HENRY KISSINGER
 3777:Henry A.Kissinger
 3779:Henry A. Kissinger
 3785:Henry A. Kissinger's
 6271:Henry Kissinger's
 9881:Robert H. Bork Henry Kissinger Paul W. McCracken Harry
10072:"Henry Kissinger\n"
10097:Henry Kissinger’s
10222:Nixon-Kissinger
11018:"Henry\n\nKissinger"
11138:"Kissinger pro-\n"
11143:"Henry\nKissinger's"
14237:KISSINGER
14270:"Kissinger |\n"
14353:Henry A. Kissinger Lectures
21995:"Henry\nKissinger"
22740:"Henry A.\nKissinger"
30219:H. Kissinger
30237:ALFRED M. GRUENTHER HENRY A. KISSINGER
30468:A. Kissinger
30501:"Kissinger\n"
34353:Henmry Kissinger
39728:Henry A. Kissinger Theodore M. Hesburgh
39963:"Henry A. Kissinger Richard L. Gelb\n"
40166:"Henry\nA. Kissinger"
42573:Kissinger's-
64109:Messrs Kissinger
64573:Henry kissinger
94259:Henry Kissinger eine
94593:"H. Kissinger\n"
94700:Henry A. Kissinger - Vertreter eines

Filtering SpaCy’s candidates into actual named entities (there are seven people intermingled in these strings) will likely require a mixture of human and machine labor.

<2022-02-19 Sat>

There are not 96,000 distinct names in this sample, even though it is a sample of 27,000 pages. This is one of the places where using uncorrected (“dirty”) OCR hampers our endeavors. Past that fortuitous group at the top of the list, the entries become very dirty indeed:

D. Signature
"Jerzy\n"
ieeiier rrr iri rir
"Wee\n"
Wdiinad Pugh
William Peters
E. List
James E. Doyle
"Fe es ee ee eee\n"
New Yor
ak ae
sald
Wolff
Li mucn
juirice Greenbaum
AL VK
MAURICE C. GREENBAUM
L. KAT
Madison Ave
Svetlana

There are a number of options to consider here.

  1. Pre-filter the pages. We know that some of the pages are too dirty to yield any recognizable text. (The purple mimeographs are an example, as, of course are hand-written pages, drawings, poor-quality photocopies, and so on.) If we had a way to detect those, we could skip trying to find named entities in a see of garbage.
  2. Train a better model.
  3. Use tools like OpenRefine to clean the data by hand.

A combination of techniques will probably be required.

<2022-02-22 Tue>

Some simple regular-expression-based filtering whittles the list down from 96,000 to 72,000. Clustering with OpenRefine will also be powerful.

Clustering is a technique commonly used in natural language processing. It entails finding groups of strings that are similar to one another, using various algorithms to calculate similarity. For example, George Kennan and George Kennen are very similar, because they differ by only one letter; with our data, we can say with great confidence that instances of the string George Kennen should be corrected to be George Kennan, thus reducing the number of name strings from two to one.

Other comparisons are not so straightforward. Suppose we are comparing F. L. Smith with F. T. Smith: are these two distinct people, or is one of these strings a mis-spelling of the other? Sometimes, if we know our data, we can make a good guess: John P. Kennedy is almost certainly John F. Kennedy. In other cases, we cannot tell without looking at the original context.

OpenRefine lets you apply half a dozen different clustering algorithms, each of which uses a different heuristic to calculate similarity. In practice, one applies each of them successively; for our experiment so far, I’ve just used the key-collision algorithms, which bring the list down to about 22,000 entries.

<2022-03-02 Wed>

After another round with OpenRefine, we’re down to about 22,000 name candidates. I’ve started to keep a few snapshot lists in a Google Spreadsheet.

The results, so far, are disappointing. Clustering is a very effective technique, often used in text processing, but it does take time and human labor. At this stage, in a production context, one would probably assign a student (with an archivist to consult with) to perform more painstaking iterations over the data to winnow out partial names and mis-recognized strings and produce a working list of names.

Some observations:

  • there are many German words and phrases in this list. I suspect the two-capitalized-words-in-a-row heuristic is responsible for these; I will do some research to see if there are standard techniques to handle this problem, which must be a common one.
  • during these clustering/merging steps with OpenRefine, we’ve lost context: the string-by-string links back to canvases. There will be ways to do that, but they will require more overhead than we want to spend now.

<2022-03-03 Thu>

OpenRefine’s clustering algorithms are indeed powerful, but there is simply too much kruft in this data set: nonsensical strings and whatnot. Let’s see if we can improve SpaCy’s NER model to give us more accurate results to start with.

I’m using Prodigy, a companion to SpaCy, developed by the same company. Prodigy is an annotation tool that uses machine learning to train data models. It isn’t free, but I have a research license.

We’ll begin by gathering training data. I haven’t been keeping the OCR output but we can do that easily enough. In fact, we’ll use SpaCy to generate data sets in one of SpaCy’s preferred data formats. And we’ll extend our object models to include metadata about the collection, the container, and the page.

Here’s an example of some training data in jsonl format:

{"text": "Lhe As for the rest of the Soviet Union: the situation that prevails there is both dreadful and dangerous.", "meta": {"Date Created": ["1991 February 3"], "Extent": ["1 folder"], "Identifier": ["ark:/88435/d504rt661"], "Title": ["\"If the Kremlin Can't Rule,\" Op-Ed about the Baltics, The Washington Post "], "Creator": ["Kennan, George F. (George Frost), 1904-2005."], "Language": ["English"], "Publisher": ["Kennan, George F. (George Frost), 1904-2005."], "Portion Note": ["entire component, excluding the C Section of the Washington Post, Feb 3, 1991"], "Container": ["Box 294, Folder 4"], "Rendered Holding Location": ["Mudd Manuscript Library"], "Member Of Collections": ["George F. Kennan Papers MC076"]}}
{"text": "If it is true, as it appears to be, that the supply of consumers’ goods to the larger cities cannot be assured without the wholehearted collaboration of the party apparatus and the armed units in the great rural hinterland of the country, then one could understand why Gorbachev has felt himself compelled to reach back at this time for the support of those institutions.", "meta": {"Date Created": ["1991 February 3"], "Extent": ["1 folder"], "Identifier": ["ark:/88435/d504rt661"], "Title": ["\"If the Kremlin Can't Rule,\" Op-Ed about the Baltics, The Washington Post "], "Creator": ["Kennan, George F. (George Frost), 1904-2005."], "Language": ["English"], "Publisher": ["Kennan, George F. (George Frost), 1904-2005."], "Portion Note": ["entire component, excluding the C Section of the Washington Post, Feb 3, 1991"], "Container": ["Box 294, Folder 4"], "Rendered Holding Location": ["Mudd Manuscript Library"], "Member Of Collections": ["George F. Kennan Papers MC076"]}}

Let’s try training on some of this data.

prodigy ner.manual ner_cold_war_papers blank:en ~/Desktop/training2/ea9a223d-e23c-4d86-894a-4164902ffc3b.jsonl --label PERSON

Nice resource on training SpaCy models: https://www.youtube.com/channel/UC5vr5PwcXiKX_-6NTteAlXw

<2022-03-04 Fri> Review

What have we accomplished so far?

  • We have developed software that enables us to build, in an unattended fashion, datasets of candidate named entities from pages, containers, and entire collections, based on Figgy’s IIIF manifests.
  • We have developed a data model that enables us to represent this (meta)data as annotations to IIIF canvases, thereby integrating it with Figgy’s underlying data model and the IIIF software base (viewers, annotation servers) already developed by ITMS.
  • We have begun to analyze the data that results from naive applications of NLP software.

Unsurprisingly, the brute-force naive approach we’ve applied so far is unsatisfactory: it produces too much noise. How can we improve these results so that we can produce a useful set of infrequent names?

Be smarter about what you look at.
Our tools naively process every page in the collection. Some of that data may not be useful or relevant (drafts of published works; newspaper clippings; handwritten notes (which cannot yet be processed with OCR); other ephemera. In a reality, an archivist would pre-select the components of the collection that are most amenable to this kind of analysis.

We also apply NER to the OCR output without checking on its quality: if we could throw out pages that were poorly recognized (again, hand-written materials; mimeographs; other bad originals), we might improve our overall NER: less garbage in, less garbage out.

Take smaller bites.
Archival collections are naturally sub-divided into thematically related components and sub-components. We are likely to get better results if we used those subdivisions to our advantage: to make hand-correction tractable; to train models iteratively.

Next Steps

  • Filter out poor OCR. Use confidence thresholds produced by Tesseract. Unfortunately, that means we can’t use the OCR already produced by Figgy.
  • Be selective in what we process. Use the Collection’s Indexes to produce training data. Concentrate on the Correspondence series.
  • Some containers might be amenable to image cleanup to improve OCR.
  • Augment our training set with more patterns. Will & Alexis have provided some name lists to help train our model, but we can expand that training set using some common NLP techniques

<2022-03-07 Mon>

Correspondence is a good set to work with. Correspondence usually has lots of names; the names will likely vary by correspondent (the social network formed by names mentioned in correspondence would probably be interesting); and there’s a lot of it in the Kennan Papers. We’ll start with subseries 1A, because much of it has been digitized.

Series 1, Subseries 1A: Permanent Correspondence, 1918-2004

There are 658 files in subseries 1A, including an index:

  • Index of permanent files, undated

    This index is an excellent data set for training; we’ll look at that in a minute. But first, let’s work on making the base data (the OCR output) better.

    OCR engines (like Tesseract) can produce plain-text output, but they usually can do much more. We’ve seen how Tesseract can serialize the text it recognizes as hOCR or ALTO, but it can also generate a detailed table of data as output, data that includes confidence scores for each word and each block of text it discovers. A confidence score is a measure of how accurately the program thinks it has recognized the word (or block, or even character) correctly. We know now, from experience, that if the OCR is poor, the NER will be poor, so if we can filter out text that has been badly OCR’d, our NER accuracy should improve.

    Deciding where to set the threshold may require some trial and error. Based on some research, it looks like setting the cutoff somewhere between 97.5 and 98.5 is common in real-world applications. Let’s try both ends and see what happens.

    It turns out those numbers don’t work at the block level; too many blocks get rejected. Something closer to 55 seems to be in the right range, but this may not be the best way; perhaps it will be better to filter at the word level.

<2022-03-08 Tue>

After some trial and error, we have a version of our software that filters out bad OCR. (For now, I’m using a word-level threshold of 95.) And it is looking much better. SpaCy tagged 1,024 distinct names after running the program over half a dozen folders of correspondence; here’s a sample:

Kennan
George Kennan
Schuster
Nelson
Gaylord Nelson
Simon
Directo Nelson Suite
George Kennan Papers
Stalin
Urban
GREENBAUM
Grantor
Alan
Singh
Greenbaum
Svetlana
Vincent
Greenbaum Kennan
Peters
Herrman
"2329 Princeton"
Zajac Consul
Wes
Trustees Assignment

I am sure doing a bit of training on the SpaCy model will make this even better, but this is something we can work with.

<2022-03-09 Wed>

Lots of fussing with image-processing code today. OpenCV was causing some logic errors and seemed like overkill; Pillow’s tiff plugin seems to be buggy. Turns out pytesseract can read images from a file on its own, so that’s what we’re doing. Running a long job now, to process the entire correspondence subseries.

<2022-03-10 Thu>

While waiting for the NER job to complete, it’s time to think about the next stage: what to do with this metadata? Here’s what we have so far:

inscription:47m5R9JfBPi8YkpX975FiF a ecrm:E34_Inscription ;
    ecrm:E55_Type etype:PERSON ;
    ecrm:P106_is_composed_of entity:4UrFk3unCBXYpgza3Fiy7t ;
    ecrm:P128i_is_carried_by <https://figgy.princeton.edu/concern/scanned_resources/49067c79-6915-4492-bd75-3554f0010ee3/manifest/canvas/0089725a-195f-42a4-8bfb-c44cf7f182d1> .

entity:4UrFk3unCBXYpgza3Fiy7t a ecrm:E90_Symbolic_Object ;
    rdfs:label "DEAN BROWN" ;
    ecrm:P190_has_symbolic_content "DEAN BROWN" .

We need to work on the ontology a bit.

E34 Inscription

rdfs:comment “Scope note:

This class comprises recognisable, texts attached to instances of E24 Physical Human-Made Thing.

The transcription of the text can be documented in a note by P3 has note: E62 String. The alphabet used can be documented by P2 has type: E55 Type. This class does not intend to describe the idiosyncratic characteristics of an individual physical embodiment of an inscription, but the underlying prototype. The physical embodiment is modelled in the CIDOC CRM as instances of E24 Physical Human-Made Thing.

The relationship of a physical copy of a book to the text it contains is modelled using E18 Physical Thing P128 carries E33 Linguistic Object.

Examples:

  • "keep off the grass" on a sign stuck in the lawn of the quad of Balliol College
  • The text published in Corpus Inscriptionum Latinarum V 895
  • Kilroy was here

Since an inscription I is a text, it cannot have a type PERSON; I invented the etype PERSON to capture SpaCy’s classification.

Minor point: The use of the term entity to refer to a Symbolic Object is confusing. I was trying to avoid committing to any classification of the string, but that might be overly cautious. At the very least, the namespace should be called symbols, but we can probably use SpaCy’s classification to enable us to say that the symbol is an E41 Appellation:

Instances of E41 Appellation may be used to identify any instance of E1 CRM Entity and sometimes are characteristic for instances of more specific subclasses E1 CRM Entity, such as for instances of E52 Time-Span (for instance “dates”), E39 Actor, E53 Place or E28 Conceptual Object. Postal addresses and E-mail addresses are characteristic examples of identifiers used by services transporting things between clients.

The Appellation has symbolic content, which is the string.

So an Inscription can be located on a canvas, and it may be composed of an Appellation. And, ultimately, the Appellation may P1 identify an Actor, who may be an E21 Person.

The Appellation may be incorrectly recognized by the OCR, in which case it may be corrected; or the Appellation may be a misspelling, in which case it should be preserved.

@prefix ecrm: <http://erlangen-crm.org/200717/> .
@prefix appellation: <https://figgy.princeton.edu/concerns/appellations/> .
@prefix inscription: <https://figgy.princeton.edu/concerns/inscriptions/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .


inscription:47m5R9JfBPi8YkpX975FiF a ecrm:E34_Inscription ;
                                   ecrm:P106_is_composed_of appellation:4UrFk3unCBXYpgza3Fiy7t ;
                                   ecrm:P128i_is_carried_by <https://figgy.princeton.edu/concern/scanned_resources/49067c79-6915-4492-bd75-3554f0010ee3/manifest/canvas/0089725a-195f-42a4-8bfb-c44cf7f182d1> .

appellation:4UrFk3unCBXYpgza3Fiy7t a ecrm:E41_Appellation ;
                              rdfs:label "George Kennan" ;
                              ecrm:P190_has_appellationic_content "George Kennan" ;
                              ecrm:P1i_identifies actor:xyz .

actor:xyz a ecrm:E21_Person ;
          skos:prefLabel "Kennan, George Frost, 1904-2005" ;
          owl:sameAs <http://viaf.org/viaf/66477608> .

<2022-03-11 Fri>

This is nice!

from rdflib import Graph

manifest= 'https://figgy.princeton.edu/concern/scanned_resources/2a701cb1-33d4-4112-bf5d-65123e8aa8e7/manifest'
g = Graph()
g.parse(manifest, format='json-ld')
g.serialize(destination="2a701cb1-33d4-4112-bf5d-65123e8aa8e7.ttl")

So we can actually include the manifests in our graph.

We can load them like this:

curl -X POST -H 'Content-Type: application/ld+json' \
     -d '{"name": "https://figgy.princeton.edu/concern/scanned_resources/670a732c-f578-4bfe-99f2-f59e5d5220f5/manifest","context": "","replaceGraphs": [],"baseURI": null,"forceSerial": false,"type": "url","format": "application/ld+json","data": "https://figgy.princeton.edu/concern/scanned_resources/670a732c-f578-4bfe-99f2-f59e5d5220f5/manifest","parserSettings": {"preserveBNodeIds": false,"failOnUnknownDataTypes": false,"verifyDataTypeValues": false,"normalizeDataTypeValues": false,"failOnUnknownLanguageTags": false,"verifyLanguageTags": true,"normalizeLanguageTags": false,"stopOnError": true},"requestIdHeadersToForward": null}'\
    http://localhost:7200/rest/data/import/upload/manifests/url

I’ve written a Python module to do this.

<2022-03-15 Tue>

By incorporating manifests into our graph, we can use the web annotation model to connect our names and named entities to canvases in a IIIF-compliant manner.

https://iiif.io/api/extension/text-granularity/context.json

<<annotation_example>>

@prefix iiif_prezi: <http://iiif.io/api/presentation/3#> .
@prefix oa: <http://www.w3.org/ns/oa#> .
@prefix gran: <http://iiif.io/api/extension/text-granularity#> .

<some_annotation> a oa:Annotation ;
                  gran:textGranularity gran:word ;
                  oa:hasTarget [ a oa:SpecificResource ;
                                 oa:source: <canvas> ;
                             oa:hasSelector [
                                              a oa:FragmentSelector ;
                                              dcterms:conformsTo <spec> ;
                                              rdf:value "x,y,w,h" .
                                              ]
                                ]

                  oa:hasBody <an_inscription> ;
                  oa:motivatedBy sc:supplementing ;

These annotations can be stored in our graph; they could also be stored in a separate annotation server.

Mapping appellations to named entities

This remains a major challenge. Even after filtering the OCR, naive SpaCy still returns lots of false positives. We need to step back and think about this.

There are several sorts of error:

  • errors of omission. SpaCy didn’t recognize a string as a name, when it should have.
  • errors of inclusion. SpaCy recognized a string as a name, when it shouldn’t have.
  • OCR errors. SpaCy recognized a string as a name, but the name contains OCR errors: Kennon for Kennan, for example.
  • partial recognition. SpaCy recognizes, say, a set of initials as a name. Errors of this kind absolutely require that a person see the context in order to resolve.

<2022-03-22 Tue>

I’ve processed the entire correspondence subseries. Here are some results.

prefix ecrm: <http://erlangen-crm.org/200717/> 
prefix etype: <https://figgy.princeton.edu/concerns/adam/> 

select (count(*) as ?count) ?name 
where { 
  ?s a ecrm:E41_Appellation .
  ?s ecrm:E55_Type etype:PERSON .
  ?s ecrm:P190_has_symbolic_content ?name .
}
group by ?name
order by ?count

This gives us a frequency list of about 51,000 distinct un-normalized names. It’s time to look at the shape of this data. The frequency ranges from 13,031 (“Kennan”) to 1, and the data have a classic “long tail” distribution: 80% of the names occur only once or twice. Most of those are errors: bad OCR; strings mis-characterized as names. Even if some of these are real names, do they fulfill our criteria of interestingness?

For our purposes, we are not interested in the very largest frequencies either, because these names are well known and already accounted for in the finding aid:

13031	Kennan
4237	George Kennan
2813	Stalin
2546	George
1260	George Kennan Papers
1053	Princeton
1040	Reagan
1025	Acheson
917	Hitler
907	Annelise
700	Marshall
658	Lenin
630	Kennedy
604	Truman
597	Khrushchev
531	Roosevelt
520	Churchill
499	Johnson
480	Janet Smith

(Janet Smith was Kennan’s secretary for a number of years, so her name appears frequently in his correspondence; she has an entire folder in the correspondence subseries.)

We will have to experiment with filtering both ends. On the high end, at what point to we start to encounter names that are not already accounted for in the finding aid? At the low end, how many times must a name appear in the collection to be considered notable? That judgment, of course, is highly subjective and context-specific, but it seems safe to chop off that very, very long tail and drop any name (or possible name) that occurs less than 3 times. That brings the number of names down to 5,500: still formidable, but manageable.

<2022-03-31 Thu>

Still ahead: being able to see an named entity in context. Create an inverted index from appellations to inscriptions.

<2022-04-05 Tue>

There are several directions to pursue now.

  • names in context: mapping name strings to regions on canvases so archivists can see the context in which the name appears.
  • names -> named entities.
  • duplicate string merging. In our model, an Inscription (marks on a canvas) composes an Appellation; an Appellation is conceptual object that has symbolic content. Each Inscription is unique, but two Inscriptions may comprise the same symbolic content: two occurrences of the string “George Kennan”, for example. What is the relationship between Appellations with the same symbolic content?

Here is an excerpt from the scope note for Appellation:

This class comprises signs, either meaningful or not, or arrangements of signs following a specific syntax, that are used or can be used to refer to and identify a specific instance of some class or category within a certain context.

Instances of E41 Appellation do not identify things by their meaning, even if they happen to have one, but instead by convention, tradition, or agreement. Instances of E41 Appellation are cultural constructs; as such, they have a context, a history, and a use in time and space by some group of users. A given instance of E41 Appellation can have alternative forms, i.e., other instances of E41 Appellation that are always regarded as equivalent independent from the thing it denotes.

Different languages may use different appellations for the same thing, such as the names of major cities. Some appellations may be formulated using a valid noun phrase of a particular language. In these cases, the respective instances of E41 Appellation should also be declared as instances of E33 Linguistic Object. Then the language using the appellation can be declared with the property P72 has language: E56 Language.

Instances of E41 Appellation may be used to identify any instance of E1 CRM Entity and sometimes are characteristic for instances of more specific subclasses E1 CRM Entity, such as for instances of E52 Time-Span (for instance “dates”), E39 Actor, E53 Place or E28 Conceptual Object. Postal addresses and E-mail addresses are characteristic examples of identifiers used by services transporting things between clients.

E41 Appellation should not be confused with the act of naming something. Cf. E15 Identifier Assignment

Examples:

  • "Martin"
  • “Aquae Sulis Minerva”
  • "the Merchant of Venice" (E35)
  • "Spigelia marilandica (L.) L." [not the species, just the name] (Hershberger, Jenkins and Robacker, 2015)
  • "information science" [not the science itself, but the name through which we refer to it in an English-speaking context]
  • “安” [Chinese "an", meaning "peace"]
  • “6°5’29”N 45°12’13”W” (example of spatial coordinate)
  • “Black queen’s bishop 4” [chess coordinate] (example of spatial coordinate)
  • “19-MAR-1922” (example of date)
  • “+41 22 418 5571” (example of contact point)
  • "[email protected]" (example of contact point)
  • “CH-1211, Genève” (example of place appellation)
  • “1-29-3 Otsuka, Bunkyo-ku, Tokyo, 121, Japan” (example of address)
  • “the poop deck of H.M.S Victory” (example of section definition)
  • “the Venus de Milo’s left buttock” (example of section definition)

I may have modeled this incorrectly. Here is the scope note for Inscription:

This class comprises recognisable, texts attached to instances of E24 Physical Human-Made Thing.

The transcription of the text can be documented in a note by P3 has note: E62 String. The alphabet used can be documented by P2 has type: E55 Type. This class does not intend to describe the idiosyncratic characteristics of an individual physical embodiment of an inscription, but the underlying prototype. The physical embodiment is modelled in the CIDOC CRM as instances of E24 Physical Human-Made Thing.

The relationship of a physical copy of a book to the text it contains is modelled using E18 Physical Thing. P128 carries (is carried by): E33 Linguistic Object.

Examples:

  • "keep off the grass" on a sign stuck in the lawn of the quad of Balliol College
  • The text published in Corpus Inscriptionum Latinarum V 895
  • Kilroy was here

E34_Inscription is a subclass both of E33_Linguistic_Object and E36_Visual_Item; it therefore has the properties of both of those classes.

This, I think, is the proper modeling:

@prefix appellation: <https://figgy.princeton.edu/concerns/appellations/> .
@prefix ecrm: <http://erlangen-crm.org/200717/> .
@prefix etype: <https://figgy.princeton.edu/concerns/adam/> .
@prefix inscription: <https://figgy.princeton.edu/concerns/inscriptions/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

inscription:1 a ecrm:E34_Inscription ;
              ecrm:P128i_is_carried_by <canvasURI> ;
              ecrm:P190_has_symbolic_content "George Kemon" .

appellation:1 a ecrm:E41_Appellation ;
              ecrm:P48_has_preferred_identifier "George Kennan" .

inscription:1 ecrm:P106_is_composed_of appellation:1 .

The inscription is the ink-marks on the page. The bit-mapped image of it produced through photography is an interpretation of that inscription. (That bit-map snippet might be the content of a note, but we’re not concerned with it out of context.) The transcription is produced by the OCR engine; SpaCy looks at the transcription and associates it with an appellation. (i.e., it says i is composed of a.)

Here is where I erred. An appellation has no symbolic content; it is an abstraction. It has a label for convenience, and it may have one or more representations. For example, Geo. Kennan and George Kennan are the same appellation; George F. Kennan is a different appellation. This is, I think, where the real power of RDF comes in: one is not confined to using a single ontology to describe classes or instances. We might use one or more statements with rdfs:label or skos:altLabel (see https://www.w3.org/2012/09/odrl/semantic/draft/doco/skos_altLabel.html)

appellation:1 a ecrm:E41_Appellation ;
              skos:prefLabel "George Kennan" ;
              skos:altLabel "Geo. Kennan" .

appellation:2 a ecrm:E41_Appellation ;
              skos:prefLabel "George F. Kennan" .

appellation:3 a ecrm:E41_Appellation ;
              skos:prefLabel "Kennan" .

appellation:3 a ecrm:E41_Appellation ;
              skos:prefLabel "George Frost Kennan" .

inscription:1 a ecrm:E34_Inscription ;
              ecrm:P128i_is_carried_by <canvasURI> ;
              ecrm:P190_has_symbolic_content "George Kemon" .

inscription:1 ecrm:P106_is_composed_of appellation:1 .

Here we say that there are 3 different names, and that inscription:1 is an inscription of appellation:1, even though it is misspelled (or mis-ocred).

While correct, I think, this is also verbose and cumbersome, and overkill for our purposes. We are interested in appellations – names – we don’t care how they are spelled. All those appellations could be rolled into one:

appellation:1 a ecrm:E41_Appellation ;
              skos:prefLabel "Kennan, George F. 1904-2005" ;
              skos:altLabel "George F. Kennan" ;
              skos:altLabel "George Kennan" ;
              skos:altLabel "Geo. Kennan" ;
              skos:altLabel "Kennan" ;
              skos:altLabel "George Frost Kennan" .

inscription:1 a ecrm:E34_Inscription ;
              ecrm:P128i_is_carried_by <canvasURI> ;
              ecrm:P190_has_symbolic_content "George Kemon" .

inscription:1 ecrm:P106_is_composed_of appellation:1 .

NOW we are getting closer to VIAF and other name authorities. Because what we really want to say is:

appellation:1 a ecrm:E41_Appellation;
              rdfs:label "George Kennan" ;
              owl:sameAs <http://viaf.org/viaf/66477608> .

wd:Q156058 a ecrm:E21_Person ;
           ecrm:P1_is_identified_by appellation:1 .

We are using SpaCy’s Named Entity Recognizer to classify string-patterns. SpaCy tags tokens (or token sequences) as, for example, PERSON. But SpaCy’s tags simply label the inscriptions as what it calls named entities. We want to model SpaCy’s labels explicitly, using SKOS, and then apply those classifications to the inscriptions.

@prefix spacy: <http://someplace.com> .

spacy:entities a skos:Collection ;
            skos:prefLabel "SpaCy entity labels" ;
            skos:member spacy:person_entity ;
            skos:member spacy:org_entity ;
            (etc).

inscription:1 a ecrm:E34_Inscription ;
              ecrm:P128i_is_carried_by <canvasURI> ;
              ecrm:P190_has_symbolic_content "George Kemon" .

inscription:1 a spacy:person_entity .

NB: Don’t confuse SpaCy’s use of the term named entity with ours. SpaCy classifies strings; we map those strings to names, and the names to entities.

<2022-04-06 Wed> Names in Context

Let’s think about names in context.

As we’ve seen in an earlier annotation_example, we can generate web annotations from our semantic data; these annotations are RDF statements like any other.

Annotation Servers

After some research, I’ve discovered that we do not need an external annotation server package, at least not for our prototype. Glen Robson’s Simple Annotation Server seems to be the only viable tool out there, and it is not terribly viable: difficult to install; buggy; not compatible with Mirador 3; etc. More importantly, it is overkill for our prototype; we don’t need most of its functionality.

What we need is a far simpler service:

  • Given some symbolic content, show me all the inscriptions in which it appears.
  • Given an inscription, show me the page on which it appears.

Unfortunately, the annotation capabilities of IIIF have not been well developed. While our model is designed to make it easy to link annotations to canvases, there is no simple solution for doing what we want to do in a IIIF viewer. For our first proof-of-concept, I’m going to implement something simpler.

<2022-04-14 Thu>

Here’s a nice SPARQL query that uses our new model to select a range of personal names

prefix ecrm: <http://erlangen-crm.org/200717/> 

select  ?name (count(?name) as ?nameCount)
where {
  ?i a ecrm:E34_Inscription .
  ?i ecrm:E55_Type "PERSON" .
  ?i ecrm:P190_has_symbolic_content ?name .
}
group by ?name
having (?nameCount < 100 && ?nameCount > 3)
order by DESC(?nameCount)

Now let’s see if we can get to the page images on which these names appear.

prefix oa: <http://www.w3.org/ns/oa#>
prefix sc: <http://iiif.io/api/presentation/2#>
prefix dctypes: <http://purl.org/dc/dcmitype/>
select *
where
{
  ?image a dctypes:Image
}

In IIIF, images are annotations on a canvas:

prefix oa: <http://www.w3.org/ns/oa#>
prefix sc: <http://iiif.io/api/presentation/2#>
select distinct ?annotation where { 
        ?annotation a oa:Annotation; 
             oa:motivatedBy sc:painting
}
prefix sc: <http://iiif.io/api/presentation/2#>
select distinct ?a where { 
  ?s a sc:Canvas .
  ?s sc:hasImageAnnotations ?lst.
  ?lst rdf:first ?a
}
prefix oa: <http://www.w3.org/ns/oa#>
prefix sc: <http://iiif.io/api/presentation/2#>
prefix dctypes: <http://purl.org/dc/dcmitype/>
select *
where
{
  ?annotation a oa:Annotation .
  ?annotation oa:hasBody ?image .
  ?annotation oa:hasTarget ?canvas .
}

<2022-04-19 Tue>

Similarly, an E34_Inscription, as an interpretation of a region on a canvas, can be linked to the canvas via an annotation:

prefix ecrm: <http://erlangen-crm.org/200717/>
prefix oa: <http://www.w3.org/ns/oa#>
prefix sc: <http://iiif.io/api/presentation/2#>
prefix dctypes: <http://purl.org/dc/dcmitype/>
select *
where
{
  ?annotation a oa:Annotation ;
  oa:hasBody ?image ;
  oa:hasTarget ?canvas .
  ?inscription ecrm:P128i_is_carried_by ?canvas ;
  ecrm:P190_has_symbolic_content ?content .
}

Or, more succinctly:

prefix ecrm: <http://erlangen-crm.org/200717/>
prefix oa: <http://www.w3.org/ns/oa#>
prefix sc: <http://iiif.io/api/presentation/2#>
select ?name ?image ?inscription
where
{
  ?inscription ecrm:P128i_is_carried_by ?canvas ;
  ecrm:E55_Type "PERSON" ;
  ecrm:P190_has_symbolic_content ?name .

  ?annotation oa:hasTarget ?canvas ;
  oa:hasBody ?image .
}

<2022-04-28 Thu>

I’m writing some scripts to generate data sets that Will can use to browse names in context. It will be helpful to be able to include some metadata in this data: the container, the page, etc. All this information is in the Manifest, but it is a little tedious to extract using SPARQL.

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix ecrm: <http://erlangen-crm.org/200717/>
prefix oa: <http://www.w3.org/ns/oa#>
prefix sc: <http://iiif.io/api/presentation/2#>
select ?m
where
{
  ?seq rdf:first ?canvas .
  ?m sc:hasSequences ?seq .
  ?m a sc:Manifest .
  ?m sc:metadataLabels ?labels .
}

There is no easy way to crawl the IIIF graph to find the Manifest associated with a Canvas, because the canvases are stored in lists, which are represented as blank nodes organized into linked lists.

A more idiomatic representation would not use lists.

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix ecrm: <http://erlangen-crm.org/200717/>
prefix oa: <http://www.w3.org/ns/oa#>
prefix sc: <http://iiif.io/api/presentation/2#>
select  ?manifest ?canvas
where
{
  ?seq a sc:Sequence .
  ?seq sc:hasCanvases ?canvasList .
  ?canvasList rdf:first ?canvas .
  ?seqList rdf:first ?seq .
  ?manifest sc:hasSequences ?seqList .
}

{ :list rdf:rest*/rdf:first ?element }

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix ecrm: <http://erlangen-crm.org/200717/>
prefix oa: <http://www.w3.org/ns/oa#>
prefix sc: <http://iiif.io/api/presentation/2#>
select *
where
{
  ?m a sc:Manifest .
  ?m sc:hasSequences ?seq . 
  ?seq rdf:first ?first .
  ?first sc:hasCanvases ?list .
  ?list rdf:rest ?canvas .
}

Doesn’t produce full list:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix ecrm: <http://erlangen-crm.org/200717/>
prefix oa: <http://www.w3.org/ns/oa#>
prefix sc: <http://iiif.io/api/presentation/2#>
select *
where
{
  ?canvas a sc:Canvas .
  ?list rdf:first ?canvas .
  ?seq sc:hasCanvases ?list .
  ?seqList rdf:first ?seq .
  ?manifest sc:hasSequences ?seqList .
}

This seems to do the trick:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix ecrm: <http://erlangen-crm.org/200717/>
prefix oa: <http://www.w3.org/ns/oa#>
prefix sc: <http://iiif.io/api/presentation/2#>
select distinct ?manifest
where
{
  ?list rdf:first <https://figgy.princeton.edu/concern/scanned_resources/2a701cb1-33d4-4112-bf5d-65123e8aa8e7/manifest/canvas/08429a7b-9b7b-4ed7-ae28-09883386dbc2> .
  ?seq sc:hasCanvases ?list .
  ?seqList rdf:first ?seq .
  ?manifest sc:hasSequences ?seqList .
}

How about metadata?

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix ecrm: <http://erlangen-crm.org/200717/>
prefix oa: <http://www.w3.org/ns/oa#>
prefix sc: <http://iiif.io/api/presentation/2#>
select *
where
{
  ?m a sc:Manifest .
  ?element rdfs:label "Title" .
  ?element rdf:value ?v .
}

SPARQL doesn’t support recursion, so you have to use a property path to traverse the linked list:

Prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix sc: <http://iiif.io/api/presentation/2#>
select distinct ?v
where
{
  ?m a sc:Manifest .
  ?m sc:metadataLabels ?list .
  ?element rdfs:label "Title" .
  ?list rdf:rest*/rdf:first ?element .
  ?element rdf:value ?v .
}

So to get the title of a manifest, you’d do this:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix sc: <http://iiif.io/api/presentation/2#>
select distinct ?title
where
{
  ?list rdf:first <https://figgy.princeton.edu/concern/scanned_resources/2a701cb1-33d4-4112-bf5d-65123e8aa8e7/manifest/canvas/08429a7b-9b7b-4ed7-ae28-09883386dbc2> .
  ?seq sc:hasCanvases ?list .
  ?seqList rdf:first ?seq .
  ?manifest sc:hasSequences ?seqList .
  ?manifest sc:metadataLabels ?metadata .
  ?element rdfs:label "Title" .
  ?metadata rdf:rest*/rdf:first ?element .
  ?element rdf:value ?title .
}

<2022-04-29 Fri>

Working with more sophisticated SPARQL now.

PREFIX oa: <http://www.w3.org/ns/oa#>
prefix ecrm: <http://erlangen-crm.org/200717/>
select ?name ?canvas ?image
where {
  ?inscription ecrm:P190_has_symbolic_content ?name ;
  ecrm:P128i_is_carried_by ?canvas .
  
  optional { ?annotation oa:hasTarget ?canvas ;
    oa:hasBody ?image . }

  {
    select  ?name (count(?name) as ?nameCount)
    where {
      ?i a ecrm:E34_Inscription ;
      ecrm:E55_Type "PERSON" ;
      ecrm:P190_has_symbolic_content ?name .
    }
    group by ?name
    having (?nameCount < 50 && ?nameCount > 48)
    order by DESC(?nameCount)
  }
}
order by ?name

<2022-05-01 Sun>

I’m still learning Sparql. This query looks straightforward enough, but it generated 37,470,551 results until I tweaked a few things.

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix ecrm: <http://erlangen-crm.org/200717/>
prefix oa: <http://www.w3.org/ns/oa#>
prefix sc: <http://iiif.io/api/presentation/2#>
SELECT distinct ?canvas ?page ?title ?container ?image ?manifest
WHERE
{
  ?canvas rdfs:label ?page .
  ?annotation oa:hasTarget ?canvas ;
  oa:hasBody ?image .
  {
    SELECT distinct ?manifest ?canvas ?title ?container
    WHERE
    {
      ?manifest sc:hasSequences/rdf:first/sc:hasCanvases/rdf:rest*/rdf:first ?canvas .
      {
        SELECT distinct ?manifest ?title ?container
        where
        {
          ?manifest a sc:Manifest .
          ?manifest sc:metadataLabels ?list .
          ?title_element rdfs:label "Title" .
          ?list rdf:rest*/rdf:first ?title_element .
          ?title_element rdf:value ?title.
          ?container_element rdfs:label "Container" .
          ?list rdf:rest*/rdf:first ?container_element .
          ?container_element rdf:value ?container.
        }
      }
    }
  }
}

<2022-05-02 Mon>

Another round with our data model.

We want to quickly move away from inscriptions and focus on names (appellations). How to automate relating inscriptions to appellations?

One method is a SPARQL CONSTRUCT statement: for each distinct inscription content string, create an appellation with that string as its preferred label, and then link them.

In a second pass, an archivist works with the appellations: one can subsume another (a directed merge).

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix ecrm: <http://erlangen-crm.org/200717/>
Construct {
  _:v a ecrm:E41_Appellation .
  ?i ecrm:P106_is_composed_of _:v .
}
WHERE {
  ?i a ecrm:E34_Inscription .
}

<2022-05-04 Wed>

A viable strategy for reviewing names is emerging.

  • Use SPARQL to extract inscriptions (the strings labeled PERSON by the NER process) within a particular frequency range (e.g., those occurring less than 50 times but more than 20 times).
  • Join those with a table of canvas metadata to create records containing the inscription, an image of the page on which it occurs, and information about the container in which it is found.
  • Load these records into OpenRefine. OpenRefine finally has good documentation; see the Open Refine User Manual.

Now comes the interesting part.

  • Use OpenRefine to duplicate the inscription-content column; call the new column “appellation”. The appellation field is where the action happens; the inscription field is left unchanged, so there is always a link back to the text as it appears on the page.
  • Use OpenRefine’s tools to wrangle the appellations. Create a text facet on the appellation column; remove facet groups that are clearly not names; perform bulk edits; etc. Most importantly, use the faceting feature to quickly resolve ambiguous and partial names: Alexis is expanded, upon inspection in context, into Alexis de Tocqueville and Alexis Johnson.
  • Use OpenRefine’s reconciliation feature to map the appellations to known entities in one or more authority databases.
  • Export the OpenRefine data to a file, and use that file to create new RDF statements for the knowledge base.

OpenRefine now has extensive support for integrating with Wikibase instances, and there is lots of documentation for working not only with Wikidata but directly with local Wikibase instances. Setting up a secure network instance of Wikidata will require some planning and some help from the dev-ops team, but I’m going to experiment with an entirely local setup.

<2022-05-05 Thu>

MediaWiki has made it much easier to set up instances of Wikibase: http://learningwikibase.com/install-wikibase/

An ER diagram

Here’s an ER diagram modeling our domain; I don’t think it’s as useful.

erDiagram
  CANVAS ||--o{ INSCRIPTION : carries
  INSCRIPTION ||--|| CONTENT : "has symbolic content"
  INSCRIPTION ||--|| CANVAS : "is carried by"
  APPELLATION ||--|{ INSCRIPTION : "is depicted by"
  INSCRIPTION }|--|{ APPELLATION : depicts
  APPELLATION ||--|{ PERSON : identifies
  PERSON ||--|{ APPELLATION : "is identified by"   
Loading

The OpenRefine documentation discusses working with local services: https://github.com/OpenRefine/OpenRefine/wiki/Reconcilable-Data-Sources#local-services

There are also instructions in the OpenRefine Manual.

To upload to Wikibase: https://docs.openrefine.org/manual/wikibase/uploading

To create new items: Creating new items

Wikiversity Data Upload Pipeline. This is a step-by-step introduction to the process and, I think, a must-read for our group.

The pipeline they describe entails developing a schema; we already have the beginning of one in our use of CRM.

See https://blogs.tib.eu/wp/tib/2022/03/16/examining-wikidata-and-wikibase-in-the-context-of-research-data-management-applications/

While it is currently not possibly to directly use a standard schema or vocabulary in your Wikibase instance (something that the Wikibase community are working to change), you can still take guidance from standards with regards to what properties are typically associated with a specific domain, and what vocabularies can be used to constrain the values of a dataset. Following standards will make querying and reusing the data in the future easier.

For each cell, you can manually “Create new item,” which will take the cell’s original value and apply it, as though it is a match. This will not become a dark blue link, because at this time there is nothing to link to: it is a draft entity stored only in your project. You can use this feature to prepare these entries for eventual upload to an editable service such as Wikibase, but most services do not yet support this feature.

The latest instructions seem to be here: https://www.mediawiki.org/wiki/Wikibase/Docker.

  • Install Docker. I re-installed Docker Desktop on my laptop.

    The install isn’t going smoothly.

<2022-05-17 Tue>

Let’s continue trying to install wikibase.

This time, following Jim Hahn’s instructions, installation went smoothly.

Let’s review some goals here. We want to create a local authority knowledge base in which archivists and others can manage named entities.

  • from OpenRefine, an archivist should be able to use a reconciliation service to link names with known entities in, say, VIAF/
  • for entities not in VIAF, the archivist should be able to

<2022-05-24 Tue>

After researching the issue, it seems that Wikibase isn’t the way we want to go for the moment. It seems that Wikibase supports creating a Wikidata-style ontology, but it doesn’t support importing statements from other ontologies (like CRM).

I’m going to exercise the workflow I’ve been developing up to this point by running it over another portion of the Kennan Papers, as suggested by Alexis: the writings; specifically the Diaries. And I am going to do it on the ruby-office1 machine, so I can tighten the code and the deployment.

<2022-05-25 Wed>

Note that it is not easy to get a Collection-level manifest.

Deployment Notes

There are two components to this system: adam, a Python package for creating named-entity graphs, and a graph database. The system is being developed with Ontotext’s GraphDB-Free, which is easy to install and use.

Adam requires Python 3.9+ It also requires Tesseract. On Linux, sudo apt install tesseract-ocr -y

  • Clone the repository.
  • run git fetch and then git checkout origin code to get the code branch.
  • Install Poetry.
  • Run poetry install.
  • Install Spacy language models: poetry python -m spacy download en_core_web_lg

code cleanup

Cleaning up the repo.

  • Moving, temporarily,the kennan/ directory of manifests to ~/Desktop to get it out of the way; I don’t remember what it’s for.