Thinking about names

Matthew Milner: June 5, 2016 at 00:10

Building parsers for Nanohistory has involved quite a bit of thinking about what's in a name. I'm going to leave organizations, places, and things and outline how I've approached the issue for prosopographical data.

Let's get some basics out of the way first - just so we're clear on what's going on. A "name" is essentially a label; whether it as attached to anything in particular is secondary in practical terms. In short, a name is a descriptor for something that can exist in reality or be completely fictive. This is important to note because unlike a cataloguer in a repository, historical scholars are interested in the movement and shaping of identities: names are the lynchpins, but do not need to be associated, necessarily, with an *actual* human being who has lived, breathed, had emotions and experiences, and passed away. In this respect a name, then, is a construct, and whether or not someone was alive, or is a fictitious individual in a novel or comic book, is a moot point for Nanohistory. There's a record; there's a name. It's the job of the scholar to know and discern what that means, but from the point of view of data, they're the same.

Of course these issues aren't new. Data scientists and cataloguers have faced them head on when creating new ontologies, namespaces, or reference models like TEI, FOAF, Schema.Org, CIDOC-CRM, and MARC21. It makes sense to list their solutions:

TEI

[See http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-persName.html]

  • addName: (additional name) contains an additional name component, such as a nickname, epithet, or alias, or any other descriptive phrase used within a personal name
  • affiliation: contains an informal description of a person's present or past affiliation with some organization, for example an employer or sponsor.
  • forename: contains a forename, given or baptismal name.
  • genName: (generational name component) contains a name component used to distinguish otherwise similar names on the basis of the relative ages or generations of the persons named.
    <persName>
     <surname>Pitt</surname>
     <genName>the Younger</genName>
    </persName>
    
  • nameLink: (name link) contains a connecting phrase or link used within a name but not regarded as part of it, such as van der or of.
  • orgName: (organization name) contains an organizational name.
  • persName: (personal name) contains a proper noun or proper-noun phrase referring to a person, possibly including one or more of the person's forenames, surnames, honorifics, added names, etc.
    <persName>
     <forename>Edward</forename>
     <forename>George</forename>
     <surname type="linked">Bulwer-Lytton</surname>, <roleName>Baron Lytton of
     <placeName>Knebworth</placeName>
     </roleName>
    </persName>
    
  • roleName: contains a name component which indicates that the referent has a particular role or position in society, such as an official title or rank.
  • surname: contains a family (inherited) name, as opposed to a given, baptismal, or nick name.

FOAF (Friend of a Friend)

[See http://xmlns.com/foaf/spec/#term_Person]

The person class is not as robust as TEI, but has been widely used for a number of years in Linked Open Data contexts, and served as the basis for thinking through RDF-based semantic web processes and schemes.

  • The Person class represents people. Something is a Person if it is a person. We don't nitpic about whether they're alive, dead, real, or imaginary. The Person class is a sub-class of the Agent class, since all people are considered 'agents' in FOAF.
  • familyName: The family name of some person. The familyName property is provided (alongside givenName) for use when describing parts of people's names. Although these concepts do not capture the full range of personal naming styles found world-wide, they are commonly used and have some value.
  • firstName: The first name of a person. The firstName property is provided (alongside lastName) as a mechanism to support legacy data that cannot be easily interpreted in terms of the (otherwise preferred) familyName and givenName properties. The concepts of 'first' and 'last' names do not work well across cultural and linguistic boundaries; however they are widely used in addressbooks and databases.
  • lastName: The last name of a person. The lastName property is provided (alongside firstName) as a mechanism to support legacy data that cannot be easily interpreted in terms of the (otherwise preferred) familyName and givenName properties. The concepts of 'first' and 'last' names do not work well across cultural and linguistic boundaries; however they are widely used in addressbooks and databases.
  • givenName: The given name of some person. The givenName property is provided (alongside familyName) for use when describing parts of people's names. Although these concepts do not capture the full range of personal naming styles found world-wide, they are commonly used and have some value.
  • name: A name for some thing. FOAF provides some other naming constructs. While foaf:name does not explicitly represent name substructure (family vs given etc.) it does provide a basic level of interoperability.
  • surname (Archaic): The surname of some person.
  • nick: A short informal nickname characterising an agent (includes login identifiers, IRC and other chat nicknames). The nick property relates a Person to a short (often abbreviated) nickname, such as those use in IRC chat, online accounts, and computer logins. This property is necessarily vague, because it does not indicate any particular naming control authority, and so cannot distinguish a person's login from their (possibly various) IRC nicknames or other similar identifiers. However it has some utility, since many people use the same string (or slight variants) across a variety of such environments.
  • title: Title (Mr, Mrs, Ms, Dr. etc). This property is a candidate for deprecation in favour of 'honorificPrefix' following Portable Contacts usage.

Schema.Org: Person

[See https://schema.org/Person]

  • additionalName: An additional name for a Person, can be used for a middle name.
  • affiliation: An organization that this person is affiliated with. For example, a school/university, a club, or a team.
  • familyName: Family name. In the U.S., the last name of an Person. This can be used along with givenName instead of the name property.
  • givenName: Given name. In the U.S., the first name of a Person. This can be used along with familyName instead of the name property.
  • honorificPrefix: An honorific prefix preceding a Person's name such as Dr/Mrs/Mr.
  • honorificSuffix: An honorific suffix preceding a Person's name such as M.D. /PhD/MSCSW.
  • jobTitle: The job title of the person (for example, Financial Manager).
  • memberOf: An Organization (or ProgramMembership) to which this Person or Organization belongs. Inverse property: member.
  • name (from thing): The name of the item.

MARC21

[See https://www.loc.gov/marc/authority/ad100.html]

Established personal name used in a name, name/title, or extended subject heading established heading records or an unestablished personal name used in these types of headings a traced or an untraced reference record.

  • First Indicator
  • Type of personal name entry element
  • 0 - Forename
  • 1 - Surname
  • 3 - Family name Undefined
  • Second Indicator
  • # - Undefined
Subfield Codes
  • $a Personal name (NR)
  • $b Numeration (NR)
  • $c Titles and other words associated with a name (R)
  • $d Dates associated with a name (NR)
  • $e Relator term (R)
  • $fDate of a work (NR)
  • $g Miscellaneous information (R)
  • $h Medium (NR)
  • $j Attribution qualifier (R)
  • $k Form subheading (R)
  • $l Language of a work (NR)
  • $m Medium of performance for music (R)
  • $n Number of part/section of a work (R)
  • $o Arranged statement for music (NR)
  • $p Name of part/section of a work (R)
  • $q Fuller form of name (NR)
  • $r Key for music (NR)
  • $s Version (NR)
  • $t Title of a work (NR)
  • $v Form subdivision (R)
  • $x General subdivision (R)
  • $y Chronological subdivision (R)
  • $z Geographic subdivision (R)
  • $6 Linkage (NR)
  • $8 Field link and sequence number (R)
Examples
  • 100 0#$aManya K'Omalowete a Djonga,$d1950-
  • 100 1#$aMeyer
  • 100 1#$aJones, James E.,$cJr.
  • 100 1#$aSoares, A. J.$q(António José)
  • 100 1#$aCasadesus, Henri Gustave,$d1870-1947.$tConcertos,$mvioloncello, orchestra,$rC minor
  • 100 0#$aAmerican,$cpseud.
  • 100 0#$aE.S.,$cMeister,$d15th cent.,$jFollower of
  • 100 1#$aReynolds, Joshua,$cSir,$d1723-1792,$jPupil of
  • 100 0#$aGustaf$bV,$cKing of Sweden,$d1858-1950
  • 100 1#$aAppleton, Victor,$cII
  • 100 1#$aSalisbury, James Cecil,$cEarl of,$dd. 1683
  • 100 1#$aBrown, John,$d1800-1859,$edefendant
  • 100 1#$aWagner, Richard,$d1813-1883.$tOuvertüre.$hSound recording
  • 100 0#$aClaudius$q(Claudius Ceccon)
  • 100 1#$aShakespeare, William,$d1564-1616$xCriticism and interpretation$xHistory$y18th century
  • 100 0#$aFrederick$bII,$cHoly Roman Emperor,$d1194-1250$xHomes and haunts$zItaly

CIDOC-CRM

[See http://www.cidoc-crm.org/rdfs/cidoc_crm_v6.2.1-draft-b-2015October.rdfs]

  • E21 Person: This class comprises real persons who live or are assumed to have lived. Legendary figures that may have existed, such as Ulysses and King Arthur, fall into this class if the documentation refers to them as historical figures. In cases where doubt exists as to whether several persons are in fact identical, multiple instances can be created and linked to indicate their relationship. The CRM does not propose a specific form to support reasoning about possible identity.
  • E39 Actor: This class comprises people, either individually or in groups, who have the potential to perform intentional actions of kinds for which someone may be held responsible. The CRM does not attempt to model the inadvertent actions of such actors. Individual people should be documented as instances of E21 Person, whereas groups should be documented as instances of either E74 Group or its subclass E40 Legal Body.
  • E82 Actor Appellation: This class comprises any sort of name, number, code or symbol characteristically used to identify an E39 Actor. An E39 Actor will typically have more than one E82 Actor Appellation, and instances of E82 Actor Appellation in turn may have alternative representations. The distinction between corporate and personal names, which is particularly important in library applications, should be made by explicitly linking the E82 Actor Appellation to an instance of either E21 Person or E74 Group/E40 Legal Body. If this is not possible, the distinction can be made through the use of the P2 has type mechanism.
  • P2 has type (is type of): This property allows sub typing of CRM entities - a form of specialisation - through the use of a terminological hierarchy, or thesaurus. The CRM is intended to focus on the high-level entities and relationships needed to describe data structures. Consequently, it does not specialise entities any further than is required for this immediate purpose. However, entities in the isA hierarchy of the CRM may by specialised into any number of sub entities, which can be defined in the E55 Type hierarchy. E51 Contact Point, for example, may be specialised into "e-mail address", "telephone number", "post office box", "URL" etc. none of which figures explicitly in the CRM hierarchy. Sub typing obviously requires consistency between the meaning of the terms assigned and the more general intent of the CRM entity in question.
  • P131 is identified by (identifies): This property identifies a name used specifically to identify an E39 Actor. This property is a specialisation of P1 is identified by (identifies) is identified by.

Bringing it all together

Of these existing approaches Schema.Org and FOAF come closest to what an historical scholar might actually find useful simply because they are rather simplistic, and because they are constructed with RDF and XML in mind. They're made for sharing, and think of a person as a 'thing' or entity with attributes etc. That said, MARC21 and the CIDOC-CRM are much more extensive and offer considerably more possibilities. TEI lies somewhere in between the two: though it's not XML per se, it shares many characteristics as SGML, but also owes much to the world of cataloguing. Of the five CIDOC-CRM is likely the most flexible (it was designed to be so), as it allows both for the definition of types for attributes and stores string data (it would appear). Yet it doesn't distinguish between name parts like the others do.

While each of these solutions works, they present problems when it comes to thinking about the ways names operate in regards to social and cultural relations for collaborative historical scholarship. There needs to be a happy middle ground between the full flexibility of CIDOC-CRM, and FOAF. When it comes to thinking about names and how they express and capture social relationships, we need an approach that allows for tracing of core name elements expressed in TEI, FOAF, Schema.Org, and MARC21. A surname or familyname captures one of the most important concepts - that identity is grounded in familial and kin lines, and that other associations proper to naming are, often, secondary. TEI, FOAF and Schema.Org capture this nicely; MARC21 attempts to do so, while CIDOC-CRM simply doesn't. There's quite a bit of flexibility within TEI's approach to personal names. From a structural point of view however, when it comes to thinking about the interrelation of the components, we find little distinction within types of roles, their association with organizations, and nicknames, epithets etc within . Because TEI referencing is internal to a single XML document, there's no real necessity to build larger relationships within the document should an encoder choose not to. It's about what encoding is necessary for the text. MARC21 is by far the most robust, and is built to encode personal as well as other forms of data, like those connected to authorship, as well as organizations etc., in a single line of text. Understanding MARC21's structure is important for those using Open Data simply because of its dominance in Library science and cataloguing contexts. And finally, despite its richness, the CIDOC CRM notes this about a person: "The CRM does not propose a specific form to support reasoning about possible identity." In short, we have text strings on the one hand, as name components, and on the other, the ability to make links in TEI, FOAF, and Schema.org to other entities like places and organizations, but no effective means for thinking about the temporality of name components, and what those associations mean to the entities from which they derive. Yes, James I was King of England, but he was also King of Scotland, and both Kingdoms, as polities and societies had Kings. Despite James's own theories on monarchy and identity, the kingdoms existed before and after his reign: kingship is therefore an attribute of the social entity of the kingdom. His son lost his head because those in the kingdom thought that this was no longer the case, one might say.

And so, the issues become much more complex when we start thinking of names as temporalized and contextualized, and the degree to which a name is connected to an individual person or derived from their association with other entities. We can think of this in a number of ways. Perhaps the easiest is to suppose that John Edward Smith is an individual we want to document from a Western context. John is born and given a name at birth. He might keep this name for the duration of his life, or he might end up changing it, at which point although he remains the same person, in terms of documentary evidence an historical scholar would have to assert or argue that in fact the change had taken place. All we have from the evidence is text with two separate names. So John Edward Smith could become anything - George Jones, Tommy Lassene, etc. In this respect the degree to which someone embraces a new coherent name determines its importance. These would be pseudonyms or other identities, which in the case of Nanohistory need to be documented as separate individuals and linked as 'sameAs'. It become more nuanced when we think of pen names in the same kind of way - Samuel Clements and Mark Twain are separate persons in terms of documentary evidence, but the same physical person. Both were cultivated by the same person, but are essentially separate entities. The matter becomes to what extent such identities are co-terminous over time. A pen name is obvious co-terminous to a certain degree with the real person. But in the case of an actual name change, perhaps it isn't. John Edward Smith might have ceased to exist when he rename himself George Jones, in which case perhaps the proper relationship is 'became', not 'sameAs'.

The usual situation, though, is for someone like John Edward Smith to acquire additional components to his name. Perhaps these serve to distinguish him from other John Edward Smiths, like his father, so he'd be John Edward Smith, Jr. Which was too awkward, so his family just called him Jack for short. Maybe his father is an aristocrat, or John Edward himself, 'the younger' we could say (using an epithet) gets knighted on his 18th birthday, becoming Sir John Edward. And maybe later at 19, he gets to use one of his father's lesser titles, the Viscount of Sherwick. By the time he's 28 or so, he's managed to earn a PhD in particle physics. And so by 30, the simple John Edward Smith of his birth, is now Sir John Edward Smith, Jr., PhD, the Viscount of Sherwick, that is Jack to his mother and brothers. And for his father, the Sr. epithet appears because there's now a junior to contend with. But before Jack's birth, he was just John Edward too. The point is that throughout all of this John Edward Smith persists, and the others are additions, or secondary names that shape the growing complexity of Jack's identity. They're contextual, and temporal in character, and might not even be coterminous with John Edward's actual life span. Said his father is attained and looses his peerages: there goes the Viscount Sherwick. Or when his father dies, John Edward (Jack that is) becomes the Duke of Gowerton. The Viscount Sherwick isn't lost to him, but it is superseded by the greater title of Duke. Yet John Edward persists.

The reality is that each of these secondary components fall into two categories - alternative names or aliases, and secondary or attribute data that properly describes or belongs to John Edward on account of his actions (such as his PhD) or those given to him by his family or his community, such as Sir and his peerage. While the Sir and Dr. are honorary prefixes, just as secondary as Jr., the peerage is something slightly different.

And why recount all of this? If we're going to build suitable network data, we need to be aware of entities like titles and positions, and how are often flattened and aggregated by cataloguers in order to allow for disambiguation among a pool of agents. At the same time, simply providing a string of data, or failing to see something like a degree or a title as a form of relationship with an organization (like a community, country, or institution) can create problems for the data itself. In many cases we often find occupations serving this role, despite the obvious problems when we think of how to model an agent. A person is defined by their names, and yes 'the Ironmonger' might be a bona fide name, but it's a certain kind of name - an alias or an epithet - for someone who normally has a fore and a surname in a western context. The solution I've opted for with NanoHistory is the following:

NanoHistory

Person
  • Forename: A given or individual name, or a single alias phrase, or a single-word name.
  • Middlenames: where appropriate given names other than the first or forename.
  • Surname: family name or last name, including locative prepositions (of, van, de, der, etc.)
  • Prefix: any kind of honorific, such as Mrs., Dr., The Rt. Hon., Sir, Lady, Sri, Sister, Pere, etc.
  • Suffix: by far the most flexible category comprising honorifics for degrees (MA, PhD, Dipl., etc.), epithets ('The Ironmonger', 'The Younger'), generations or numerics (Jr., Sr., III), or nicknames ('Jack', 'Tommy', etc.). Epithets can also include phrases that are like titles, but which have no association with a polity or organization but contain aristocratic ranks, but are in fact nicknames ('Lord of the Blue Sea', 'Red Baron').
Secondary Data
  • Orthographic Variants: Nanohistory permits the documentation of orthographic variants for canonical names of individuals. These should be phonetically related to the canonical name, but can also be used for different transliterations. They do not constitute a new person.
  • Titles: Titles are properly attributes or events connected to societies or organizations. They can be aristocratic, or not, like the King of France, or the Bishop of Esztergom. Since they are events, technically, they can have start and stop dates like other events. They can also be officers or ranks within organizations like the military or a corporation.
Other Identities

Rather than collapsing distinct names into single identities, Nanohistory allows for the preservation of distinct identities and views their collapse or binding as a matter of scholarly assertion properly designated as a 'sameAs' or 'isAlias' event.

  • Aliases & Pseudonyms: These are properly distinct separate people in Nanohistory. Our classic example is Mark Twain.
  • Occupations: Occupations are documented as events using terms - John Edward Smith is printer. There might well be a title called the Printer of Rexton, but that would be in relation to an organization, and might not be necessarily an occupation or profession. Because these are documented as events, we can assign start and end dates for the occupation.