The Orlando British Women's Writing Dataset Release 1: Biography and Bibliography

Introduction

This dataset provides a rich set of linked open data representing women's literary history from the beginnings to the present, concentrated on writing in English in the British Isles but with tentacles out to other languages, literary traditions, and parts of the world. It emerges from the ongoing experiments in literary history being conducted by The Orlando Project, whose textbase is published and regularly updated and augmented as Orlando: Women's Writing in the British Isles from the Beginnings to the Present by Cambridge University Press since 2006, and from the Canadian Writing Research Collaboratory's work in Linked Open Data as a means of enabling digital scholarship and collaboration in the humanities.

The Orlando Textbase is a semi-structured collection of biocritical entries providing detailed information on the lives and writing of more than 1400 writers with accompany literary, social, and political materials to provide context to its representation of literary history. It does not contain digitized versions of primary texts.

The aim of extracting linked data from Orlando's textbase is to make the data accessible in new ways to discovery, querying, analysis, and visualization; to promote interlinking between Orlando and other related materials on the web; and to experiment with the potential of Linked Open Data technologies to support knowledge production and dissemination in the humanities.

This is a first release of the dataset, which will be augmented and refined over time. This release is focused on the internal linking of biographical information using the CWRC ontology, with selective linking out to other ontology terms and other linked data entities. All of Orlando's bibliographical data, linked to Orlando authors, is also included in this release.

Research Questions

The ontology contains a number of competency questions. The entire preamble will be helpful in understanding the ontology and its potential uses for elucidating the data. Note that it exists in French as well as English. One key difference between this data and some other datasets is the large number of variables: consulting the ontology, and the related ontologies whose namespaces appear at the top of it, in relation to the data is key to understanding the dataset.

For the purpose of this competition, we suggest some more focused questions (see below) that the current dataset is positioned to illuminate. These could be approached by visualizations in range of formats including charts, network graphs, geospatial maps, heat maps, trend visualization, and infographics.

An example of a chart approach: For the research question, "What is the relevance of family size to religious affiliation, socioeconomic status, or other factors?", produce a series of charts (bar, line, pie charts, etc.) that answer this question. Make the process interactive so that the user can select the factors for a multi-dimensional visualization.

An example of a network graph approach: what kinds of biographical networks connect British women writers to each other and to other writers? How extensive are kinship networks as opposed to networks based on political or religious affiliation?

An example of a map approach: Can you map out all the geographic information relevant to a person in Orlando? Can you also map out the person and all the people that they are connected to in Orlando? Does this map information change over time period, with ethnicity, religion, etc.? How do the countries associated with geographical heritage change over time?

An example of a heat map approach: Taking one of the research questions provided below such as "Does social identity become more diverse over time?" can you represent this as a series of heat maps that plot out social identity (religion, ethnicity, etc.) over time? This could even be in the form of a video such as https://www.flickr.com/photos/150411108@N06/43350961005/#.

Emily Bronte Infographic An example of trend analysis: the research question, "How is the number of publications related to the number of children a woman writer had?" could be illustrated using dynamic graphs, examples of which can be seen in videos of Hans Rosling's Gapminder visualizations such as https://www.youtube.com/watch?v=jbkSRLYSojo.

An example of an infographics approach: If you Google Emily Brontë, an infographic such as the one shown here is provided.

Using the Orlando data, can you produce a more detailed infographic of the author? Can you then expand this to have additional infographics of people who are connected to her and who also appear in the Orlando data? Can you find information in the Orlando data and the ontologies that would produce a much different infographic than Google's? What would an infographic of a larger group, such as all writers from a particular historical period, or all writers of a given genre of literature, look like?

Sample Queries

  1. What is the relevance of family size to religious affiliation, socioeconomic status, or other factors?
  2. Does social identity become more diverse over time? Are hybrid cultural forms more common than they were previously? Where are the clusters of hybridity and what are the outliers?
  3. What do bibliometric visualizations reveal about the publishing networks in the bibliographical data?
  4. Does the rise and fall of various genres over time within the works of Orlando authors correlate with scholarly accounts of when those genres rose and fell?
  5. Which authors are the most isolated in the network, and what factors (for instance, social class, place of publication) seem to be associated with being an outlier?
  6. Can number of publications be related to the number of children a woman writer had?
  7. Who are the most connected authors in the textbase (most points of contact with other authors) and what are the most common types of connections, in terms of either specific relationships or the contexts in which they occur? Do the types of connections tracked in the dataset modify over time?
  8. What can a visualization reveal both about the structure of the ontology and the data associated with different components of it? For instance, can it reflect the number and types of instances associated with different components of the ontology, such as by showing the shifting proportions of different Biographical Contexts such as religion, politics, and sexuality as they occur in entries over time?

Linked Open Data

For more information on Linked Open Data, see Appendix A.

Location

Datasets

Name

Description

Format

Location

Triples

Bio+Bibl

Entire set of triples

(contains all those listed below)

rdf/xml

ttl

Graph

http://sparql.cwrc.ca/storage/Orlando_Bio_and_Biblio_Version_1-0.rdf [276.7M]

http://sparql.cwrc.ca/storage/Orlando_Bio_and_Biblio_Version_1-0.ttl [189.8M]

http://sparql.cwrc.ca/db/BiographyBibliographyV1

3321746

Biography

Most biographical information from the Orlando dataset (details below)

rdf/xml

ttl

Graph

http://sparql.cwrc.ca/storage/Orlando_Biography_Subset_Version_1-0.rdf [105.3M]

http://sparql.cwrc.ca/storage/Orlando_Biography_Subset_Version_1-0.ttl [65.3M]

http://sparql.cwrc.ca/db/BiographyV1

1081674

CulturalForms

A subset of Biography focused on social identities

rdf/xml

ttl

Graph

http://sparql.cwrc.ca/storage/Orlando_Cultural_Forms_Subset_Version_1-0.rdf [19.4M]

http://sparql.cwrc.ca/storage/Orlando_Cultural_Forms_Subset_Version_1-0.ttl [11M]

http://sparql.cwrc.ca/db/CulturalFormsV1

192004

Bibliography

Standard bibliographic metadata using the Bibframe ontology

rdf/xml

ttl

Graph

http://sparql.cwrc.ca/storage/Orlando_Bibliography_Subset_Version_1-0.rdf [171.4M]

http://sparql.cwrc.ca/storage/Orlando_Bibliography_Subset_Version_1-0.ttl [126.3M]

http://sparql.cwrc.ca/db/BibliographyV1

2240072

Ontologies

CWRC

The main ontology

rdf/xml

ttl

n-triples

http://sparql.cwrc.ca/ontologies/cwrc.html[b][c][d][e]

http://sparql.cwrc.ca/ontologies/cwrc.rdf

http://sparql.cwrc.ca/ontologies/cwrc.ttl

http://sparql.cwrc.ca/ontologies/cwrc.nt

17772

CWRC genre ontology 

Literary genres

http://sparql.cwrc.ca/ontologies/genre.html

http://sparql.cwrc.ca/ontologies/genre.rdf

http://sparql.cwrc.ca/ontologies/genre.ttl

http://sparql.cwrc.ca/ontologies/genre.nt

8572

Illnesses and injuries ontology

Illnesses and injury classification based on IDC

http://sparql.cwrc.ca/ontologies/ii.html

http://sparql.cwrc.ca/ontologies/ii.rdf

http://sparql.cwrc.ca/ontologies/ii.ttl

http://sparql.cwrc.ca/ontologies/ii.nt

887

Provenance

The datasets have been derived from the text base of The Orlando Project which “explores and harnesses the power of digital tools and methods to advance feminist literary scholarship” (http://www.artsrn.ualberta.ca/orlando/).

The dataset is based on the Orlando textbase, which comprises 1413 entries (1117 British women writers, 178 male writers, 177 other women writers—listed twice if their nationality shifted); 13,794 free-standing and 28,807 embedded chronology entries; 29,479 bibliographical listings; 2,749,854 tags; 9,038,958 words (exclusive of tags).

The Orlando dataset is XML-encoded text that employs several schemas that encode aspects of the Project's born-digital entries and contextualizing information on the lives and works of writers, which aim to elucidate the conditions of production and reception of their texts, as well as the features of the texts themselves, from a recuperative critical perspective that considers gender and other intersecting social forces to be a salient factor in literary history.

The textbase employs several XML schemas:

The text encoded with these schemas has provided the basis for the extraction process that has produced the dataset provided here.

Data extraction and transformation

The data extraction was guided by the CWRC ontologies, which were created on the basis of the XML tagset and the existing Orlando Project data. They leverage a number of existing ontologies, but also create a number of new relationships and numerous data instances.

The CWRC ontologies can be found at http://sparql.cwrc.ca/ from which they can be retrieved in various formats.

The separate ontologies are:

The code used in extraction can be found at https://github.com/cwrc/RDF-extraction

The basic methodology for producing the RDF from the Orlando SML is to extract relationships from selected Xpath locations in the document. The correspondence between the XML document and the RDF relationships extracted for this dataset can be seen by comparing the XML file (here) and the resulting RDF file (here).

If very specific questions remain, detailed information on extraction decisions is found here.

Data subsets

This draft dataset is available as a whole or in three subsets:

  1. Biography, containing:
    1. Birth and Death - includes birth and death dates for writers for whom these are known, in some cases including birth order within family and cause of death;
    2. Cultural identities - contains information on the social identities associated with writers, ranging from language, religion, social class, race, colour, or ethnicity to nationality. Such identities shift both historically and at times within writers’ lives;
    3. Family relations - information on the family members, including spouses, of writers, and at times information related to their occupations;
    4. Education - includes links to instructors, schools, subjects of study, and credentials earned;
    5. Friends - includes information about loose associations through to close and enduring friendships to intimate relationships - includes information on both erotic and non-erotic ties;
    6. Leisure and Society - information on social activities;
    7. Occupations - covers both significant activities of and jobs held by writers;
    8. Political Affiliation - information on writers’ political activities including their affiliations with particular groups or organizations and their degrees of involvement;
    9. Spatial activities - information on writers’ residences, visits, travels, and migration to particular locations. Spatial data coordinates are granular to the level of settlement only; that is, they do not distinguish between different locations in the same place, such as London;
    10. Violence - information on writers’ experiences of violence on a range of scales
    11. Wealth - information concerning writers’ poverty, income, and wealth
    12. Health - information on writers’ physical and mental health and illnesses;
    13. General biographical materials that don’t align with the specific categories.
  2. Bibliography
    1. Standard bibliographic data about works published by the authors whose lives are described in the dataset, plus all works referenced in the Orlando textbase.
    2. Genre classification for the texts by women writers who have entries in Orlando.

Precision and Accuracy

This is an "open world" dataset, which is to say that the absence of an assertion does not indicate that the assertion is untrue.

The relationships between people represented in this dataset are based on inference from the XML. Results of the scripts have been read against key documents to ensure that the assertions that have been created in RDF are generally reasonable, based on the markup, and the scripts adjusted in response. However, not all results could be checked by human beings, and it is important to remember that the XML relationships were created by human beings producing discursive accounts of literary history and therefore seeking to tag notable features of the material they were writing, without awareness that this extraction would later take place. The result is that at times this dataset will produce misleading assertions based on how the data is extracted.

The most common cases are where a person mentioned within an XML tag used to create a relationship is tangentially rather than centrally involved in the relationship associated with that tag. For example, the discussion of Virginia Woolf's novel Orlando contains a mention of Woolf's diary, and that mention is tagged as a genre; as a result, Orlando is identified as a diary as well as a biography, drama, fiction, novel, history, masque, and mock form. Another example is that the name of a commentator or a contemporary witness, if mentioned and tagged in the XML within the prose of a tag, may be extracted and create a false assertion of a relationship between the subject of an Orlando entry and the commentator or witness, whereas they were being named in the document as a source of information about the relationship. We have worked to eliminate such false assertions where we can do so systematically during the data extraction process, and plan to work towards more sophisticated means of eliminating such false assertions such as of personal relationships between people whose lives did not overlap.

However, these and other factors also mean that not every relationship indicated by the XML has necessarily been extracted, particularly if such extraction would lead to a substantial number of false assertions in addition to true ones. The extraction scripts are available for consultation here.

Omissions

This dataset is not in any sense comprehensive, given that it is based both on historical sources full of gaps and selectivity with respect to the inclusion of particular details. The markup from which the data is extracted is also on the more interpretive end of the spectrum, meaning that there are inconsistencies in application, even though encoders receive extensive training and all markup has been reviewed by a senior scholar.

In addition, there are some specific omissions from this draft dataset. The following major components of the Orlando data are not yet present:

  1. Personal name variants, including pseudonyms;
  2. Occupations for family members (not yet linked to ontology instances);
  3. Events related to birth, death, family, and interpersonal contexts, plus thousands of freestanding events that include information on other writers and historical contexts;
  4. Orlando literary relations, apart from those represented in the bibliographic data;
  5. Scholarly notes and citation information (found at present only in the full Orlando textbase).

Other omissions are related to the provenance of the dataset and the priorities of the Orlando Project itself. Specific genre information is present only for the works of women writers with entries and not for all bibliographic records. In general, information on men writers and writers from outside Britain is less full than information on British women writers.

This is not yet a 5-star LODset. Still to come are more links out to other LOD identifiers, plus dereferenceable URIs for the writers themselves, who are currently signalled with placeholder identifiers in a format based on their standard names. Thus, although http://cwrc.ca/cwrcdata/Abdy_Maria appears to be a URL, it will not resolve to a webpage and the user will get a 404 error. In addition there are “blank nodes” in the format such as "N59254ff7a0fa44a6b36eb878ea08b7a0" that occur in the Web Annotation component of the data. These are generated in order to link each annotation/context to the string of turtle for the triples that is associated with that annotation, so that a system can make that link in the future. For many purposes, it may be best just to filter those out.

Key decisions and strategies

Anonymous nodes

There are technically anonymous nodes, that is nodes that do not have labels and are not dereferencable. Their identifiers come in forms such as: N296d21fa528149a3a49ae7099a6e37fa.

Nodes such as "Annie Writer's brother" have been created when we know something about the person even though we don't know their name. This means people exist in the ontology that do not have URIs and therefore cases can exist where one person is separately defined multiple times.

Natural language processing

NLP has been employed judiciously to complement the XML in cases where there was obvious benefit and we could expect a high level of accuracy. For instance, the high degree of overlap in vocabulary between terms used in the

<CAUSEOFDEATH>
tag and words occurring in the
<HEALTH>
tag allowed us to create assertions about health factors that were not supported by the markup.
</HEALTH>

Regularization, disambiguation, and linking of data

Regularization was available in the dataset only for personal names and organizations. In other areas, such as for religions, it has been achieved through the ontology, wherein the "skos:altLabel" property indicates terms that have been grouped together.

In the case of place names, the results of automated matching (of the combination of settlement name and its associated region and geo-political unit) have been reviewed for accuracy, so geospatial coordinates should be accurate on places down the the level of the settlement or populated place. We use the Geonames for most place identifiers, supplemented by Getty placenames where needed. These matches are made with a view to making the data mappable. However, we recognize that historical shifts and political contestation may make the labels associated with particular past and present places problematic.

While it is highly desirable to convert strings of text to things, in the sense of defined, dereferencable entities, this is not possible or indeed desirable in all cases, given the state of the source data. Where it has been possible to link to an external ontology or to regularize vocabulary within the data and create LOD instance of terms within the CWRC ontology, this dataset has done so. However, in some cases the data has been so idiosyncratic or heterogeneous that this has not been done. This is often the case for a few instances within a larger subset of instances that have been regularized, such as religion. In the case of education, there were many outliers that have not been created as linked data instances. There are thousands of occupations, which are at present represented as strings, but a subset of common occupations grouping terms together is forthcoming.

Notable features

Events provide content with temporal and geospatial locators amenable to timelines or mapping. We draw on the Simple Event Model, extending it to indicate separately from date values themselves the degree of certainty associated with an event's date, and providing typing of events specific to our domain. Not all Contexts have events at present, although all will eventually: at present they are concentrated on Cultural Forms, Health, Violence, Wealth, Leisure and Other Life Events. Note that although birth and death events are not present in this dataset, there are birth and death dates for authors that can be used as a basis for modeling temporal change.

Contexts use the Web Annotation data model to provide links from particular entities to their discursive contexts. "MotivatedBy:identifying" annotations indicate all entities mentioned in a particular discursive context, so one can group mentions of a certain person or persons by Context type. "MotivatedBy:describing" annotations include as annotation bodies the writer who is being described; associated RDF triples are included as plain text bodies of these annotations with identifiers in a form such as "N59254ff7a0fa44a6b36eb878ea08b7a0". These plaintext values can be used to identify the exact triple that is being described by the data. During processing it is safe to use these strings as valid Turtle or N-Triples.

Spatial data is often but not always (e.g. place of death) regularized to Geonames or the Getty Thesaurus of Geographic Names identifiers.

Bibliography objects use the BIBFRAME RDF schema supplemented by a couple of terms from schema.org.

License

This dataset is made available under a CC-BY-NC license. If you make use of this dataset, we would appreciate being informed of this at cwrc@ualberta.ca.

Contributors

to the production of this dataset:

Jeffery Antoniuk. University of Alberta. Orlando Project programmer and systems analyst. Assisted in preparation and extraction of dataset.

Susan Brown. University of Guelph. Project lead. Produced extraction guidelines and guided process. https://www.uoguelph.ca/~sbrown/

Joel Cummings. Responsible for overseeing and contributing to technical work on the ontology and making key design decisions to support extraction from a variety of sources.

Jasmine Drudge-Willson. University of Guelph. Research Assistant. Responsible for researching and modeling structural and theoretical aspects of external ontologies, and how they relate to the aims of the CWRC ontology.

Abigel Lemak. University of Guelph. PhD student in Literary Studies. Overall project management, as well drafting ontology terms and definitions, particularly those that deal with Cultural Forms.

Kim Martin. University of Guelph. Helped with project management. Produced sample triples for testing accuracy of extraction.

Alliyya Mo. University of Guelph. Wrote the extraction scripts for cultural form extraction and much of the rest of the biography data. Minted instance data for cultural forms, genres, religions, politIcal affiliations. Responsible also for much ontology refinement reflecting the extraction process.

Michaela Rye. University of Guelph. Undergraduate research assistant. Responsible for tag cleanup as well as disambiguating and drafting ontology terms, particularly those that deal with Occupations.

Gurjap Singh. University of Guelph. Co-op student. Responsible for extraction of birth, death, and family data from Orlando data. Created script to combine triples from the separate scripts. Queried Geonames API to get URIs for locations in Orlando.

Thomas Smith. University of Guelph. Undergraduate research assistant. Responsible for tag cleanup as well as disambiguating and drafting ontology terms, particularly those that deal with geospatial data, educational awards, literary awards, and Occupations.

Deborah Stacey. University of Guelph. Associate professor in the School of Computer Science. Helped coordinate the process. Wrote scripts for extracting cause of death and health triples. http://ontology.socs.uoguelph.ca/

See also the Orlando Project credits and the CWRC Ontology Credits

Appendix A

Various tools exist to handle and produce Linked Open Data (LOD) in its various formats.

CWRC uses an OWL RDF ontology schema which means data is represented in triple format (subject, predicate, object). The RDF is then exported to one of several formats which express the RDF in a unique syntax.

To perform the transformation and handling of RDF several libraries are available depending on your language of choice, here are our recommendations:

Each of these libraries allows one to load in the RDF data, and traverse the data as a graph which is useful in identifying links between data. More comprehensive libraries like Apache Jena allow one to query the dataset within your application to infer and make queries on the data.

A very full set of recommended resources on LOD has been gathered by the folks at Islandora's CLAW project, and can be found here.

Querying using the SPARQL Endpoint

We have also provided each ontology within a Graph so that it can queryed using the sparql endpoint. This requires knowledge of SPARQL but is powerful as an inspection and generation tool.

As an example to get information about bibliography classes we provide this query which you can paste in the SPARQL box:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX bf: <http://id.loc.gov/ontologies/bibframe/>

SELECT * WHERE {
  GRAPH <http://sparql.cwrc.ca/db/BibliographyV1> {
      ?sub bf:place ?obj .
    ?obj rdf:value ?o .
  }
} 
 LIMIT 10