[01-29-2022] Release 1.52 See 1.51 release for more details.
This dataset provides a rich set of linked open data representing women's literary history from the beginnings to the present, concentrated on writing in English in the British Isles but with tentacles out to other languages, literary traditions, and parts of the world. It emerges from the ongoing experiments in literary history being conducted by The Orlando Project, whose textbase is published and regularly updated and augmented as Orlando: Women's Writing in the British Isles from the Beginnings to the Present by Cambridge University Press since 2006, and from the Canadian Writing Research Collaboratory's work in Linked Open Data as a means of enabling digital scholarship and collaboration in the humanities.
The Orlando Textbase is a semi-structured collection of biocritical entries providing detailed information on the lives and writing of more than 1400 writers with accompany literary, social, and political materials to provide context to its representation of literary history. It does not contain digitized versions of primary texts.
The aim of extracting linked data from Orlando's textbase is to make the data accessible in new ways to discovery, querying, analysis, and visualization; to promote interlinking between Orlando and other related materials on the web; and to experiment with the potential of Linked Open Data technologies to support knowledge production and dissemination in the humanities.
This is a second release of the dataset, which is continuously being augmented and refined. This release is focused on the internal linking of biographical information using the CWRC ontology, with selective linking out to other ontology terms and other linked data entities. All of Orlando's bibliographical data, linked to Orlando authors, is also included in this release.
The ontology contains a number of competency questions. The entire preamble will be helpful in understanding the ontology and its potential uses for elucidating the data. Note that it exists in French as well as English. One key difference between this data and some other datasets is the large number of variables: consulting the ontology, and the related ontologies whose namespaces appear at the top of it, in relation to the data is key to understanding the dataset.
For the purpose of this competition, we suggest some more focused questions (see below) that the current dataset is positioned to illuminate. These could be approached by visualizations in range of formats including charts, network graphs, geospatial maps, heat maps, trend visualization, and infographics.
An example of a chart approach: For the research question, "What is the relevance of family size to religious affiliation, socioeconomic status, or other factors?", produce a series of charts (bar, line, pie charts, etc.) that answer this question. Make the process interactive so that the user can select the factors for a multi-dimensional visualization.
An example of a network graph approach: what kinds of biographical networks connect British women writers to each other and to other writers? How extensive are kinship networks as opposed to networks based on political or religious affiliation?
An example of a map approach: Can you map out all the geographic information relevant to a person in Orlando? Can you also map out the person and all the people that they are connected to in Orlando? Does this map information change over time period, with ethnicity, religion, etc.? How do the countries associated with geographical heritage change over time?
An example of a heat map approach: Taking one of the research questions provided below such as "Does social identity become more diverse over time?" can you represent this as a series of heat maps that plot out social identity (religion, ethnicity, etc.) over time? This could even be in the form of a video such as Temperature Anomalies by Country 1880-2017.
An example of trend analysis: the research question, "How is the number of publications related to the number of children a woman writer had?" could be illustrated using dynamic graphs, examples of which can be seen in videos of Hans Rosling's Gapminder visualizations such as Hans Rosling's 200 Countries, 200 Years, 4 Minutes - The Joy of Stats - BBC Four.
An example of an infographics approach: If you Google Emily Brontë, an infographic such as the one shown here is provided.
Using the Orlando data, can you produce a more detailed infographic of the author? Can you then expand this to have additional infographics of people who are connected to her and who also appear in the Orlando data? Can you find information in the Orlando data and the ontologies that would produce a much different infographic than Google's? What would an infographic of a larger group, such as all writers from a particular historical period, or all writers of a given genre of literature, look like?
For more information on Linked Open Data, see Appendix A.
Name
Description
Format
Location
Triples
Biography+Bibliographic Context Centric
Entire set of triples including ontologies
(contains all those listed below)
RDF/XML
TTL
Graph
http://sparql.cwrc.ca/data/orlando.rdf [276.7M]
http://sparql.cwrc.ca/data/orlando.ttl [189.8M]
http://sparql.cwrc.ca/data/orlando
3870906
Biography
Most biographical information from the Orlando dataset
http://sparql.cwrc.ca/data/orlando/biography.rdf [105.3M]
http://sparql.cwrc.ca/data/orlando/biography.ttl [65.3M]
http://sparql.cwrc.ca/data/orlando/biography
1081674
CulturalForms
A subset of Biography focused on social identities
http://sparql.cwrc.ca/data/orlando/culturalforms.rdf [19.4M]
http://sparql.cwrc.ca/data/orlando/culturalforms.ttl [11M]
http://sparql.cwrc.ca/data/orlando/culturalforms
192004
Bibliography
Standard bibliographic metadata using the Bibframe ontology
http://sparql.cwrc.ca/data/orlando/bibliography.rdf [171.4M]
http://sparql.cwrc.ca/data/orlando/bibliography.ttl [126.3M]
http://sparql.cwrc.ca/data/orlando/bibliography
2240072
Biography Simple Triples
http://sparql.cwrc.ca/data/orlando/simple.rdf [276.7M]
http://sparql.cwrc.ca/data/orlando/simple.ttl [189.8M]
http://sparql.cwrc.ca/data/orlando/simple
170557
CWRC
The main ontology
Documentation
N-Triples
http://sparql.cwrc.ca/ontologies/cwrc.html
http://sparql.cwrc.ca/ontologies/cwrc.rdf
http://sparql.cwrc.ca/ontologies/cwrc.ttl
http://sparql.cwrc.ca/ontologies/cwrc.nt
17772
CWRC genre ontology
Literary genres
http://sparql.cwrc.ca/ontologies/genre.html
http://sparql.cwrc.ca/ontologies/genre.rdf
http://sparql.cwrc.ca/ontologies/genre.ttl
http://sparql.cwrc.ca/ontologies/genre.nt
8572
Illnesses and injuries ontology
Illnesses and injury classification based on IDC
http://sparql.cwrc.ca/ontologies/ii.html
http://sparql.cwrc.ca/ontologies/ii.rdf
http://sparql.cwrc.ca/ontologies/ii.ttl
http://sparql.cwrc.ca/ontologies/ii.nt
887
The datasets have been derived from the text base of The Orlando Project which “explores and harnesses the power of digital tools and methods to advance feminist literary scholarship” (http://www.artsrn.ualberta.ca/orlando/).
The dataset is based on the Orlando textbase, which comprises 1413 entries (1117 British women writers, 178 male writers, 177 other women writers—listed twice if their nationality shifted); 13,794 free-standing and 28,807 embedded chronology entries; 29,479 bibliographical listings; 2,749,854 tags; 9,038,958 words (exclusive of tags).
The Orlando dataset is XML-encoded text that employs several schemas that encode aspects of the Project's born-digital entries and contextualizing information on the lives and works of writers, which aim to elucidate the conditions of production and reception of their texts, as well as the features of the texts themselves, from a recuperative critical perspective that considers gender and other intersecting social forces to be a salient factor in literary history.
The textbase employs several XML schemas:
The text encoded with these schemas has provided the basis for the extraction process that has produced the dataset provided here.
The data extraction was guided by the CWRC ontologies, which were created on the basis of the XML tagset and the existing Orlando Project data. They leverage a number of existing ontologies, but also create a number of new relationships and numerous data instances.
The CWRC ontologies can be found at http://sparql.cwrc.ca/ from which they can be retrieved in various formats.
The separate ontologies are:
The code used in extraction can be found at https://github.com/cwrc/RDF-extraction
The basic methodology for producing the RDF from the Orlando XML is to extract relationships from selected XPath locations in the document.
If very specific questions remain, detailed information on extraction decisions is found here.
This draft dataset is available as a whole or in three subsets:
This is an "open world" dataset, which is to say that the absence of an assertion does not indicate that the assertion is untrue.
The relationships between people represented in this dataset are based on inference from the XML. Results of the scripts have been read against key documents to ensure that the assertions that have been created in RDF are generally reasonable, based on the markup, and the scripts adjusted in response. However, not all results could be checked by human beings, and it is important to remember that the XML relationships were created by human beings producing discursive accounts of literary history and therefore seeking to tag notable features of the material they were writing, without awareness that this extraction would later take place. The result is that at times this dataset will produce misleading assertions based on how the data is extracted.
The most common cases are where a person mentioned within an XML tag used to create a relationship is tangentially rather than centrally involved in the relationship associated with that tag. For example, the discussion of Virginia Woolf's novel Orlando contains a mention of Woolf's diary, and that mention is tagged as a genre; as a result, Orlando is identified as a diary as well as a biography, drama, fiction, novel, history, masque, and mock form. Another example is that the name of a commentator or a contemporary witness, if mentioned and tagged in the XML within the prose of a tag, may be extracted and create a false assertion of a relationship between the subject of an Orlando entry and the commentator or witness, whereas they were being named in the document as a source of information about the relationship. We have worked to eliminate such false assertions where we can do so systematically during the data extraction process, and plan to work towards more sophisticated means of eliminating such false assertions such as of personal relationships between people whose lives did not overlap.
However, these and other factors also mean that not every relationship indicated by the XML has necessarily been extracted, particularly if such extraction would lead to a substantial number of false assertions in addition to true ones. The extraction scripts are available for consultation here.
This dataset is not in any sense comprehensive, given that it is based both on historical sources full of gaps and selectivity with respect to the inclusion of particular details. The markup from which the data is extracted is also on the more interpretive end of the spectrum, meaning that there are inconsistencies in application, even though encoders receive extensive training and all markup has been reviewed by a senior scholar.
In addition, there are some specific omissions from this draft dataset. The following major components of the Orlando data are not yet present:
Other omissions are related to the provenance of the dataset and the priorities of the Orlando Project itself. Specific genre information is present only for the works of women writers with entries and not for all bibliographic records. In general, information on men writers and writers from outside Britain is less full than information on British women writers.
This is not yet a 5-star LODset. Still to come are more links out to other LOD identifiers, plus dereferenceable URIs for the writers themselves, who are currently signalled with placeholder identifiers in a format based on their standard names. Thus, although http://cwrc.ca/cwrcdata/Abdy_Maria appears to be a URL, it will not resolve to a webpage and the user will get a 404 error. In addition there are “blank nodes” in the format such as "N59254ff7a0fa44a6b36eb878ea08b7a0" that occur in the Web Annotation component of the data. These are generated in order to link each annotation/context to the string of turtle for the triples that is associated with that annotation, so that a system can make that link in the future. For many purposes, it may be best just to filter those out.
NLP has been employed judiciously to complement the XML in cases where there was obvious benefit and we could expect a high level of accuracy. For instance, the high degree of overlap in vocabulary between terms used in the <CAUSEOFDEATH> tag and words occurring in the <HEALTH> tag allowed us to create assertions about health factors that were not supported by the markup.
<CAUSEOFDEATH>
<HEALTH>
Regularization was available in the dataset only for personal names and organizations. In other areas, such as for religions, it has been achieved through the ontology, wherein the skos:altLabel property indicates terms that have been grouped together.
skos:altLabel
In the case of place names, the results of automated matching (of the combination of settlement name and its associated region and geo-political unit) have been reviewed for accuracy, so geospatial coordinates should be accurate on places down the the level of the settlement or populated place. We use the Geonames for most place identifiers, supplemented by Getty placenames where needed. These matches are made with a view to making the data mappable. However, we recognize that historical shifts and political contestation may make the labels associated with particular past and present places problematic.
While it is highly desirable to convert strings of text to things, in the sense of defined, dereferencable entities, this is not possible or indeed desirable in all cases, given the state of the source data. Where it has been possible to link to an external ontology or to regularize vocabulary within the data and create LOD instance of terms within the CWRC ontology, this dataset has done so. However, in some cases the data has been so idiosyncratic or heterogeneous that this has not been done. This is often the case for a few instances within a larger subset of instances that have been regularized, such as religion. In the case of education, there were many outliers that have not been created as linked data instances. There are thousands of occupations, which are at present represented as strings, but a subset of common occupations grouping terms together is forthcoming.
Events provide content with temporal and geospatial locators amenable to timelines or mapping. We draw on the Simple Event Model, extending it to indicate separately from date values themselves the degree of certainty associated with an event's date, and providing typing of events specific to our domain. Not all Contexts have events at present, although all will eventually: at present they are concentrated on Cultural Forms, Health, Violence, Wealth, Leisure and Other Life Events. Note that although birth and death events are not present in this dataset, there are birth and death dates for authors that can be used as a basis for modeling temporal change.
Contexts use the Web Annotation data model to provide links from particular entities to their discursive contexts. Annotations with an oa:identifying motivation indicate all entities mentioned in a particular discursive context, so one can group mentions of a certain person or persons by Context type. Annotations with an oa:describing motivation include as annotation bodies the writer who is being described.
oa:identifying
oa:describing
Spatial data is often but not always (e.g. place of death) regularized to Geonames or the Getty Thesaurus of Geographic Names identifiers.
Bibliography objects use the BIBFRAME RDF schema supplemented by a couple of terms from schema.org.
This dataset is made available under a CC-BY-NC license. If you make use of this dataset, we would appreciate being informed of this at cwrc@ualberta.ca.
to the production of this dataset:
Jeffery Antoniuk. University of Alberta. Orlando Project programmer and systems analyst. Assisted in preparation and extraction of dataset.
Susan Brown. University of Guelph. Project lead. Produced extraction guidelines and guided process. https://www.uoguelph.ca/~sbrown/
Joel Cummings. Responsible for overseeing and contributing to technical work on the ontology and making key design decisions to support extraction from a variety of sources.
Jasmine Drudge-Willson. University of Guelph. Research Assistant. Responsible for researching and modeling structural and theoretical aspects of external ontologies, and how they relate to the aims of the CWRC ontology.
Hannah Stewart. University of Guelph. Research Assistant. Responsible for definition refinements within the CWRC/GENRE ontologies and supporting ontology modelling.
Abigel Lemak. University of Guelph. PhD student in Literary Studies. Overall project management, as well drafting ontology terms and definitions, particularly those that deal with Cultural Forms.
Kim Martin. University of Guelph. Helped with project management. Produced sample triples for testing accuracy of extraction.
Alliyya Mo. University of Guelph. Wrote the extraction scripts for cultural form extraction and much of the rest of the biography data. Minted instance data for cultural forms, genres, religions, politIcal affiliations. Responsible also for much ontology refinement reflecting the extraction process.
Michaela Rye. University of Guelph. Undergraduate research assistant. Responsible for tag cleanup as well as disambiguating and drafting ontology terms, particularly those that deal with Occupations.
Gurjap Singh. University of Guelph. Co-op student. Responsible for initial extraction of birth, death, and family data from Orlando data. Queried Geonames API to get URIs for locations in Orlando.
Thomas Smith. University of Guelph. Undergraduate research assistant. Responsible for tag cleanup as well as disambiguating and drafting ontology terms, particularly those that deal with geospatial data, educational awards, literary awards, and Occupations.
Deborah Stacey. University of Guelph. Associate professor in the School of Computer Science. Helped coordinate the process. Wrote scripts for extracting cause of death and health triples. http://ontology.socs.uoguelph.ca/
See also the Orlando Project credits and the CWRC Ontology Credits
Various tools exist to handle and produce Linked Open Data (LOD) in its various formats.
CWRC uses an OWL RDF ontology schema which means data is represented in triple format (subject, predicate, object). The RDF is then exported to one of several formats which express the RDF in a unique syntax.
To perform the transformation and handling of RDF several libraries are available depending on your language of choice, here are our recommendations:
Each of these libraries allows one to load in the RDF data, and traverse the data as a graph which is useful in identifying links between data. More comprehensive libraries like Apache Jena allow one to query the dataset within your application to infer and make queries on the data.
A very full set of recommended resources on LOD has been gathered by the folks at Islandora's CLAW project, and can be found here.
We have also provided each ontology within a Graph so that it can queryed using the sparql endpoint. This requires knowledge of SPARQL but is powerful as an inspection and generation tool.
As an example to get information about bibliography classes we provide this query which you can paste in the SPARQL box:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX bf: <http://id.loc.gov/ontologies/bibframe/> SELECT * WHERE { GRAPH <http://sparql.cwrc.ca/db/BibliographyV1> { ?sub bf:place ?obj . ?obj rdf:value ?o . } } LIMIT 10