Text Dataset

This page describes the structure and fields that can be found in the text xml file. Please note that all rights in this dataset are reserved. It is only provided here for non-commercial, personal use.

Structure

There is one <root> element. Inside this, there is one <source> element which contains the <creation_date> of the xml file, plus <source_url> of where the file originated from. Following this there are multiple <record> elements, which contain details of each record in the xml. The fields in the <record> elements are listed below:

Field Details

<irn> - a unique numeric identifier used by the PrideNZ content management system
<pridenz_url> - url of the PrideNZ page containing the source material
<atl_ref> - Alexander Turnbull Library reference (if item is deposited with ATL)
<atl_url> - Url of Alexander Turnbull Library record (if item is deposited with ATL)
<title> - title of material, contained within a CDATA tag to accomodate special characters
<usage> - important rights and usage information
<voices> - voices identified in this media item. Tags are delimited by a semi-colon
<interviewer> - person(s) identified as the interviewer (if any)
<tags> - tags that have been manually associated with this media item. Tags are delimited by a semi-colon, and contained within a CDATA tag to accomodate special characters
<date> - date of production (sometimes this may be approximate)
<year> - year of production
<location> - location of production
<type> - can be: "computer generated text", "transcript" or "hansard"
<text_generation_date> - date computer generated text was created
<text> - the actual content of transcript/AI text/Hansard, contained within a CDATA tag to accomodate special characters