converting transcription text files for nyingarn

Existing transcription text files will need to be converted to TEI XML format before uploading them into the Nyingarn Workspace. There are a number of tools you can use including  https://pandoc.org/ and https://transpect.github.io/. Please reach out to our team if you need any guidance. To convert Microsoft Word documents, please follow the instructions below or watch our tutorial video.

If you have a transcription done in Microsoft Word or similar, these existing transcription text files will need to be converted to TEI XML format before uploading them to the Nyingarn Workspace. There are a number of tools you can use including  https://pandoc.org/ and https://transpect.github.io/. Microsoft Word documents may contain headings and page breaks obstructing the conversion process. These unnecessary elements need to be removed, and the page numbers appropriately styled to be recognised by the code in the Workspace. The instructions below walk through how to convert a Word document to TEI for Nyingarn ingestion through TEIgarage, an online conversion service.

Before starting, decide on the manuscript item name. Remember that in Nyingarn, an item name must be a unique identifier and can contain letters, numbers, and underscores. The page number sequencing then follows the item name. For example Bates34-001; Bates34-002; Bates34-003. For more naming instructions, see our support page.

Naming your pages in sequence

Open the Microsoft word document (.docx). Name each page of your transcription according to the Nyingarn item name and sequence number. For example, on the first page of the transcription you would type Bates34-001 at the top or bottom of the page. This page of transcription should correspond to the same page of the original manuscript (i.e. page 1).

 

Create a Style of Page Numbers

Step 1 Click the Styles in the Home toolbar.
Step 2 Click the A+ at the bottom of the tool window to create a new style.


Step 3 Name the style Page
Step 4 For style type choose Character in the dropdown menu
Step 5 Choose a Style colour. Choosing a colour other than black will help you to recognise the change in your document.

Page is now a standard style in your documents.

Apply Page style to every page name (e.g. Bates34-001) in the document.

Step 6 The find and replace function is helpful for bulk changes. Type Bates34-??? into the Find what: box. The ??? denote wild fields and will help you to find the entire page number.
Note: Make sure use wildcards is ticked.
Step 7 Click the cursor in the Replace with: field, then click More to expand the Find and Replace options if they are not already displayed.
Step 8 Next click the Format button.
Step 9 Click the Style option, and choose style ‘Page’.

Now that the page naming/numbering is correct and styled with ‘Page’ style, page breaks should be removed.

Remove Page Breaks

Step 10 Using the Find and Replace function, in the Find what: field click ‘Special’.
Step 11 Choose ‘Manual Page Break’. This will add the symbol ^m (see screenshot below). Nothing is needed in the Replace with: section, so leave it blank.

Final Steps – Save and convert the document

Step 12 Save the Microsoft Word .docx file using the naming convention for the Nyingarn Workspace e.g. Bates34-tei.docx.
Step 13 Convert the .docx file to TEI XML using TEIgarage: https://teigarage.tei-c.org/#. TEIgarage will ask you to select the type of document you want to convert.
Step 14 Choose Documents. Choose Convert from: Microsoft (.docx), Convert to: TEI P5 XML Document.
Step 15 The next window will ask you to select the file for conversion. Click the Choose File button to browse, find, and upload your file.
Step 16 The file will automatically download ready to be ingested into the Nyingarn Workspace. The document should be named e.g. Bates34-tei.xml.