Nyingarn Architecture

The following information is conditional upon further investigation. There are many assumptions that need to be tested with the applications and services referenced so they are subject to change.

Nyingarn can be broadly viewed as an active working space and a repository for interaction, preservation and dissemination.

  1. The working space must (these are in no particular order):
    • be user friendly;
    • identify that community approval has been sought and received;
      • for consideration: how to handle if the user doesn’t yet have approval
    • be able to mint new items;
    • be able to load items from the repository for further action;
    • be able to support running operations on data in a modular fashion;
    • be able to push content to the repository for preservation and dissemination;
  2. The repository component must:
    • enable discovery of content by type, keyword and structured data elements (people, places, topics, languages etc)
    • be based upon OCFL with RO-Crate as the description standard

A schematic of the architecture is shown following (this is still evolving so it may not be strictly correct; the requirements above are the authoritative information):

Working space

The working space will be composed of a custom built application that will guide the user through manuscript preparation. This application will help them OCR their images, mediate their usage of fromthepage for further transcription / correction, help them preserve their data in the repository.

A sequence diagram illustrating the user flow in the working space follows (this is still evolving so it may not be strictly correct; the requirements above are the authoritative information):

From The Page

A From The Page service has been commissioned and is available at: https://fromthepage.nyingarn.net.

Scriptable / Batch Processing

The working space will be designed to accommodate modular data processing processes including image OCR and potentially entity and feature extraction using natural language tools.

  • A tool to batch process images using Tesseract OCR has is available from https://github.com/CoEDL/nyingarn-tesseract-processor. This tool is designed to be used standalone by the more technical user to ease the process of performing OCR on a batch of images. This capability will be integrated into the working space. The tool produces:
    • a zip file containing the images and the OCR data in a text file named as the image.
    • hocr, alto and box outputs

Working storage

In order to support compatibility with the AWS public cloud and to simplify scaling up the deployment the storage will be designed around an S3 API compatible service provided by the https://min.io/ object storage service.

Repository

not yet written

Code, Licensing, Principles

All code will be released with the GPLv3 license and it will be available from either the CoEDL GitHub organisation or the ArkistoPlatform GitHub organisation.

As much as possible existing tools and services will be integrated with development focused on providing a coherent user experience around those services. As much as possible tools and processes will be modular and containerised in order to support reproducibility in deployment and ease of use by others.