Documentation
Data Acquisition and Data Management
Gwenlyn Tiedemann and Dr. Anne Luther

Date of document publication: 2022-11-07

Data Acquisition and Documentation

The data-acquisition process was a core portion of the project. Identifying institutions, finding objects that fit the scope of the project, building partnerships with institutions and managing the data transfer were only parts of the complexity. This documentation provides an outline of the processes established by the Digital Benin team to bring together the data shown on the platform.

Rather than working with a ready-made project-management software, we custom-built several bases for the specific processes of this project. Data in the cultural-institution context is specific to the field, and therefore the transfer and acquisition processes for Digital Benin took into account the social dimensions in institutions (e.g., staff, time and data production), the technical dimension (e.g., diverse software, export formats and transfer methods) and the wider context of the request (e.g., restitution debates, publishing data online, licensing). The tight timeline for the project was strongly impacted by the pandemic, with institutional shutdowns, furloughs and position changes three months into the project. These influences were dealt with to react to changes or unexpected progress in a flexible manner. Before starting the data acquisition, we built the following progress tracking and documentation platforms:

  • Project Tracker – Gives overview of milestones, deliverables and a project calendar that can be accessed by everyone and is managed by the team catalyst.
  • Data Processing – Includes overviews that communicate the data acquisition progress to the team and are synchronised for the research and development process, which was also documented in these overviews in full detail.
  • Data Acquisition Documentation – Tracks data acquisition with details about contacts, email content, data formats and dates of data received.

All three platforms grew with the project’s progress and development on several levels. The core project approach is that data acquisition leads to the necessary research required for the database development of the digital platform. The aim of the acquisition was to receive data pertaining to Benin objects (see parameters) that our partner institutions produce and have historically produced and stored in their databases or published online. The outcome of the data acquisition is a repository containing the received data that is used for the research and development of the database for the digital platform. It was important for the project to appropriately review the data that we asked institutions to provide. It was already ambitious to receive the data outlined below; asking for data outside these parameters would lead to significant delays that the project could not afford in the two-year scope.

Prior to establishing contact with the institutions, we defined key terms within our project team for clear communication with the institutions:

  • Field titles: Metadata that describes what information (value) we find in a field.
  • Preliminary Data: Information about objects, images and any data that can be shared promptly.
  • Finalised Data: Data that institutions know they can share with us in the course of our project; often this included provenance data, images of objects that had missing or historical photographs, exhibition histories, and so forth.
  • Surplus Data: Photographic archives, research material, archival materials.

With these definitions, we set out the following parameters for the scope of the data acquisition:

We are looking primarily for objects created for or used within Benin Kingdom up to 1897, including plaques, commemorative heads, carved ivory tusks, figures, shrine objects like bells, pendants, staffs and so forth. As this is not always a clearly defined corpus, we are also considering any objects related to Benin that were created and/or circulated from the African continent to Europe or America before 1930.

Indeed, we would like to consider for inclusion any artworks or objects that tell relevant stories, add crucial or insightful information to the core object set or reveal a broader picture, such as artworks and information that represent the colonial period of present-day Nigeria, the raid on Benin Kingdom and its aftermath:

  • Associated archival and/or historical materials such as photographs, postcards, newspapers, field notes, and so forth, particularly within the same aforementioned time period.
  • Bini-Portuguese ivories/objects (though some were made by Owo carvers) and related arts.
  • Related Udo-style artworks and non-urban art forms, such as masquerades, that may give us a broader picture of the time period, migrations, and performances within the city.

Our plan for the data collection was to receive the preliminary or complete data by the end of October 2021, one year after the start date of the project. Any additional media files, archival material or new photographs could be added until l May 2022. This schedule was made to be flexible since communication with some institutions needed more time, with the result that we received data from some institutions as late as August 2022.

For the data acquisition, we defined the resources and tools required prior to communicating with the institutions: an email client; data transfer services (i.e., WeTransfer); team communication, organisation and management software; command line; web and desktop tools capable of accessing API endpoints and programmatically retrieving their payloads (i.e., Postman, Google Chrome, Python, JavaScript); version control and documentation tools for programmatic, versioned storage.

After the internal understanding of these elements for the data transfer, the following steps were taken one month into the project:

  • Introduction: Contact, introduction and conversation with the institutional partners, including sharing the press release, website, the official invitation to participate in the project and a letter of intent for the institutions.
  • Conversation: Introduction of the project, building partnerships, and asking for data that institutions generated about Benin objects (outlined in parameters above). With this conversation, staff was identified for curatorial and content questions and the transfer of data.
  • Preliminary Data Collection: Contact with the identified staff in the institutions to initiate data transfer (i.e., data export, email, internal API, public API).
  • Data Review: Review process starts (see Object Research & Reviews)
  • Data Sharing Agreement: Institutions that need a data sharing agreement received an appendix with identified fields and media. The agreement was sent to the institutions with a clear deadline for receiving the data. It is signed by both the MARKK and the institution.
  • Data Transfer: Data was received for objects that were reviewed and communicated as falling within the scope of the project. Often the preliminary data received was the data we were already working with, unless institution sadded fields or we asked for missing data. In the case of a public API from the institution, we were in close contact with the curators of the institutions that gave us object IDs for the queries. In the rare case of internal API access, we were in contact with the institution’s technical staff to collect data.
  • Data Stored, Documented and Organised: Data acquisition progress was communicated to the team with the outlined internal software. It was necessary for acquired data to be stored immediately and programmatically and categorised clearly. Time was dedicated to ensuring that any confusion, questions or requests the team had about the data or data acquisition process were prioritised.

The development of the database and schema for interaction with the data on the digital platform was dependent on the data acquisition. In order to connect the accumulated data and build a tool that allows comparison, interaction and cataloguing, a connection layer for metadata fields was built that can identify cataloguing patterns within and across collections. The development of the metadata linking relied on insights produced by studying the patterns based on the knowledge gained from the conversations with the institutions and by studying the received data.

Data Processing Documentation

The backbone of the entire project is the object’s metadata. The task of the data engineer was to process all the data received from the institutions to make it usable and accessible for the team’s research as well as the platform publication. Gwenlyn Tiedemann joined the project mid-2021 and took over the data processing after a change in the team at the beginning of the project. During the period of this team change, which cost the project several months, Krystelle Denis and Alex Horak, the technical lead and development team of the platform, respectively, had to take on this additional complex task alongside their planned duties. They made the exemplary extra effort to ensure that this project would not be delayed.

At the beginning of the project, the team defined the basic premises and the project-internal recording and storage of data for further use, which will be explained here, followed by problem-solving processes in the context of the new data coming into the project and the flexibility this required to not lose the basic premises outlined at the beginning. Multiple agile adjustments were necessary in the data processing.

The documentation on data acquisition outlines the ways Dr. Anne Luther corresponded with the institutions to obtain the object-related data. In addition, data engineer Gwenlyn Tiedemann was in direct contact with the institutions when complex technical issues arose. The data acquired from the institutions was divided into four categories: object metadata, object images, research material and surplus/archival material. The research material was given directly to the project researchers; the other data categories went into the data pipeline the data engineer built to process them.

In addition to the data received directly from the institutions, data was obtained from the public or internal APIs or retrieved from the collections available online with consent from the partnering institutions. We distinguish here between internal data – sent directly by the institutions, even if they are partly API exports – and public data – data retrieved by Digital Benin from openly accessible collections or through public APIs. The basic premise here was that internal data takes precedence over public data if internal data provided more information. This meant we showed internal data to avoid repetition and only added multiple datasets to the platform for an institution when the internal and publicly available datasets differed significantly in the information they provided about the object records.

Cultural data is complex and multilayered, which is also reflected in the metadata received. The data is in many different formats and structures, which will be discussed further below. Fundamentally, the recording of the data and its structuring can provide information about how the institutions perceive their objects – and how much information they are willing to share about them – and show the historical understanding (Turner, 2020) and technical possibilities to record object documentation locally in each institution (see Loukissas, 2019, p.15). This resulted in the most important basic premise: data structures and field titles were not changed, in order to preserve each institution’s logic of collection cataloguing. This created a challenge in regard to the diverse data types, structures and complexity of the metadata – namely, it made data comparability more complex, especially in the project’s internal metadata linking (see Digital Benin Project Introduction and Catalogue) as well as in the use of data in the project team’s research. The complexity of the data diversity became the biggest challenge in the data processing and provided the opportunity to create a new approach in publishing and researching data from multiple sources in the museum context.

Metadata Processing

One hundred thirty-one institutions are included on the platform, and since we have obtained both public and internal data from some, we are showing 153 datasets in total. However, over the course of the data acquisition process we received data from a total of 138 institutions and 417 datasets as individual files with metadata in a wide variety of formats and structures. An additional 178 datasets were publicly retrieved. The file formats ranged from Word documents and PDFs to Excel spreadsheets and XML and JSON files, and sometimes just email content. This sum and variation of the data sets results from the multiple changes that the institutions made to their data over the course of the two-year project by submitting new files or by sending emails and notes with additional or revised information. This meant a lot of changes, additions, and the merging or removing of datasets or object records. Furthermore, the project internally decided to remove personal information such as staff names or donor addresses from the data as well as insurance values and specific storage locations of the objects, for legal and security reasons and to avoid harm.

The main task was to bring this heterogeneous dataset to a comparative level. At the beginning of the project, it was decided that the metadata would be stored in a simple table structure. Only one table was to be created per dataset. The table representation enabled low-threshold data accessibility and usability for the team and a fast turnaround in processing for the limited time available. The premise was that the project’s research was developed inductively from the data, therefore creating access to the data. These tables were then made available to the b/t development team via the project-internal software tool for further use and research on the emerging platform. The challenge of this approach became apparent for the multi-tiered and complex data layers in some datasets. We had to find a way to display and make the multi-layered data structures quickly accessible because we did not want to compromise them. A solution was found in autumn 2021. By parsing the individual data paths within the structures and embedding them in the table headers, it was possible to preserve the structures, even though the information was presented in a flat table. When pulling the data for the platform, b/t can correctly represent it as multi-layered by converting it back into a JSON structure. Owing to time lost at the beginning of the project caused by the unforeseeable team changes, the developers and the project catalyst decided to display the relational data structures that we received from some institutions in a flat manner. The data structures obtained through this method allow further development in the representation of multi-layer data.

Only rarely was the data received from the institutions usable without any further processing. Providing the data in Excel spreadsheets was not required, because we wanted to receive the data with a fast turnaround and asked institutions for the most convenient export option on their side. Common database software such as TMS make it difficult for institutions to export data as a spreadsheet. Institutions were encouraged to send the data any way possible – the essential point being that they provide us with metadata as quickly as possible in view of the short project duration.

Processing the data required it to be machine-readable, which meant Word documents and PDFs had to be transcribed and processed for the development on the platform. One example of a major processing challenge is, when an institution exported multiple datasets from different parts of their internal system which were related to the same objects. These datasets had to be connected and merged so that we are able to display information for one record from multiple exports. This was particularly difficult when different file formats or structures were submitted, as these initially had to be brought into one format and to one level. Some institutions transferred data as multiple Excel spreadsheets, after a consultation with their technical staff we found out that this used to be multi-layered data. For these cases the data were sent in several tables, which then had to be connected and merged on a multi-layered level on our side first in order to convert them into the developed path-field titled table structure.

An advantage of receiving several data tables per institution was the possibility of identifying errors or missing objects in the data by comparing them with one another as well as with the data accessible online. As mentioned earlier, for some institutions both internal and public data needed to be compared. Thus, alongside preparing and processing the data to make it available for research and the platform, researching and examining the data was also a big aspect of the data engineer’s work. Several years of experience in working with and processing complex heterogeneous cultural datasets, creating databases and data structures and being a researcher in art history were also important in identifying errors and understanding the complex data to enter conversations and knowledge exchanges with the institutions.

It was especially the experience with databases and their structures that made the most laborious work on the data only possible. At times, it was an immense amount of work to process exports from large historical databases and remove so-called metadata clutter that did not relate to the objects’ documentation. In these complex cases, the data could only be transformed into the table structure after an intensive investigation of the data structures to make it comprehensible.

Object-Image Mapping

For the review of the objects, the object images were especially an important component. Most of the images were sent at a late stage or remain incomplete. Some objects were only photographed on the project’s request.

The images needed to be mapped to objects because they were sent separately from the datasets – in some instances from an entirely different institutional database. For this object-image mapping, a matching point is necessary. This was usually already chosen by the institutions in the form of naming the image files after the corresponding object ID. For example, fields such as accession number, PRN, ID or object no. all referred to the object ID and were consequently linked to the Digital Benin-authored metadata field ‘object ID’. However, the non-standardized way to record Object IDs led to the issue that if they contained dots or special characters, or when special characters were omitted or replaced in the image filename by the institution it caused difficulties in the image mapping process. This prevented direct matching with the object ID from the dataset. To prevent this conflict, a glossary of regex rules was created. The image files were renamed based on these rules, and the object IDs were subject to them during mapping. Among other things, dots were replaced by hyphens and spaces by underscores. Through these regulations, the filenames/IDs were standardised so that no conflicts could occur during matching without distorting the unifier significantly. This meant that all image filenames had to be edited, since errors could also occur in the naming on the institutional side which only become apparent during machine processing. Even one too many spaces in the filename creates problems when trying to match images with object IDs.

Moreover, in some cases, images submitted later were named after an ID from the dataset that is different from the one that had been sent to us earlier and was used to map the previous images. Manual inspection of the image files and assignments was necessary not only in these cases, but also in cases where the image files were named not after an ID from the dataset but perhaps after their individual IDs or the designation given by the camera that took the image. In some cases, institutions assigned names such as IMG_1 or IMG_2 to the images, which made it impossible for us to match the images with the object data provided without an indicator in the dataset. A back-and-forth between the institutions was necessary to determine which image belonged to which object. On other occasions, and only when the institutions transferred a small number of images or assignable objects, was it possible to match them via differentiated searches in the metadata and manually rename or assign them. And in some instances, it was also necessary to double-check the object’s assignment. However, with large quantities of images, manual assignment and renaming was not possible, or exceeded the time frame, and so was only necessary in concrete cases of mismatching. For these cases, the image metadata was helpful. Metadata regarding the object images was sent in part by the institutions, which assigned image filenames to objects. This again was meant to connect different metadata and add the image names to the object metadata. Sometimes the image names were already in the object metadata but often they had to be supplemented by images sent subsequently.

Since objects do not always have only one image, ideally the institutions would provide the basic object-image mapping identifiers and mark the order of the images. When the latter did not occur, it was necessary to add a sequence naming to the image files or sequence setting of the image filenames in the object metadata. This required manually going through each image file to set or adjust the order. Unfortunately, this was not possible in all cases – for instance, those large image sets where the images belonging to an object had no recognisable or sortable name structure, or where the image names had already been included in the object metadata by the institution without specifying an order. Apart from these few exceptions, in which an adjustment of the image order was not possible timewise, the order was pursued as follows: first came an overall representation of the object; second, a representation from several perspectives; third, detailed photographs followed; and lastly, other historical photographs, exhibition photographs and catalogue cards.

Surplus/Archival Data

In addition to the metadata and images, we received what we initially called ‘surplus’ data, which consisted of a variety of documents, such as articles, archival files, correspondence, exhibition photographs, historical photographs, catalogue cards and inventory books. Over fifty institutions sent additional files, which were viewed and attempted to organise to filter relevant data, among other things, to populate the archive, and to share content relevant to the project research with the team. The articles were sorted as freely accessible files or stored and referenced in Zotero. It was decided not to put the historical photographs on the platform owing to the lack of time to review them, because the violent and harmful content shown in the photographs requires extensive consultation before they can be displayed publicly online. The catalogue cards and exhibition photographs were added to the object level since some institutions had already done this themselves, and in some cases because these images were the only ones available that showed the object. This also allowed a more immediate access to the object-related information. Some reports, wall labels and catalogue texts were also included in the object’s metadata. The remaining files, including archival files, correspondence, inventory books, illustrations, reports of various kinds and metadata of archival materials, were brought into an archive of Digital Benin, which is currently being reviewed and has become a focus for the project’s extension. The archive is only preliminary work of the following extension project, in which the archival materials relevant to historical Benin object research will be further explored.

To summarise, Digital Benin shows the current state of data structures, standards and vocabulary across 131 institutions in twenty countries. Because the data collected is so diverse, it is impossible to have a universal solution in data processing without reducing or deleting data. In each case, solutions for finding, processing, merging and transforming data were necessary to preserve and respect the data transferred by each institution. Since each institution had to be examined individually, a processing scheme could not be applied universally. Without close collaboration within the team, especially in the elaboration and execution of the workflows and data pipelines between the technical lead, data processing and research teams, the platform could not have represented the data with the premises outlined above.