§1. Introduction
§1.1. This publication summarises the results gained in discussions of the Initiative for Digital Cuneiform Studies (IDCS) Workshop ”Von analog zu digital. Konzeptionen der Keilschriftforschung im 21. Jahrhundert am Beispiel administrativer Urkunden” held at the Johannes Gutenberg-Universität Mainz on February 26th and 27th 2021. The workshop was initiated to discuss future perspectives of young researchers in Assyriology, computational linguistics, Digital Humanities and Computer Science who would like to work with digital cuneiform artefacts. The workshop was especially centred around the question how a transformation of a traditional edition from cuneiform texts into a digital scholarly edition can be supported. This article represents an elaborated version of the results of the discussion sessions and will, wherever appropriate, relate to the contributions made at the workshop, which can be found on the workshop’s blog page[1].
§1.2. As in many other fields of the humanities, the digitalisation of Assyriology becomes increasingly important, and more and more digital cuneiform research projects arise (see for example the CDLI[2], ORACC[3], MOCCI[4], HPM[5], ETCSL[6] Archibab[7], MTAAC[8], ANEE[9], Achemenet[10] et cetera)[11]. However, the authors’ personal experiences have shown that for young researchers without any prior knowledge in the Digital Humanities, it can be challenging to find an entry point into Digital Assyriology. Equally, for computer scientists interested in working with cuneiform artefacts, it is necessary to understand the complexity of the material and the workflow of Assyriologists. To fully exploit the potential of digitalisation with advantages for both Assyriology and the Computer Sciences, it seems necessary to cooperate closely (Maiocchi 2021, p. 125). To that end, the first step towards a digitised Assyriology is to make the cuneiform resources available in a machine-readable format, i.e. in digital editions. In the long run, digital research methods in, e.g. Computational Linguistics, are likely to be profitable for the field of Digital Assyriology but require a correctly formatted and possibly enriched dataset. Assyriologists and computer scientists can support each other in creating appropriate datasets for various research communities to foster research in their respective areas. Therefore, a well-planned data curation and digital scholarly edition process is needed to create a win-win scenario. Based on the discussions of our workshop, we intend to illustrate the necessities, challenges and advantages of a digitised Assyriological workflow. The article thus intends to give a first overview and primarily addresses (young) researchers from both fields of research.
§1.3. Therefore, the next chapter of the article generally deals with both the challenges and opportunities of the digital scholarly edition. The following chapters three to five are based on the workflow of an edition process. In the first part of each chapter, we reflect on the tasks and challenges Assyriologists face in their non-digital workflow. In the second part, we examine how the Assyriological workflow can be transformed into a digital form. The sixth chapter highlights the technical infrastructures necessary for preserving and archiving a created digital edition, and the seventh chapter presents the (possible) role of machine learning in Assyriological research. In the last chapter, we explore the possible conversions of the traditional workflow towards a digital one and its implications on people and technological requirements.
§2. Opportunities and Challenges of the Digital Scholarly Edition
§2.1. Scholarly editions as a medium of scientific discourse are an established means of making texts available for the scientific community. Sahle (2016, 23) defines a scholarly edition as follows:
A scholarly edition is the critical representation of historical documents.
§2.2. In general, digital editions follow the same principles as traditional editions regarding their content, methods, or structure. However, some additional technical aspects have to be considered, e.g. how to structure the user interface to access the digital edition? In which format are the texts encoded? Are the basic data made available, and if yes, in which formats? How and where is the data archived in the long term? Under which license has the work to be published in order to make it usable for scientific research? Some more requirements are given in the following guidelines (Fischer 2014; Association 2011; Forschungsgemeinschaft 2015b). Digital scholarly editions, while already used in many scholarly disciplines, provide new opportunities (Gabler 2010) but also new challenges which the targeting scholarly community needs to form an opinion on and apply those in their daily practice. In particular, digital scholarly editions, as is the case with many means of digitisation, allow a collaborative creation, discussion and improvement of their content by a potentially large user base of experts from all over the world at any time. Technologies that allow for such collaborative efforts need to be adjusted to provide an overall added value for the scientific community. In contrast to an edition published as a traditional publication, e.g. a book, a digital edition provides distinct advantages which may also further the scientific discourse about certain published content. First of all, digital publications provide improved accessibility of research for both the scientific community and the public. Under the terms of open-access publication, the research results can be accessed freely and in a fully extended way by anyone interested. Especially the public perception should not be neglected since research is to a large degree financed by the public and small subjects like Assyriology are not particularly well known and often struggle to legitimise themselves[12]. Another advantage is the possibility of keeping research up to date over a larger period of time. While in traditional workflows, addenda and corrigenda have to be published separately, a digital edition can be versioned and thus be updated. Furthermore, the publication of cuneiform texts in repositories allows more rapid feedback and discussion about the contents of a digital edition. Moreover, a digitised text or text corpus allows submitting differentiated search queries regarding specific lemmas, their contexts or entire lines of text. In larger text corpora, further search parameters like genre, period and region can be applied. Another advantage lies in the possibility to easily include other kinds of media like images or 3D scans into the text edition, which is only partially possible in classical analogue publications. Furthermore, it is not only possible to give a representation of the cuneiform object but also to go a step further and annotate transliterations, images, 3D scans or renderings. An example would be the possibility to present detailed grammatical analyses for any verbal form (as an annotation on the text edition) or to designate every cuneiform sign on a given object (as an image annotation). Annotations like these provide fast access to information interesting for Assyriological research and teaching and form the basis for more advanced digital methods like Natural Language Processing (NLP) or machine learning applications (cf. Hajo 2002). Last but not least, a digital processing of the cuneiform sources is the basic requirement for the application of advanced digital methods, like network analysis (Pagé-Perron 2018) and methods of machine learning (cf. § 7)
§2.3. Besides the aforementioned opportunities of creating an accurate digital scholarly edition, a suitable data format needs to be selected for storing the digital edition. Different formats such as the ASCII Transliteration Format (ATF) [13] (see Figure 3) or the Text Encoding Initiative (TEI) format are established means of representing transliterations in Assyriology which are recently accompanied by newer approaches of representations such as JSON Transliteration Format (JTF)[14], also with possible links to other linked data resources in the digital Assyriological community. Different formats may provide distinct advantages for different research communities. The Nino-Cunei project[15] provides access to cuneiform corpora using Jupyter notebooks (Kluyver et al. 2016) to easily create statistics for data scientists and (future) data science aware Assyriologists as well. Even though this kind of representation is not common among Assyriologists today, we think that easy access to statistics, different visualisations of cuneiform texts and machine-readable access thereof can provide new opportunities for an easier presentation and access to data science methods used in digital Assyriology.
§2.4. Another issue that is to be solved in every digital edition project is the aspect of versioning digital edition drafts, first of all for the researcher creating the digital edition and in the long run for resolving possible disputes among scholars in the field. Digital technologies allow for a very rapid response, collaborative work and interpretation of cuneiform texts, which should be taken advantage of to further the scientific discourse in the field of Assyriology. Especially the representation of the scientific discourse in the sense of the option to dispute or comment on certain text passages could be an interesting previously non-feasible way of discussion.
§2.5. At the same time, researchers should be empowered to use digital tools to sharpen discussions not only about textual contents but also about digital representations of cuneiform tablets such as photos and 3D representations. Annotations on both of these mediums have become technologically feasible, can be shared and connected to other internet data, and can also be beneficial to machine learning and computational linguistic communities.
§2.6. However, despite new possibilities being brought forward by technological advancements, it needs to be noted that embracing these new technologies requires a significant amount of technical expertise, which cannot be expected of the average Assyriologist. Therefore, it is of tremendous importance that infrastructures provide many of these services to the Assyriology and other research communities in human-accessible ways, but also in machine-accessible ways using meaningful standardized Application Programming Interfaces (APIs). Moreover, infrastructures could act as data brokers, i.e. enable the connection of data between infrastructures to foster synergies between already existing data silos. Examples of such synergies might be the creation of joined dictionaries, maps of findspots of classified texts across data silos (Rattenborg 2019) and the grouping of semantically or by other means similar texts and archaeological objects.
§2.7. These possibilities are not necessarily a matter of technologies but rather a matter of the community of Assyriologists to determine.
§2.8. In the following we would like to point out these new possibilities following the path of a cuneiform tablet from initial examination to final publication.
§3. The Archaeological Object and Object Curation
§3.1. At the beginning of each scholarly edition process stands the cuneiform text as an archaeological object (tablet, cylinder, etc.). The excavated objects are measured immediately after their discovery and assigned a find number that documents the excavation context (findspot, layer, plenum, room). Subsequently, the objects are documented, i.e. photographed, drawn if necessary, described and inserted into databases. The excavation results are usually published in analogue excavation reports. Unlike until the beginning of the 20th century, the objects remain in the country of their discovery after the excavation is completed and are transferred to a local museum. Therefore, to fully understand and interpret a cuneiform text, the arxfchaeological record (Lucas 2012) has to be considered. Specific information such as the provenience of the archaeological object, its exact findspot, the find context (i.e. other objects in the surrounding area) is also crucial to fully understand the content of a cuneiform text and its Sitz-im-Leben. In digital formats, this information appears as metadata which may be included in the respective file, may be provided as an external file and may include mode specific metadata once it is hosted in a repository. One example might be the Cuneiform Site Index (Rattenborg 2019), which highlights specific kinds of metadata about the findspots of cuneiform tablets. Relevant metadata for cuneiform text editions include e.g.[16]:
- collection/owner
- museum number
- excavation number
- internal cdli-number
- provenience
- findspot (area, layer, room)
- period
- object type
- material
- measurement
- language
- script
- genre + sub-genre
- primary publication
- secondary publications
- author of transliteration (e.g. ATF, etc.)
- author of translation
- remarks
§3.2. In general, the metadata relevant for a publication also depends on the genre of the text(s) and can be modified and extended accordingly. In the case of letters, e.g. the sender and addressee can be listed. In the case of administrative texts, further information about sealings and dates can be given.
§3.3. How can this information now be organised and processed in a digital edition? How can the digital edition be combined with information about the archaeological object relevant to other research communities such as archaeology, cultural heritage management or the linked data community?
§3.4. To find an answer to these questions, it can be advantageous to look at related work concerning data standards in both the linked data and archaeology community. For archaeology and cultural heritage management, standards like the CIDOC Conceptual Reference Model (CIDOC-CRN) by Comité international pour la documentation (CIDOC) (Cidoc 2003), CRM digital (CRMdig) (Doerr and Theodoridou 2014) and Lightweight Information Describing Objects (LIDO) (Coburn et al. 2010) offer established methods of providing vocabularies to model metadata. They allow modelling the history of an archaeological object from excavation, object curation at a museum to the eventual publication, including the involved persons, rightsholders and further metadata. CIDOC provides many extensions that may be used to describe, e.g. a textual representation, for example, a single glyph on a cuneiform tablet as a digital object, so that other software may reuse it. Similarly, it is possible to describe images or other points of interest which may be present on a given archaeological object. Digitally enhanced descriptions may serve as a basis for the Assyriologist to better understand details in the archaeological context, which in turn may be used to interpret the textual content. For computer scientists, this kind of information can be harvested and be brought into context with other existing information to enhance tasks such as machine learning. In this context, it might be useful to compare information in the CIDOC-CRM standard to information provided by the CDLI and ORACC, as excerpted previously. Are these standards complementary, and which information could be appended to the metadata provided by CDLI and ORACC? If this investigation shows that specific data are missing, this may provide an opportunity to define new CIDOC CRM extensions or more specialised versions of vocabularies for Assyriologists.
§4. 3D Scanning and 2D Images
§4.1. To make a cuneiform object available to the (scientific) public, images of the tablet have to be created. Since subsequent damage and loss of a cuneiform object cannot be ruled out, the creation of images also offers the possibility of preserving the information given on the cuneiform object both to the scientific community and the public. Until today autographs (a redrawing of the tablet) and photos are the primary graphical representations of cuneiform objects. A photo has the advantage of showing the original object but depends on the quality and the lighting conditions under which it was taken. On the other hand, autographs always represent an interpretation by the editor but often offer better readability than a photo. Therefore, a text’s (re-)editing ideally includes a collation of the text(s) as a further step in the editing process. This involves examining the tablet again in its original state to minimise possible errors in transliteration and autography. With the introduction of 3D scanning, it is also possible to produce 3D scans (see. Figure 1) of the tablets, which combine the advantages of both previous forms of publication, as they correspond to the original object. 3D scans are becoming increasingly widespread (Mara 2019, e.g. DFG project ”Die digitale Edition der Keilschrifttexte aus Haft Tappeh (Iran)”[17]), but are not yet common practice in all Assyriological projects. 3D scans furthermore offer the opportunity to automatically create 2D pictures called renderings, which can be used in traditional publication formats.
§4.2. The advantage of renderings compared to photographs is that they can be graphically revised to ensure the best possible readability of the depicted cuneiform signs. Additionally, renderings are uniformly aligned and automatically generated from given 3D models, giving an advantage to automated processing applications in machine learning domains.
§4.3. When it comes to 3D scans the requirements of Assyriologists and computer scientists start to differ: For philologists, the resolution of 3D scans needs to be precise enough to recognise the cuneiform signs in all their details. The case is different if the 3D scan is used in a machine learning setting that intends to recognize cuneiform signs. In this case, the accuracy of the algorithm used for the cuneiform sign recognition depends on the input of training data of a certain quality, which may deviate from the expectation of the Assyriological community. This means that the quality of a 3D scan needs to be assessed depending on the use case which the scan is intended to be used for. To do this, 3D scans need to document the necessary parameters of their creation process in a machine-readable format. Currently, this is possible in a software-dependent scenario in customised formats or via access to the scanning software using APIs, but no standardized format exists to capture the creation process of a given 3D model as a whole. For 2D images, standards such as the Exchangeable Image File Format (EXIF) (Tachibanaya 2001) and the Extensible Metadata Platform (XMP) (Ball and Darlington 2007) provide means of documenting the creation process for 2D images and renderings. Since the first 3D scans of cuneiform tablets have been conducted Hahn et al. 2006, various research projects have embraced the scanning of cuneiform tablets for digital preservation. The HeiCuBeDa 3D benchmark dataset (Mara 2019) shows one example of a dataset comprised of 3D scans and metadata used for various classifications by the machine learning community and documents an important collection of cuneiform tablets. HeiCuBeDa, among other projects, also set a precedent for hosting 3D scans appropriately from which we can infer certain prerequisites we agree are important:
- Traceability and comprehensibility of the creation process and purpose of the 3D scans and 2D images and possible 3D renderings.
- Making these parameters visible in the data repository which the artefacts are hosted in.
- Accessibility of 3D objects and 2D images using established web services such as International Image Interoperability Framework (IIIF) (Snydman, Robert Sanderson, and T. Cramer 2015), so that they may be easily reused in other research contexts.
- Assignment of unique IDs, e.g. Digital Object Identifier (DOI) (Paskin 2010) to enable citations of digital artefacts.
- Connection of 3D scans to other digital artefacts (e.g. transliterations, archaeological objects, historic contexts etc.).
- Development of annotation and citation standards for areas on 3D scans to further the scientific discourse and computational analysis.
- Linking transliterations (words/characters etc.) to image annotations.
§4.4. We can also see that related work is already tackling some of these challenges so that these prerequisites might become more commonplace and fulfilled by data portals in the near future. The IIIF 3D community group[18] is currently discussing standards on the creation, discovery and annotation of 3D data so that 3D meshes might be hosted and be citable using standardized means. (Homburg, A. Cramer, et al. 2021) proposes a metadata model which describes the capturing process of a 3D mesh by automatically gathered parameters from the scanning software to make 3D meshes also acquired from different capturing processes comparable. This might pave the way for an automatic suitability analysis of 3D scans for specific to-be-defined use cases once these 3D scans are hosted on a larger scale in data repositories. Further challenges such as annotation standards and vocabularies and suitable linking methods between digital objects remain an area of research.
§5. The (Digital) Edition Process
§5.0.1. This section will describe the process of creating a cuneiform text edition using the resources described in the previous sections and the aspects specific to the digital medium.
§5.0.2. Now that the position of the digital edition within Assyriology has been discussed, the question arises how the transliteration and digital edition is represented. The TEI/XML format and framework is well-established in various digital edition projects (Burnard 2020), but not very common in Assyriology. Projects like the Electronic Text Corpus of Sumerian Literature (ETCSL)[19] used TEI/XML to represent transliterations so that they could be processed using TEI/XML text processors. However, a more common representation of cuneiform digital editions would be the ATF format used by major cuneiform repositories such as CDLI or ORACC. In contrast to a TEI/XML representation, ATF does not mandate a markup of the cuneiform text in certain ways, as it is quite common with TEI/XML encoded digital editions. Rather, it is only a representation of textual content. However, several XML dialects and markup standards have been developed[20] in various research projects, which may serve as inspirations for not only a representation of cuneiform transliterations in data, but moreover for an appropriate visual representation in the web or traditional desktop applications as well.
§5.1. Challenges in Palaeography
§5.1.1. The starting point of every ancient Near Eastern text processing is the occupation with the cuneiform script. This writing system was developed at the end of the 4th millennium B.C. in today’s southern Iraq and was used until about the 1st century A.D., with a multitude of different languages, dialects and language forms. It shows regional differences in its distribution throughout the Near and Middle East. In the 3500 years of cuneiform history, the individual signs were repeatedly developed, abstracted and standardized, so that from the graphemes of the 1st millennium B.C. one can only very rarely extrapolate to the signs of the 3rd millennium B.C. and vice versa (Table 1). This phenomenon poses great challenges for the cuneiform researcher: The approximately 900 attested cuneiform signs (Borger 2010) are reflected by over 10,000 sign variants that differ according to chronological and geographical location. Even in a small regional setting of the same period, several variants of a single sign may have been in use. Each scribe had his own handwriting so that even within a single text corpus a wide variety of sign variants can exist. These signs follow a ”standard form”, which could be shaped with the addition or omission of so-called ”filling wedges” depending on context and linguistic register. However, these palaeographic peculiarities are not only a challenge for the cuneiform researcher but also a clear advantage since they help him or her to classify the texts chronologically and linguistically, already at first glance.
§5.1.2. For the identification of the individual cuneiform signs, some analogous sign lists are available to the Assyriologist, which are often very specific (coverage of individual text corpora, language levels or regions, etc.) or offer a general but not complete overview of the character repertoire. No sign list covers all known variants of the cuneiform signs but offers only selected forms. For example, the very detailed sign list by R. Borger (Borger 2010) concentrates on the Neo-Assyrian and Neo-Babylonian forms of the signs, i.e. the sign forms of the 1st millennium B.C. Another example is the sign list (Mittermeyer 2006) listing the sign of Old-Babylonian (early 2nd millennium) Sumerian literary texts. The only sign list so far depicting the development of the single signs through all stages of cuneiform palaeography was created by R. Labat (Labat 1995). Though still highly valuable research tools, some sign lists are partly outdated since the phonetic reading of the characters evolves, and phonetic values are constantly added or deleted. Compact and complete sign lists, e.g. a complete language level in the whole Mesopotamian area, are therefore still a desideratum.
§5.1.3. Even further challenges arise in the digital representation of cuneiform script[21]. Cuneiform signs have been introduced into the Unicode standard (Allen et al. 2012). During this process, each Unicode code point reserved for cuneiform characters has been assigned a sign name. However, when reading a transliterated text, transliterations do not necessarily follow the sign names given in Unicode, either because a cuneiform sign might be transliterated differently by cuneiform language and the appropriate transliteration needs to be chosen according to the contexts of the given cuneiform sign. However, the opposite conversion from transliteration to Unicode code points is definite. Projects like Nuolenna[22] provide sign lists and services like Cuneify[23] provide a conversion from transliterations to Unicode code points.
§5.1.4. However, cuneiform signs often lack a standardized appearance (cf. Edzard 1976-1980, 559). Therefore, standardized forms of Unicode do not include essential information like the actual shape of the sign on the cuneiform object, the number of wedges and other characteristics of the respective scribal conventions. Unicode cuneiform signs will never capture the distinct sign variants on the cuneiform tablets that the Assyriologist is currently investigating, nor were they intended for this purpose. Currently, this information can only be observed on the visual representation of the cuneiform object (image, autograph, etc.; see § 3) and have to be documented manually either in the scientific literature or in sign lists. So far, they are rarely documented in a formalised digital way.
§5.1.5. The Dagstuhl Seminar ”Digital Palaeography” (Hassner et al. 2014) identified these and other needs in digital palaeography of different text types and advocated for better support of palaeography in digital standards. In particular, the integration of palaeographic data as linked data has been discussed and was further emphasized in (Homburg 2020).
§5.1.6. We have identified several areas of improvement for palaeographic representations and challenges for researchers to overcome for cuneiform. At first, a digital abstract representation of cuneiform characters would be useful. Other disciplines like Egyptology have recently defined encodings to represent Egyptian hieroglyphs (Glass et al. 2017; Nederhof, Polis, and Rosmorduc 2019), which allow a computer to access the contents of subareas of an Egyptian hieroglyphic and for a user to input these characters easily. While certainly structured differently, the cuneiform languages could profit from a similar encoding capturing the shape of a cuneiform character. Approaches in this direction have been taken in the application of the Gottstein encoding (Panayotov 2015), which allows the user to search for cuneiform sign variants. Recently, a new encoding, PaleoCodage (Homburg 2019a; Homburg 2021), has been proposed to capture all information included in a cuneiform character’s shape. More precise encodings may help to classify sign variants that occur in the daily work of every Assyriologist. We see the potential of digital character encodings to improve the following fields of application:
- Machine learning and classification of cuneiform sign variants.
- Digital modelling and registration of cuneiform sign variants. (Panayotov 2015; Homburg 2019a; Homburg 2021)
- Annotation of character variants and connection of digital cuneiform signs across language and text barriers.
- Making sign variants searchable.
- Dynamic creation of cuneiform fonts for different text corpora with different criteria.
- Automatic creations of autographies for cuneiform texts from annotated transliteration content.
- Extension of cuneiform transliteration formats to include at least linked palaeographic information.
- Development of ontology models for the expression of palaeographic content.
§5.1.7. Nevertheless, capturing and saving cuneiform sign variants warrants the organisation of these. To tackle these challenges, we recommend creating a cuneiform sign variant registry that can serve as an online, continuously updatable database in which not only shapes of cuneiform signs and their properties may be encoded, but also their metadata such as the historical and geographical context, and relations between sign variants (e.g. cuneiform signs appearing as parts of other cuneiform signs) may be represented. This is by far not a short-term endeavour but could be improved by technologies such as image recognition techniques which could clearly identify and encode character variants from either 3D scans or 2D photos/renderings of cuneiform tablets.
§5.1.8. Furthermore, the registry should be enriched by appropriate metadata to build the foundation for a search engine that could answer questions such as ”Where has sign variant XYZ occurred?”. Naturally, such a search engine would depend on generated cuneiform characters as fonts possibly enabled by PaleoCodage or by pictures of cuneiform signs and a discussion among the Assyriological community as to which abstract depictions of automatically generated cuneiform palaeographic variants are acceptable. In addition, a data model is needed to represent these character variants. In this regard, one could think of extending related work such as the Ontolex-Lemon Model for dictionaries to support palaeographic features, something for which no official standard is known to the authors to date. Similarly, representations of palaeography for TEI/XML with respect to cuneiform languages might be useful.
§5.2. Standards for Annotations
§5.2.1. After a cuneiform text has been embedded in a chronological and linguistic context via the palaeographic specifics, further processing of the text begins. In this step, a transliteration and translation of the text in question are made. In the transliteration, the phonetic values of the individual cuneiform characters readable on the tablet are noted, and damaged parts and otherwise illegible passages are marked (see Figure 3).
§5.2.2. The transliteration already represents an interpretation of the text since most cuneiform signs are polyphonic, i.e. have several phonetic values and can be read as word signs at the same time (see Figure 2). The transliteration of a cuneiform sign thus depends largely on how the editor places the character in the overall context of the tablet. Therefore, in the course of editing, a great deal of grammatical, lexical, and semantic information is gathered that ultimately influences transliteration and translation. Much of this information is only indirectly perceptible in the product of text editing, i.e., the text edition, primarily through the translation, which maps the decisions made regarding grammar and lexis. An exception is a philological commentary, in which the editor specifically takes up and further explains certain grammatical, lexical or even semantic aspects of the text. Which phenomena are processed in a commentary, however, is entirely at the discretion of the editor.
§5.2.3. While in traditional text editions much of this information is not directly noted, technical processing offers the possibility to map this information and make it usable for further applications in the later process (e.g. dictionaries based on semantic web-technologies, character lists, machine learning applications, etc.; see below §7) (cf. Maiocchi 2021, p. 124).
§5.2.4. In the Computer Sciences, these kinds of information are referred to as annotations. There, annotations comprise extra information associated with an existing medium like documents, images or other pieces of information. In the digital age, markup languages provide means of adding annotations to textual or image content, so-called web annotations (Maiocchi 2021, p. 119–120; see also Figure 4 and 5). In the wake of the formalisation of meaning through semantic web technologies, annotations are often coupled with semantic web definitions to guarantee machine-readable accessibility of its contents, categorisation of information, and at the same time, human-readable access to annotation content.
§5.2.5. Transliterations could e.g. be provided with detailed word-for-word grammatical information (Figure 5), which benefits not only the researcher or learner of the respective language but also creates digital resources for research in NLP. For cuneiform texts, annotations can provide new and more precise means of attaching information to 2D and 3D images, e.g. possible readings of an uncertain sign, additions of broken parts, or any other outstanding palaeographical information. Thus, annotations provide means to implement information directly in a text or image without interfering with the coherency of the edition.
§5.2.6. The World Wide Web Consortium (W3C) Web Annotation data model (Ciccarese, Soiland- Reyes, and Clark 2013) provides a framework for representing web annotations using semantic web technologies. However, the web annotation data model does not give guidance on how to represent the content of annotations for the respective to-be-annotated elements. For this, specific communities have created annotation standards which are followed to various extents - and not all of which applied in the W3C Web Annotation Data Model. The OCHRE environment provides annotations which can link images and transcriptions of cuneiform tablets (Prosser and Schloen 2021), whereas the annotation format does not seem to be open access. ORACC provides linguistic annotations which are interwoven in its transliteration format[24]. These annotations provide part of speech tags, lemmatizations, single word translations to English and decompositions of word components into grammatical particles - however without a machine-readable description of their meaning. The CDLI currently implements linguistic annotations which have been gathered from automated corpus analysis gained from the MTAAC project (Baker et al. 2017) in the CONLL-U and CONLL-RDF formats. These annotations are focused purely on linguistic aspects and provide no connection to annotated images. However, they provide - at least for Sumerian - to the authors knowledge the most detailed linguistic annotations. In addition, to express linguistic annotations, the Ontologies of Linguistic Annotation (OliA) (Chiarcos and Sukhareva 2015) may be used, e.g. to annotate nouns and verbs. Also, the ASCII Transliteration Format (ATF) may provide some guidance on annotating certain elements in cuneiform texts (e.g. broken characters, interpretations by the authors, among others) (see Figure 4). These annotations have not yet been formalised in terms of ontologies in the same way as other approaches for annotating images or 3D scans. Here, we see great potential for standardization and changed practices in Digital Assyriology. In addition, connecting annotations between different digital mediums are likely to become of greater interest for several reasons:
- Connection of annotations on text contents vs. image contents provide a more comprehensive understanding of decisions taken when creating a transliteration.
- Transfer of annotations: Users should be able to annotate, e.g. on a 2D image, and transfer these annotations to other mediums such as a 3D scan or a transliteration (see Figure 4).
§5.2.7. Once annotations have been standardized and opportunities for the transfer of annotation contents have been created, automatic and semi-automatic annotation strategies may be better examined and compared. While these technologies already work to a certain extent today, they will be easier to use once standardized classification targets have been defined. Apart from texts, the other digital mediums delivered in the course of the digital edition may also profit from annotations. Extralinguistic features, for example, might be better highlighted on a photo, 2D rendering or on a 3D model which represents the given cuneiform tablet as opposed to commenting in the digital edition text (see Figure 3 and 4). Through standardization of the annotation vocabulary, we can expect not only benefits for the Assyriological community but also for other research communities who can comment and reuse annotations for classification tasks in machine learning or automated translations.
§5.3. Standards for Dictionaries
§5.3.1. Almost all text editions include a translation of the text(s) studied. A translation, however, is always an interpretation of the text, based on the knowledge of the grammar and the lexicon of the language used in the respective text(s). This knowledge is generally collected in grammars and dictionaries. It must be kept in mind that grammars and dictionaries represent the state of research at the time of their publication. Through research, the knowledge of grammar and lexicon keeps growing, so that addenda become necessary from time to time. The availability of dictionary resources in Assyriology highly differs between the cuneiform languages. Whereas in Akkadian studies, two major dictionaries (CAD and AHw), as well as several smaller wordbooks and specialised glossaries, exist, the situation is completely different for Sumerian. So far, there is no complete printed dictionary of Sumerian available, but there are different glossaries (e.g. Sumerischer Zettelkasten[25]) and a large digital compilation in the form of the Electronic Pennsylvania Sumerian Dictionary (ePSD)[26]. The main dictionaries in the field contain the following information: To navigate through the dictionary, it needs to be organized by lemmas, representing the basic form of the word. Furthermore, palaeographic information demonstrates how the word was written in different chronological and geographical areas and, in the case of Akkadian, Hittite, etc., which logograms were used to represent the word. Since the translation of a word at least partially depends on the context it is used in, a detailed collection of references is added. Thus, a dictionary finally gives all possible translations of a word.
§5.3.2. Different data formats to represent cuneiform dictionaries are common and some of those have been standardized and are used across research communities. TEI/XML dictionaries (Bański, Bowers, and Erjavec 2017; Budin, Majewski, and Mörth 2012) are used across disciplines to model dictionaries and relate them to text corpora. Currently, cuneiform transliterations are, in some cases, represented using TEI/XML, but dictionaries represented in this way are not known to the authors. It could make sense to adopt TEI/XML dictionary representations for infrastructures that represent such data. In addition, another approach to represent dictionaries and relate them to machine-readable semantic descriptions is represented by the Ontolex- Lexicon Model for Ontologies (Lemon) Model (McCrae et al. 2017). The model can display words and word forms and relate them to words in other languages as well as semantic meanings. This model is considered to be used by the acr:cdli for the internal representation of dictionaries, and in it we see the greatest potential for further usage for the representation of words. What Ontolex-Lemon misses is metadata about given words. Here, the Assyriology and other communities of interest would be advised to define certain metadata to be attached to the definition of words and word forms to formalise the representation of a digitally represented dictionary for Assyriologists. Another important aspect is the relation and enrichment of digitally created glossaries of different cuneiform digital edition projects. Glossaries can benefit from enrichments of dictionary contents and likewise dictionaries may benefit from additional references to word occurrences presented by glossaries. To connect these resources in data represents a huge potential for researchers.
§5.4. Standards for Sign Lists
§5.4.1. A graphical sign list compiles different variants of cuneiform signs and their phonetic and ideographic readings. These lists can be generated from text corpora within one archaeological site or documents of the same chronological classification of a broader region. A general sign list collocated from different time intervals can provide information on the development and improvement of the cuneiform signs. Such compilations are normally analogically published (e.g. Labat 1995, Borger 2010 etc.) and are extremely helpful to the researcher, as they not only show the development over time but can also reveal variants at different sites and facilitate the search for, for example, broken signs, character strings, or even contextual readings.
§5.4.2. For (digital) sign lists, a distinction should be made between two types: The search for already transliterated signs and the search for the graphic representation of cuneiform signs. A digital sign search should be structured in such a way that, on the one hand, the search for the sign name of the cuneiform sign, and on the other hand, the search for individual phonemes is made possible.
§5.4.3. The polyphony of the cuneiform signs results in several phonetic values for one grapheme, which are selected by the processor of the texts depending on the context (see Figure 2). When searching single phonetic values of a cuneiform sign, hits with other readings of this sign should also be suggested. In addition, the search should also take into account logographic and syllabic readings and ignore brackets, breaks or other interruptions. Therefore, the texts should already be prepared and, in the best case, linked to digital dictionaries so that sign lists can be generated automatically from the entered texts.
§5.4.4. In a graphical search query, it should be possible to search for sign shapes in particular. Also, all known sign variations should be displayed with references. In the best case, regional, chronological and linguistic filters can be applied (Endesfelder 2021)[27].
§5.4.5. In data, sign lists should be associated to globally available sign lists, character variant registries and further data resources of interest. Similarly to dictionaries, sign lists created in a digital edition project should eventually become a part of the global corpus of cuneiform sign information which should be easily accessible for humans and machines alike.
§5.5. Requirements for Cuneiform Research Software
§5.5.1. Besides providing the ability to edit and publish the respective edition, software supporting digital scholarly editions should be held to further standards, as outlined by, e.g. (Sichani and Spadini 2018). Software should be performant, support a variety of common input and output formats relevant for many research communities and provide easily understandable tutorials for all target communities. For the latter, it is especially important to also support examples which exemplify a correct usage of the research software. These examples should illustrate the intended usage using minimum examples and are at best written with increasing complexity. This way, an easy access for all target communities can be guaranteed. Not only for cuneiform research software, but for all research software in particular, a few more prerequisites should be met: Software should be citable (Druskat, Gruenpeter, Chue Hong, et al. 2017), open source, easily extendable and at best provide a roadmap and a contribution guide to make it easy for developers to submit further contributions. Developers should clearly communicate how to submit errors in the software, how to state feature requests and what the goals and anticipated limitations of their respective software are. With this, users get the correct impression of what the software can and cannot do and what it was programmed to do. In this section, we can only highlight the most important aspects of an accessible and usable cuneiform research software. For a more detailed assessment of research software we refer to related work that has discussed this for the field of archaeology (Homburg, Klammt, et al. 2021). In particular, we can recommend a discourse about useful software within the Assyriology community either by means of software review articles, as it is already done in related disciplines such as archaeology[28] or in sessions on conferences to better elaborate the needs of the Assyriology community and to find a suitable developer stack for its implementation.
§5.6. Data Exports related to Transliterations
§5.6.1. In the last subsections, we have discussed the different digital artefacts generated when creating a digital scholarly edition of cuneiform texts. While only some data exports we mention in the following might be interesting for the Assyriological community, the question we try to answer here is which data exports should be provided by an infrastructure hosting digital transliterations. Data exports should exhibit viable options in many different research communities and should be compatible to the tool stack of the determined target audience. In general, we deem the following data essential for a fully modern digital edition. At first we list data products which would be generated on a per text basis:
- A 3D representation of the cuneiform tablet with capturing metadata and under an open license (preferably Creative Commons) (Y.-H. Lin et al. 2006).
- Renderings of the 3D representation with image metadata.
- Representation of the transliteration in as many formats as possible (ATF, TEI/XML, JTF, RDF) including metadata representations.
§5.6.2. Next we list derived data products which may be created when describing a whole text corpus of cuneiform texts:
- Sign list and glossary/dictionary extractions from a given text corpus in either TEI/XML or RDF.
- Collections of annotations for specific purposes by annotation medium.
- Derived data products from annotations (e.g. extracted 2D images of annotation areas).
- Machine learning datasets derived from connected data.
- Corpus statistics.
§5.6.3. Together, these data products represent a holistic view on both textural contents and the text corpus itself, providing room for statistical analysis and the critical examination of individual texts by Assyriologists.
§6. Infrastructures
§6.0.1. Infrastructures are repositories or content management systems used by scholars and algorithms to access relevant research data, store currently in-progress research projects, and promote the open science idea (Bezjak et al. 2018) to make research results publicly available whenever possible while at the same time giving credit to the individual research contributions (Robson, Rutz, and Kersel 2014). To function in this fashion, infrastructures on a technical level should adhere to certain principles, which we introduce in the following. At first, data provided by infrastructures should be acr:fair data (Wilkinson et al. 2016); that is, data should be:
- Findable: Data should be easy to find for both humans and computers (e.g. using a search engine).
- Accessible: Clear and precise information on how and under which conditions data can be accessed.
- Interoperable: Data should be stored in such a way that enables easy integration with other data of a similar kind.
- Reusable: Metadata should be added and well-described so that data can be replicated and/or combined in different settings.
§6.0.2. Next, data repositories are advised to follow the CARE principles (Alliance 2019) and the acr:trust principles (D. Lin et al. 2020). The TRUST principles for digital repositories (D. Lin et al. 2020) advocate for Transparency, Responsibility, User focus, Sustainability and Technologies to support the aforementioned goals. The CARE principles demand data to have collective benefit, be governed by, to adhere to ethical principles and to work responsibly with data concerning the (property) rights of indigenous people. As such, data which is published in cuneiform studies, should always relate to the country of origin of the digital artifacts and publish data in ways that make data accessible for this audience. To implement said principles, standards for data distribution of research data using well-established glo:api standards such as glo:rest (Massé 2011) web services should be followed. Clearly, the access to these APIs needs to be documented as well, which is achieved by adhering to documentation standards for APIs such as OpenAPI (Initiative et al. 2017) or by following already clearly established guidelines outlined in web standards such as in the case of SPARQL Protocol And RDF Query Language (SPARQL) endpoints. These guidelines are increasingly common even in application schemes for research grants, as shown in the example of DFG guidelines on research data (Forschungsgemeinschaft 2015a).
§6.0.3. Besides technical principles, one should not forget the requirements that the different research communities have. Many of the requirements mentioned above can be equally applied from the point of view of Assyriological research. Thus, findability and accessibility already form one of the most important requirements. Additionally, a user-friendly, at best self-explanatory interface is necessary so that anyone, from student to researcher, can easily find the texts and information needed for learning, teaching and researching. Though no longer updated, an Assyriological example could be the Electronic Text Corpus of Sumerian Literature (ETCSL)[29], whose main page provides the user with technical information about the project, a manual of how to use ETCSL, a full catalogue of all texts worked into the project and of course the opportunity to browse the content or specifically search the corpus for words. The texts themselves can be depicted in Unicode or ASCII, and are provided with a transliteration and translation, as well as information on cuneiform sources, literature and revision history. In the long term, the infrastructure needs to be constantly supported and, if necessary, updated and of course at best financed.
§6.0.4. Another basic requirement in Digital Assyriology, and Digital Humanities in general, is that data is published as open access (Klump et al. 2006) and provided using an appropriate open license, such as the Creative Commons Licenses (Aliprandi 2011). However, the content of the data repositories and their quality assurance is an equally important aspect.
§6.0.5. How can scientists assure that data stored in repositories can be updated when new research insights warrant such an action? Can data infrastructures be more than just data repositories, enabling collaborative work and discourse in or between research areas? Examples of cooperation and similar problems to solve can already be seen in computational linguistics, e.g. (Chiarcos, Khait, et al. 2018) researching how to automatically translate cuneiform into other languages, or in digital epigraphy (De Santis and Rossi 2018), in which several research communities try to solve commonly occurring problems and compare working methods beyond the boundaries of research areas.
§6.0.6. When looking at currently available resources for cuneiform infrastructures, one can observe that very few resources fulfil the requirements of a typical data infrastructure. Still, we found that many resources exist that would be very suitable to be either hosted as a data infrastructure or could be integrated into data infrastructures as they represent vital pieces of research in their respective field. Besides CDLI and ORACC, which represent the most sophisticated data infrastructures to date, we consider the following other resources to be interesting for inclusion into data infrastructures:
- ePSD (The electronic Pennsylvania Sumerian Dictionary)[30] and ePSD2[31]
- ETCSL
- BDTNS (Database of Neo-Sumerian Texts)[32]
- Leipzig-Münchner Sumerischer Zettelkasten[33]
- Archibab[34]
- Achemenet[35]
§6.0.7. The consolidation of data included in these repositories, but also in resources which are to date only accessible in written form is a task we see as essential in the upcoming years.
§6.1. Linked Data
§6.1.1. Linked Data and the Semantic Web (Berners-Lee, Hendler, and Lassila 2001) provide an interpreted, and machine-readable wealth of information of different disciplines often summarised under the term Linked Open Data (LOD) (Bauer and Kaltenböck 2011). Linked Open Data of good quality should adhere to the 5-star principles of linked data vocabulary use (Janowicz et al. 2014), that is, data should be
- Available on the web.
- Available as machine-readable structured data.
- Available in a non-proprietary format.
- Encoded using open standards e.g. from W3C[36].
- Linked to other data to provide context.
§6.1.2. In that sense, 5-star linked open data could be seen as an application of the FAIR data principles using linked data technologies. Another perspective on this topic is given by the Linked Open Usable Data (LOUD) principles (Sanderson 2019), advocating for linked open usable data, that is, data which allows a right abstraction level for the target audience of this data, provides as few barriers as possible for users to access the given data, is at best self-explanatory, documented with working examples and consists of a consistent pattern, i.e. exhibits a clear structure of its contents. We can expect the linked open data cloud to grow on a constant basis and in various domains of knowledge which may also be relevant to the Assyriological community. Especially, Linguistic Linked Open Data (LLOD) (McCrae et al. 2016), i.e. the transformation of text corpora and dictionary resources into linked data is a prospect which is currently researched, standardized (Chiarcos, Pagé-Perron, et al. 2018) and used (Svärd et al. 2018) to create a machine-readable and machine-interpretable source of cuneiform texts. Yet, we can expect a consolidation of linked data resources not only in the LLOD cloud, but also as a usage in the backend of research data infrastructures, so that a connection between data might be realised via the connection of different research data repositories in the future.
§6.2. Crowdsourcing
§6.2.1. Crowdsourcing, according to (Estellés- Arolas and González-Ladrón-de-Guevara 2012, p. 197), can be defined as ”a type of participative online activity in which an individual, an institution, a nonprofit organization, or company proposes to a group of individuals of varying knowledge, heterogeneity, and number, via a flexible open call, the voluntary undertaking of a task”. This means that a heterogeneous community of individuals is working towards increasing the quality of a certain dataset by undertaking a set of predefined tasks to be solved. The assumption is that most participants will come to a right assessment concerning the task’s solution so that the solution may be automatically or semi-automatically applied to the given dataset. Crowdsourcing also often involves an aspect of gamification (Aparicio et al. 2012), giving participants a tangible or intangible recognition for their contribution to the crowdsourcing effort.
§6.2.2. First ideas of crowdsourcing have been defined by (Nurmikko et al. 2014) also coined under the term citizen science. For cuneiform texts, we know that only very few people can be considered experts in interpreting their textual content. Thus, collaborations on improving the quality of already created transliteration texts currently rely on a small group of people who cannot handle this workload individually in the long run. Hence, it seems natural to explore how crowdsourcing approaches may help increase the quality of already existing transliteration texts or add further information. In general, digital cuneiform representations provide many opportunities for crowdsourcing approaches, and crowdsourcing has already been applied to transliterations in other disciplines (Khapra et al. 2014). The CuneiForce approach (López-Nores et al. 2019) investigates the first application to annotate unread cuneiform tablets with a gamified design. It shows that it may be possible to separate tasks which only an Assyriologist is able to determine and tasks which may also be solved by a layman. For example, a layman may determine if a cuneiform sign is broken or not in a classification task, but may not be able to verify if a transliteration has been assigned correctly to a cuneiform sign. We can see how crowdsourcing would improve several aspects of digital cuneiform-related artefacts:
- Annotation of Machine Learning Features: Given predefined classification targets or features for a machine learning task, the Assyriological community could confirm or discard certain classifications one at a time.
- Correction of automatically assigned linguistic annotations: The Assyriologist community could be presented a line of transliterations with a question concerning a word form or a translation to verify.
- Correct annotation of sign variants: A computer-generated abstract sign variant could be presented alongside an image of said sign variant on the cuneiform tablet. The Assyriologist community could confirm or discard the given abstract representation.
- Improvement of transliterations: The Assyriologist community could be presented with transliterations to evaluate and discuss.
§6.2.3. To effectively offer crowdsourcing applications, correctly set up infrastructures, and provide easy access to the task to be solved are a prerequisite. Infrastructures such as for example Zooniverse (Simpson, Page, and De Roure 2014), in this case, have the task of integrating crowdsourcing approaches and data gathered by crowdsourcing approaches sensibly while at the same time motivating participants of crowdsourcing tasks to further contribute to the tasks. To that end, this process should be supervised by experts. Results gathered by crowdsourcing should be clearly marked as such to not mislead the reader into thinking that the presented data is all curated. Still, crowdsourcing results should be taken with caution and should be quality-assured before integrating their results into a possibly already curated database.
§6.3. Quality Assurance
§6.3.1. Data of high quality is usually defined as data that fulfils or exceeds requirements set by the users of the respective data (8000-8:2015 2015). Usually, agreements by a community on which requirements and expected outcomes are important are written down as data quality standards. Methods to ensure compliance to these data quality standards, such as compliance tests, are provided alongside defined standards.
§6.3.2. In the last subsection, we discussed how crowdsourcing might be applied to correct transliterations or associated content of transliterations. Crowdsourcing may therefore be a method of quality assurance to consider. However, crowdsourcing cannot be a solution for all data quality issues. In particular, when transliterating a text, the solution as to which transliteration is the ”best” or ”most accurate” is up to interpretation. We therefore think that a quality control mechanism for transliterations has to include the following aspects:
- Representations of opinions/transliteration changes of different researchers concerning the textual content.
- Ability to comment on existing transliterations of an area of text/image areas which have been annotated.
- Community-driven approach to accepting changes in transliterations under certain circumstances.
§6.3.3. In essence, the requirements for quality assurance proposed here mean peer review and scientific discourse using annotations, as proposed in the concept of a digital scholarly edition (Sahle 2016). Inspiration might also be taken from other communities, such as the OpenReview platform (Soergel, Saunders, and Mccallum 2013). It allows public discussion of scientific work during the reviewing process via comments by assigned reviewers, peers, and the authors themselves. It has become an essential part of one of the major machine learning conferences. Such an approach, however, also requires the willingness of the community both to provide the additional reviewing effort and to participate in such a public form of discussion. In conclusion we recommend to establish a peer review based review model which is centered around important infrastructures. Researchers should get recognition for reviewing already created transliterations and infrastructures should recommend reviews based on the individual researcher’s interests and expertise. In this way, the individual researcher may show his or her expertise not only by publishing papers, but also by a significant amount of review work and thus a dedication for their respective area of research. Ultimately, peer review should also be carried out to validate crowdsourcing approaches. Not only is the validation of crowdsourced data a necessity, but also a valuable resource of citizen science data which may be the source of further publications in Digital Assyriology research.
§7. Machine Learning
§7.0.1. Supervised machine learning is one of the key components that enabled the boost in popularity of applications with artificial intelligence in recent years, such as image recognition or smart assistants. Its application also offers opportunities for the Assyriological community. One of the main use-cases is the automation of processes that have to be performed entirely manually until now. The advantage of automation is that it can process many artefacts in a shorter time. Automatic transliteration from images of cuneiform tablets, e.g., could significantly speed up the creation of new text editions. Another example is the automatic translation of texts which could open the corpora to new readers and computers who lack extensive training in the languages written in cuneiform (Pagé-Perron 2017).
§7.0.2. A supervised machine learning system works by learning from examples. Unlike other approaches from the computational area, it does not make its decisions based directly on manually written (or programmed) rules. Instead, the system is given a set of examples that it learns to distinguish. As a simplified example, for the transliteration of cuneiform tablets from images, a machine learning system could be given images of a hundred tablets with their corresponding transliterations (created by a human expert). With this data, the system learns patterns that help it to reproduce the transliterations. After this so-called training phase, the machine learning system can then be applied to new cuneiform tablets and automatically predict new transliterations for which no manually created transliterations yet exist. In other academic fields, this automatic learning from examples has been shown to perform more accurately on complex tasks than manually created rules for many applications.
§7.0.3. For the field of Assyriology, initial studies have been performed on different tasks related to cuneiform artefacts. We have seen that one of the first steps in creating a digital edition is the Transliteration or Transcription into a digital text form. Automatic Optical Character Recognition (OCR) of cuneiform has been developed for recognising text on 3D scans (Bogacz, Gertz, and Mara 2015; Seiler, Mara, and Bogacz 2021; Somel et al. 2021), 2D images (Mousavi and Lyashenko 2017; Dencker et al. 2020) as well as handwritten texts (Mousavi and Lyashenko 2017; Yamauchi, Yamamoto, and Mori 2018). Related work includes the automatic conversion from transliterated text into phonological transcriptions (Sahala et al. 2020) and the encoding of cuneiform shapes (Homburg 2021). Once the digital text is available, higher-level tasks from Natural Language Processing (NLP) can be applied. These include the automatic segmentation of text (Homburg and Chiarcos 2016), the identification of entities in text such as specific individuals (Liu et al. 2015; Pagé-Perron 2018; Bansal et al. 2021), the annotation of morphosyntactic labels (part-of-speech) (Sukhareva et al. 2017; Bansal et al. 2021), the identification of periods (Bogacz and Mara 2020) as well as languages and dialects (Jauhiainen et al. 2019; Bernier-Colborne, Goutte, and Léger 2019), reconstruction of fragmented text (Fetaya et al. 2020; Bernhard and Hedderich 2021) and machine translation (Pagé-Perron et al. 2017; Bansal et al. 2021).
§7.0.4. Beyond the automation of existing processes currently done manually, machine learning can also offer opportunities to obtain new insights based on quantitative methods. Unsupervised machine learning methods are especially popular in this regard. These methods have been applied in literary analysis for different languages (Moretti 2005) and a few recent works also exist on Assyrian texts. Probably the most popular approach is clustering where similar elements are automatically grouped together (Wagner et al. 2013; Monroe 2018). (Pagé-Perron 2018) experiments with exploratory data analysis where the interaction between researcher and machine learning method is emphasised and where expert and machine influence each other with the insights they obtain.
§7.1. Challenges with Machine Learning
§7.1.1. Machine learning systems require hundreds, thousands of examples or even more to learn from. The more complex the task, the more training data is usually required. All these examples need their corresponding labels (like the manual transliteration for the images above). Obtaining such large datasets is often a significant challenge at the beginning of a new project. Open access to existing datasets could help foster research, similarly to how this is the case for other communities like Natural Language Processing and Computer Vision. Standardized data formats could help additionally as this would reduce the pre-processing effort when combining data from different sources. The lack of labelled data is by no means specific to Digital Assyriology research, and a variety of methods have been developed to handle low-resource scenarios in Natural Language Processing (NLP)(Hedderich et al. 2020). For cuneiform texts, researchers have experimented e.g. with weak supervision (Sukhareva et al. 2017; Dencker et al. 2020) and synthetic data generation (Rusakov et al. 2019).
§7.1.2. While machine learning systems can support researchers by processing large amounts of artefacts, their predictions will, in most cases, contain errors. Where a perfect system would reach close to 100%, most of theworks mentioned in the previous section obtain a performance score between 50% and 90% (the specific score depends on the task, e.g. accuracy, precision or recall). Between 10% and 50% of the predictions are therefore incorrect. Moreover, even with further research, a certain amount of incorrect predictions will probably remain. (Wagner et al. 2013) calls it "a working hypothesis, rather than a conclusive result."
§7.1.3. For some research questions, mistakes in the data can be acceptable, compared to not having this data at all if it is too large to be fully processed manually. Alternatively, a domain expert can verify the output of the machine learning system. This so-called post-editing is, e.g. popular in translating text in different languages where translators check and correct the output of machine translation systems (Koponen 2016). Even though a substantial amount of manual effort is needed, this can still be faster than a purely manual process. Additionally, the machine learning system can also learn from the expert’s corrections, called human-in-the-loop and active learning (Wang et al. 2021). These procedures could also be combined with the aforementioned crowdsourcing.
§8. Conversion of Workflows
§8.0.1. In the last sections, we presented tools that enable a digital edition of cuneiform texts and approaches by other disciplines in using cuneiform text data. In this section, we would like to explore which effects these new digital tools have on a traditional edition process and how they might affect the digital edition process when the technologies have further evolved. Which additional qualifications does the Assyriologist of the future have to attain? Which tasks will likely become obsolete? Which tasks are likely to gain better support from these technologies? And what needs to change for Assyriologists to be ready for these future technologies?
§8.0.2. It can be deduced so far that the regular workflow of Assyriological research will not change dramatically. Even if supposed to be digital, the edition of a cuneiform text still requires the working steps elaborated on above, i.e. the object has to be observed, a transliteration and translation, as well as a commentary, have to be created, and finally, the results of the process have to be published. Digital methods can support and simplify the process as shown, e.g. by the creation of 3D images and renderings but are unlikely to replace any of it, at least in the currently visible state of technology. However, researchers working on digital editions must keep in mind that the technical requirements listed in previous chapters need to be observed during the editorial process. To be able to adapt the Assyriological research to the technical level already possible, it seems to be necessary to insert the knowledge of digital methods and technologies into the classical Assyriological training (see also Maiocchi 2021, p. 125). The Assyriologist of the future, in our view, should get an understanding of data literacy (Koltay 2017), i.e. they should understand which data artefacts are created, saved, versioned, attributed, annotated and which metadata is essential for data providers to host said data and which advantages these data representations provide for which research communities. Adding to this, an understanding of the role of these individual artefacts concerning a traditional edition in the text should be conveyed to new students of Assyriology. Apart from a basic understanding of data literacy, the essentials of data science (Igual and Seguí 2017) might greatly improve the ability of Assyriologists to draw new conclusions about larger quantities of data and relate their current case of research to other data repositories. This includes introducing statistical methods using state of the art tools and analysis that can be created on especially text corpus data. Adding to these skills, optional or mandatory interdisciplinary courses together with other research communities working with cuneiform texts might improve the skill sets of the student of Assyriology and the understanding of other scholars of the cooperating research area. So far, only a few binding standards have been established in the workflow of Digital Assyriology. It seems necessary that the digital Assyriological community collaborates more closely to define these standards. Furthermore, since personal resources in Assyriology are limited, closer collaboration is advisable to quality-assure, discuss, criticise and further enhance the scientific discourse on already existing and emerging cuneiform resources of all mediums, e.g. aspects of transliterations, renderings, photos and 3D models.
§8.1. Inter-language Mappings and Evolving Data Infrastructures
§8.1.1. In the last sections, we have described the possibilities that present themselves by developing machine-readable and machine-interpretable resources such as semantic dictionaries, curated sign lists, and the documentation of character variants. In essence, the foundations of interlinking cuneiform representations on a semantic level are researched. The next step would be to implement these in research data infrastructures and create tools that use the data included in these repositories. A simple application case would be the lookup of registered cuneiform character variants either by roughly drawing its shape or by other metadata describing the cuneiform character representation. In this context it needs to be remarked that many data repositories strive toward creating inter-language mappings to provide data and its relations in different languages or to be able to translate data contents. Wikidata, for example, curates a database of lexemes that can be associated with concepts present in the Wikidata knowledge base. This is not only a valuable resource for linguistic research but also provides the possibility to save the contents of words, characters and parts of words and characters of cuneiform languages. Relations between words in the different cuneiform languages are often only found by extracting this information from text corpora. However, stating this information explicitly in linked data can improve the performance of Natural Language Processing approaches. In consequence, this helps researchers of many disciplines to draw new conclusions about the distribution of certain information. More integrated linguistic data accompanied with information about digital artefacts and geographical data may also pave the way for data infrastructures to become essential platforms for the scientific discourse.
§8.2. A Digital Processing Pipeline?
§8.2.1. Having established the role of platforms and possible example applications that platforms may provide in the previous sections, the need to create easily integrated data and the conversion and enrichment of formerly created data into these data infrastructures will become a priority. As (Homburg 2019b) pointed out, not only digital infrastructures are a necessity but also digital environments to conduct digital editions in. These environments might be a part of digital infrastructure services (e.g. to create/edit transliterations, to upload images and 3D scans and their correct representation in data), or data to be imported might need to be created using external toolchains. Still, depending on the needs to the respective community, tools for a local preparation of data to submit to data repositories are likely to play a role as well. It will be important to consolidate these efforts and to give guidance to Assyriologists as to which tools should forma best practice digital processing pipeline and whether to distinguish these by language, discipline or other factors.
§8.3. Towards the integration of machine learning
§8.3.1. In Section §7, we gave an overview of exciting machine learning approaches that could automate part of the edition process, like transliteration from images or the annotation of entities. While promising first steps have already been taken in that direction, we see two main issues necessary for a broader acceptance of such tools in the community:
§8.3.2. The first issue is the usability of the tools. Many approaches are developed by researchers with a Computer Science and Machine Learning background. This might make these tools difficult to use for other users. Some of the earlier mentioned works already provide demos and graphical user interfaces. We see it as essential that further efforts are taken into the development, maintenance and documentation of easily usable tools, also beyond a specific publication or project. On the other hand, it also requires Assyriologists to experiment with new tools and give feedback on how they could be integrated into their workflow, even when, in the beginning, this might result in more effort than their established workflow.
§8.3.3. The second issue is the accuracy of the machine learning models. For most tasks, the machine learning approaches are faster but also make many more mistakes than a human expert. As proposed above, a solution could be human-in-the-loop approaches where a machine learning system and a human expert cooperate, and the machine improves based on the expert’s feedback. From the machine learning community, this requires the development of such approaches. The Assyriological expert needs to invest more time in the short-term, correcting the machine’s output to benefit from better automatisation in the future. Another important aspect is the willingness to publish and share data with the community. Combining data from different sources can allow the building of better machine learning systems. And these improved systems can then again benefit the members of the community.
§9. Conclusions
§9.1. In this publication, we reflected on the discussion made by experts at the workshop ”Von analog zu digital” held at Mainz University in February 2021. We have identified shortcomings in several stages of the digital edition process, discussed how ideal data infrastructures for storing several digital artefacts should look, how they should be interconnected. We have shown which approaches are already underway to support the many tasks of an Assyriologist in the creation of a digital edition. With this knowledge in mind, we drafted a realistic vision of the work of a digital Assyriologist given the circumstances described in this publication. We encourage the Assyriological community, independent of the languages they analyse, to create an organisational body for data standards and coordinate many points raised in this publication. With the right organisation, we can envision newly educated digital Assyriologists proficient with rudimentary data science skills and understanding how assisting technologies work. These newly educated Assyriologists, similar to the shift from the Humanities to the Digital Humanities, would be curious about exchanging their research and ideas on how to improve and extend automatised tasks to assist their own and other research communities.
§10. Acknowledgements
§10.1. We thank the German Federal Ministry of Education and Research (BMBF) and the German Rectors’ Conference (HRK) for funding the acr:idcs workshop and therefore making these proceedings possible. The responsibility for the content of this publication lies with the authors.
Bibliography
-
8000-8:2015, ISO. 2015. “Data Quality - Part 8: Information and Data Quality: Concepts and Measuring.” Standard. Geneva, CH: International Organization for Standardization. https://www.iso.org/obp/ui/#iso:std:iso:8000:-8:ed-1:v1:en.
-
Aliprandi, Simone. 2011. Creative Commons: A User Guide. A Complete Manuel with a Theoretical Introduction and Practical Suggestions. Ledizioni.
-
Alliance, Global Indigenous Data. 2019. “CARE Principles for Indigenous Data Governance.” GIDA. https://www.gida-global.org/care.
-
Aparicio, Andrés Francisco, Francisco Luis Gutiérrez Vela, José Luis González Sánchez, and José Luis Isla Montes. 2012. “Analysis and Application of Gamification.” In Proceedings of the 13th International Conference on Interacción Persona-Ordenador, 1–2.
-
Baker, Heather D., Christian Chiarcos, Robert K. Englund, Ilya Khait, Émilie Pagé-Perron, and Maria Sukhareva. 2017. “Machine Translation and Automated Analysis of Cuneiform Languages (MTAAC).” https://hcommons.org/deposits/item/hc:12751.
-
Ball, Alex, and Mansur Darlington. 2007. “Briefing Paper: The Adobe eXtensible Metadata Platform (XMP).” UKOLN Research Organization.
-
Bansal, Rachit, Himanshu Choudhary, Ravneet Punia, Niko Schenk, Émilie Pagé-Perron, and Jacob Dahl. 2021. “How Low Is Too Low? A Computational Perspective on Extremely Low-Resource Languages.” In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop, 44–59. Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.acl-srw.5.
-
Bański, Piotr, Jack Bowers, and Tomaz Erjavec. 2017. “TEI-Lex0 Guidelines for the Encoding of Dictionary Information on Written and Spoken Forms.” In Electronic Lexicography in the 21st Century: Proceedings of eLex 2017 Conference.
-
Bauer, Florian, and Martin Kaltenböck. 2011. “Linked Open Data: The Essentials.” Edition Mono/Monochrom, Vienna 710.
-
Berners-Lee, Tim, James Hendler, and Ora Lassila. 2001. “The Semantic Web.” Scientific American 284 (5): 34–43.
-
Bernhard, Johannes, and Michael A. Hedderich. 2021. “Rekonstruktion von Fragmentierten Dokumenten Mit NLP.” Hypotheses. https://idcs.hypotheses.org/216.
-
Bernier-Colborne, Gabriel, Cyril Goutte, and Serge Léger. 2019. “Improving Cuneiform Language Identification with BERT.” In Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects, 17–25. Ann Arbor, Michigan: Association for Computational Linguistics. https://doi.org/10.18653/v1/W19-1402.
-
Bogacz, Bartosz, Michael Gertz, and Hubert Mara. 2015. “Character Retrieval of Vectorized Cuneiform Script.” In 2015 13th International Conference on Document Analysis and Recognition (ICDAR), 326–30. IEEE.
-
Bogacz, Bartosz, and Hubert Mara. 2020. “Period Classification of 3D Cuneiform Tablets with Geometric Neural Networks.” In 2020 17th International Conference on Frontiers in Handwriting Recognition (ICFHR), 246–51. IEEE.
-
Borger, Rykle. 2010. Mesopotamisches Zeichenlexikon. Zweite, Revidierte Und Aktualisierte Auflage. Alter Orient und Altes Testament (AOAT) 305.
-
Brandes, Tim, and Eva-Maria Huber. 2020. “Die Texte Aus Haft Tappeh ‚Äì- Beobachtungen Zu Den Textfunden Aus Areal I.” Elamica 10: 9–41.
-
Budin, Gerhard, Stefan Majewski, and Karlheinz Mörth. 2012. “Creating Lexical Resources in TEI P5. a Schema for Multi-Purpose Digital Dictionaries.” Journal of the Text Encoding Initiative, no. 3.
-
Burnard, Lou. 2020. “What Is TEI Conformance, and Why Should You Care?” Journal of the Text Encoding Initiative, no. 12.
-
Charpin, Dominique. 2014. “The Assyriologist and the Computer: The «Archibab» Project.” Hebrew Bible and Ancient Israel 3 (1): 137–53.
-
Chiarcos, Christian, Ilya Khait, Émilie Pagé-Perron, Niko Schenk, Christian Fäth, Julius Steuer, William Mcgrath, Jinyan Wang, and others. 2018. “Annotating a Low-Resource Language with LLOD Technology: Sumerian Morphology and Syntax.” Information 9 (11): 290.
-
Chiarcos, Christian, Émilie Pagé-Perron, Ilya Khait, Niko Schenk, and Lucas Reckling. 2018. “Towards a Linked Open Data Edition of Sumerian Corpora.” In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).
-
Chiarcos, Christian, and Maria Sukhareva. 2015. “Olia-Ontologies of Linguistic Annotation.” Semantic Web 6 (4): 379–86.
-
Ciccarese, Paolo, Stian Soiland-Reyes, and Tim Clark. 2013. “Web Annotation as a First-Class Object.” IEEE Internet Computing 17 (6): 71–75.
-
Cidoc, Crm. 2003. “The CIDOC Conceptual Reference Model.” 2003-10). http://www.cidoc-crm.org.
-
Coburn, Erin, Richard Light, Gordon McKenna, Regine Stein, and Axel Vitzthum. 2010. “LIDO-Lightweight Information Describing Objects Version 1.0.” ICOM International Committee of Museums. http://lido-schema.org/schema/v1.0/lido-v1.0-specification.pdf.
-
Crüsemann, et al. (ed.), Nicola. 2013. Uruk: 5000 Jahre Megacity; Begleitband Zur Ausstellung “Uruk - 5000 Jahre Megacity” Im Pergamonmuseum, Staatliche Museen Zu Berlin in Den Reiss-Engelhorn-Museen Mannheim.
-
De Santis, Annamaria, and Irene Rossi. 2018. Crossing Experiences in Digital Epigraphy: From Practice to Discipline. De Gruyter Open.
-
Dencker, Tobias, Pablo Klinkisch, Stefan M. Maul, and Björn Ommer. 2020. “Deep Learning of Cuneiform Sign Detection with Weak Supervision Using Transliteration Alignment.” Plos One 15 (12): e0243039.
-
DeRose, Steven. 1999. “XML and the TEI.” Computers and the Humanities 33 (1): 11–30.
-
Doerr, Martin, and Maria Theodoridou. 2014. “CRMdig an Extension of CIDOC-CRM to Support Provenance Metadata.” Techreport. Technical Report 3.2. Heraklion: ICS-FORTH.
-
Druskat, Stephan, M. Gruenpeter, N. Chue Hong, and others. 2017. “Citation File Format-CFF.” Zenodo2017 10.
-
Edzard, Dietz Otto. 1976-1980. “Keilschrift.” Reallexikon Der Assyriologie Und Vorderasiatischen Archäologie (RlA) 5: 544–68.
-
Endesfelder, Marc. 2021. “Urukagina: Ein Framework Für Voll Maschinenlesbare Keilschriftkorpora.” Hypotheses. https://idcs.hypotheses.org/318.
-
Estellés-Arolas, Enrique, and Fernando González-Ladrón-de Guevara. 2012. “Towards an Integrated Crowdsourcing Definition.” Journal of Information Science 38 (2): 189–200.
-
Fetaya, Ethan, Yonatan Lifshitz, Elad Aaron, and Shai Gordin. 2020. “Restoration of Fragmentary Babylonian Texts Using Recurrent Neural Networks.” Proceedings of the National Academy of Sciences 117 (37): 22743–51.
-
Fischer, Franz. 2014. “Kriterienkatalog Für Die Besprechung Digitaler Editionen.” https://www.i-d-e.de/publikationen/weitereschriften/kriterien-version-1-1/.
-
Forschungsgemeinschaft, Deutsche. 2015a. “DFG Guidelines on the Handling of Research Data.” https://www.dfg.de/download/pdf/foerderung/grundlagen_dfg_foerderung/forschungsdaten/guidelines_research_data.pdf.
-
———. 2015b. “Förderkriterien Für Wissenschaftliche Editionen in Der Literaturwissenschaft.” https://www.dfg.de/download/pdf/foerderung/grundlagen_dfg_foerderung/informationen_fachwissenschaften/geisteswissenschaften/foerderkriterien_editionen_literaturwissenschaft.pdf.
-
Gabler, Hans Walter. 2010. “Theorizing the Digital Scholarly Edition.” Literature Compass 7 (2): 43–56. https://doi.org/https://doi.org/10.1111/j.1741-4113.2009.00675.x.
-
Glass, Andrew, Ingelore Hafemann, Mark-Jan Nederhof, Stéphane Polis, Bob Richmond, Serge Rosmorduc, and Simon Schweitzer. 2017. “A Method for Encoding Egyptian Quadrats in Unicode.” Unicode Proposal. https://unicode.org/wg2/docs/n4818-quadrat-encoding.pdf.
-
Hahn, Daniel V., Donald D. Duncan, Kevin C. Baldwin, Jonathon D. Cohen, and Budirijanto Purnomo. 2006. “Digital Hammurabi: Design and Development of a 3D Scanner for Cuneiform Tablets.” In Three-Dimensional Image Capture and Applications VII, 6056:60560E. International Society for Optics.
-
Hajo, Cathy Moran. 2002. “Minimum Standards for Electronic Editions.” https://www.documentaryediting.org/wordpress/?page_id=508.
-
Hassner, Tal, Robert Sablatnig, Dominique Stutzmann, and Ségoléne Tarte. 2014. “Digital Palaeography: New Machines and Old Texts (Dagstuhl Seminar 14302).” In Dagstuhl Reports. Vol. 4. Schloss Dagstuhl-Leibniz-Zentrum für Informatik.
-
Hedderich, Michael A., Lukas Lange, Heike Adel, Jannik Strötgen, and Dietrich Klakow. 2020. “A Survey on Recent Approaches for Natural Language Processing in Low-Resource Scenarios.” Proceedings of the 2021 Conference of the North American Chapter of the ACL (NAACL-HLT). https://arxiv.org/abs/2010.12309.
-
Homburg, Timo. 2019a. “Paleo Codage - A Machine-Readable Way to Describe Cuneiform Characters Paleographically.” In DH 2019. Utrecht, Netherlands. https://dev.clariah.nl/files/dh2019/boa/0259.html.
-
———. 2019b. “Towards Creating A Best Practice Digital Processing Pipeline For Cuneiform Languages.” In DH 2019. Utrecht, Netherlands. https://dev.clariah.nl/files/dh2019/boa/1204.html.
-
———. 2020. “Towards Paleographic Linked Open Data (PLOD): A General Vocabulary to Describe Paleographic Features.” In 15th Annual International Conference of the Alliance of Digital Humanities Organizations, DH 2020, Ottawa, Canada, July 20-25, 2020, Conference Abstracts, edited by Laura Estill and Jennifer Guiliano. https://dh2020.adho.org/wp-content/uploads/2020/07/369%5C_TowardsPaleographicLinkedOpenDataPLODAgeneralvocabularytodescribepaleographicfeatures.html.
-
———. 2021. “PaleoCodage - Enhancing Machine-Readable Cuneiform Descriptions Using a Machine-Readable Paleographic Encoding.” Digital Scholarship in the Humanities. https://doi.org/10.1093/llc/fqab038.
-
Homburg, Timo, and Christian Chiarcos. 2016. “Word Segmentation for Akkadian Cuneiform.” In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), 4067–74.
-
Homburg, Timo, Anja Cramer, Laura Raddatz, and Hubert Mara. 2021. “Metadata Schema and Ontology for Capturing and Processing of 3D Cultural Heritage Objects.” Heritage Science.
-
Homburg, Timo, Anne Klammt, Hubert Mara, Clemens Schmid, Sophie Charlotte Schmidt, Florian Thiery, and Martina Trognitz. 2021. “Diskussionsbeitrag - Handreichung zur Rezension von Forschungssoftware in den Altertumswissenschaften / Impulse - Recommendations for the review of archaeological research software.” Archäologische Informationen. https://doi.org/10.11588/ai.2020.1.81422.
-
Igual, Laura, and Santi Seguí. 2017. “Introduction to Data Science.” In Introduction to Data Science, 1–4. Springer.
-
Initiative, OpenAPI, and others. 2017. “OpenAPI Specification.” Retrieved from GitHub 1. https://github.com/OAI/OpenAPI-Specification/blob/master/versions/3.0.
-
Janowicz, Krzysztof, Pascal Hitzler, Benjamin Adams, Dave Kolas, and Charles Vardeman II. 2014. “Five Stars of Linked Data Vocabulary Use.” Semantic Web 5 (3): 173–76.
-
Jauhiainen, Tommi, Heidi Jauhiainen, Tero Alstola, and Krister Lindén. 2019. “Language and Dialect Identification of Cuneiform Texts.” In Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects, 89–98. Ann Arbor, Michigan: Association for Computational Linguistics. https://doi.org/10.18653/v1/W19-1409.
-
Khapra, Mitesh M., Ananthakrishnan Ramanathan, Anoop Kunchukuttan, Karthik Visweswariah, and Pushpak Bhattacharyya. 2014. “When Transliteration Met Crowdsourcing: An Empirical Study of Transliteration via Crowdsourcing Using Efficient, Non-Redundant and Fair Quality Control.” In LREC, 196–202. Citeseer.
-
Klump, Jens, Roland Bertelmann, Jan Brase, Michael Diepenbroek, Hannes Grobe, Heinke Höck, Michael Lautenschlager, Uwe Schindler, Irina Sens, and Joachim Wächter. 2006. “Data Publication in the Open Access Initiative.” Data Science Journal 5: 79–83.
-
Kluyver, Thomas, Benjamin Ragan-Kelley, Fernando Pérez, Brian E Granger, Matthias Bussonnier, Jonathan Frederic, Kyle Kelley, et al. 2016. Jupyter Notebooks-a Publishing Format for Reproducible Computational Workflows. Vol. 2016.
-
Koltay, Tibor. 2017. “Data Literacy for Researchers and Data Librarians.” Journal of Librarianship and Information Science 49 (1): 3–14.
-
Koponen, Maarit. 2016. “Is Machine Translation Post-Editing Worth the Effort?: A Survey of Research into Post-Editing and Effort.” The Journal of Specialised Translation, no. 25: 131–48.
-
Labat, René. 1995. Manuel d’épigraphie Akkadienne. Signes. Syllabaire, Idéogrammes.
-
Lin, Dawei, Jonathan Crabtree, Ingrid Dillo, Robert R Downs, Rorie Edmunds, David Giaretta, Marisa De Giusti, et al. 2020. “The TRUST Principles for Digital Repositories.” Scientific Data 7 (1): 1–5.
-
Lin, Yi-Hsuan, Tung-Mei Ko, Tyng-Ruey Chuang, and Kwei-Jay Lin. 2006. “Open Source Licenses and the Creative Commons Framework: License Selection and Comparison.” Journal of Information Science and Engineering 22 (1): 1–17.
-
Liu, Yudong, Clinton Burkhart, James Hearne, and Liang Luo. 2015. “Enhancing Sumerian Lemmatization by Unsupervised Named-Entity Recognition.” In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1446–51.
-
López-Nores, Martín, Juan Luis Montero-Fenollós, Marta Rodríguez-Sampayo, José Juan Pazos-Arias, Silvia González-Soutelo, and Susana Reboreda-Morillo. 2019. “CuneiForce: Involving the Crowd in the Annotation of Unread Mesopotamian Cuneiform Tablets Through a Gamified Design.” In Conference on E-Business, e-Services and e-Society, 158–63. Springer.
-
Lucas, Gavin. 2012. Understanding the Archaeological Record. Cambridge University Press.
-
Maiocchi, Massimo. 2021. “Current Approaches towards Ancient Near Eastern Textual Sources: Some Remarks on Contemporary Methodologies for Philological Research.” In dNisaba Za3-Mi2. Ancient Near Eastern Studies in Honor of Francesco Pomponio. Dubsar 19, edited by Palmiro Notizia et al., 117–27.
-
Mara, Hubert. 2019. “HeiCuBeDa Hilprecht - Heidelberg Cuneiform Benchmark Dataset for the Hilprecht Collection.” heiDATA. https://doi.org/10.11588/data/IE8CCN.
-
Marzahn, Joachim. 2010. “Zur Wahrnehmung Babylons in Der Öffentlichkeit.” Mitteilungen Der Deutschen Orient-Gesellschaft Zu Berlin (MDOG) 142: 181–89.
-
Massé, Mark. 2011. REST API. Design Rulebook: Designing Consistent RESTful Web Service Interfaces. O’Reilly Media, Inc.
-
McCrae, John P., Julia Bosque-Gil, Jorge Gracia, Paul Buitelaar, and Philipp Cimiano. 2017. “The Ontolex-Lemon Model: Development and Applications.” In Proceedings of eLex 2017 Conference, 19–21.
-
McCrae, John Philip, Christian Chiarcos, Francis Bond, Philipp Cimiano, Thierry Declerck, Gerard De Melo, Jorge Gracia, et al. 2016. “The Open Linguistics Working Group: Developing the Linguistic Linked Open Data Cloud.” In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), 2435–41.
-
Mittermayer, Catherine. 2006. Altbabylonische Zeichenliste Der Sumerisch-Literarischen Texte. Academic Press/Vandenhoeck & Ruprecht.
-
Monroe, M. Willis. 2018. “Using Quantitative Methods for Measuring Inter-Textual Relations in Cuneiform.” Digital Biblical Studies, 257.
-
Moretti, Franco. 2005. Graphs, Maps, Trees: Abstract Models for a Literary History. Verso.
-
Mousavi, Seyed Muhammad Hossein, and Vyacheslav Lyashenko. 2017. “Extracting Old Persian Cuneiform Font out of Noisy Images (Handwritten or Inscription).” In 2017 10th Iranian Conference on Machine Vision and Image Processing (MVIP), 241–46. IEEE.
-
Müller, Gerfrid G.W., and Daniel Schwemer. 2018. “Hethitologie-Portal Mainz (HPM). A Digital Infrastructure for Hittitology and Related Fields in Ancient Near Eastern Studies.” De Gruyter Open Poland. https://doi.org/10.1515/9783110607208-014.
-
Nederhof, Mark-Jan, Stéphane Polis, and Serge Rosmorduc. 2019. “Unicode Control Characters for Ancient Egyptian.” In Proceedings of the International Congress of Egyptologists XII, 14. IFAO.
-
Nurmikko, Terhi, Jacob Dahl, Nicholas Gibbins, and Graeme Earl. 2012. “Citizen Science for Cuneiform Studies.” In ACM Web Science 2012 (23/06/12). https://eprints.soton.ac.uk/341015/.
-
Pagé-Perron, Émilie. 2017. “Expanding Digital Assyriology With Open Access and Machine Learning.” Digital Humanities Quarterly (in Press).
-
———. 2018. “Network Analysis for Reproducible Research on Large Administrative Cuneiform Corpora.” Digital Biblical Studies, 194.
-
Pagé-Perron, Émilie, Maria Sukhareva, Ilya Khait, and Christian Chiarcos. 2017. “Machine Translation and Automated Analysis of the Sumerian Language.” In Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, 10–16. Vancouver, Canada: Association for Computational Linguistics. https://doi.org/10.18653/v1/W17-2202.
-
Panayotov, Strahil V. 2015. “The Gottstein System Implemented on a Digital Middle and Neo-Assyrian Palaeography.” Cuneiform Digital Library Notes 17.
-
Paskin, Norman. 2010. “Digital Object Identifier (DOI®) System.” Encyclopedia of Library and Information Sciences 3: 1586–92.
-
Pirngruber, Reinhard. 2019. “Cuneiform Palaeography in First Millennium BC Babylonia.” In Current Research in Cuneiform Palaeography 2. Proceedings of the Workshop Organised at the 64th Rencontre Assyriologique Internationale Innsbruck 2018, edited by Elena Devecchi et al., 157–75.
-
Prosser, Miller C, and Sandra R Schloen. 2021. “The Power of OCHRE's Highly Atomic Graph Database Model for the Creation and Curation of Digital Text Editions.” Spadini et. Al, 55–72.
-
Rattenborg, Rune. 2019. “Cuneiform Site Index (CSI): A Gazetteer of Findspots for Cuneiform Texts in the Eastern Mediterranean and the Middle East.” https://ancientworldonline.blogspot.com/2019/12/cuneiform-site-index-csi-gazetteer-of.html.
-
Robson, Eleanor, M. Rutz, and M. Kersel. 2014. “Tracing Networks of Cuneiform Scholarship with Oracc, GKAB and Google Earth.” Archaeologies of Text. Archaeology, Technology, and Ethics, 142–63.
-
Rusakov, Eugen, Kai Brandenbusch, Denis Fisseler, Turna Somel, Gernot A. Fink, Frank Weichert, and Gerfrid G.W. Müller. 2019. “Generating Cuneiform Signs with Cycle-Consistent Adversarial Networks.” In Proceedings of the 5th International Workshop on Historical Document Imaging and Processing, 19–24.
-
Sahala, Aleksi, Miikka Silfverberg, Antti Arppe, and Krister Lindén. 2020. “Automated Phonological Transcription of Akkadian Cuneiform Text.” In Proceedings of the 12th Language Resources and Evaluation Conference, 3528–34. Marseille, France: European Language Resources Association. https://www.aclweb.org/anthology/2020.lrec-1.433.
-
Sahle, Patrick. 2016. “What Is a Scholarly Digital Edition.” Digital Scholarly Editing: Theories and Practices 1: 19–39.
-
Sanderson, Rob. 2019. “LOUD: Linked Open Usable Data.” https://linked.art/loud/.
-
Seiler, Martin, Hubert Mara, and Bartosz Bogacz. 2021. “Large Scale Wedge Extraction.” Hypotheses. https://idcs.hypotheses.org/248.
-
Sichani, Anna-Maria, and Elena Spadini. 2018. “Criteria for Reviewing Tools and Environments for Digital Scholarly Editing, Version 1.0.” https://www.i-d-e.de/publikationen/weitereschriften/criteria-tools-version-1/.
-
Simpson, Robert, Kevin R Page, and David De Roure. 2014. “Zooniverse: Observing the World’s Largest Citizen Science Platform.” In Proceedings of the 23rd International Conference on World Wide Web, 1049–54.
-
Snydman, Stuart, Robert Sanderson, and Tom Cramer. 2015. “The International Image Interoperability Framework (IIIF): A Community & Technology Approach for Web-Based Images.” In Archiving Conference, 2015:16–21. Society for Imaging Science.
-
Soergel, David, Adam Saunders, and Andrew Mccallum. 2013. “Open Scholarship and Peer Review: A Time for Experimentation.” Proceedings of the 30th International Conference on Machine Learning, Atlanta, Georgia, USA 28. https://openreview.net/pdf?id=xf0zSBd2iufMg.
-
Somel, Turna, Eugen Rusakov, Christopher Rest, Denis Fisseler, Gerfrid G.W. Müller, Gernot A. Fink, and Frank Weichert. 2021. “Zeichenerkennung Mithilfe Der Künstlichen Intelligenz ‚Äì Computer-Unterstützte Keilschriftanalyse.” Hypotheses. https://idcs.hypotheses.org/283.
-
Sukhareva, Maria, Francesco Fuscagni, Johannes Daxenberger, Susanne Görke, Doris Prechel, and Iryna Gurevych. 2017. “Distantly Supervised POS Tagging of Low-Resource Languages under Extreme Data Sparsity: The Case of Hittite.” In Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, 95–104.
-
Svärd, Saana, Heidi Jauhiainen, Aleksi Sahala, and Krister Lindén. 2018. “Semantic Domains in Akkadian Texts.” Digital Biblical Studies, 224.
-
Tachibanaya, Tsurozoh. 2001. “Description of Exif File Format.” http://park2.wakwak.com/tsuruzoh/Computer/Digicams/exif-e.html.
-
Wagner, Allon, Yuval Levavi, Siram Kedar, Kathleen Abraham, Yoram Cohen, and Ran Zadok. 2013. “Quantitative Social Network Analysis (SNA) and the Study of Cuneiform Archives: A Test-Case Based on the Murašû Archive.” Akkadica 134 (2): 117–34.
-
Wang, Zijie J., Dongjin Choi, Shenyu Xu, and Diyi Yang. 2021. “Putting Humans in the Natural Language Processing Loop: A Survey.” In Proceedings of the First Workshop on Bridging Human-Computer Interaction and Natural Language Processing, 47–52. Online: Association for Computational Linguistics. https://www.aclweb.org/anthology/2021.hcinlp-1.8.
-
Wilkinson, Mark D, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, et al. 2016. “The FAIR Guiding Principles for Scientific Data Management and Stewardship.” Scientific Data 3 (1): 1–9.
-
Yamauchi, Kenji, Hajime Yamamoto, and Wakaha Mori. 2018. “Building a Handwritten Cuneiform Character Imageset.” In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 719–22.
Abbreviations
AHw Akkadisches Handwörterbuch. von Soden 1965-1981. 12
API Application Programming Interface. 14, 30
ASCII American Standard Code for Information Interchange. 3, 11, 28, 30
ATFASCII Transliteration Format. 3, 11, 30
CAD The Assyrian Dictionary of the Oriental Institute of the University of Chicago. CAD 1964-2010. 12
CDLI Cuneiform Digital Library Initiative. 12
CIDOC Comité international pour la documentation. 4
CIDOC-CRMCIDOC Conceptual Reference Model. 4, 30
CRMdig CRMdigital. 4
DOI Digital Object Identifier. 6, 30
ETCSL Electronic Text Corpus of Sumerian Literature. 6
EXIF Exchangeable Image File Format. 5, 30
FAIR Findable Accessible Interoperable Resusable. 14
IDCS Initiative for Digital Cuneiform Studies. 1, 21
IIIF International Image Interoperability Framework. 6, 30
JSON Java Script Object Notation. 3, 28
JTF JSON Transliteration Format. 3
Lemon Lexicon Model for Ontologies. 12, 30
LIDO Lightweight Information Describing Objects. 4, 30
LLOD Linguistic Linked Open Data. 16, 30
LOD Linked Open Data. 15
LOUD Linked Open Usable Data. 16, 30
NLP Natural Language Processing. 18, 30
OCR Optical Character Recognition. 18, 30
OliA Ontologies of Linguistic Annotation. 11
RDF Resource Description Framework. 14, 28, 30
REST Representational State Transfer. 14, 30
SPARQL SPARQL Protocol And RDF Query Language. 14, 30
TEI Text Encoding Initiative. 3 TRUST Transparency, Responsibility, User Focus, Sustainability, Technology. 14
W3C World Wide Web Consortium. 11
XML Extensible Markup Language. 12
XMP Extensible Metadata Platform. 5, 6, 30
Glossary
Application Programming Interface (API) An application programming interface (API) is a connection between computers or between computer programs. It allows a program to request a service or functionality from another program in a standardized way. An editorial software could, e.g., request the transliteration from an image to text from a separate, specialized transliteration program via an API. 14
ASCII Transliteration Format (ATF) The ASCII Transliteration format is a data format used by CDLI and ORACC to represent transliterations. Besides the plain text, it specifies additional commands to mark, e.g. broken elements or if the text belongs to the obverse and reverse side of an object. 11
CIDOC Conceptual Reference Model (CIDOC-CRM) The CIDOC Conceptual Reference Model is a linked data vocabulary and ontology to model metadata about archaeological objects. 4
Digital Object Identifier (DOI) A digital object identifier (DOI) is a handle (such as doi:10.1000/182) that allows to uniquely and persistently identify objects such as publications or data sets. It is standardized by the International Organization for Standardization (ISO). 6
Exchangeable Image File Format (EXIF) The Exchangeable Image File Format is a format that defines technical metadata for several common image formats. It allows a camera or a user to store additional information along with an image such as when the imagewas created, camera settings or GPS location. 6
International Image Interoperability Framework (IIIF) The International Image Interoperability Framework (IIIF) is a web API standard for sharing images online. 6
Lexicon Model for Ontologies (Lemon) The Lexicon Model for Ontologies is a vocabulary for modeling dictionary resources in linked data. 12
Lightweight Information Describing Objects (LIDO) Lightweight Information Describing Objects is an XML-based dataformat which is used to describe metadata of museums and collections. 4
Linguistic Linked Open Data (LLOD) Linguistic Linked Open Data describes a part of the Semantic Web which deals with the representation of linguistic terminology, dictionaries and other natural language text resources. 16
Linked Open Usable Data (LOUD) Linguistic Linked Open Data describes a part of the Semantic Web which is sufficiently accessible to ordinary users for certain use cases. 15
Natural Language Processing (NLP) Natural Language Processing sits at the intersection of linguistics and computer science. Its goal is the automatic analysis and processing of typically large amounts of natural language text. 18
Optical Character Recognition (OCR) Optical Character Recognition describes the (automated) conversion of images with textual content into digital text. 18
Representational State Transfer (REST) Representational State Transfer is a standardized way to create software services used in theWorld Wide Web. 14
SPARQL Protocol And RDF Query Language (SPARQL) SPARQL is a query language which can be used to query linked data. A query can e.g. retrieve all entries with a specific name from the linked data. 14
Extensible Metadata Platform (XMP) Extensible Metadata Platform is a metadata standard to add information about e.g. photographers, the photo equipment and other related information to digital media such as images. 6
Transcription In cuneiform studies a transcription (or bound transcription) means merging the distinguished signs of a transliteration into bound words according to the grammar of the respective language. Example: a-na É dUTU i-ru-bu-ma (transliteration) // ana bīt Šamaš īrubūma (transcription). 18
Transliteration A transliteration is a sign-by-sign representation of a cuneiform text in the Latin alphabet. Additionally, the signs thought to form a word are connected by hyphens, meaning that every transliteration simultaneously interprets the text. Wordsigns, syllables, determinations, and, if necessary, foreign words are distinguished through different formats. Example: a-na É dUTU i-ru-bu-ma – "They entered the temple of Šamaš". 18
Footnotes
- [1] https://idcs.hypotheses.org/category/workshop-abstracts.
- [2] Cuneiform Digital Library Initiative: https://cdli.ucla.edu/; with a list of related work and up-to-date online resources https://cdli.ucla.edu/related-projects.
- [3] Open Richly Annotated Cuneiform Corpus: http://oracc.museum.upenn.edu/; with a detailed list of included projects, see http://oracc.museum.upenn.edu/projectlist.html.
- [4] The Munich Open-access Cuneiform Corpus Initiative: https://www.ag.geschichte.uni-muenchen.de/forschung/forschprojekte/mocci-deu/index.html.
- [5] Hethitologie Portal Mainz: https://www.hethport.uni-wuerzburg.de/HPM/index.php with an excellent infrastructure, see also (Wagner et al. 2013).
- [6] Electronic Text Corpus of Sumerian Literature: https://etcsl.orinst.ox.ac.uk/.
- [7] Archives babyloniennes XXe-XVIIe siecles AV. J.-C.: https://www.archibab.fr/; Koponen 2016.
- [8] Machine Translation and Automated Analysis of Cuneiform Languages: https://cdli-gh.github.io/mtaac/; (Wang et al. 2021).
- [9] Ancient Near Eastern Empires: https://www2.helsinki.fi/en/researchgroups/ancient-near-eastern-empires.
- [10] http://www.achemenet.com/.
- [11] This represents only a selection of important platforms and cuneiform repositories. The denoted projects are by far not complete.
- [12] German perspective; cf. (Marzahn 2010, 181–189).
- [13] http://oracc.museum.upenn.edu/doc/help/editinginatf/cdliatf/index.html
- [14] https://github.com/cdli-gh/jtf-lib
- [15] https://github.com/Nino-cunei/oldbabylonian.
- [16] The presented metadata are based on the data provided by CDLI and ORACC.
- [17] https://www.ao.altertumswissenschaften.uni-mainz.de/digitale-edition-der-keilschrifttexte-aus-haft-tappeh/.
- [18] https://iiif.io/community/groups/3d/.
- [19] https://etcsl.orinst.ox.ac.uk/edition2/etcslmanual.php.
- [20] http://xml.coverpages.org/xmlMarkupANE.html#birmingham.
- [21] See the Late Babylonian Signs project (LaBaSi) (https://labasi.acdh.oeaw.ac.at/) and (Koltay 2017).
- [22] https://github.com/tosaja/Nuolenna/blob/master/signlist.txt.
- [23] http://cuneifyplus.arch.cam.ac.uk.
- [24] http://oracc.museum.upenn.edu/doc/help/languages/sumerian/index.html
- [25] https://www.Assyriologie.uni-muenchen.de/forschung/forschungsprojekte/sumglossar/zettelkasten200609.pdf.
- [26] http://psd.museum.upenn.edu/nepsd-frame.html.
- [27] https://corpus.writing-sumerian.Assyriologie.uni-muenchen.de/.
- [28] https://journals.ub.uni-heidelberg.de/index.php/ckit.
- [29] https://etcsl.orinst.ox.ac.uk/.
- [30] http://psd.museum.upenn.edu/nepsd-frame.html.
- [31] http://oracc.museum.upenn.edu/epsd2/sux.
- [32] http://bdtns.filol.csic.es/.
- [33] https://www.Assyriologie.uni-muenchen.de/forschung/forschungsprojekte/sumglossar/zettelkasten200609.pdf.
- [34] https://www.archibab.fr/.
- [35] http://www.achemenet.com/.
- [36] https://www.w3.org.
Version: 2023-10-16