To start exploring the database, please visit the database section.

The database offers a selection of Sumerian texts and terms related to writing and accounting (such as "scribe", "tablet", "seal", "to count", etc.). The texts have been extracted from a large digitized corpus, using ad-hoc data mining algorythms. The program evaluates the relevance of the entries on the basis of the total number of attestations of interesting words in the individual texts, as well as the weight attached to the individual selected words. Not all lexemes are in fact equally interesting to the scope of the project. For instance, the word for "scribe" is very central, whereas the one for "to go around" is relevant only if it involves scribes and accountants. Accordingly, each lemma in the list of interesting terms is assigned a weight 1-5 (1 = poorly relevant, 5 = very relevant). The Sumerian terms relevant to the research have been established manually, i.e. producing an authoritative list of lemmas (including individual spellings and morphological forms). The list must be organized as a spreadsheet with the following format (only the first couple of entries are showned here):
#lemma_id #class #spellings #senses #akkadian #phon_glosses #bibliogr. #periods #freq #url #ranking #notes #forms
bisaŋdubak [ARCHIVIST] (N) bisaŋ-dub-ba, bisaŋ-dub, bisaŋ-dub-ba-a, pi-ša3-ad-ba-ar-ra, ša₁₃-dub-ba-a archivist šandabakku - ... Old Akkadian, Lagash II, Ur III, Old Babylonian, Middle Babylonian 1829x http://oracc.org/epsd2/o0025217 xxxxx - ša13-dub-e, pisaŋ-dub-ba, ša13-dub-ba-ka-na, bisaŋ-dub, ša13-dub-ba-ka, bisaŋ-dub-ba-ra, ša13-dub-ba-a, ...
hur[SCRATCH] (V/t) ḫur, ḫur-ḫur, ara3a-ra-ra, ḫuḫur, ara3-ra, ara3a-ra-ra, ḫuḫur to scratch, draw eṣēru ara3a-ra-ra; ḫ uḫ ur ... Lagash II, Old Babylonian, Middle Babylonian, Neo-Assyrian, Neo-Babylonian, Hellenistic 114x http://oracc.org/epsd2/o0030406 xxxx hur-ra, hur-ra-za, a-ra-an-hur-hur-re, bi2-hur, hur-hur, mu-un-hur, u3-mu-ni-hur, hur-ra-gin7, ... -
... ... ... ... ... ... ... ... ... ... ... ... ...

The script takes advantage of two linguistic features of Sumerian:

  1. The language is agglutinative, which makes nominal and verbal roots rather easy to identify by a computer.
  2. third millennium BCE scribes often wrote down words as "naked" roots (i.e. without further grammatical morphemes), which also simplifies the task (although it also complicates interpretation).
These two facts combined allow for a script implemention that circumvents the issue of dealing with grammar and content-aware text recognition.

However, early cuneiform sources have also drawbacks when it comes to lemmatization. In fact, the Sumerian writing system exploits homography quite extensively. This means that two unrelated words may happen to share the same orthographical representation. Consider for instance the following English sentences:

  1. Lead /lɛd/ is a chemical element with the symbol Pb
  2. You will lead /liːd/ the way
In these two examples, the same sequence of characters, namely l-e-a-d, is used to express two different words. Similarly, the sign DUB in Sumerian may be used to express either "inscribed document" or the verb "to heap up" (possible differences in the actual phonetic realization are still debated). If two or more Sumerian lexemes share at least one spelling, the program considers them as a "collection". Collection words are defined here as new words within the dictionary, consisting of the sum of all shared spellings and meanings. Let us simplify a bit and consider the following Sumerian words:
Lexeme Basic meaning Root forms Morphological forms
SAR GARDEN sar, sar-sar, sa-risar sar-ba, sar-bi, etc.
SAR RUN sar, sar-sar, sar-re ba-ra-mu-un-da-ab-sa-re, ga-ba-ab-sar-sar
SAR SHARPEN sar, sar-sar mu-ni-in-sar-sar-re-eš
SAR SHAVE sar, sakar-sakar, sa-karsar sakar-sakar
SAR SMOKE sar, sar-sar al-sar-sar-re-ne, etc.
SAR WRITE sar, sar-sar a-ab-sar, sar-re-dam
As it appears, these lexemes share the form sar. If the program encounters an instance of SAR in a given text and no further morphological elements are attached to the root, it treats such instance as a collection word, having bulk meaning "GARDEN||RUN||SHARPEN||SHAVE||SMOKE||WRITE" (note the individual basic meanings are added up in alphabetical order). Practitioners must use their linguistic competence in order to establish the proper meaning in context.

However, as the program knows all morphological forms actually attached to a given word, an automatic disambiguation is possible, at least in all such cases where a given root has a unique morphological form (i.e. a form not shared with other homographs). For instance, if the program hits the form sar-re-dam, it correctly identifies it as meaning "to write", as the other omographs are unattested in such form.

The program outputs a ranked list of promising texts, based on word ranking and total matches within individual texts. It also produces searchable tables that help validating the output. Wrong or unwanted results may be filtered out in a subsequent clean-up step. The program in fact accepts a list of texts to be ignored, which must be manually compiled by the practitioner after close inspection of the results from the first iteration. It is equally possible to provide the program with a list of morphological forms to be ignored. This is useful in case of rare or uncertain spellings that may sometimes generate considerable background noise.