To start exploring the database, please visit the database section.
The database offers a selection of Sumerian texts and terms related to writing and accounting (such as "scribe", "tablet", "seal", "to count", etc.). The texts have been extracted from a large digitized corpus, using ad-hoc data mining algorythms. The program evaluates the relevance of the entries on the basis of the total number of attestations of interesting words in the individual texts, as well as the weight attached to the individual selected words. Not all lexemes are in fact equally interesting to the scope of the project. For instance, the word for "scribe" is very central, whereas the one for "to go around" is relevant only if it involves scribes and accountants. Accordingly, each lemma in the list of interesting terms is assigned a weight 1-5 (1 = poorly relevant, 5 = very relevant). The Sumerian terms relevant to the research have been established manually, i.e. producing an authoritative list of lemmas (including individual spellings and morphological forms). The list must be organized as a spreadsheet with the following format (only the first couple of entries are showned here):#lemma_id | #class | #spellings | #senses | #akkadian | #phon_glosses | #bibliogr. | #periods | #freq | #url | #ranking | #notes | #forms |
---|---|---|---|---|---|---|---|---|---|---|---|---|
bisaŋdubak [ARCHIVIST] | (N) | bisaŋ-dub-ba, bisaŋ-dub, bisaŋ-dub-ba-a, pi-ša3-ad-ba-ar-ra, ša₁₃-dub-ba-a | archivist | šandabakku | - | ... | Old Akkadian, Lagash II, Ur III, Old Babylonian, Middle Babylonian | 1829x | http://oracc.org/epsd2/o0025217 | xxxxx | - | ša13-dub-e, pisaŋ-dub-ba, ša13-dub-ba-ka-na, bisaŋ-dub, ša13-dub-ba-ka, bisaŋ-dub-ba-ra, ša13-dub-ba-a, ... |
hur[SCRATCH] | (V/t) | ḫur, ḫur-ḫur, ara3a-ra-ra, ḫuḫur, ara3-ra, ara3a-ra-ra, ḫuḫur | to scratch, draw | eṣēru | ara3a-ra-ra; ḫ uḫ ur | ... | Lagash II, Old Babylonian, Middle Babylonian, Neo-Assyrian, Neo-Babylonian, Hellenistic | 114x | http://oracc.org/epsd2/o0030406 | xxxx | hur-ra, hur-ra-za, a-ra-an-hur-hur-re, bi2-hur, hur-hur, mu-un-hur, u3-mu-ni-hur, hur-ra-gin7, ... | - |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
The script takes advantage of two linguistic features of Sumerian:
However, early cuneiform sources have also drawbacks when it comes to lemmatization. In fact, the Sumerian writing system exploits homography quite extensively. This means that two unrelated words may happen to share the same orthographical representation. Consider for instance the following English sentences:
Lexeme | Basic meaning | Root forms | Morphological forms |
---|---|---|---|
SAR | GARDEN | sar, sar-sar, sa-risar | sar-ba, sar-bi, etc. |
SAR | RUN | sar, sar-sar, sar-re | ba-ra-mu-un-da-ab-sa-re, ga-ba-ab-sar-sar | SAR | SHARPEN | sar, sar-sar | mu-ni-in-sar-sar-re-eš |
SAR | SHAVE | sar, sakar-sakar, sa-karsar | sakar-sakar |
SAR | SMOKE | sar, sar-sar | al-sar-sar-re-ne, etc. |
SAR | WRITE | sar, sar-sar | a-ab-sar, sar-re-dam |
However, as the program knows all morphological forms actually attached to a given word, an automatic disambiguation is possible, at least in all such cases where a given root has a unique morphological form (i.e. a form not shared with other homographs). For instance, if the program hits the form sar-re-dam, it correctly identifies it as meaning "to write", as the other omographs are unattested in such form.
The program outputs a ranked list of promising texts, based on word ranking and total matches within individual texts. It also produces searchable tables that help validating the output. Wrong or unwanted results may be filtered out in a subsequent clean-up step. The program in fact accepts a list of texts to be ignored, which must be manually compiled by the practitioner after close inspection of the results from the first iteration. It is equally possible to provide the program with a list of morphological forms to be ignored. This is useful in case of rare or uncertain spellings that may sometimes generate considerable background noise.