WritEMe - Lexicon database

To start exploring the database, please visit the database section.

The database offers a selection of Sumerian texts and terms related to writing and accounting (such as "scribe", "tablet", "seal", "to count", etc.). The texts have been extracted from a large digitized corpus, using ad-hoc data mining algorythms. The program evaluates the relevance of the entries on the basis of the total number of attestations of interesting words in the individual texts, as well as the weight attached to the individual selected words. Not all lexemes are in fact equally interesting to the scope of the project. For instance, the word for "scribe" is very central, whereas the one for "to go around" is relevant only if it involves scribes and accountants. Accordingly, each lemma in the list of interesting terms is assigned a weight 1-5 (1 = poorly relevant, 5 = very relevant). The Sumerian terms relevant to the research have been established manually, i.e. producing an authoritative list of lemmas (including individual spellings and morphological forms). The list must be organized as a spreadsheet with the following format (only the first couple of entries are showned here):

#lemma_id	#class	#spellings	#senses	#akkadian	#phon_glosses	#bibliogr.	#periods	#freq	#url	#ranking	#notes	#forms
bisaŋdubak [ARCHIVIST]	(N)	bisaŋ-dub-ba, bisaŋ-dub, bisaŋ-dub-ba-a, pi-ša3-ad-ba-ar-ra, ša₁₃-dub-ba-a	archivist	šandabakku	-	...	Old Akkadian, Lagash II, Ur III, Old Babylonian, Middle Babylonian	1829x	http://oracc.org/epsd2/o0025217	xxxxx	-	ša13-dub-e, pisaŋ-dub-ba, ša13-dub-ba-ka-na, bisaŋ-dub, ša13-dub-ba-ka, bisaŋ-dub-ba-ra, ša13-dub-ba-a, ...
hur[SCRATCH]	(V/t)	ḫur, ḫur-ḫur, ara3^a-^ra-ra, ^ḫuḫur, ara3-ra, ara3^a-^ra-ra, ^ḫuḫur	to scratch, draw	eṣēru	ara3^a-^ra-ra; ^{ḫ u}ḫ ur	...	Lagash II, Old Babylonian, Middle Babylonian, Neo-Assyrian, Neo-Babylonian, Hellenistic	114x	http://oracc.org/epsd2/o0030406	xxxx	hur-ra, hur-ra-za, a-ra-an-hur-hur-re, bi2-hur, hur-hur, mu-un-hur, u3-mu-ni-hur, hur-ra-gin7, ...	-
...	...	...	...	...	...	...	...	...	...	...	...	...

The script takes advantage of two linguistic features of Sumerian:

The language is agglutinative, which makes nominal and verbal roots rather easy to identify by a computer.
third millennium BCE scribes often wrote down words as "naked" roots (i.e. without further grammatical morphemes), which also simplifies the task (although it also complicates interpretation).

These two facts combined allow for a script implemention that circumvents the issue of dealing with grammar and content-aware text recognition.

However, early cuneiform sources have also drawbacks when it comes to lemmatization. In fact, the Sumerian writing system exploits homography quite extensively. This means that two unrelated words may happen to share the same orthographical representation. Consider for instance the following English sentences:

Lead /lɛd/ is a chemical element with the symbol Pb
You will lead /liːd/ the way

In these two examples, the same sequence of characters, namely l-e-a-d, is used to express two different words. Similarly, the sign DUB in Sumerian may be used to express either "inscribed document" or the verb "to heap up" (possible differences in the actual phonetic realization are still debated). If two or more Sumerian lexemes share at least one spelling, the program considers them as a "collection". Collection words are defined here as new words within the dictionary, consisting of the sum of all shared spellings and meanings. Let us simplify a bit and consider the following Sumerian words:

Lexeme	Basic meaning	Root forms	Morphological forms
SAR	GARDEN	sar, sar-sar, ^sa-risar	sar-ba, sar-bi, etc.
SAR	RUN	sar, sar-sar, sar-re	ba-ra-mu-un-da-ab-sa-re, ga-ba-ab-sar-sar
SAR	SHARPEN	sar, sar-sar	mu-ni-in-sar-sar-re-eš
SAR	SHAVE	sar, sakar-sakar, ^sa-karsar	sakar-sakar
SAR	SMOKE	sar, sar-sar	al-sar-sar-re-ne, etc.
SAR	WRITE	sar, sar-sar	a-ab-sar, sar-re-dam

As it appears, these lexemes share the form sar. If the program encounters an instance of SAR in a given text and no further morphological elements are attached to the root, it treats such instance as a collection word, having bulk meaning "GARDEN||RUN||SHARPEN||SHAVE||SMOKE||WRITE" (note the individual basic meanings are added up in alphabetical order). Practitioners must use their linguistic competence in order to establish the proper meaning in context.

However, as the program knows all morphological forms actually attached to a given word, an automatic disambiguation is possible, at least in all such cases where a given root has a unique morphological form (i.e. a form not shared with other homographs). For instance, if the program hits the form sar-re-dam, it correctly identifies it as meaning "to write", as the other omographs are unattested in such form.

The program outputs a ranked list of promising texts, based on word ranking and total matches within individual texts. It also produces searchable tables that help validating the output. Wrong or unwanted results may be filtered out in a subsequent clean-up step. The program in fact accepts a list of texts to be ignored, which must be manually compiled by the practitioner after close inspection of the results from the first iteration. It is equally possible to provide the program with a list of morphological forms to be ignored. This is useful in case of rare or uncertain spellings that may sometimes generate considerable background noise.