From Dialectsyntax
Jump to: navigation, search


This tagging protocol provides an overview of the tags that were used during the parts of speech tagging of the SAND-project (Syntactic Atlas of Dutch Dialects). Every tag is represented both with a numeral code and with one or more capitals. In the tagging application, which assigns the tags semi-automatically, the number and capital codes are two alternative but equivalent ways to assign a tag to the transcription.

Every tag has the following format: Category, Attribute, Value, Specification, Specification. For example: V FEAT FIN PT 1.PL; Category = V (verb); Attribute = FEAT (feature/characteristic); Value = FIN (finite); Specification = PT (present tense); Specification = 1.PL (first person plural). Category, attribute, value and specification are marked in capitals. If these capitals are between brackets, the marking/filling-in is optional.

Every tag corresponds to a five digit code. The structure is as follows. The first digit indicates the category (for example 1 = N, i.e. noun). The second digit indicates the attribute (for example 3 = CASE, i.e. case). The third number marks the value of the attribute (for example 1 = OBL, i.e. oblique). The fourth and fifth numbers specify the value (for example 2 = DAT, i.e. dative). A zero marks depending on its position no category/no attribute/no value/no specification. The number code 13120 thus corresponds with the tag N CASE OBL DAT.

Every tag is followed by a short description of category/attribute/value/specification and, if necessary, an illustration (examples). In most cases it will be necessary to assign more than one number code to a word. For example: blackberries in a bucket (of) blackberries gets code 111000: N INFL -es (noun with inflection -es), plus the code 12300: N POS POST-N (noun in postnominal position).

The tagset is inspired by the tagset used in Corpus Gesproken Nederlands (Corpus Spoken Dutch - F. Van Eynde, Part of Speech Taggingen Lemmatisering, Centre for Computational linguistics, K.U. Leuven, 2000.), which is based on the EAGLES standard for tagsets. The SAND tagset differs in a number of ways from both CGN and EAGLES tagsets. These differences will be mentioned and illustrated in this document, whenever necessary.

0. Uncertainty tag 00000 O No tag. Use this code if the category of the word is not clear. This code is not to be used if the category is clear, but the attribute, value or specification is not. In this latter case, the 0 is to be inserted in the position of attribute/value/specification.

Personal tools