Introduction

From Dialectsyntax
Revision as of 13:07, 24 October 2011 by Admin (Talk | contribs)

Jump to: navigation, search

Introduction

This document provides an overview of the tags that were used during the parts of speech tagging of the SAND-project (Syntactic Atlas of Dutch Dialects). Every tag is represented both with a numeral code and with one or more capitals. In the tagging application, which assigns the tags semi-automatically, the number and capital codes are two alternative but equivalent ways to assign a tag to the transcription.

Every tag has the following format: Category, Attribute, Value, Specification, Specification. For example: V FEAT FIN PT 1.PL; Category = V (verb); Attribute = FEAT (feature/characteristic); Value = FIN (finite); Specification = PT (present tense); Specification = 1.PL (first person plural). Category, attribute, value and specification are marked in capitals. If these capitals are between brackets, the marking/filling-in is optional.

Every tag corresponds to a five digit code. The structure is as follows. The first digit indicates the category (for example 1 = N, i.e. noun). The second digit indicates the attribute (for example 3 = CASE, i.e. case). The third number marks the value of the attribute (for example 1 = OBL, i.e. oblique). The fourth and fifth numbers specify the value (for example 2 = DAT, i.e. dative). A zero marks depending on its position no category/no attribute/no value/no specification.
The number code 13120 thus corresponds with the tag N CASE OBL DAT.

Every tag is followed by a short description of category/attribute/value/specification and, if necessary, an illustration (examples).
In most cases it will be necessary to assign more than one number code to a word. For example: blackberries in a bucket (of) blackberries gets code 111000: N INFL -es (noun with inflection -es), plus the code 12300: N POS POST-N (noun in postnominal position).

The tagset is inspired by the tagset used in Corpus Gesproken Nederlands (Corpus Spoken Dutch - F. Van Eynde, Part of Speech Taggingen Lemmatisering, Centre for Computational linguistics, K.U. Leuven, 2000.), which is based on the EAGLES standard for tagsets. The SAND tagset differs in a number of ways from both CGN and EAGLES tagsets. These differences will be mentioned and illustrated in this document, whenever necessary.

==

Personal tools