Introduction

From Dialectsyntax
Revision as of 13:25, 2 November 2011 by Franca (Talk | contribs)

Jump to: navigation, search

The tags that are used in the Edisyn search engine can be viewed here. This tagset is used to label the parts of speech of (dialect) corpora. The document shows how the tags of the various corpora are connected to those of the Edisyn search engine. In the column 'Edisyn search engine' the tags are taken up which are used in this search engine. The other columns show the tags that apply to each individual corpus. Per row the correspondence between a tag of a corpus and that of the search engine is made visible. The tags of the Edisyn search engine consist of two parts, a linguistic category (e.g. V 'verb') which may be modified with one ore more feature(s) (e.g. 1,s 'first person', 'singular'). In the search engine one can search via categories or features or both. In order to make many databases interoperable the categories and features are as general as possible.

The protocol below is a manual for performing Parts of Speech tagging. It was developed by Sjef Barbiers and Guido Vanden Wyngaerd, for the SAND-project (Syntactic Atlas of Dutch Dialects), but can be useful for other dialect research groups/projects.

This protocol is also available in [www.meertens.knaw.nl/pdf/variatielinguistiek/dialectsyntax/Tagging-protocol.pdf PDF format].

Introduction

This document provides an overview of the tags that were used during the parts of speech tagging of the SAND-project (Syntactic Atlas of Dutch Dialects). Every tag is represented both with a numeral code and with one or more capitals. In the tagging application, which assigns the tags semi-automatically, the number and capital codes are two alternative but equivalent ways to assign a tag to the transcription.

Every tag has the following format: Category, Attribute, Value, Specification, Specification. For example: V FEAT FIN PT 1.PL; Category = V (verb); Attribute = FEAT (feature/characteristic); Value = FIN (finite); Specification = PT (present tense); Specification = 1.PL (first person plural). Category, attribute, value and specification are marked in capitals. If these capitals are between brackets, the marking/filling-in is optional.

Every tag corresponds to a five digit code. The structure is as follows. The first digit indicates the category (for example 1 = N, i.e. noun). The second digit indicates the attribute (for example 3 = CASE, i.e. case). The third number marks the value of the attribute (for example 1 = OBL, i.e. oblique). The fourth and fifth numbers specify the value (for example 2 = DAT, i.e. dative). A zero marks depending on its position no category/no attribute/no value/no specification. The number code 13120 thus corresponds with the tag N CASE OBL DAT.

Every tag is followed by a short description of category/attribute/value/specification and, if necessary, an illustration (examples). In most cases it will be necessary to assign more than one number code to a word. For example: blackberries in a bucket (of) blackberries gets code 111000: N INFL -es (noun with inflection -es), plus the code 12300: N POS POST-N (noun in postnominal position).

The tagset is inspired by the tagset used in Corpus Gesproken Nederlands (Corpus Spoken Dutch - F. Van Eynde, Part of Speech Taggingen Lemmatisering, Centre for Computational linguistics, K.U. Leuven, 2000.), which is based on the EAGLES standard for tagsets. The SAND tagset differs in a number of ways from both CGN and EAGLES tagsets. These differences will be mentioned and illustrated in this document, whenever necessary.

0. Uncertainty tag 00000 O No tag. Use this code if the category of the word is not clear. This code is not to be used if the category is clear, but the attribute, value or specification is not. In this latter case, the 0 is to be inserted in the position of attribute/value/specification.

Personal tools