Combining different corpora
1. Introduction
One of the goals of the Edisyn project is to provide a centralized search engine which can search different corpora simultaneously, and show the combined search results. What follows is a concise description of how we plan to accomplish this from a technical point of view.
2. Requirements
- A single unified search interface for different European corpora of dialect transcriptions.
- A single mapping solution (map of Europe showing data from different corpora).
- Searching for text strings and patterns across different corpora. This poses no philosophical problems, but may be of limited use as we are dealing with different languages.
- Searching for PoS tags across different corpora. The necessary assumption here is that the tagging of different corpora will be similar enough for a unified search to be possible and useful.
3. A concrete proposal
A web service interface between different corpora seems the best way to achieve this ‘concurrent search’ goal. This means that each research group hosts, maintains, and stays responsible for, its own corpus.
We further assume that these corpora are accessible via the web. The thing to do is to add an extra interface for its search facilities, meant to be accessed by a computer (in this case, our centralized search engine) instead of a human. This is called a web service interface and it can exist alongside any existing search interface. The centralized search engine then calls the different corpora via these web service interfaces, and shows the aggregated results on its own results page.
There are a couple of different web service protocols; our proposal is to use the simplest, XML-RPC, as the connecting interface between different corpora. XML-RPC is a very simple, language-agnostic protocol, easy to get working in a variety of programming languages. See http://www.xmlrpc.com/ for more detailed information.
RPC means Remote Procedure Call. Meaning, XML-RPC is a protocol to call a function on a remote server over HTTP, using a simple standardized XML wrapper. An XML-RPC client (the centralized search engine) sends a request to an XML-RPC server (one of the participating corpora); the server decodes the request and calls the appropriate function; the return value of that function is sent back by the server, also wrapped in XML in a standard way. The XML-RPC client then decodes the result and shows it to the end user, together with replies from other servers.
Client and server do not need to know anything about each others internal workings: they just communicate in XML. This means that the programming language used to implement the XML-RPC interface is not important: PHP clients/servers can talk to Java clients/servers can talk to Perl clients/servers ... etc. Most major programming languages have an XML-RPC library, so developers don't have to deal with the low-level workings of the protocol.
4. Usefulness
Having web service interfaces for different corpora is generally useful, also outside the Edisyn project and its goals, and potentially long after the Edisyn project has run its course. Anyone could write a centralized search application, focusing on their specific goals, if different corpora expose their search facilities in this way.
|