Next: Meta Data Up: Legal Aspects, Contracts Previous: BAS Contents

Sharing Model

Speech corpora productions range from EUR 20.000 for a small mono-language read speech corpus to several millions of EUR for a large multi-language, multi-modal WOZ corpus. In almost all cases it makes sense to share these corpora.

Small corpora are often highly innovative - sharing them after a period of exclusive use generates revenue for the owner without compromising his competitive advantage.
Large corpora are often too expensive to produce for a single institution - a common specification, a distributed collection effort, and a one-to-one exchange of corpus data helps to reduce the cost for each partner.
In general, the value of a corpus multiplies with the number of contexts (e.g. languages, recording environments, etc.) for which it is available.

For the production of a shared corpus, the obvious organizational form is collaboration. This means that partners form a consortium with the aim of creating a shared speech corpus, e.g. a multi-language corpus. Each partner is responsible for a part of the corpus, e.g. his language, and in the end all corpora are exchanged freely within the consortium. Of course a very careful corpus design and strict monitoring by an independent partner outside the consortium are indispensable conditions so that the deal works out satisfactory for all partners.

SpeechDat (M), SpeechDat (II) and SpeechDat Car were the first large corpus productions based on this sharing model; others might follow. See www.speechdat.org for details about the SpeechDat projects.

Next: Meta Data Up: Legal Aspects, Contracts Previous: BAS Contents

BITS Projekt-Account 2004-06-01