Even if you are not using a re-annotation technique, we recommend that the validators create a copy of each validated transcript or annotation file and mark the errors found in the file in a way that allows a later automatic extraction of the errors. For example, if the following is a piece of phonemic segmentation from the corpus
SAP: 2343 16574 h SAP: 18917 9780 OY SAP: 28697 2376 d SAP: 31073 3289 @and the validator checking this data decides that the phoneme category /d/ is wrong, he adds a special marked line into his copy of the annotation file like:
SAP: 2343 16574 h SAP: 18917 9780 OY SAP: 28697 2376 d ANN_ERROR: SAP: 28697 2376 t SAP: 31073 3289 @This way the validator can provide detailed information about the errors to the client / producer, which is often required in the validation contract.
Only employ validators that are native speakers of the corpus language. If you are working with a group of validators, try to achieve the same level of expertise for all of them. For instance, if you are validating the phonemic segmentation of speech signals, only hire well trained phoneticians and let them participate in a special training to make sure everybody has the same conception of the potential errors found in the data.
Define an error scheme for each type of annotation, i.e. a closed set of error types together with their description and examples. Test the scheme on a small scale set of data before the whole group of validators starts working.
For larger validation groups use a database system to keep track of already validated data. Use some kind of server/client architecture to automatically deal out data that are not validated yet and to collect the results. A simple and very effective tool to achieve this is the WWWTranscribe tool. See appendix B for a short description of WWWTranscribe and how to get it.