Data Pipelining

Next: Recruitment Up: Data Logistics Previous: Storage Contents

Data Pipelining

Speech corpus collections are usually not a strictly linear process as depicted in this cookbook. Therefore it is most likely that you will start to process or even annotate your recorded data while the collection is still in progress. The term Data Pipelining refers to the logistical problem of ensuring that the required data are at the required location at the right time.

In large projects where many post-processing steps and annotation procedures are necessary and where these processes might be conducted in parallel by different working groups, this problem can be the hardest to tackle. One aid for avoiding costly idle times is to design a dynamic data flow chart where staff members can see online what data are available and what data are to be processed next and even what data are to expected in the near future. One practical way to realize this might be a Web interface generated by a database which all processing steps are logged into. A single working group may check out data for annotation in the database, and later on, after finishing the job, mark the data as ready. If this is done systematically and consistently, it is easy for the management to detect bottle necks or idle times early enough to react accordingly.^6.3

The concerns about storage and safety mentioned above also apply to the whole pipeline, of course. Very often you will find that idle times are not caused by too slow or too fast working groups but by missing resources like disk space. Always be prepared to store the data of up to 10 recording days `on the side' because there is a problem in the data pipeline that has to be solved. If you do not provide this, the whole pipeline might come to a stop, which might cost you a lot of money.

Next: Recruitment Up: Data Logistics Previous: Storage Contents

BITS Projekt-Account 2004-06-01