
=====================================================
==== Use case anonymizing a video + annotation ======
=====================================================

What we have: A video file (*.mp4) with one or more persons speaking.
The video is annotated; the annotation is in BPF format.

What we want: Create a version of video and annotation so that the following
terms are masked (anonymized):
B-B-C
broadcasting
bulletins
report
reports
school
stories
studio
television
website
All masked words should be replaced by the tag *MASKED* and all masked
phonetic items (phonemes, syllables) should be repleced by the tag
<masked>; the speech signal should be masked by a sine tone; the output
annotation should be rendered in praat TextGrid.


Solution: Web Interface Anonymizer
(a detailed description of the 'Anonymizer' service is a the bottom of 
this README.)

=================================================
=============== Procedure =======================
=================================================

* go to http://clarin.phonetik.uni-muenchen.de/BASWebServices

* go to service Anonymizer

* upload Video_100sec_deu-DE.mp4 and the BPF annotation file
  Video_100sec_deu-DE.par (in the directory where this README is)
  by dropping the files in the grey upload area and clicking 'Upload'.

* upload Video_100sec_deu-DE_aTerms.txt (= the list of terms to be 
  anonymized, in this directory) under the option 
  'File with terms to be anonymized'

* choose the following options:
    'Output format' = TextGrid
    'Tag for anonymized word labels' = *MASKED*
    'Tag for anonymized phone labels' = <masked>
    'Signal type to mask terms' = beep

* click 'Run Web Service'


  What happens now: 
  The service looks for all matches of terms in the ORT tier of the input
  BPF annotation file. All matches are cross-referenced to all other 
  parallel annotations (word tokens and phonetic derivates) and to the 
  speech signal, and all these are being masked. The masked video and 
  annotation file (transformed into TextGrid) are then returned by the 
  service in a ZIP archive.


* download the result ZIP archive Video_100sec_deu-DE.zip and extract it. 

* Play the video and inspect the annotation file.

Hint: you can do the same with the text input instead of the BPF input
(Video_100sec_deu-DE.txt, in this dir) by using the 'Pipeline' service
using the pipeline 'G2P_MAUS_ANONYMIZER'; Anonymizer options can be found
under 'Expert options' in the Pipeline service.

--------------------------------------

Description of Anonymizer

The 'Anonymizer' is a tool that allows you to mask terms (= words or phrases) within a recording and the corresponding annotation, so that the data can be published without violating legal rights. For that the annotation must be formatted in BPF (see here for details on BPF) and must contain a time-alignment (usually the result of a MAUS segmentation). Further, you must define all terms that should be anonymized in a single TXT file (UTF-8 encoded) and provide this file via the Option 'File with terms to be anonymized' to the service. The 'Anonymizer' will then search for these terms in the orthographic tier ORT within the input BPF file and mask all occurances in the signal and in the annotation. 

The services reads as input a signal file (sound, video) and a BAS Partitur Format annotation (BPF, *.par), and - via the Option 'File with terms to be anonymized' - a list of terms to be anonymized in both inputs. The service then masks all occurrences of these terms within the signal and in the annotation, and returns the two anonymized files in a ZIP archive. It is highly recommended to use this service as the last module of a pipeline (see Pipeline service), whenever you require that certain terms are masked in your data. For instance, the pipeline 'G2P_CHUNKER_MAUS_ANONYMIZER' will deliver an anonymized, classical MAUS segmentation result together with the masked input signal. Note that the ANONYMIZER options can be found under 'Expert options' in the Pipeline service.


The input signal can be one of wav,nis,nist,sph,mp4,mpeg,mpg,avi,fvl or can be omitted (if option 'Process Annotation Only' is set); the input annotation must be a BPF file *.par with at least an ORT tier and one of MAU,SAP,PHO tiers (see here for details regarding the BPF).

Output is either a ZIP containing the masked (by noise or a beep, see option 'Signal type to mask terms') signal (where sound files keep the same properties as input, while video input is re-coded into MP4 with h264 and aac encoding), and the input annotation (in a format given by the option 'Output format') with all word label occurances replaced by the string given in option 'Tag for anonymized word labels' (default: 'ANONYMIZED') and all phonetic label occurances replaced by the string given in option 'Tag for anonymized phone labels' (default: 'nib' for SAMPA, '(.)' for IPA encodings); or the output ZIP contains just the anonymized annotation file (if option 'Process Annotation Only' is set).

The (required) input list of terms to be anonymized must be uploaded via the option 'File with terms to be anonymized'; it must be encoded in UTF-8 and contain one term per line; terms may contain blanks, in which case only consecutive occurances of the words within the term are anonymized (phases or sentences). The formulation of the terms is tricky, because they must match exactly the way words are repesented in the ORT tier of the input BPF. For instance, if you intend to mask all occurances of the name of Hitler, you should define a list like:

Hitler
Adolf Hitler
Hitlers
Hitler's

but the list need not contain combinations with quotes and punctuations like:

Hitler,
Hitler?
"Hitler"
'Hitler'

etc. because these were deleted from the ORT tier anyway (but of course they do not hurt the service if you include them). So, to be sure that all occurances of a name/word are masked in the recording, include all possible word forms in the list.

See also the <a target="_blank" href="https://clarin.phonetik.uni-muenchen.de/BASWebServices/help#ExampleOfAnonymizerServiceUsage">list of FAQs</a>.


