
# Example ABM simulation 

# F. Schiel, J. Harrington, R. Winkelmann, M. Stevens

For versioning history see header of script Rcmd/master.R

Warning: function modules in subdir functions/
agentfns.R specificAgentFunctions.R equal_class.R splitandmerge.R
are linked by other projects. Do not change without strict 
backwards compatibility! 

-----------------------------------------------------
Open Issues

- speaker normalization: currently we use Lobanov applied to all synchroneous 
(initial) track data of an agent. That means that the mean and SD of all the
track values (without considering the time information nor the fact that the 
data within a sibilant are of course correlated) are used to shift/scale the 
tracks. In the first ABM experiment of Mary (2016), this lead to some peculiar 
distorted data, such as speacker F05. Alternatives would be:
  + Jmh: use only data from the extrem points of the sibilant space (e.g. from
  'seen' and 'sheen')
  + Flo: use another normalization, such as a morphic normalization between the 
  2/9% quantiles (similar to Gerstman)
  + do nothing
- when using many word types, it turns out that the split&merge algorithm actually 
  favors splits, until the constraint that each phonological cluster must be shared
  by at least 2 words is reached (see function phonsplit.sub(), comment = 
  'Do not allow a cluster to consist of just one word'). Alternatives:
  + set the constraint higher than 2 words e.g. a percentage of total number of 
  word types?
  + do the split&merge based on all memory contents of all agents; I'm not sure how 
  this should work...
  + do nothing, but then for analysis and displays convert 'derived labels' into
  'equivalent labels' (jmh), i.e. for each phone label that developped in the 
  agents' memory replace it by one of 7 possible combinations of 'canonic labels'
  /s/ /str/ and /S/ that are defined by the word categories (that don't change).
  E.g. a derived label 'xcvdgr' that appears only in 's' and 'str' words, will
  be called equivalence label 's+str' and so on. This results in 7 possible 
  equivalence labels: 's' 'S' 'str' 's+str' 'str+S' 's+S' 's+str+S'.
  The problem here is of course that we mask the real greedy clustering that 
  takes place in most cases until every derived label covers exactly two word
  classes. On the other hand it solves the open problem how to prevent the greedy
  splitting without any heuristics?
- merge() is technically more likely than split, since it requires only equal 
  distances, not significant smaller distances; therefore mergers should be more likely
  than splits; in fact we observe the opposite: the initial splitandmergefull() usually 
  ends in very small clusters (each covering 2-3 words).


-----------------------------------------------------
Description

# This script exemplifies a Agent Based Modelling after the fashion 
# of Harrington & Schiel 2017, Language
# applied to within-group s-retraction in Australian English and extended
# by the possibility of split&merge of agent-specific phonological categories.
# It can be used to adapt the basic ABM technique to other 
# use cases.

# Refer to the following paper for background and details: 
# Harrington, J. & Schiel, F. (in press) /u/-fronting and agent-based modeling: The relationship between the origin and spread of sound change. Language, in press.
# download: http://www.phonetik.uni-muenchen.de/~jmh/papers/harringtonschiel.pdf

# This ABM assumes that the exchanged features between agents are triplets
# of three DCT-0 ... DCT-2 derived from a feature track within a phoneme 
# (e.g. in this example the first spectral moment over the time of a 
# sibilant /s/ or /ʃ/). There is no easy way 
# to change the script to another number of dimensions or parameterisation
# because tracks are re-covered from the DCT triplets frequently, but technically 
# you can use different dimensions as you wish (plots might fail, though).
# This ABM models mainly within-group interactions; there is no group contact
# in the main hypothesis, but rather a grouping according word types 
# (word initial 's', 'str' and 'ʃ'). 
# However, to give examples for group-related sound change
# modelling we add a group category 'gender' to the main dataframe str.df 
# and show some example plots accross genders and give some comments at 
# possible locations in the script where agent grouping might be applicable.
# In general you can think of the 'sibililant class' or 'word initial'
# expressed in the column str.df$Initial as the grouping factor for this ABM;
# If you intend to model 'real' agent groups you'll have to replace this factor
# by a grouping factor such as 'gender'. 

# In this example script two basic types of ABM experiments are implemented
# which the user can select interactively:
# 1. Singe-run ABM runs one simulation of simGroups * simGroupSize interactions,
# every simGroup plots are created and the agent memories are stored; single-runs
# can be performed with ploting group-level distributions/tracks or for a single 
# agent only.
# 2. Multiple-run ABM runs the same configuration:
# some results can be averaged over multiple ABM runs then to make the result more robust.

# The term 'Harrington Rule' in the following refers to the basic idea that 
# sound change from one categorical feature distribution in the direction of 
# another category is driven by a simple rule, that only word tokens 
# are memorized by an agent, if they are close to the class distribution 
# within the memory of the agent. In this script mainly the second method 
# described in Harrington & Schiel 2017, Language, Section 4, is implemented,
# i.e. memorization is restricted by the following constraint:
# the log Mahalanobis distance of the received token to the phonological distribution 
# (perc.sibilant.2|3) must be less or equal to a threshold (2.5) or the posterior probability 
# (perc.sibilant.1) must be higher than 1/3 to prevent outliers to memorized.
# This is a strong constraint because it is independent of the other phonological classes,
# and just dependent on the closeness *and form* of the Gaussian of the exchanged class.
# Furthermore, memory loss (= the deletion of an exemplar from agent memory) is
# controlled by age or by outlier-removal (the largest Mahalanobis distance). 
# The package also contains alternative strategies, mainly the original memorization
# strategy of Harrington; please refer to the parameters prodFuncName and 
# percFuncName described below.

# The term 'split&merge' in the following refers to a basic data-driven
# feature clustering based on k-means developped by J. Harrington in 2016.

# The software structure is as follows:
# - the master.R script residing in sub-dir Rcmd is running the experiment 
# by issueing  'source("<path>/master.R")' on the R commandline, and
# should lead the experimenter through the simulation. At numerous times
# the user is asked to confirm plots or printouts before continueing and/or
# is asked to select different ways to simulate (basically a single run or
# multiple runs). In the same location as the master script there must be 
# the script pathsAndLibraries.R which defines file paths (e.g. the 
# installation dir on the local computer), libraries/functions to be loaded,
# etc. Be sure to use an up-to-date R version, since you will probably have 
# to install some packages before the master.R script runs.

# A large number of parameters defined at the start of the script control the 
# simulation; for each run a log file <date&time>.txt is created in the 
# sub dir defined by LogDir as well as other results named with 
# the same <date&time> (e.g. animation of the feature space, log files, 
# feature distributions/tracks before/after the ABM etc.
#- the sub dir 'data' stores initial feature data to be loaded by the ABM
# (in most cases dataframes and/or trackobjects) 
# - the subdir 'functions' stores low level functions partly developed by 
# R. Winkelmann. Some of these exist in several different forms; the master
# script can select which functions are being used via parameters passed to
# the ABM simulation (see parameters 'percFuncName' 'prodFuncName' below)
# Split&merge functions are stored here as well.
# - the sub dir PNGs is used to stored single frames of simulations
# - the sub dir LogDir stores automatic log files and plots; each ABM run resides in a
# sub dir LogDir/<date&time> which is refered to as 'LogDirDate' in the following  

# Input data
# One of the main innovative features of Harrington ABM is the usage of real
# synchroneous production data observed from speech recordings. Therefore, the 
# first part of this script is dedicated to create these input data in form of a dataframe
# that can then be used to initialize agent memories. Basically, this means that 
# we make a very simple assumption, namely that the exemplars in the memory of an 
# agent are the same as his observed productions at a certain point in time.
# The input dataframe is called str.df in the following and must be conform with 
# the code (here and in many functions residing in the sub dir functions). Here
# is a brief description that might help to adapt the script to other types of
# ABM experiments:
# Each line of str.df represents one exemplar token observed in the acoustics of 
# a real speaker. The obligatory columns in str.df and their meaning for the ABM 
# are (all other columns are ignored):
# labels    : the phonological label of the spoken phoneme (here: 's' or 'ʃ');
#             this implicitely defines the initial phonological inventory of each speaker
#             (here: every agent start with 2 phoneme classes 's' and 'ʃ'; and to make 
#             things even more complicated, 'labels' is called 'V' in the later part 
#             of the script, sorry)
# Word      : the word class of the word in which the phoneme is contains
#             (here: 'sane', 'stream', 'shame' etc.)
# Speaker   : the speaker ID; if the grouping according gender should work, the 
#             ID must be of the form 'F##' (female) or 'M##' (male)
# Initial   : the phonological class (here: the sibilant+context, 's', 'str' or 'S'
#             this never changes because it is defined by the word label, insofar 
#             this is a redundant information but handy to have.
# Gender    : agent grouping factor; in other types of ABM experiments (e.g. group contacts) this can be used 
#             to mark a speaker group instead gender
# P1,P2,P3  : the actual DCT coefficients of the track data: DCT-0 ... 2, also referred
#             to as 'Height', 'Slope' and 'Curvature'; this is the 'signal' exchanged
#             the agents; these can be anything and can be any number of features; e.g. just 
#             P1 and P2 would change the dimensionality of teh ABM to 2 dimensions.

# Plots
# Pictures are created as PDF/PNG in the LogDir.
# Several switches control the behaviour of the script, i.e. if plots are 
# generated during the simulations and whether and which time-slice frames 
# (*.pgn) are generated in the PNGs subdir that later 
# can be aggregated into animated GIFs (see boolean definitions below). For this
# the software ImageMagick (command line command 'convert') must be installed.
# Ubuntu: sudo apt-get install imagemagick
# Mac: http://imagemagick.org/script/binary-releases.php#macosx


############################################################
# call with 'source("<pfad2>/Rcmd/master.R")
############################################################

