Distinctive language cues
This is work coming out of discussions on bilingualism with my postdoc advisor Denise Klein as well as developmental researchers Krista Byers-Heinlein, Linda Polka and Lena Kremin. A big question in bilingual language learning is how an infant learns to discriminate languages.
For training a multilingual speech recognizer I used Mozilla’s Common Voice corpus. This is a large crowd-sourced speech data set containing multiple languages of which I picked German, English and French. The corpus consists of individual sentences recorded by volunteer contributors through a web interface. I aligned transcriptions and audio on the word and phoneme level using the Montreal Forced Aligner software. The following is an example:
These are the number of utterances I used for training, development and test splits.
During training, utterances were sampled from the three languages with uniform probability.
For a given utterance the neural network was fed with a (mel-scale) spectrogram representation of the audio as in the figure above. Based on this it was trained to predict phoneme and word labels for every time point, as shown in the following diagram.
graph LR spec(<font size=2>Mel-scale<br>spectrogram) --> conv[<font size=2>Convolutional<br>layers] conv --> rnn1[<font size=2>Recurrent<br>layer 1] rnn1 --> cphon((<font size=2>Phonemes)) rnn1 --> rnn2[<font size=2>Recurrent<br>layer 2] rnn2 --> cword((<font size=2>Words)) classDef whiteBox fill:#ddd,stroke:#888,stroke-width:2px; class spec,conv,rnn1,rnn2,cphon,cword whiteBox
The input goes through a series of convolutional layers (which have fixed receptive fields in time), followed by two recurrent processing layers (which have longer memory: important for language discrimination, see later). Phoneme- and word-level classifications are computed from separate recurrent layers (see Lugosch et al., 2019, Interspeech), which forces some hierarchical structure on the networks' processing.
In the following, I show example outputs from the trained network (phoneme classifications) plotted on top of the example spectrogram shown before. You can add/remove traces by clicking on the legend entries.
This is not to show the level of accuracy of the network, I selected automatically a set of phonemes that had a high probability at some point during the utterance. It’s rather to demonstrate the dynamic and probabilistic nature of the network’s output. The following shows examples of word classifications.
The plot is busy, but you can remove/add languages by clicking on the figure legend.
To study the language discrimination behavior of the network more explicitly, I replace phoneme & word classifiers with a linear language classifier that receives the concatenated outputs of the two recurrent layers, as shown in the following diagram:
graph LR spec(<font size=2>Spectrogram) --> conv[<font size=2>Convolutional<br>layers] conv --> rnn1[<font size=2>Recurrent<br>layer 1] rnn1 --> clang((<font size=2>Language)) rnn1 --> rnn2[<font size=2>Recurrent<br>layer 2] rnn2 --> clang classDef whiteBox fill:#ddd,stroke:#888,stroke-width:2px; class spec,conv,rnn1,rnn2,clang whiteBox
This language ‘readout’ is trained on the same data while the rest of the network’s weights are frozen. The resulting classifier allows us to view language discrimination dynamically as in the following graph:
Here, the classifier’s output (language probabilities) is displayed on top of the spectrogram from the example sentence shown earlier. Note that language probabilities do not sum to 1 during pauses, since there is a fourth option “no speech” that the classifier can assign; I left this trace out of the plot to avoid clutter.
After some initial uncertainty, the network correctly classifies the utterance as English from 500 ms onward. The second sentence “the man..” is immediately recognized as English: We can see that this is an effect of context, i.e. memory encoded in the recurrent activity: In the following graph, I show the network’s classification when it receives the three parts of the utterance separately (as marked by the vertical lines).
To evaluate the accuracy of language discrimination, I thus took random snippets of different lengths from utterances of the Common Voice test set. The results of the classification are shown below.
We see that discrimination accuracy increases with audio length, plateauing at over 98% for all three language pairs. Accuracy for the pair ‘German-English’ stays well below the others for audio snippets shorter than one second. As Germanic languages, these two should indeed be more similar to each other than to French as a Romance language. This difference in discrimination accuracy is likely not explainable by the rhythmical differences between the languages (stress- vs. syllable-timed), because rhythmic cues would only be apparent in longer audio snippets: here, we see most divergence at short audio lengths.
Cues for language discrimination at this short time-scale could be distinctive phonemes or phoneme sequences (phonotactics). In the following graph I show the distinctiveness of each language’s phonemes. This was done by letting the network classify one second audio snippets just as shown for the example sentence above. Any distinctive cue for a given language should result in a rise of that language’s probability curve: I average the steepness of this curve for each phoneme as a measure of distinctiveness; this is what is plotted in the following graph.
We can appreciate that distinctive phonemes include those that are unique for a given languange, such as German x, English dʒ or French ʒ, but also phonemes that have equivalents in all languages, such as German and French d, or French l. These might have different phonetic realizations or occur in different contexts between languages.
In the following, I plot distinctiveness of words, defined analogously to phonemes as average steepness of the language probability curve. It’s a busy graph, but you can click on it to open a larger, interactive version.
It likely highlights short words that can be easily recognized with little ambiguity between languages, and can thus act as strong language cues.