This repository contains two notebooks for working with the Buckeye corpus. In particular, the code here
- aligns transcriptions with Unicode IPA symbols.
- produces representations of the Buckeye corpus data convenient for use with a language model (specifically one trained on the Fisher corpus) and for predicting reduction.
The goal of the notebook Converting Buckeye Transcriptions to Unicode IPA symbols
is to develop/document code
- to be able to access Buckeye data where all segment symbols have been converted to use IPA representations.
- to identify and fix (some) annotation anomalies in the Buckeye data.
Note! The alignment and 'anomaly' fixes documented here are based on
- what limited documentation I have found about the process of corpus generation.
- my intended use cases.
Additional documentation, more intense scrutiny of the data (e.g. of spectrograms), and different applications could very plausibly motivate different decisions. The point of this notebook is to document my exploration of the corpus and my alignment/anomaly patch decisions.
Critical:
- Scott Seyfarth's wonderful
buckeye
package (see https://github.com/scjs/buckeye) - your own local copy of the Buckeye corpus
Convenient, but non-essential and relatively easy to replace:
more_itertools
funcy
For some analysis/plotting; convenient, but non-essential and relatively easy to replace with your preferred libraries:
pandas
Given the size of the Buckeye data (it's a 2.5GB corpus - not just a lexicon in a single .csv
file) and the fact that there's already a nice existing Python package for interfacing and interacting with the corpus, I'm not going to make an IPA'd copy of the Buckeye data, just some basic functions for converting representations.
Code at the end of the notebook = the main outcome of this notebook.
The main goal of the notebook Preprocessing Buckeye corpus transcriptions for ease of processing and use with kenlm
is to produce (/document the production of) a representation of Buckeye corpus data whose vocabulary has been normalized with respect to the Fisher corpus and where utterance segmentation has been performed. The motivation for doing this is applying a language model trained on (a slightly processed version of) the Fisher corpus to Buckeye. The second goal of the notebook is creating annotated relations describing each of the utterances and wordform tokens in the Buckeye corpus for the purpose of predicting reduction.
To that end,
- A smattering of orthographic wordforms and transcriptions are aligned and/or corrected per the other notebook in this repository (
Converting Buckeye Transcriptions to Unicode IPA symbols
). - Utterances are segmented per Seyfarth (2014) / his
buckeye
package. - Non-speech noises (e.g.
[laughter]
or[silence]
) are removed from utterances. - All orthographic characters are lower-cased.
Critical:
- A local copy of the Buckeye corpus data.
- The
buckeye
python package.
Less important:
funcy
pandas
plotnine
If run successfully, this notebook will create the following files as outputs:
- A .json file containing a list of objects (Python dictionaries), where each object is a finitary relation describing an utterance (and associated metadata) in the Buckeye corpus.
- A .txt file containing one utterance from Buckeye per line, suitable for use with a language model.
- A .txt file containing the vocabulary (one wordform per line) of the previous file.
- A .json file containing a list of objects (Python dictionaries), where each object is a finitary relation describing a wordform token (and associated metadata) in the Buckeye corpus.
- A version of #4 for just those wordforms that meet certain exclusion criteria ('target' wordforms).
- A version of #3 for just those wordforms that FIXME
- Text files containing the {1,2,3,4} {preceding, following} wordforms of each 'target' wordform, plus a JSON file containing the full bidirectional context of each 'target' wordform.
- A TSV file relating each non-disfluent, un-interrupted orthographic wordform in the corpus with a non-unk and non-empty-string processed orthographic representation to its unique phonemic transcription.
- A TSV file relating each non-disfluent, un-interrupted orthographic wordform in the corpus with a non-unk and non-empty-string processed orthographic representation to its phoneetic transcriptions.
- A JSON file containing information (viz. including speech rate statistics) about speakers in the corpus.
- A JSON file containing information (viz. including duration statistics) about word types in the corpus.