buckeye-lm

This repository contains two notebooks for working with the Buckeye corpus. In particular, the code here

aligns transcriptions with Unicode IPA symbols.
produces representations of the Buckeye corpus data convenient for use with a language model (specifically one trained on the Fisher corpus) and for predicting reduction.

IPA alignment

The goal of the notebook Converting Buckeye Transcriptions to Unicode IPA symbols is to develop/document code

to be able to access Buckeye data where all segment symbols have been converted to use IPA representations.
to identify and fix (some) annotation anomalies in the Buckeye data.

Note! The alignment and 'anomaly' fixes documented here are based on

what limited documentation I have found about the process of corpus generation.
my intended use cases.

Additional documentation, more intense scrutiny of the data (e.g. of spectrograms), and different applications could very plausibly motivate different decisions. The point of this notebook is to document my exploration of the corpus and my alignment/anomaly patch decisions.

Dependencies

Critical:

Scott Seyfarth's wonderful buckeye package (see https://github.com/scjs/buckeye)
your own local copy of the Buckeye corpus

Convenient, but non-essential and relatively easy to replace:

more_itertools
funcy

For some analysis/plotting; convenient, but non-essential and relatively easy to replace with your preferred libraries:

pandas

Results / outputs?

Given the size of the Buckeye data (it's a 2.5GB corpus - not just a lexicon in a single .csv file) and the fact that there's already a nice existing Python package for interfacing and interacting with the corpus, I'm not going to make an IPA'd copy of the Buckeye data, just some basic functions for converting representations.

Code at the end of the notebook = the main outcome of this notebook.

Language-model friendly and relational representations of corpus data

The main goal of the notebook Preprocessing Buckeye corpus transcriptions for ease of processing and use with kenlm is to produce (/document the production of) a representation of Buckeye corpus data whose vocabulary has been normalized with respect to the Fisher corpus and where utterance segmentation has been performed. The motivation for doing this is applying a language model trained on (a slightly processed version of) the Fisher corpus to Buckeye. The second goal of the notebook is creating annotated relations describing each of the utterances and wordform tokens in the Buckeye corpus for the purpose of predicting reduction.

Processing steps

To that end,

A smattering of orthographic wordforms and transcriptions are aligned and/or corrected per the other notebook in this repository (Converting Buckeye Transcriptions to Unicode IPA symbols).
Utterances are segmented per Seyfarth (2014) / his buckeye package.
Non-speech noises (e.g. [laughter] or [silence]) are removed from utterances.
All orthographic characters are lower-cased.

Dependencies

Critical:

A local copy of the Buckeye corpus data.
The buckeye python package.

Less important:

funcy
pandas
plotnine

Outputs

If run successfully, this notebook will create the following files as outputs:

A .json file containing a list of objects (Python dictionaries), where each object is a finitary relation describing an utterance (and associated metadata) in the Buckeye corpus.
A .txt file containing one utterance from Buckeye per line, suitable for use with a language model.
A .txt file containing the vocabulary (one wordform per line) of the previous file.
A .json file containing a list of objects (Python dictionaries), where each object is a finitary relation describing a wordform token (and associated metadata) in the Buckeye corpus.
A version of #4 for just those wordforms that meet certain exclusion criteria ('target' wordforms).
A version of #3 for just those wordforms that FIXME
Text files containing the {1,2,3,4} {preceding, following} wordforms of each 'target' wordform, plus a JSON file containing the full bidirectional context of each 'target' wordform.
A TSV file relating each non-disfluent, un-interrupted orthographic wordform in the corpus with a non-unk and non-empty-string processed orthographic representation to its unique phonemic transcription.
A TSV file relating each non-disfluent, un-interrupted orthographic wordform in the corpus with a non-unk and non-empty-string processed orthographic representation to its phoneetic transcriptions.
A JSON file containing information (viz. including speech rate statistics) about speakers in the corpus.
A JSON file containing information (viz. including duration statistics) about word types in the corpus.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.gitignore		.gitignore
Converting Buckeye Transcriptions to Unicode IPA symbols.ipynb		Converting Buckeye Transcriptions to Unicode IPA symbols.ipynb
Preprocessing Buckeye corpus transcriptions for ease of processing and use with kenlm.ipynb		Preprocessing Buckeye corpus transcriptions for ease of processing and use with kenlm.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

buckeye-lm

IPA alignment

Dependencies

Results / outputs?

Language-model friendly and relational representations of corpus data

Processing steps

Dependencies

Outputs

About

Releases

Packages

Languages

emeinhardt/buckeye-lm

Folders and files

Latest commit

History

Repository files navigation

buckeye-lm

IPA alignment

Dependencies

Results / outputs?

Language-model friendly and relational representations of corpus data

Processing steps

Dependencies

Outputs

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages