Data Statement for Sentiment Analysis Multitool

Data set name: Sentiment Analysis Multitool Lexicon and Corpus

Citation (if available): Mads Guldborg Kjeldgaard Kongsbak, Steffan Eybye Christensen, Lucas Høyberg Puvis de Chavannes, Peter Due Jensen. "Sentiment Analysis Multitool, SAM". 2019. Bachelor dissertation, IT University of Copenhagen.

Data set developer(s): Mads Guldborg Kjeldgaard Kongsbak, Steffan Eybye Christensen, Lucas Høyberg Puvis de Chavannes, Peter Due Jensen

Data statement author(s): Leon Derczynski

Others who contributed to this document:

A. CURATION RATIONALE

The goal is to develop data useful for training sentiment analysis systems sensitive to a broad variety of Danish, especially that on social media. Thus, the majority of the data comes from social media sources and conversation turns in online discussions.

B. LANGUAGE VARIETY/VARIETIES

BCP-47 language tag: da-DK
Language variety description: Danish from social media; may include L2

C. SPEAKER DEMOGRAPHIC

Description: People using Danish as the main language in online discussions
Age: 13+
Gender: Mixed
Race/ethnicity (according to locally appropriate categories): Danish speakers, so predominantly Danish nationals
First language(s): Mostly L1 Danish; some L2
Socioeconomic status: Mxies
Number of different speakers represented: Hundreds to thousands, though for privacy reasons author IDs aren't stored, so there's no precise count.
Presence of disordered speech: At normal population levels

D. ANNOTATOR DEMOGRAPHIC

Description: Four bachelor students at ITU Copenhagen
Age: Late 20s
Gender: Male
Race/ethnicity (according to locally appropriate categories): White European
First language(s): Danish
Training in linguistics/other relevant discipline: None formal; this is from a research project in NLP.

E. SPEECH SITUATION

Description: Comments on public news articles on social media
Time and place: 2019, mostly Facebook and Twitter
Place: Online
Modality (spoken/signed, written): Written
Scripted/edited vs. spontaneous: Spontaneous
Synchronous vs. asynchronous interaction: Asynchronous
Intended audience: Others who read comments on news stories

F. TEXT CHARACTERISTICS

Typical social media variance; mixture of formal and informal register, orthographic variety especially for effect. See Microblog-genre noise and impact on semantic annotation accuracy.

G. RECORDING QUALITY

Verbatime comments including unicode

H. OTHER

I. PROVENANCE APPENDIX

About this template

Data Statement template based on the worksheets distributed at the 2020 LREC workshop on Data Statements, by Emily M. Bender, Batya Friedman, and Angelina McMillan-Major. Adapted to Markdown by Leon Dercyznski.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DATASTATEMENT.md

DATASTATEMENT.md

Data Statement for Sentiment Analysis Multitool

A. CURATION RATIONALE

B. LANGUAGE VARIETY/VARIETIES

C. SPEAKER DEMOGRAPHIC

D. ANNOTATOR DEMOGRAPHIC

E. SPEECH SITUATION

F. TEXT CHARACTERISTICS

G. RECORDING QUALITY

H. OTHER

I. PROVENANCE APPENDIX

About this template

Files

DATASTATEMENT.md

Latest commit

History

DATASTATEMENT.md

File metadata and controls

Data Statement for Sentiment Analysis Multitool

A. CURATION RATIONALE

B. LANGUAGE VARIETY/VARIETIES

C. SPEAKER DEMOGRAPHIC

D. ANNOTATOR DEMOGRAPHIC

E. SPEECH SITUATION

F. TEXT CHARACTERISTICS

G. RECORDING QUALITY

H. OTHER

I. PROVENANCE APPENDIX

About this template