Skip to content

Latest commit

 

History

History
66 lines (43 loc) · 2.83 KB

DATASTATEMENT.md

File metadata and controls

66 lines (43 loc) · 2.83 KB

Data Statement for Sentiment Analysis Multitool

Data set name: Sentiment Analysis Multitool Lexicon and Corpus

Citation (if available): Mads Guldborg Kjeldgaard Kongsbak, Steffan Eybye Christensen, Lucas Høyberg Puvis de Chavannes, Peter Due Jensen. "Sentiment Analysis Multitool, SAM". 2019. Bachelor dissertation, IT University of Copenhagen.

Data set developer(s): Mads Guldborg Kjeldgaard Kongsbak, Steffan Eybye Christensen, Lucas Høyberg Puvis de Chavannes, Peter Due Jensen

Data statement author(s): Leon Derczynski

Others who contributed to this document:

A. CURATION RATIONALE

The goal is to develop data useful for training sentiment analysis systems sensitive to a broad variety of Danish, especially that on social media. Thus, the majority of the data comes from social media sources and conversation turns in online discussions.

B. LANGUAGE VARIETY/VARIETIES

  • BCP-47 language tag: da-DK
  • Language variety description: Danish from social media; may include L2

C. SPEAKER DEMOGRAPHIC

  • Description: People using Danish as the main language in online discussions
  • Age: 13+
  • Gender: Mixed
  • Race/ethnicity (according to locally appropriate categories): Danish speakers, so predominantly Danish nationals
  • First language(s): Mostly L1 Danish; some L2
  • Socioeconomic status: Mxies
  • Number of different speakers represented: Hundreds to thousands, though for privacy reasons author IDs aren't stored, so there's no precise count.
  • Presence of disordered speech: At normal population levels

D. ANNOTATOR DEMOGRAPHIC

  • Description: Four bachelor students at ITU Copenhagen
  • Age: Late 20s
  • Gender: Male
  • Race/ethnicity (according to locally appropriate categories): White European
  • First language(s): Danish
  • Training in linguistics/other relevant discipline: None formal; this is from a research project in NLP.

E. SPEECH SITUATION

  • Description: Comments on public news articles on social media
  • Time and place: 2019, mostly Facebook and Twitter
  • Place: Online
  • Modality (spoken/signed, written): Written
  • Scripted/edited vs. spontaneous: Spontaneous
  • Synchronous vs. asynchronous interaction: Asynchronous
  • Intended audience: Others who read comments on news stories

F. TEXT CHARACTERISTICS

Typical social media variance; mixture of formal and informal register, orthographic variety especially for effect. See Microblog-genre noise and impact on semantic annotation accuracy.

G. RECORDING QUALITY

Verbatime comments including unicode

H. OTHER

I. PROVENANCE APPENDIX

About this template

Data Statement template based on the worksheets distributed at the 2020 LREC workshop on Data Statements, by Emily M. Bender, Batya Friedman, and Angelina McMillan-Major. Adapted to Markdown by Leon Dercyznski.