Data set name: Sentiment Analysis Multitool Lexicon and Corpus
Citation (if available): Mads Guldborg Kjeldgaard Kongsbak, Steffan Eybye Christensen, Lucas Høyberg Puvis de Chavannes, Peter Due Jensen. "Sentiment Analysis Multitool, SAM". 2019. Bachelor dissertation, IT University of Copenhagen.
Data set developer(s): Mads Guldborg Kjeldgaard Kongsbak, Steffan Eybye Christensen, Lucas Høyberg Puvis de Chavannes, Peter Due Jensen
Data statement author(s): Leon Derczynski
Others who contributed to this document:
The goal is to develop data useful for training sentiment analysis systems sensitive to a broad variety of Danish, especially that on social media. Thus, the majority of the data comes from social media sources and conversation turns in online discussions.
- BCP-47 language tag: da-DK
- Language variety description: Danish from social media; may include L2
- Description: People using Danish as the main language in online discussions
- Age: 13+
- Gender: Mixed
- Race/ethnicity (according to locally appropriate categories): Danish speakers, so predominantly Danish nationals
- First language(s): Mostly L1 Danish; some L2
- Socioeconomic status: Mxies
- Number of different speakers represented: Hundreds to thousands, though for privacy reasons author IDs aren't stored, so there's no precise count.
- Presence of disordered speech: At normal population levels
- Description: Four bachelor students at ITU Copenhagen
- Age: Late 20s
- Gender: Male
- Race/ethnicity (according to locally appropriate categories): White European
- First language(s): Danish
- Training in linguistics/other relevant discipline: None formal; this is from a research project in NLP.
- Description: Comments on public news articles on social media
- Time and place: 2019, mostly Facebook and Twitter
- Place: Online
- Modality (spoken/signed, written): Written
- Scripted/edited vs. spontaneous: Spontaneous
- Synchronous vs. asynchronous interaction: Asynchronous
- Intended audience: Others who read comments on news stories
Typical social media variance; mixture of formal and informal register, orthographic variety especially for effect. See Microblog-genre noise and impact on semantic annotation accuracy.
Verbatime comments including unicode
Data Statement template based on the worksheets distributed at the 2020 LREC workshop on Data Statements, by Emily M. Bender, Batya Friedman, and Angelina McMillan-Major. Adapted to Markdown by Leon Dercyznski.