UD-WALS-Linguistic-Patterns

Overview

This repository was created as part of a university project for the Digital Linguistics Project course at the Faculty of Social Sciences and Humanities, University of Zagreb. The project aims to document and explore linguistic patterns using Universal Dependencies (UD) and the World Atlas of Language Structures (WALS).

The project was divided into two phases:

Phase 1: Contribution to a broader Slovenian linguistic project, involving the creation of a table mapping WALS features to UD queries.
Phase 2: An independent analysis focusing on word order patterns in Slovenian, comparing written and spoken corpora.

The repository serves as a transparent record of the work, including data processing scripts, analysis results, and relevant documentation.

Repository Structure

UD-WALS-Linguistic-Patterns
├── app/
│   ├── app.py                            # Streamlit app
├── data/                                 # Data files
│   ├── extracted/                        # Processed datasets
│   ├── features/                         # Feature data and visualizations
│   ├── results/                          # Analysis outputs
│   ├── src/                              # Raw CoNLL-U files
│
├── docs/                                 # Documentation
│   ├── reports/                          # Reports and drafts
│   ├── project_proposal.pdf              # Project proposal
│   ├── projekt_zg.v1.pdf                 # Draft paper
│   ├── qualitative_analysis_proposal.md  # Qualitative analysis proposal
│
├── scripts/                              # Python scripts
│   ├── 1_compare_features.py             # Compare features
│   ├── 2_fix_and_validate_conllu.py      # Fix and validate CoNLL-U files
│   ├── 3_remove_punct_conllu.py          # Remove punctuation
│   ├── 4_clean_stark_word_order.py       # Process word order
│   ├── 5_combine_both_processed.py       # Merge datasets
│   ├── 6_analyze_processed.py            # Analyze corpora
│
├── requirements.txt                      # Dependencies
├── README.md                             # Repository guide
└── LICENSE.txt                           # License

Results: Word Order Analysis

The quantitative study conducted as part of this project reveals key differences in word order patterns between spoken and written Slovenian:

Written Corpus (SSJ):
- Strong preference for SVO (Subject-Verb-Object), reflecting the structured syntax typical of written language and aligning with the WALS value for Slovenian.
Spoken Corpus (SST):
- Greater variation, with word orders such as SOV, OSV, and OVS appearing more often.

Interpretation

Written language prioritizes unmarked SVO for clarity and consistency.
Spoken language is more flexible, using varied word orders to emphasize topics or structure information.
These findings highlight the adaptability of spoken syntax and the influence of pragmatics on word order, emphasizing the need to revisit and update outdated WALS feature values.

Outputs

Visualizations and analysis results are available in data/results/ and in the paper draft located at docs/paper_v1.pdf.

License

This project is licensed under the Apache License 2.0.

Key Terms:

Permissions: This license allows for the use, modification, and distribution of the code, provided that all copies or substantial portions of the code include the original license.
Limitations: The code is provided "as is," without warranty of any kind, express or implied, and without liability for any claims or damages arising from its use.
Attribution: If you modify and distribute this code, you must include a prominent notice stating that you modified the files.

For more details, please refer to the full license text.

Credits

Nives Hüll ([email protected], [email protected])

Acknowledgments

Special thanks to Kaja Dobrovoljc for the opportunity to contribute to their project, which is part of the broader Gravitacija Project, a Slovenian initiative providing valuable resources and inspiration for research in syntactic typology and universal dependencies. I also thank Luka Terčon for the support and guidance during the project.
Thanks also to Petra Bago, professor of the Digital Linguistics Project course, for the encouragement and support throughout the project.

Resources

Libraries:
- BeautifulSoup for web scraping and parsing HTML.
- Matplotlib for visualizations.
- Pandas for data manipulation and analysis.
- Seaborn for statistical data visualization.
- NumPy for numerical computing.
- SciPy for scientific computing.
- Conllu for processing CoNLL-U files.
- Scikit-learn for machine learning utilities.
- Streamlit for interactive web applications.

Tools

ChatGPT for coding, debigging, and writing assistance.
GitHub for version control and collaboration.
Grew-match for syntactic structure identification in corpora.
Python for data processing and analysis.
Q-CAT for syntactic categorization.
STARK for linguistic corpus processing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UD-WALS-Linguistic-Patterns

Overview

Table of Contents

Prerequisites

System Information

Cloning the Repository

Installation

Running the Scripts

Repository Structure

Results: Word Order Analysis

Interpretation

Outputs

License

Key Terms:

Credits

Acknowledgments

Resources

Tools

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
app		app
data		data
docs		docs
scripts		scripts
README.md		README.md
requirements.txt		requirements.txt

hulln/UD-WALS-Linguistic-Patterns

Folders and files

Latest commit

History

Repository files navigation

UD-WALS-Linguistic-Patterns

Overview

Table of Contents

Prerequisites

System Information

Cloning the Repository

Installation

Running the Scripts

Repository Structure

Results: Word Order Analysis

Interpretation

Outputs

License

Key Terms:

Credits

Acknowledgments

Resources

Tools

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages