This repository was created as part of a university project for the Digital Linguistics Project course at the Faculty of Social Sciences and Humanities, University of Zagreb. The project aims to document and explore linguistic patterns using Universal Dependencies (UD) and the World Atlas of Language Structures (WALS).
The project was divided into two phases:
- Phase 1: Contribution to a broader Slovenian linguistic project, involving the creation of a table mapping WALS features to UD queries.
- Phase 2: An independent analysis focusing on word order patterns in Slovenian, comparing written and spoken corpora.
The repository serves as a transparent record of the work, including data processing scripts, analysis results, and relevant documentation.
- Python: This project requires Python 3.7 or later.
- Libraries: The following Python libraries are needed:
matplotlib
pandas
seaborn
numpy
scipy
conllu
scikit-learn
streamlit
Install the required libraries using the requirements.txt
file provided in the repository.
- Operating System: Windows 11
- Version: 10.0.22631.4317 (Windows 11)
- Windows Subsystem for Linux (WSL): Enabled
- WSL Distribution: Ubuntu
- WSL Version: 2
To clone this repository, run the following command:
git clone https://github.com/UD-WALS-Linguistic-Patterns.git
To install the required dependencies, navigate to the project directory and run:
pip install -r requirements.txt
To run the scripts, navigate to the project directory and run:
python scripts/[script_name].py
UD-WALS-Linguistic-Patterns
├── app/
│ ├── app.py # Streamlit app
├── data/ # Data files
│ ├── extracted/ # Processed datasets
│ ├── features/ # Feature data and visualizations
│ ├── results/ # Analysis outputs
│ ├── src/ # Raw CoNLL-U files
│
├── docs/ # Documentation
│ ├── reports/ # Reports and drafts
│ ├── project_proposal.pdf # Project proposal
│ ├── projekt_zg.v1.pdf # Draft paper
│ ├── qualitative_analysis_proposal.md # Qualitative analysis proposal
│
├── scripts/ # Python scripts
│ ├── 1_compare_features.py # Compare features
│ ├── 2_fix_and_validate_conllu.py # Fix and validate CoNLL-U files
│ ├── 3_remove_punct_conllu.py # Remove punctuation
│ ├── 4_clean_stark_word_order.py # Process word order
│ ├── 5_combine_both_processed.py # Merge datasets
│ ├── 6_analyze_processed.py # Analyze corpora
│
├── requirements.txt # Dependencies
├── README.md # Repository guide
└── LICENSE.txt # License
The quantitative study conducted as part of this project reveals key differences in word order patterns between spoken and written Slovenian:
- Written Corpus (SSJ):
- Strong preference for SVO (Subject-Verb-Object), reflecting the structured syntax typical of written language and aligning with the WALS value for Slovenian.
- Spoken Corpus (SST):
- Greater variation, with word orders such as SOV, OSV, and OVS appearing more often.
- Written language prioritizes unmarked SVO for clarity and consistency.
- Spoken language is more flexible, using varied word orders to emphasize topics or structure information.
- These findings highlight the adaptability of spoken syntax and the influence of pragmatics on word order, emphasizing the need to revisit and update outdated WALS feature values.
Visualizations and analysis results are available in data/results/
and in the paper draft located at docs/paper_v1.pdf
.
This project is licensed under the Apache License 2.0.
- Permissions: This license allows for the use, modification, and distribution of the code, provided that all copies or substantial portions of the code include the original license.
- Limitations: The code is provided "as is," without warranty of any kind, express or implied, and without liability for any claims or damages arising from its use.
- Attribution: If you modify and distribute this code, you must include a prominent notice stating that you modified the files.
For more details, please refer to the full license text.
-
Special thanks to Kaja Dobrovoljc for the opportunity to contribute to their project, which is part of the broader Gravitacija Project, a Slovenian initiative providing valuable resources and inspiration for research in syntactic typology and universal dependencies. I also thank Luka Terčon for the support and guidance during the project.
-
Thanks also to Petra Bago, professor of the Digital Linguistics Project course, for the encouragement and support throughout the project.
- Libraries:
- BeautifulSoup for web scraping and parsing HTML.
- Matplotlib for visualizations.
- Pandas for data manipulation and analysis.
- Seaborn for statistical data visualization.
- NumPy for numerical computing.
- SciPy for scientific computing.
- Conllu for processing CoNLL-U files.
- Scikit-learn for machine learning utilities.
- Streamlit for interactive web applications.