Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bre vs por stopwords #332

Closed
laygir opened this issue Dec 28, 2024 · 4 comments
Closed

bre vs por stopwords #332

laygir opened this issue Dec 28, 2024 · 4 comments

Comments

@laygir
Copy link

laygir commented Dec 28, 2024

Hello,

Could you confirm if bre is actually stopwords for Breton or is it mixed up with Portuguese (por)

https://github.com/fergiemcdowall/stopword/blob/main/src/stopwords_bre.js
https://github.com/fergiemcdowall/stopword/blob/main/src/stopwords_por.js

@eklem
Copy link
Collaborator

eklem commented Dec 29, 2024

Hi, I'm not sure. It's taken from the stopwords-json repository. I checked the issues there, and found this comment:

The Breton list you have (br.json), is actually Brazilian Portuguese.

And I can see it removes just a few words from Breton text in the tests. So, then it seems we need to add another list or generate one? Any suggestions @laygir ?

@laygir
Copy link
Author

laygir commented Dec 29, 2024

Hey @eklem

Thank you for looking into it. The issue you mentioned confirms it.
Perhaps the words removed in the tests are just a coinciding intersection between the two language.

I'm no language expert here btw. but in this case since bre is actually Brazilian Portuguese, I'd do one or more of the following;

  • Rename and document it as Brazilian Portuguese. Not sure if you strictly do language only or language-locale (pt-BR) is also acceptable.
  • If above wouldn't work, then merge it with Portuguese after analyzing both lists and removing strictly pt-BR stopwords if any.
  • Create a new Breton list with correct stopwords. I could only ask GPT to confirm the existing list and provide Breton stopwords. Here is the conversation.

@eklem
Copy link
Collaborator

eklem commented Dec 30, 2024

The bre-code is added by this repository, but seems br was mistaken. It hasn't been fixed in the super-repository stopwords-iso/stopwords-iso, but is in the actual Breton stopword respository. I'll replace with that one. I won't do any merge on Brazilian Portugese and trusting ChatGPT with acutal output is maybe not so smart. But we have the generated output by Gene Diaz, so we're good to go. Thanks for opening the issue!

eklem added a commit that referenced this issue Dec 30, 2024
They seem to have been Brazilian Portugese - #332

Replaced with Gene Diaz' Breton stopword list - https://github.com/stopwords-iso/stopwords-b been Brazilian Portugese - #332
@eklem
Copy link
Collaborator

eklem commented Dec 30, 2024

Published v3.1.4

@eklem eklem closed this as completed Dec 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants