spaCy pipeline for french fictions or first person point of view texts (with a focus on personal pronouns, moods and tenses), mostly trained on novels.
Feature | Description |
Language | french |
Name | fr_solipcysme |
Version | 0.2.5 |
spaCy | ==3.8.4 |
Default Pipeline | jusqucy_tokenizer ,commecy_normalizer , jusqucy_normalizer , pretagger_hunspell ,morphologizer , viceverser_lemmatizer , parser |
Components | jusqucy_tokenizer, jusqucy_normalizer, commecy_normalizer, morphologizer , viceverser_lemmatizer, parser |
Vectors | 669785 keys, 6697856 unique vectors (100 dimensions) |
Sources | Corpus narraFEATS (morphologizer), Universal Dependencies (parser), french-word-vectors (vectors) |
License | GPL |
Author | thjbdvlt |
pip install
import spacy
nlp = spacy.load("fr_solipcysme")
doc = nlp(
"la MACHINE à (b)rouiller le temps s'est peuuut-etre déraillée..?"
for i in doc:
i.norm_, # commecy_normalizer / jusqucy_normalizer
i.pos_, # morphologizer
i.morph, # morphologizer
i.lemma_, # viceverser_lemmatizer
i.dep_, # parser
i.head, # parser
i.sent_start, # jusqucy_tokenizer
i._.ttype, # jusqucy_tokenizer
i._.isword, # jusqucy_tokenizer
# these attributes are not especially usefull.
# mostly used to make morphologizer more accurate.
doc._.jusqucy_ttypes, # jusqucy_tokenizer
doc._.hunspell_po, # pretagger_hunspell
doc._.hunspell_is, # pretagger_hunspell
solipCysme not only is a trained pipeline, but also a set of minimal pipeline components and model architectures that can be used independently.
a modified MultiHashEmbed that makes it possible to use Doc
underscore attributes as features. The value of an attribute must be a list
of int
, and must have the same length as the Doc
a modified CharacterEmbed that makes it possible to use underscore attributes as features and that replace nC
(number of character) by nCstart
and nCend
, so that one can chose an asymetric representation of words (e. g., for french, to only suffix, with nCstart = 0
and nCend = 6
a component that makes Hunspell morphological analysis available as features for the SolipcysmeMultiHashe
or SolipcysmeCharEmbed
- only knows about straigt apostroph (
) and quotes ("
). - morphologizer depends on the
, because this tokenizer sets a value to a doc extension (Doc._.jusqucy_ttypes
), used by the morpholgizer. - morphologizer depends on the
component, too; because the morphologizer uses the output of Hunspell as token features (po:
features). - no
this work is released under GPL license (v3).