-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NLP pipeline design #21
Comments
Having a clearly defined pipeline is also useful for facilitating counting tokens in parallel (#20). |
One way around it might be to return the original string, along with tokens (instead of just tokens) from tokenizers and analyzers. |
Data structure
I ran some simple benchmarks to compare the read/write performance of I ran 3 read performance tests: summing the length of tokens, determining unique vocab and applying the stemmer to each token. Overall if you take into consideration the variance (+/-) in the results there isn't much difference - which surprised me. I assumed that I ran 2 write performance test: 1) tokenising the text and outputting either a
PipelineI have been experimenting with different ways the pipeline components (tokenizer, stemmer, ngramer...) could be implemented. Here are my thoughts so far... Each component could have a standard method let tokens = tokenizer.transform(text);
let stemed = stemmer.transform(tokens);
let filtered = filter.transform(stemed).collect::<Vec<_>>(); Each component is producing and consuming, where appropriate, an iterator and the results are collected at the end of the pipeline. This reduces intermittent memory use and doesn't have a performance cost when compared to raw for-loops. In the example, Here is a full dummy example (rust playground): #[allow(dead_code)]
#[allow(unused_variables)]
#[derive(Debug)]
enum StringWrap<'a> {
Slice(&'a str),
String(String),
}
struct DummyTokenizer {
split_on: String,
}
impl DummyTokenizer {
pub fn transform<'a>(&'a self, text: &'a str) -> impl Iterator<Item = StringWrap<'a>> + 'a {
text.split(&self.split_on).map(|x| StringWrap::Slice(x))
}
}
struct DummyFilter {
word: String,
}
impl DummyFilter {
//for TokenToToken
pub fn transform<'a>(
&'a self,
tokens: impl Iterator<Item = StringWrap<'a>> + 'a,
) -> impl Iterator<Item = StringWrap<'a>> + 'a {
tokens.filter(move |x| match x {
StringWrap::Slice(s) => s.clone() != self.word,
StringWrap::String(s) => s.clone() != self.word,
})
}
}
struct DummyStemmer {}
impl DummyStemmer {
pub fn transform<'a>(
&'a self,
tokens: impl Iterator<Item = StringWrap<'a>> + 'a,
) -> impl Iterator<Item = StringWrap<'a>> + 'a {
// Outputs a StringWrap::string
tokens.map(|x| match x {
StringWrap::Slice(s) => StringWrap::String([s, "ing"].join("")),
StringWrap::String(s) => StringWrap::String([s, "ing".to_string()].join("")),
})
}
}
fn main() {
// API example
let text = "Marry had a little lamb.";
// Pipeline components
let tokenizer = DummyTokenizer {
split_on: " ".to_string(),
};
let filter = DummyFilter {
word: "lamb".to_string(),
};
let stemmer = DummyStemmer {};
// Pipeline
let output = tokenizer.transform(text);
let output = filter.transform(output);
let output = stemmer.transform(output).collect::<Vec<_>>();
println!("{:?}", output);
} In above, In summary:
Some possible problems:
|
Thanks a lot for the analysis and investigation @joshlk !
+1 generally on that. The initial reason I didn't go with it initially, I think, was that the signature of the transform method wouldn't be exactly the same between different components. Using an enum for
Sounds good as well. I was just thinking we might need to have some container object to avoid doing this chaining of iterators manually. Something along the lines of the pipeline object in spacy. It could be something along the lines of, let pipe = Pipeline::new(tokenizer, filter, stemmer);
let output = pipe.transform(text); (full example here) but that means each component would need to be identified by some trait and designed in advance. One could then apply this e.g. on a set of documents with, let documents = vec!["The Moon is an astronomical body orbiting Earth as its only natural satellite.", "It is the fifth-largest satellite in the Solar System", "The Moon is, after Jupiter's satellite Io, the second-densest satellite in the Solar System"];
let output = documents.iter().map(|doc| pipe.transform(doc).collect::<Vec<_>>()).collect::<Vec<_>>(); or in parallel, use rayon::prelude::*;
let output = documents.par_iter().map(|doc| pipe.transform(doc).collect::<Vec<_>>()).collect::<Vec<_>>(); A bit similarly to what is currently done for vectorizers here. The double Another alternative could be to define an arbitrary list of steps e.g. let step = vec![tokenizer, filter, stemmer];
let pipe = Pipeline::new(steps);
let output = pipe.transform(text); this would be more flexible, but again because the signatures of Another information to consider is that HuggingFace tokenizers crate are essentially solving this same problem. They have a pipeline struct (called "Tokenizer") that consists of 4 steps, where for instance the signature of a whitespace tokenizer can be found here. I wouldn't have minded re-using their API or even some models since clearly they have good traction and more resources at the moment, however the limitations I see are,
Studying their API more and possible ways to interact might be worthwhile though.
It could be nice to add some of those benchmarks for tokenizers. It would allow measuring in particular the overhead of the Python wrapper by comparing with results in |
Returning a |
We can construct a pipeline function using a macro that doesn't require a common trait. All that is required is a macro_rules! pipeline {
($input:expr, $($args:expr),*) => {{
let output = $input;
$(
let output = $args.transform(output);
)*
output.collect()
}}
}
let vecs: Vec<String> = pipeline!(
text,
tokenizer,
filter,
stemmer
); If there isn't a common trait, is there any other downsides? Another option could be that the tokenizer's transform method maps an iterator to an iterator and to use it you have to wrap the input in an iterator. Another standard method could be
I don't have any experience with
I don't think they have a unifying trait or interface for all pipeline components from what I can see.
I believe this has a performance impact as it uses dynamic dispatch. But I think it's necessary if you want all components to have a common trait. As far as I can work out (correct me if I'm wrong) you have to make a trade off:
So there are two considerations:
|
I ran a benchmark comparing using
So doesn't seem like much of a difference. Which is good as Here is an example of a shared trait using #[allow(dead_code)]
#[allow(unused_variables)]
#[derive(Debug)]
enum StringWrap<'a> {
Slice(&'a str),
String(String),
}
trait PipelineComponent<'a> {
type InItem;
type OutItem;
fn transform(
&'a self,
items: Box<dyn Iterator<Item = Self::InItem> + 'a>,
) -> Box<dyn Iterator<Item = Self::OutItem> + 'a>;
}
struct DummyTokenizer {
split_on: String,
}
impl<'a> DummyTokenizer {
fn transform_text(&'a self, text: &'a str) -> Box<dyn Iterator<Item = StringWrap<'a>> + 'a> {
let iter = text.split(&self.split_on).map(|x| StringWrap::Slice(x));
Box::new(iter)
}
}
impl<'a> PipelineComponent<'a> for DummyTokenizer {
type InItem = &'a str;
type OutItem = StringWrap<'a>;
fn transform(
&'a self,
items: Box<dyn Iterator<Item = Self::InItem> + 'a>,
) -> Box<dyn Iterator<Item = Self::OutItem> + 'a> {
let iter = items.flat_map(move |s| self.transform_text(s));
Box::new(iter)
}
}
struct DummyFilter {
word: String,
}
impl<'a> PipelineComponent<'a> for DummyFilter {
type InItem = StringWrap<'a>;
type OutItem = Self::InItem;
fn transform(
&'a self,
items: Box<dyn Iterator<Item = Self::InItem> + 'a>,
) -> Box<dyn Iterator<Item = Self::OutItem> + 'a> {
let iter = items.filter(move |x| match x {
StringWrap::Slice(s) => s.clone() != self.word,
StringWrap::String(s) => s.clone() != self.word,
});
Box::new(iter)
}
}
struct DummyStemmer {}
impl<'a> PipelineComponent<'a> for DummyStemmer {
type InItem = StringWrap<'a>;
type OutItem = Self::InItem;
fn transform(
&'a self,
items: Box<dyn Iterator<Item = Self::InItem> + 'a>,
) -> Box<dyn Iterator<Item = Self::OutItem> + 'a> {
// Outputs a StringWrap::string
let iter = items.map(|x| match x {
StringWrap::Slice(s) => StringWrap::String([s, "ing"].join("")),
StringWrap::String(s) => StringWrap::String([s, "ing".to_string()].join("")),
});
Box::new(iter)
}
}
struct DummyEmbedder {}
impl<'a> PipelineComponent<'a> for DummyEmbedder {
type InItem = StringWrap<'a>;
type OutItem = Vec<u8>;
fn transform(
&'a self,
items: Box<dyn Iterator<Item = Self::InItem> + 'a>,
) -> Box<dyn Iterator<Item = Self::OutItem> + 'a> {
let iter = items.map(|x| match x {
StringWrap::Slice(s) => Vec::from(s.as_bytes()),
StringWrap::String(s) => Vec::from(s.as_bytes()),
});
Box::new(iter)
}
}
fn main() {
// API example
let text = "Marry had a little lamb.";
// Pipeline components
let tokenizer = DummyTokenizer {
split_on: " ".to_string(),
};
let filter = DummyFilter {
word: "lamb".to_string(),
};
let stemmer = DummyStemmer {};
let embedder = DummyEmbedder {};
// Pipeline
let output = tokenizer.transform_text(text);
let output = filter.transform(output);
let output = stemmer.transform(output);
let output = embedder.transform(output).collect::<Vec<u8>>();
println!("{:?}", output);
} Key features:
I haven't done so already but it should be relatively easy to create a Pipeline object. |
Thanks for investigating @joshlk !
The macro approach is interesting in that it would allow combining steps with more flexibility. The issue that if there is no pipeline object, it would be a bit harder to interface with code that is expected to take a pipeline as input (e.g. here) and it won't be possible to serialize pipelines.
Thanks for doing that! Yes, I was hoping there wouldn't be that much difference. I also get very similar results when running the benchmark in your branch.
I'm getting the following error when compiling this code, probably a minor issue,
but removing the DummyEmbedder step works as expected. Static dispatch for input could also work if we want to, though maybe the signatures are getting a bit more complex, trait PipelineComponent<'a> {
type InItem;
type OutItem;
fn transform<T>(&'a self, items: T) -> Box<dyn Iterator<Item = Self::OutItem> + 'a>
where
T: Iterator<Item = Self::InItem> + 'a;
} (example in https://github.com/rth/vtext/blob/example/pipeline2/examples/example1.rs) Maybe
To avoid this inconsistency I wonder if the following would be done, impl<'a> PipelineComponent<'a> for DummyTokenizer {
type In = Iterator<Item = &'a str>;
type OutItem = StringWrap<'a>;
fn transform<T: Self::In + 'a>(&'a self, items: T) -> Box<dyn Iterator<Item = Self::OutItem> + 'a>
{ ... }
} (so I far I haven't managed to make this compile, but I'll try later). Overall the parametrizable types for input/output sound good. |
As an alternative we could also consider using smolstr or smartstring to return a more consistent type. Benchmarks can be found in https://fasterthanli.me/articles/small-strings-in-rust |
Another thought regarding pipeline design. I am experimenting with creating Rust functions that input and output iterators that can be linked together in Python. For example in the below code: the function So the python signatures of the two functions are (roughly):
Both So in Python from rustlib import string_to_iterator, upercase
iter_1 = string_to_iterator("a bb ccc 333") # Creates an iterator
iter_2 = upercase(iter_1) # Creates an new iterator from the previous
print(iter_2.__next__()) # "A"
print(iter_2.__next__()) # "BB" So This is achieved by wrapping iterators in a struct Here is the Rust code: extern crate pyo3;
use pyo3::prelude::*;
use pyo3::{wrap_pyfunction, PyIterProtocol};
#[pyfunction]
fn string_to_iterator<'py>(py: Python<'py>, input: String) -> PyResult<IterBox> {
let input_vec: Vec<String> = input.split(" ").map(|s| s.to_string()).collect();
let iter = input_vec.into_iter();
Ok(IterBox {
iter: Box::new(iter)
})
}
#[pyfunction]
fn upercase<'py>(py: Python<'py>, iter: IterBox) -> PyResult<IterBox> {
let iter_new = iter.iter.map(|s| s.to_uppercase());
Ok(IterBox {
iter: Box::new(iter_new)
})
}
#[pyclass]
struct IterBox {
iter: Box<dyn CloneIterator<Item=String> + Send>,
}
#[pyproto]
impl PyIterProtocol for IterBox {
fn __iter__(slf: PyRefMut<Self>) -> Py<Self> {
slf.into()
}
fn __next__(mut slf: PyRefMut<Self>) -> Option<String> {
slf.iter.next()
}
}
// --- Make IterBox cloneable
trait CloneIterator: Iterator + Send {
fn clone_box(&self) -> Box<dyn CloneIterator<Item = Self::Item> + Send>;
}
impl<T> CloneIterator for T
where
T: 'static + Iterator + Clone + Send,
{
fn clone_box(&self) -> Box<dyn CloneIterator<Item = Self::Item> + Send> {
Box::new(self.clone())
}
}
impl Clone for Box<dyn CloneIterator<Item=String> + Send> {
fn clone(&self) -> Box<dyn CloneIterator<Item=String> + Send> {
(**self).clone_box()
}
}
impl Clone for IterBox {
fn clone(&self) -> IterBox {
IterBox {
iter: self.iter.clone_box()
}
}
}
// ------
#[pymodule]
fn rustlib(_py: Python, m: &PyModule) -> PyResult<()> {
m.add_wrapped(wrap_pyfunction!(string_to_iterator))?;
m.add_class::<IterBox>()?;
m.add_wrapped(wrap_pyfunction!(upercase))?;
Ok(())
} I think further work could be done to make the current implementation more ergonomic as it currently clones the iterator when passed as an argument value. But I hope I paint the general picture of whats achievable. This setup could be used as a way of linking pipeline objects in Python. This might mitigate the need to have a pipeline object in Rust which is compatible with Python. |
Ideally, an NLP pipeline in Rust could look something like,
where
collection
is an iterator over documents.There are several chalenges with it though,
RegexpTokenizer
takes a reference to the document and return anIterable
of&str
with the same lifetime as the input document, but then borrow checker doesn't appear to be happy when it is used in the pipeline. This may be related to using closures (cf next point) though.collection.map(tokenizer)
doesn't work,nor does
collection.map(tokenizer.tokenize)
(i.e. using a method) for some reason. We can usecollection.map(|document| tokenizer.tokenize(&document))
but then lifetime is not properly handled between input and output (described in the previous point).More investigation would be necessary, and both points are likely related.
The text was updated successfully, but these errors were encountered: