Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference is very slow on Mac m1 #1064

Open
hhamud opened this issue Jan 16, 2025 · 5 comments
Open

Inference is very slow on Mac m1 #1064

hhamud opened this issue Jan 16, 2025 · 5 comments
Labels
bug Something isn't working

Comments

@hhamud
Copy link

hhamud commented Jan 16, 2025

Describe the bug

I am using a fine tuned model based on microsoft's phi3.5 mini called 'Sciphi/triplex' which aims to help you extract entities and relationships from a piece of text, however it is very slow. It takes approximately 30s when it should take 3s.

Any ideas on why this would be?

Code is here:

pub fn format(text: &str) -> String {
    let entity = vec!["LOCATION", "POSITION", "DATE", "CITY", "COUNTRY", "NUMBER"]
        .iter()
        .map(ToString::to_string)
        .join(",");
    let predicates = vec!["POPULATION", "AREA"]
        .iter()
        .map(ToString::to_string)
        .join(",");

    format!("
Perform Named Entity Recognition (NER) and extract knowledge graph triplets from the text. NER identifies named entities of given entity types, and triple extraction identifies relationships between entities using specified predicates.

        **Entity Types:**
        {entity}

        **Predicates:**
        {predicates}

        **Text:**
        {text}
")
}

impl Llm {
    pub async fn new(model_id: &str) -> Self {
        let model = TextModelBuilder::new(model_id)
            .with_logging()
            .with_paged_attn(|| {
                PagedAttentionMetaBuilder::default()
                    .with_gpu_memory(MemoryGpuConfig::MbAmount(500))
                    .build()
            })
            .unwrap()
            .build()
            .await
            .unwrap();

        Self { model }
    }

    pub async fn send_message(self, message: &str) -> Result<String, String> {
        let messages = TextMessages::new().add_message(TextMessageRole::User, message);

        let resp = self.model.send_chat_request(messages).await.unwrap();
        let res = resp.choices[0]
            .message
            .content
            .clone()
            .ok_or("fails to parse response".to_string());

        if let Ok(parsed) = serde_json::from_str::<Value>(&res.clone().unwrap()) {
            if let Ok(formatted) = serde_json::to_string_pretty(&parsed) {
                return Ok(formatted);
            }
        }

        res
    }
}

#[cfg(test)]
mod tests {
    use std::time::Instant;

    use super::*;

    #[tokio::test(flavor = "multi_thread", worker_threads = 1)]
    async fn test_spawn_llm() {
        let text = "San Francisco officially the City and County of San Francisco, is a commercial, financial, and cultural center in Northern California.With a population of 808,437 residents as of 2022, San Francisco is the fourth most populous city in the U.S. state of California behind Los Angeles, San Diego, and San Jose.";
        let input = format(&text);
        let llm = Llm::new("sciphi/triplex").await;
        let start = Instant::now();
        let text = llm.send_message(&input).await;
        let end = start.elapsed();
        println!("Total time: {:?}", end);
        assert!(text.is_ok());
    }

Latest commit or version

master branch, latest commit

@hhamud hhamud added the bug Something isn't working label Jan 16, 2025
@cdoko
Copy link
Collaborator

cdoko commented Jan 16, 2025

Is this compiled with the metal feature enabled?

@hhamud
Copy link
Author

hhamud commented Jan 16, 2025

Is this compiled with the metal feature enabled?

Yes, I have this in my cargo toml

mistralrs = { git = "https://github.com/EricLBuehler/mistral.rs.git", branch = "master", features = ["metal"] }

@hiive
Copy link

hiive commented Feb 12, 2025

Seconded. It seemed to take a nosedive about a month ago.

@EricLBuehler
Copy link
Owner

@hiive I think I may have a solution for your case.

On Metal, our preallocation for a large PagedAttention KV cache can cause slowdowns for some reason.

I would recommend checking out the PagedAttentionMetaBuilder::with_gpu_memory method to set the memory amount (in MB) to a reasonable amount (for example, 4096 MB). I think this should improve speeds.

@hhamud how much memory is available on your system?

@hhamud
Copy link
Author

hhamud commented Feb 21, 2025

@hiive I think I may have a solution for your case.

On Metal, our preallocation for a large PagedAttention KV cache can cause slowdowns for some reason.

I would recommend checking out the PagedAttentionMetaBuilder::with_gpu_memory method to set the memory amount (in MB) to a reasonable amount (for example, 4096 MB). I think this should improve speeds.

@hhamud how much memory is available on your system?

Mac M1 Pro 32gb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants