Inference is very slow on Mac m1 #1064

hhamud · 2025-01-16T02:47:03Z

Describe the bug

I am using a fine tuned model based on microsoft's phi3.5 mini called 'Sciphi/triplex' which aims to help you extract entities and relationships from a piece of text, however it is very slow. It takes approximately 30s when it should take 3s.

Any ideas on why this would be?

Code is here:

pub fn format(text: &str) -> String {
    let entity = vec!["LOCATION", "POSITION", "DATE", "CITY", "COUNTRY", "NUMBER"]
        .iter()
        .map(ToString::to_string)
        .join(",");
    let predicates = vec!["POPULATION", "AREA"]
        .iter()
        .map(ToString::to_string)
        .join(",");

    format!("
Perform Named Entity Recognition (NER) and extract knowledge graph triplets from the text. NER identifies named entities of given entity types, and triple extraction identifies relationships between entities using specified predicates.

        **Entity Types:**
        {entity}

        **Predicates:**
        {predicates}

        **Text:**
        {text}
")
}

impl Llm {
    pub async fn new(model_id: &str) -> Self {
        let model = TextModelBuilder::new(model_id)
            .with_logging()
            .with_paged_attn(|| {
                PagedAttentionMetaBuilder::default()
                    .with_gpu_memory(MemoryGpuConfig::MbAmount(500))
                    .build()
            })
            .unwrap()
            .build()
            .await
            .unwrap();

        Self { model }
    }

    pub async fn send_message(self, message: &str) -> Result<String, String> {
        let messages = TextMessages::new().add_message(TextMessageRole::User, message);

        let resp = self.model.send_chat_request(messages).await.unwrap();
        let res = resp.choices[0]
            .message
            .content
            .clone()
            .ok_or("fails to parse response".to_string());

        if let Ok(parsed) = serde_json::from_str::<Value>(&res.clone().unwrap()) {
            if let Ok(formatted) = serde_json::to_string_pretty(&parsed) {
                return Ok(formatted);
            }
        }

        res
    }
}

#[cfg(test)]
mod tests {
    use std::time::Instant;

    use super::*;

    #[tokio::test(flavor = "multi_thread", worker_threads = 1)]
    async fn test_spawn_llm() {
        let text = "San Francisco officially the City and County of San Francisco, is a commercial, financial, and cultural center in Northern California.With a population of 808,437 residents as of 2022, San Francisco is the fourth most populous city in the U.S. state of California behind Los Angeles, San Diego, and San Jose.";
        let input = format(&text);
        let llm = Llm::new("sciphi/triplex").await;
        let start = Instant::now();
        let text = llm.send_message(&input).await;
        let end = start.elapsed();
        println!("Total time: {:?}", end);
        assert!(text.is_ok());
    }

Latest commit or version

master branch, latest commit

cdoko · 2025-01-16T10:40:39Z

Is this compiled with the metal feature enabled?

hhamud · 2025-01-16T14:44:25Z

Is this compiled with the metal feature enabled?

Yes, I have this in my cargo toml

mistralrs = { git = "https://github.com/EricLBuehler/mistral.rs.git", branch = "master", features = ["metal"] }

hiive · 2025-02-12T06:08:34Z

Seconded. It seemed to take a nosedive about a month ago.

EricLBuehler · 2025-02-12T21:21:53Z

@hiive I think I may have a solution for your case.

On Metal, our preallocation for a large PagedAttention KV cache can cause slowdowns for some reason.

I would recommend checking out the PagedAttentionMetaBuilder::with_gpu_memory method to set the memory amount (in MB) to a reasonable amount (for example, 4096 MB). I think this should improve speeds.

@hhamud how much memory is available on your system?

hhamud · 2025-02-21T15:38:37Z

@hiive I think I may have a solution for your case.

On Metal, our preallocation for a large PagedAttention KV cache can cause slowdowns for some reason.

I would recommend checking out the PagedAttentionMetaBuilder::with_gpu_memory method to set the memory amount (in MB) to a reasonable amount (for example, 4096 MB). I think this should improve speeds.

@hhamud how much memory is available on your system?

Mac M1 Pro 32gb

hhamud added the bug Something isn't working label Jan 16, 2025

EricLBuehler mentioned this issue Feb 13, 2025

Endless inferencing with cpu on DeepSeek-R1-Distill-Qwen-1.5B #1134

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference is very slow on Mac m1 #1064

Inference is very slow on Mac m1 #1064

hhamud commented Jan 16, 2025 •

edited

Loading

cdoko commented Jan 16, 2025

hhamud commented Jan 16, 2025

hiive commented Feb 12, 2025

EricLBuehler commented Feb 12, 2025

hhamud commented Feb 21, 2025

Inference is very slow on Mac m1 #1064

Inference is very slow on Mac m1 #1064

Comments

hhamud commented Jan 16, 2025 • edited Loading

Describe the bug

Latest commit or version

cdoko commented Jan 16, 2025

hhamud commented Jan 16, 2025

hiive commented Feb 12, 2025

EricLBuehler commented Feb 12, 2025

hhamud commented Feb 21, 2025

hhamud commented Jan 16, 2025 •

edited

Loading