LLM’s on your Local_ ABC

Published on January 27, 2025

Running LLMs Locally_

The WHY and HOW to use them …

Image from LM Studio

There are a few reasons why you should consider using LLMs in your local. Company policies, Data Privacy, and Security are a few. But there are also benefits! Local deployment offers greater flexibility to customize the LLM according to specific needs, such as fine-tuning the model on proprietary datasets or adjusting performance parameters.

So lets jump in a couple of options for you to do this.

LM Studio offers a nice GUI from where you can download and run different models by just doing a couple of clicks!

Llama 3.2–3b parameter model now running on my local:

LM Studio

As shown above, you can use different endpoints to check if the model is running, send a chat history to the model to predict the next assistant response, predict the next tokens given a prompt, or generate text embeddings for a given text input:

POSTMAN WEB
POSTMAN

You can easily download and run any of the available models on your local:

LM Studio does not require a connection in order to work. This includes functions like chatting with models, chatting with documents, or running a local server, none of these require an internet connection.

Lets explore Ollama!

This is as simple as download from their site ^ and run the model (any model) from your terminal:

ollama run llama3.2 >

VSCODE

You can also use and test API endpoints >

curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "What are you?",
"stream": false
}
POSTMAN

You can generate the next message in a chat. This is a streaming endpoint, so there will be a series of responses. Streaming can be disabled using "stream": false. The final response object will include statistics and additional data from the request.

curl http://localhost:11434/api/chat -d '{
"model": "llama3.2",
"messages": [
{
"role": "user",
"content": "why should we consider using LLM on our local?"
}
],
"stream": false
}'
POSTMAN WEB
POSTMAN

To consider, you should have about 8 GB of RAM available to run the 7B models, 16 GB to run the 13B models, and 32 GB to run the 33B models.

As mentioned before, an active internet connection is essential for the initial setup, downloading models, and accessing certain features of Ollama, the core functionality of running models locally does not require ongoing internet access. This design allows you to use the power of large language models within a secure, controlled, and offline environment once everything is properly installed and configured.

Keep it simple and secure :)

Best!