How I Built a Personal Finance AI Assistant with Local Language Models

Published on March 16, 2025

And what I learned along the way

I’ve always hated the idea of sending my financial data to cloud-based personal finance services. Account names and numbers, transactions, spending habits (and more) reveal intimate details about my life. Once in the cloud, my financial data is for sale and can be used to target me for advertisements or other purposes.

Now I know what you’re thinking — paid personal finance solutions exist, and some allegedly don’t sell or share your data, but some random person still has access to it. They also usually operate on a subscription model, and subscriptions don’t float my boat. I don’t even pay for a music streaming service.

Excel is the next best option, right? It’s fine for most people, but I have a deep seated hatred for Excel, so that won’t work either.

So, I did what any rational human being would do. I built my own iOS budgeting app. Release date TBD.

I’ve Got the Power

With full control over my own budgeting app, I can integrate basically whatever I want.

Ever since modern “AI” entered the public stream of consciousness with ChatGPT, I’ve been dying to utilize the technology for personal finance — being able to quickly answer complicated questions like income or spending trends or forecasting out significant changes in my finances is valuable to me.

The problem? Cloud-based AI APIs aren’t free, and they aren’t private.

Still being my completely rational self, this led me to explore a question: Could I build a useful AI-powered financial assistant that wasn’t cloud-based?

Proof of Concept

Local language models have come a long way from when I first started experimenting with Llama 2 — they’ve become much more intelligent and have dramatically shrunk in parameter size relative to the intelligence of the model.

The ultimate goal for me is on-device inference on an iPhone. To take full advantage of Apple’s hardware and software, I would need to utilize Apple’s CoreML framework. The only problem is that CoreML doesn’t natively support formats like GGUF or GGML — I would need to convert them to Apple’s .mlmodel format, and the process to do that conversion today is complex and time-consuming. As a result, it’s very unattractive for a proof of concept. Since any good news on this front is at least a few months away in WWDC 2025, I had to find another way to experiment.

The Workaround

While on-device Core ML-based inference isn’t feasible for me right now, that doesn’t mean that Apple silicon native inference isn’t. Enter Ollama. Ollama allows me to host and run inference on open-source language models locally on my M1 Pro Macbook Pro (thanks Apple for that great naming scheme), along with exposing an API for me to build off of. For the purposes of this article I used either phi4-mini or gemma3:4b small language models to simulate what might be possible on an iPhone.

Architecture

The setup to use a locally networked Ollama inference server in my app is relatively straightforward — here’s my stack:

This setup allows me to pull models from Ollama, host them on Apple silicon not all that different from a modern iPhone (RAM not withstanding) and interact with them over the network from my budgeting app. Changing out models is as simple as changing one line of config code, so I can rapidly test a whole host of models without complicated model format conversions. Let’s start from the top.

User Experience

UX is the core of any app, and to keep mine simple, I implemented a messaging-style UI that follows familiar patterns for an AI chat experience:

A section for users to type and send their prompt to the assistant
Persistent chat history for context continuity
Support for follow-up questions that maintain conversational context
Ability to clear context and start fresh via a simple button

When put together, those elements create a user experience of simply chatting with a financial assistant, while the complex data extraction, context management, and inference processes are executing behind the scenes. It looks a little something like this:

With the frontend and backend now taken care of, I needed to get the messy middle figured out —allowing them to talk to each other.

The Communication Layer

Getting the communication layer stood up actually wasn’t too bad — Ollama’s API doc was useful, and there was plenty of support resources on the web. To stay consistent with user’s expectations of how AI chat experiences work today, I went the route of streaming the model’s response back to my app instead of displaying a loading icon until the response was complete. The heart of my implementation looks like the following:

   // Generate a streaming response from the Ollama API
    // - Parameters:
    //   - prompt: The user's prompt
    //   - systemPrompt: Optional system instructions
    //   - onToken: Callback function that receives each token as it arrives
    //   - onCompletion: Callback function that is called when the response is complete
    func generateStreamingResponse(
        prompt: String,
        systemPrompt: String? = nil,
        onToken: @escaping (String) -> Void,
        onCompletion: @escaping (Result) -> Void
    ) {
        let fullPrompt = formatPrompt(userPrompt: prompt, systemPrompt: systemPrompt)
        
        let request = GenerateRequest(
            model: modelName,
            prompt: fullPrompt,
            stream: true,
            options: GenerateRequest.Options(
                temperature: temperature,
                topP: topP,
                numPredict: maxTokens
            )
        )
        
        Task {
            do {
                // Create URL request
                let url = baseURL.appendingPathComponent("generate")
                var urlRequest = URLRequest(url: url)
                urlRequest.httpMethod = "POST"
                urlRequest.addValue("application/json", forHTTPHeaderField: "Content-Type")
                urlRequest.timeoutInterval = 60.0
                
                let encoder = JSONEncoder()
                urlRequest.httpBody = try encoder.encode(request)
                
                // Start streaming session
                let (stream, _) = try await URLSession.shared.bytes(for: urlRequest)
                
                let decoder = JSONDecoder()
                
                for try await line in stream.lines {
                    guard !line.isEmpty else { continue }
                    
                    // Each line contains a complete JSON response
                    if let data = line.data(using: .utf8) {
                        do {
                            let response = try decoder.decode(GenerateResponse.self, from: data)
                            onToken(response.response)
                            
                            // Check if this is the final chunk
                            if response.done == true {
                                onCompletion(.success(()))
                                break
                            }
                        } catch {
                            print("⚠️ Error decoding JSON: \(error)")
                            onCompletion(.failure(error))
                            break
                        }
                    }
                }
            } catch {
                print("⚠️ Streaming error: \(error)")
                onCompletion(.failure(error))
            }
        }
    }

After a good amount of troubleshooting and fixing, I was finally able to get HTTP 200 logs showing in my Ollama server terminal, and responses were returning to my frontend. They weren’t very good responses though.

Intelligent Context Management

It quickly became obvious that while I could ask the assistant questions, it didn’t have access to my data. This meant it was just making up responses to my questions. Manually providing data to the assistant in my prompt provided much better results, but that defeated the purpose of interacting with it, so I started thinking through different ways to give the assistant access to my data in the app’s CoreData store.

Retrieval augmented generation (or RAG) was my first thought for this scenario. However, setting up a vector database, embedding the user’s data, and making it available to the model via RAG is completely overkill for my app. I ruled this out about as quickly as I thought of it. The next best option I thought of was to utilize the model’s context window to load in relevant data.

For some context (ha!) — GPT3 initially had a context window of around 2k tokens when it went mainstream. Models today, even tiny ones, have a context window of 128k tokens or more. Knowing this, I figured I had much more context available to me than I was going to use for a chat session. I could probably just dump my whole CoreData store into context for the assistant to use, right?

Wrong. What I found in practice was:

Even with just test data, the complicated relationships for my data used a lot more context than I assumed it would’ve. It all fit, but only barely, which means this wouldn’t work for a production implementation.
The models couldn’t handle this amount of data and still give relevant responses — the rate of inaccurate and frustrating responses went through the roof.

I had to find some other way to provide only the most relevant data. Enter:

The Context Generator

The context generator’s purpose is to analyze keywords in user questions (e.g., “spend,” “budget,” “income,” “account”, etc.) and determine what financial data is relevant to provide to the model. Based on detected intent, the system then selectively pulls the data from the CoreData store. I also time-box this to the user’s currently selected year in the app as to not overwhelm the model with too much data.

This approach provides what I believe will be sufficient context to answer most questions, but is clearly lacking for larger context applications such as year-over-year comparisons. It’s also very brittle, and not something I intend to support long-term. You can see some of how I’m doing this here:

 func getBudgetContext(for query: String, context: NSManagedObjectContext) -> String {
        var budgetContext = ""
        
        // Add budget summary data (always included)
        budgetContext += getBudgetSummary(context: context)
        
        // Analyze query and add relevant data
        let queryAnalysis = analyzeQuery(query)
        
        // Add query-specific data based on keywords
        if queryAnalysis.needsSpendingDetails {
            budgetContext += getSpendingDetails(context: context)
        }
        
        if queryAnalysis.needsCategoryDetails {
            budgetContext += getCategoryDetails(context: context)
        }
        
        if queryAnalysis.needsIncomeDetails {
            budgetContext += getIncomeDetails(context: context)
        }
        
        if queryAnalysis.needsAccountDetails {
            budgetContext += getAccountDetails(context: context)
        }
        
        if queryAnalysis.needsUpcomingTransactions {
            budgetContext += getUpcomingTransactions(context: context)
        }
        
        return budgetContext
    }
    
    // MARK: - Query Analysis
    
    private struct QueryAnalysis {
        var needsSpendingDetails = false
        var needsCategoryDetails = false
        var needsIncomeDetails = false
        var needsAccountDetails = false
        var needsUpcomingTransactions = false
    }
    
    private func analyzeQuery(_ query: String) -> QueryAnalysis {
        let lowercasedQuery = query.lowercased()
        
        var analysis = QueryAnalysis()
        
        // Spending keywords
        let spendingKeywords = ["spend", "expenses", "cost", "paid", "purchase", "bought"]
        analysis.needsSpendingDetails = spendingKeywords.contains { lowercasedQuery.contains($0) }
        
        // Category keywords
        let categoryKeywords = ["category", "budget", "allocation", "limit"]
        analysis.needsCategoryDetails = categoryKeywords.contains { lowercasedQuery.contains($0) }
        
        // Income keywords
        let incomeKeywords = ["income", "earn", "salary", "revenue", "money", "paid"]
        analysis.needsIncomeDetails = incomeKeywords.contains { lowercasedQuery.contains($0) }
        
        // Account keywords
        let accountKeywords = ["account", "balance", "bank", "savings", "checking"]
        analysis.needsAccountDetails = accountKeywords.contains { lowercasedQuery.contains($0) }
        
        // Upcoming transaction keywords
        let upcomingKeywords = ["upcoming", "recurring", "future", "next", "planned", "scheduled"]
        analysis.needsUpcomingTransactions = upcomingKeywords.contains { lowercasedQuery.contains($0) }
        
        return analysis
    }

Prompt Engineering

With context now being provided, I needed to shape the model’s responses to user inquiries. The system prompt is where I’ve primarily been addressing this, and it’s been an interesting exercise — my first iterations made the model’s responses nonsensical, too long, and overly flowery. Subsequent iterations reversed that behavior and made it too short and rigid. Eventually, I landed on a relatively useful system prompt:

         // Helper function to create the system prompt
    private func createSystemPrompt(with context: String) -> String {
        """
        You are a friendly budget assistant for a personal budgeting app. Provide clear, helpful insights about the user's finances.
        
        Here is the user's financial data:
        \(context)
        
        RESPONSE GUIDELINES:
        1. Be concise but warm - aim for 3-5 sentences for most responses. Do not overuse exclamation points or excessive congratulatory statements
        2. Lead with the most important financial fact or figure
        3. Balance accuracy with approachability
        4. Support answers with specific numbers from the data
        5. Use a conversational but straightforward tone
        6. For complex information, use short bullet points
        7. Offer one practical suggestion WHEN RELEVANT
        8. Avoid unnecessary technical jargon
        9. If the user asks for specifics for things like their income or expenses, be comprehensive in your response
        10. Do not offer a recap of the information you've already provided
        11. Make it clear when you're referencing data from a specific month or the entire year (e.g., "In March 2025, your expenses were..." or "For the entire year of 2025, your total income was...")
        12. Do not provide excess information - if the user wants to know specifics of a particualr month, category, type of transaction, etc. do not provide information that doesn't fit their inquiry
        """

This system prompt generally keeps the model anchored to its role, the financial data provided to it, and how to best respond in a way that a human might appreciate. I’m still far from considering myself proficient in system prompt writing, though!

End result

When everything is put together, the experience looks like this:

Downsides

While this proof-of-concept is promising, it has significant limitations:

Dependency on Ollama and the Local Network

Obvious callout — this approach requires Ollama and my client device running on the same network, or for the client to be VPN’ed in to my network. This is not my ideal end-state as I’d like this to be fully local on my iPhone, but it is sufficient to experiment with for now.

Limited Context

Since the current implementation only has a relatively small amount of context to work with via the user prompt and context generator, it’s not fulfilling it’s true potential. I have some ideas on how to address this shortcoming (like model tool use), but we’ll see how far those options are able to go. With models this small, I suspect the results won’t be great.

(Other) Small Model Problems

We’ve seen incredible improvements for tiny models, and they’re generally useable now, but they’re still not very good. In my testing they often confuse basic things such as the difference between income and expenses, and you can generally only get a handful of prompts in before it starts to go off the tracks.

Inconsistent Performance

Response times vary from 2–10 seconds depending on query complexity and system load, creating a suboptimal user experience. This experience is generally acceptable for now, but I assume the clock is ticking until the responses need to be near-instantaneous.

Limited Financial Expertise

While tiny models understand financial concepts, they lack the depth of knowledge that a specialized financial advisor would have. I hope this improves with future models.

Wrap Up: Why This Matters

In the end, this experiment demonstrates that useful, privacy-preserving AI isn’t just theoretically possible — it’s achievable today. As the process to get language models running on edge devices improves, model compression techniques advance, and mobile hardware capabilities shift to prioritize on-device AI, local inference will become increasingly viable.

On the flip side though, I could just continue this approach and utilize much larger models on my networked inference server. A reasoning model makes a lot of sense for this use case, and I don’t see those being feasible on an iPhone any time soon.

Regardless, I’m looking forward to what the future brings — a personal finance agent that completely eliminates the manual part of budgeting is my endgame, and hopefully we start to see green shoots for those experiences later this year.

Thanks for reading! Drop a comment or interact with the article if you enjoyed it.

Continue reading on website

Other news

🌸 Spring bingo - Wellness challenge - Halfway! 🌸

April 15, 2025

Hey Hivebriters! Quick check-in on our April Wellness Challenge - Spring Bingo! We're halfway through the month, and it's the perfect time to jump in if you haven't started yet (or keep going if you have)! Quick Reminders:Complete rows or columns for 5 raffle entries eachSquares with 📷 require photo submissions in the commentsSubmit completed rows/columns through the form by April 30thBonus entri