
Rabbit R1, Large Action Models and Testing

Intro
I had something of a revelation the other day. Like many of us, I have used Large Language Models in work and out of curiosity. Claude is my LLM of choice, as I enjoy the idea of Constitutional AI. In tribute to a ridiculously fast moving world, I saw in the news a device called Rabbit R1 which uses a Large Action Model. The claim here is that LAM’s are able to predict human intent and replicate these interactions. As soon as I spied this claim, I began to clench in anticipation of the claims made by testing tool vendors. Once I’d calmed down, I then set about thinking how it might be useful.
Tool vendors dream
The theory is that you have a device that you teach how to complete tasks on the internet (and many devices learning from people all the time), so that it can then repeat these tasks for you. This sounds very familiar to me. This could either be the gateway to yet more claims that testing is dead, or a way to free up development teams from toil to focus on value. I think I know what will happen first, flying in the face of testing’s unblemished record of being alive while things are still being built. I look forward to wheeling out James Whitaker and Albert Savoia’s talk from GTAC 2011 for the umpteenth time.
Interesting things for testing
LAM’s will need testing, especially security testing
First and foremost, we shouldn’t talk about LAM’s like they don’t need testing. I think lots of people talk about LLM’s replacing activities like testing, when they are woefully undertested themselves. Security and privacy are the obvious areas. LLM’s can suggest the wrong path, a LAM could walk you down the path, open the gate, through the town and into the middle of the motorway. Bad actors or even just plain old bugs could have dreadful consequences.
Also, LAM’s are also being described as LLM agents. Basically an LLM with abilities such as breaking down large tasks into subtasks, plus access to a bunch of tools such as search engines. And here’s me thinking that we were a bit scared of LLM’s with internet access. Given the combinations of behaviours and tools that a LAM/LLM agent might have, testing the intended actions of the model before handing it the tools (tested separately) would be a sound strategy. Then see how they get on in a controlled environment.
Exploratory testing
You know what LLM’s are better at. Structured work, like generating code snippets, scripts and test data. LLM’s are not so good at following their noses, doing weird things on purpose (although sometimes by accident) and having an off the cuff adventure through a feature. Interestingly LAM’s might be trained by human activity and other LAM’s who have been trained by human activity who have been trained by other LAM’s, you get the idea. I suspect a LAM would end up doing some wildly boring exploratory testing as the models gravitate to the mean.
If you watch the Rabbit R1 What is a LAM video, it mimics a simple journey to book a room on AirBnB. Which you would probably cover with an automated test, synthetics and analytics for days anyway. Not much novel happening there. Plus it is dry, the human intent is to book a room that they want to stay in for the price. The pictures often seal it too, for me anyway.
A LAM which followed and learned from your exploratory testing may be an interesting thing, although I’m not sure it would stay fresh for long, it would need to constantly receive new input from a human (or other models), plus the surrounding context (bugs, other feature development). Sounds like a human would still be doing the exploratory testing then, at least at first. Better as an automation aid for now.
User interfaces
You know where we test all the time which is really time consuming, generally broken and suffers from extreme device, operating system and browser fragmentation? Thats right, the UI. You know where Rabbit R1 wants to train its LAM with your behaviour? Thats right, the UI! It seems a little silly to me to drive the browser with this technology. I mean, its taken a long time for browser automation to become anything like half decent in terms of reliability.
An API call is probably a bit more sensible and efficient and less prone to the foibles of the web browser or a mobile app. Although API calls to services are limited, sensitive to bots and need an integration from the LLM to API. But then how will the model be trained, unless a human drives the interface first? This makes me suspect that any convienence the human gets from such a model is a side effect of the real value, which is the training data. Remember, if it thinks and speaks but you can’t see where its brain is, you are the product.
Synthetic testing
Synthetic testing has been around a while pushed by the big monitoring and alerting tooling companies. Its where we might extend some of our test automation suite to cover key scenarios in production, perhaps scaling the number of users and cadence of test runs. Useful to gather performance and journey completion metrics. Currently synthetic testing has its shortcomings, as a LAM would gather information from your actual users, rather than synthetic users repeating the same scenarios. A LAM might have power in finding out what the actual key scenarios are versus what you think they are. A positive use of a LAM might be help organisations with their crippling assumptions about what users are actually doing. We’ve all seen the glass of water meme.
Using analytics in testing
I don’t know about you, but I’ve yet to test anything that doesn’t have a truckload of analytics frameworks inserted into it. Only for a new product/marketing/data person/agency to come in and demand a new framework be added. Analytics need testing too by the way, we get a bit loosey goosey with them or just remember they exist in the moment before pressing the merge button. Analytics are also pretty unreliable in transmitting their events and inevitably everyone gets tied up in knots naming them. We all know what comes before cache invalidation in the hardest things in software development.
However, when used as one of many oracles for testing analytics can be very useful. Would a LAM replace analytics then is what I ask myself? They are somewhat at cross purposes. Analytics gather information from activity rather attempting to predict intent, the LAM is the driver, the analytics framework is the collector. For my money, we can continue to use our analytics frameworks as a flawed but useful oracle even as LAM’s grow in use. If Rabbit R1 tried it, more will follow.
End
LAM’s could be the next generation of super powered bots, a security disaster waiting to happen or just a marketing hype machine. Or all of the aforementioend. For testing though, there will inevitably be claims of exploratory testing models in the near future. Like with an LLM, a more thoughtful approach for a LAM would be to ask how do we safely train the model within our context to fix problems we have. Rather than finding a list of properties on AirBnB, which is a solved problem. Its finding the right one that matters…
Interesting links:
- Marketing – https://www.rabbit.tech/
- Shallow Dive – https://www.youtube.com/watch?v=dWEjDIsMFB0
- Deeper Dive – https://www.geeksforgeeks.org/rabbit-ai-large-action-models-lams/#what-is-action-model-learning
- Company Sponsored Dive – https://www.trinetix.com/insights/what-are-large-action-models-and-how-do-they-work
- Hugging Face LAM – https://huggingface.co/posts/dhuynh95/390309349796467
- Github Topics – https://github.com/topics/large-action-model