Experimenting with Anthropic's Computer Use for QA

Development

Adam Smith

December 6, 2024

•

4 min read

Anthropic recently introduced their “Computer Use” tool, enabling developers to integrate their AI models to execute tasks on a computer. As they describe it in their blog post.

Developers can direct Claude to use computers the way people do—by looking at a screen, moving a cursor, clicking buttons, and typing text.

Sounds cool, interesting, and a little scary. So we decided to kick the tires and see if this tool could help automate QA in any of our projects.

There’s three basic types of QA we do:

Exploratory, aka poke around and see if anything looks broken
Verifying in-sprint development work
Building up a suite of regression tests

TL;DR

The technology is very cool, but it's not ready for any production-level QA tasks. It is slow and often hits the rate limit of 40,000 tokens / minute very early in the testing flow.

How we tested

To test this we ran a bunch of experiments using Anthropics reference implementation in a Docker environment. The experiments were run on our pourwall.com marketing site and web app, as well as a few other projects.

The reference implementation was easy to get running. After starting the tool in Docker, we just needed to get an Anthropic API key and add it to the admin panel.

Each experiment was either an exploratory, in-sprint validation, or regression test. We added a prompt similar to what we’d ask a QA engineer to do, and just let it rip. A few examples are found below.

Experiment 1: Exploratory website testing

Prompt: Navigate to pourwall.com, ensure that the page loads correctly and all the elements look OK. Record the results in a text file named pourwall_observations.txt. Then navigate to the FAQ and Plan pages and repeat for each.

Results: View here

Succeeded in navigating to each page and validating that it was working. Ended up making lots of unhelpful recommendations and observations (more specific prompting would likely help)

Notes: Took roughly 2 minutes (video edited for brevity) and cost $0.15 (~150,000 input tokens and 2,000 output tokens)

Experiment 2: Exploratory web app testing sign up

Prompt: Navigate to https://pourwall.com/ create a user account noting down the username and password in a text file. Login to the site, and perform exploratory analysis of the app. Make observations on bugs and potential fixes.

Results: View here

The experiment failed due to token rate limiting. Used 40,000 tokens in a minute and didn’t even get signed up. Before hitting the rate limit, the tool struggled to complete the sign-up flow.

Experiment 3: Exploratory web app testing login and use dashboard

Prompt: login to https://pourwall.com with credentials: username: redacted and password: redacted

Create a file called pourwall_exploratory_observations.txt

Perform exploratory testing on the app noting any bugs or odd behavior in the file.

Results: Failed due to token limit

Findings:

The tool does not appear to be ready for practical QA work, we won’t know more until we can get a higher token rate limit. At the moment we’ll be sticking with our current mix of human QA (for Exploratory and verifying in-sprint work) + scripted automated tests (for the suite of regression tests).

Many experiments failed due to token rate limiting early in the testing steps. I requested a rate limit increase and will try again if they grant it.
Best potential use case: Exploratory testing is probably the most likely use case. The tools' slowness, cost, and unpredictability don’t bode well for validating in-sprint work or suite or regression testing.
The tool is cool, but still slow and goofy. Watch it struggle to signup for Pourwall here.
High Costs: Even a relatively simple task in Experiment 1 cost $0.15 for just one page of observations (not expensive compared to human QA, but will add up and be costly vs automated suite of tests)

Conclusion

We’ll definitely re-do these experiments if our rate limit gets increased and will update the post if this happens. This technology is going to improve and we’re excited to come back and try again when it does.

SHARE THIS STORY

Get. Shit. Done. 👊

Whether your ideas are big or small, we know you want it built yesterday. With decades of experience working at, or with, startups, we know how to get things built fast, without compromising scalability and quality.

Get in touch

Whether your plans are big or small, together, we'll get it done.

Let's get a conversation going. Shoot an email over to projects@betaacid.co, or do things the old fashioned way and fill out the handy dandy form below.