Day 4: Agentic Task Executor
Building an agentic task executor with browser automation—from local model orchestration to headless Playwright recordings reviewed via WhatsApp.
Goals
The goals of a task executor are as follows.
- Allow human operators to instruct an agent to resolve customer tickets
- Allow agents to follow a set of curated skills to resolve issues on their own
Orchestration
If this project were started 3 months ago, n8n would have been a solid choice. But with advances in agent skills, we have a more flexible and effective way to solve the orchestration problem. Since agents are shadowing human operators in the same environment, it is easy to convert a workflow to an agent skill and close the feedback loop for continuous learning.
There are several community-driven attempts to build similar orchestrators by reverse engineering Claude Code and Codex. The main idea is to expose these CLI tools via WebSocket so that a web UI can be customised on top of the core agentic driver. I tried some of them on a local model provider running gpt-oss-20b and found OpenCode to offer the best compatibility for custom model providers.
Running models locally is good for many reasons like privacy, speed, and control. The performance of smaller models has improved drastically over the years with more effective context engineering and long term memory. Researchers have also found success in knowledge distillation where learnings from larger pretrained models boost the reasoning capabilities of their smaller cousins. In fact, this field moves so fast that in a few years’ time, local models running on 16GB of GPU / unified memory at 200 tokens/s can probably perform equally well as Opus 4.6 today.
Browser Automation
The first use case I chose to build was browser automation. Most office jobs today are accomplished using several SaaS products. New employees joining the company often need to learn the Standard Operating Procedures for using these online platforms. What if we could teach these SOPs to an agent so it can respond with a screencast whenever new employees have questions? Or better still, these agents could execute the clicks and keyboard presses directly following the SOP. Browser automation is therefore the most general solution for a helpdesk agent.
The Chrome DevTools MCP gives AI agents most of the capabilities they need for browser automation. Running it from Claude requires a local Chrome browser and a recent Node / npm installation. Since the browser is installed locally, it can use the exact same display drivers for rendering the web as a regular office user. If you login to those SaaS platforms on your local browser, the agent can also access the same content as you. The only downside is that it takes up a significant portion of your model’s context window, consuming ~17k extra tokens for each prompt.
Hosted Fallback
While this works great for personal apps, accessing it remotely can be a challenge if you don’t have a home VPN or remote desktop setup. For those who prefer a hosted solution, I have also deployed a cloud version as fallback. I installed Playwright and Chromium in a Docker image to run browsers in headless mode. Playwright comes with a CLI that can be easily accessed from agent skills without bloating the model context window. An open model like GLM-5 is perfectly fine for driving the agentic loop.
Another nice thing about Playwright is that it comes with native recording function. Once the agent completes a browser automation task, a video is saved and uploaded to the human operator for review via WhatsApp. The initial implementation of this feature was buggy because WhatsApp did not support the WebM video format. I pasted the API error message back to Claude so it adjusted its own skill to run FFmpeg to convert the screencast to mp4 before uploading.
Dockerfile Authoring
On the other hand, using agents to author Dockerfiles was surprisingly painful. While the instructions it generated were mostly correct, the resulting image layers were not optimised for cache size. For any complex images, I’d consider adding a validation loop for agents to minimise the file size as part of the development process.
Running and recording a headless browser can be quite resource intensive depending on the website being loaded. I had to scale up my deployment to 4 vCPU with 2GB memory for snappier performance on most sites.