Let's imagine this scenario: you are building an application using LLMs intended to automate many web-related tasks for you. You search on Google and are presented with two potential solutions:
1. A solution that uses vision models to access the website.
2. Reading the HTML/DOM elements to access the web application.
Let's take a deeper look at why both of these approaches can be challenging to implement:
Using Vision Models: The basic concept is that the vision LLM is provided with periodic screenshots of the UI it is trying to access. The LLM then predicts the mouse coordinates of the area to click, or you grid the image of the website according to its elements to get a close estimation of where the LLM wants to click or go.
Using HTML Code to Access the Website: This approach involves taking in the HTML syntax and asking the LLM to generate
XPATH commands for accessing the elements of the website.
But what's the problem?The problem with the first approach is pretty evident:
it is going to drill a black hole in your wallet, and the accuracy is horrible (LLMs often can't figure out where to click).The problem with the second approach is a bit more ambiguous:
You test it on a small web app, and it works amazingly well. But then you test it on a large web app with extremely long HTML syntax, and you run into context size issues. Moreover, you face a lot of problems with CSR (client-side rendered) libraries, as they generate HTML on action on the client side, causing your false positives to rise even higher.So, what is the solution? Well,
vimGPT employed a best-of-both-worlds type of solution that works but is not the most accurate due to the reasons mentioned above. Personally, I believe the way forward for this tech is vision LLMs. Give it one more year, and they will far exceed our expectations.