-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add a basic implementation of a recording based tool and client #134
base: main
Are you sure you want to change the base?
Conversation
@kshmir is attempting to deploy a commit to the Antiwork Team on Vercel. A member of the Team first needs to authorize it. |
This aims to implement #124 through |
To be 100% fair, thinking it, "recording" is a better name than snapshot in this case. https://github.com/vcr/vcr is the main inspiration for me using these types of tests, I got used to snapshots as a "good enough" solution, but in this case recording is the only way. |
A key thing will be to recognize where things are being clicked instead of using coordinates, by using XPath, otherwise this approach would be too flaky, I'll make it in a subsequent PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your PR @kshmir. A few comments for you to look at :)
I ran the tests with your new feature and yes the test execution is faster but seems flaky. Seems like the only actions that are properly being executed and recorded are navigation and screenshot actions. If you know a better way to perform browser actions such as xpath instead of coordinates please address them in the same PR otherwise the package will become broken.
Also I already left you a review about this but I mention it again. Seems like each test step executed by AI is being executed again by the replay mechanism. At least that's what I understood from the debugging logs. Maybe you can help me understand this:
Overall your approach seems very interesting and definitely is faster. Please address these issues and let me know if you have any questions or suggestions.
import { sleep } from '@anthropic-ai/sdk/core'; | ||
import pc from 'picocolors'; | ||
|
||
const JUDGMENT_PROMPT = `You are a test result validator. Your task is to evaluate if a test passed or failed based on the final state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference between JUDGMENT_PROMPT
and JUGDEMENT_SYSTEM_PROMPT
. Preferably, move this to a new file in prompts
dir and export from index.ts
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor just made it twice, fixing soon.
packages/shortest/src/ai/client.ts
Outdated
|
||
constructor(config: AIConfig, debugMode: boolean = false) { | ||
constructor(config: AIConfig, debugMode: boolean = true) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to update the docs. We have a CLI arg called pnpm shortest --debug-ai
. for your debug.ts
solution please adjust it so that it will show your logs when that CLI arg is provided. Please show your logs along with the already implemented logs that are in place. You can refer to cli/bin.ts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, will include support for cli arguments to make debugging easier, also for enabling snapshots.
} | ||
} | ||
|
||
async execute(input: ActionInput): Promise<ToolResult> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this part is executing the test steps twice. Once when AI performs it and once as a replay. Perhaps this should be avoided. Maybe the first time we run pnpm shortest
we should let AI execute its actions and then run the replay in the consecutive time. Would be cool to have a smart mechanism to delete the snapshots once they become obsolete (No rush for now I will add this as a new issue later - unless you want to tackle this in the same PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to make something usable inside a flag to keep moving forward.
Also please bump the version in Thank you. |
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
I do think snapshots work as a term over recording. Seems more accurate/intuitive to "screenshotting" which is happening, versus recording which implies videos imo. |
Makes sense, will keep it that way then. |
@m2rads this is what is executed, at the same time, I'll remove the screenshot steps, since they're irrelevant, and convert the click action to a XPath execution. |
Why is xpath better than clicking a coordinate? Feels like the latter is more like what a human would do. I can see it breaking more often I guess if the UI changes, but then it would just rerun the test and look for the button like a human would. Seems better to me than falling back on something deterministic like xpath (why not write that explicitly in the test then?) but perhaps I'm missing something. |
I agree with you and I am not convinced that xpath is better option either. Besides in some cases xpath can be flaky. The computer use API is trained on x,y coordinates. @kshmir Can you explain how you implemented xpath and used it with AI? |
@m2rads I'm still testing out the xpath solution... it's obviously a tradeoff... This is what I see... Some XPaths can be more reliable than others, the most basic is the dom tree structure which is similar to X,Y coordinates, except that inside a DOM. Based on what you say, XPaths could be a fallback, I'm now recording a structure like this |
I guess for now I'll leave the XPath part in another commits based on what you guys mentioned and release a final version tomorrow that minimizes screenshots. I think we have different goals for what the tool can be doing so I don't want to overdoit. |
I think the most important pending thing will be a flag to disable/enable all of this. I let the XPath stuff for other PR. |
Some stuff is still quite irrelevant as I was trying out the repo...
I need to clean it up, would be useful to ignore the screenshot steps, and maybe add a wait for load after clicking or navigating instead of a 1s sleep.
Also a --snapshot parameter should be added and a cleanup parameter as well.
Speed is clearly better, and the AI still makes the final judgement call with a different prompt.