Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add a basic implementation of a recording based tool and client #134

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

kshmir
Copy link

@kshmir kshmir commented Dec 24, 2024

Some stuff is still quite irrelevant as I was trying out the repo...

  1. Made a SnapshotBrowserTool that records the steps taken by the AI.
  2. Made a SnapshotAIClient that makes a final decision after all the recorded steps are rerun from the snapshot file (JSONL)

I need to clean it up, would be useful to ignore the screenshot steps, and maybe add a wait for load after clicking or navigating instead of a 1s sleep.

Also a --snapshot parameter should be added and a cleanup parameter as well.

Speed is clearly better, and the AI still makes the final judgement call with a different prompt.

Copy link

vercel bot commented Dec 24, 2024

@kshmir is attempting to deploy a commit to the Antiwork Team on Vercel.

A member of the Team first needs to authorize it.

@CLAassistant
Copy link

CLAassistant commented Dec 24, 2024

CLA assistant check
All committers have signed the CLA.

@kshmir
Copy link
Author

kshmir commented Dec 24, 2024

This aims to implement #124 through snapshotting recording!

@kshmir
Copy link
Author

kshmir commented Dec 24, 2024

To be 100% fair, thinking it, "recording" is a better name than snapshot in this case.

https://github.com/vcr/vcr is the main inspiration for me using these types of tests, I got used to snapshots as a "good enough" solution, but in this case recording is the only way.

@kshmir kshmir changed the title add a basic implementation of a snapshot based tool add a basic implementation of a recording based tool Dec 24, 2024
@kshmir kshmir changed the title add a basic implementation of a recording based tool add a basic implementation of a recording based tool and client Dec 24, 2024
@kshmir
Copy link
Author

kshmir commented Dec 24, 2024

A key thing will be to recognize where things are being clicked instead of using coordinates, by using XPath, otherwise this approach would be too flaky, I'll make it in a subsequent PR.

@bojl bojl mentioned this pull request Dec 24, 2024
@m2rads m2rads self-requested a review December 24, 2024 05:29
Copy link
Contributor

@m2rads m2rads left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your PR @kshmir. A few comments for you to look at :)

I ran the tests with your new feature and yes the test execution is faster but seems flaky. Seems like the only actions that are properly being executed and recorded are navigation and screenshot actions. If you know a better way to perform browser actions such as xpath instead of coordinates please address them in the same PR otherwise the package will become broken.

Also I already left you a review about this but I mention it again. Seems like each test step executed by AI is being executed again by the replay mechanism. At least that's what I understood from the debugging logs. Maybe you can help me understand this:

image

Overall your approach seems very interesting and definitely is faster. Please address these issues and let me know if you have any questions or suggestions.

import { sleep } from '@anthropic-ai/sdk/core';
import pc from 'picocolors';

const JUDGMENT_PROMPT = `You are a test result validator. Your task is to evaluate if a test passed or failed based on the final state.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between JUDGMENT_PROMPT and JUGDEMENT_SYSTEM_PROMPT. Preferably, move this to a new file in prompts dir and export from index.ts.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor just made it twice, fixing soon.


constructor(config: AIConfig, debugMode: boolean = false) {
constructor(config: AIConfig, debugMode: boolean = true) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to update the docs. We have a CLI arg called pnpm shortest --debug-ai. for your debug.ts solution please adjust it so that it will show your logs when that CLI arg is provided. Please show your logs along with the already implemented logs that are in place. You can refer to cli/bin.ts

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will include support for cli arguments to make debugging easier, also for enabling snapshots.

}
}

async execute(input: ActionInput): Promise<ToolResult> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this part is executing the test steps twice. Once when AI performs it and once as a replay. Perhaps this should be avoided. Maybe the first time we run pnpm shortest we should let AI execute its actions and then run the replay in the consecutive time. Would be cool to have a smart mechanism to delete the snapshots once they become obsolete (No rush for now I will add this as a new issue later - unless you want to tackle this in the same PR).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to make something usable inside a flag to keep moving forward.

@m2rads
Copy link
Contributor

m2rads commented Dec 24, 2024

Also please bump the version in packages/shortest/package.json to "version": "0.1.1" and update the packages/shortest/CHANGELOG.md

Thank you.

Copy link

vercel bot commented Dec 24, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
shortest ✅ Ready (Inspect) Visit Preview 💬 Add feedback Dec 26, 2024 1:46am

@slavingia
Copy link
Contributor

I do think snapshots work as a term over recording. Seems more accurate/intuitive to "screenshotting" which is happening, versus recording which implies videos imo.

@kshmir
Copy link
Author

kshmir commented Dec 25, 2024

I do think snapshots work as a term over recording. Seems more accurate/intuitive to "screenshotting" which is happening, versus recording which implies videos imo.

Makes sense, will keep it that way then.

@kshmir
Copy link
Author

kshmir commented Dec 25, 2024

Thanks for your PR @kshmir. A few comments for you to look at :)

I ran the tests with your new feature and yes the test execution is faster but seems flaky. Seems like the only actions that are properly being executed and recorded are navigation and screenshot actions. If you know a better way to perform browser actions such as xpath instead of coordinates please address them in the same PR otherwise the package will become broken.

Also I already left you a review about this but I mention it again. Seems like each test step executed by AI is being executed again by the replay mechanism. At least that's what I understood from the debugging logs. Maybe you can help me understand this:

image

Overall your approach seems very interesting and definitely is faster. Please address these issues and let me know if you have any questions or suggestions.

image

@m2rads this is what is executed, at the same time, I'll remove the screenshot steps, since they're irrelevant, and convert the click action to a XPath execution.

@slavingia
Copy link
Contributor

Why is xpath better than clicking a coordinate? Feels like the latter is more like what a human would do. I can see it breaking more often I guess if the UI changes, but then it would just rerun the test and look for the button like a human would. Seems better to me than falling back on something deterministic like xpath (why not write that explicitly in the test then?) but perhaps I'm missing something.

@m2rads
Copy link
Contributor

m2rads commented Dec 26, 2024

Why is xpath better than clicking a coordinate? Feels like the latter is more like what a human would do. I can see it breaking more often I guess if the UI changes, but then it would just rerun the test and look for the button like a human would. Seems better to me than falling back on something deterministic like xpath (why not write that explicitly in the test then?) but perhaps I'm missing something.

I agree with you and I am not convinced that xpath is better option either. Besides in some cases xpath can be flaky. The computer use API is trained on x,y coordinates. @kshmir Can you explain how you implemented xpath and used it with AI?

@kshmir
Copy link
Author

kshmir commented Dec 26, 2024

@m2rads I'm still testing out the xpath solution... it's obviously a tradeoff...

This is what I see...

image

Some XPaths can be more reliable than others, the most basic is the dom tree structure which is similar to X,Y coordinates, except that inside a DOM.
data-testid is the "industry standard"
I also asked sonnet to give me a hint in the prompt that I can use to make a text search.

Based on what you say, XPaths could be a fallback, I'm now recording a structure like this

image

@kshmir
Copy link
Author

kshmir commented Dec 26, 2024

I guess for now I'll leave the XPath part in another commits based on what you guys mentioned and release a final version tomorrow that minimizes screenshots. I think we have different goals for what the tool can be doing so I don't want to overdoit.

@kshmir
Copy link
Author

kshmir commented Dec 26, 2024

I think the most important pending thing will be a flag to disable/enable all of this. I let the XPath stuff for other PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants