add a basic implementation of a recording based tool and client #134

kshmir · 2024-12-24T03:57:30Z

Some stuff is still quite irrelevant as I was trying out the repo...

Made a SnapshotBrowserTool that records the steps taken by the AI.
Made a SnapshotAIClient that makes a final decision after all the recorded steps are rerun from the snapshot file (JSONL)

I need to clean it up, would be useful to ignore the screenshot steps, and maybe add a wait for load after clicking or navigating instead of a 1s sleep.

Also a --snapshot parameter should be added and a cleanup parameter as well.

Speed is clearly better, and the AI still makes the final judgement call with a different prompt.

vercel · 2024-12-24T03:57:34Z

@kshmir is attempting to deploy a commit to the Antiwork Team on Vercel.

A member of the Team first needs to authorize it.

CLAassistant · 2024-12-24T03:57:36Z

All committers have signed the CLA.

kshmir · 2024-12-24T03:58:35Z

This aims to implement #124 through ~~snapshotting~~ recording!

kshmir · 2024-12-24T04:00:29Z

To be 100% fair, thinking it, "recording" is a better name than snapshot in this case.

https://github.com/vcr/vcr is the main inspiration for me using these types of tests, I got used to snapshots as a "good enough" solution, but in this case recording is the only way.

kshmir · 2024-12-24T04:04:22Z

A key thing will be to recognize where things are being clicked instead of using coordinates, by using XPath, otherwise this approach would be too flaky, I'll make it in a subsequent PR.

m2rads

Thanks for your PR @kshmir. A few comments for you to look at :)

I ran the tests with your new feature and yes the test execution is faster but seems flaky. Seems like the only actions that are properly being executed and recorded are navigation and screenshot actions. If you know a better way to perform browser actions such as xpath instead of coordinates please address them in the same PR otherwise the package will become broken.

Also I already left you a review about this but I mention it again. Seems like each test step executed by AI is being executed again by the replay mechanism. At least that's what I understood from the debugging logs. Maybe you can help me understand this:

Overall your approach seems very interesting and definitely is faster. Please address these issues and let me know if you have any questions or suggestions.

m2rads · 2024-12-24T05:35:51Z

packages/shortest/src/ai/snapshot-client.ts

+import { sleep } from '@anthropic-ai/sdk/core';
+import pc from 'picocolors';
+
+const JUDGMENT_PROMPT = `You are a test result validator. Your task is to evaluate if a test passed or failed based on the final state.


What is the difference between JUDGMENT_PROMPT and JUGDEMENT_SYSTEM_PROMPT. Preferably, move this to a new file in prompts dir and export from index.ts.

Cursor just made it twice, fixing soon.

m2rads · 2024-12-24T05:39:32Z

packages/shortest/src/ai/client.ts


-  constructor(config: AIConfig, debugMode: boolean = false) {
+  constructor(config: AIConfig, debugMode: boolean = true) {


I need to update the docs. We have a CLI arg called pnpm shortest --debug-ai. for your debug.ts solution please adjust it so that it will show your logs when that CLI arg is provided. Please show your logs along with the already implemented logs that are in place. You can refer to cli/bin.ts

Yes, will include support for cli arguments to make debugging easier, also for enabling snapshots.

m2rads · 2024-12-24T05:42:31Z

packages/shortest/src/browser/core/snapshot-browser-tool.ts

+    }
+  }
+
+  async execute(input: ActionInput): Promise<ToolResult> {


Seems like this part is executing the test steps twice. Once when AI performs it and once as a replay. Perhaps this should be avoided. Maybe the first time we run pnpm shortest we should let AI execute its actions and then run the replay in the consecutive time. Would be cool to have a smart mechanism to delete the snapshots once they become obsolete (No rush for now I will add this as a new issue later - unless you want to tackle this in the same PR).

I'll try to make something usable inside a flag to keep moving forward.

m2rads · 2024-12-24T05:55:44Z

Also please bump the version in packages/shortest/package.json to "version": "0.1.1" and update the packages/shortest/CHANGELOG.md

Thank you.

vercel · 2024-12-24T05:57:14Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
shortest	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Dec 26, 2024 1:46am

slavingia · 2024-12-24T18:13:04Z

I do think snapshots work as a term over recording. Seems more accurate/intuitive to "screenshotting" which is happening, versus recording which implies videos imo.

kshmir · 2024-12-25T23:41:50Z

I do think snapshots work as a term over recording. Seems more accurate/intuitive to "screenshotting" which is happening, versus recording which implies videos imo.

Makes sense, will keep it that way then.

kshmir · 2024-12-25T23:47:08Z

Thanks for your PR @kshmir. A few comments for you to look at :)

I ran the tests with your new feature and yes the test execution is faster but seems flaky. Seems like the only actions that are properly being executed and recorded are navigation and screenshot actions. If you know a better way to perform browser actions such as xpath instead of coordinates please address them in the same PR otherwise the package will become broken.

Also I already left you a review about this but I mention it again. Seems like each test step executed by AI is being executed again by the replay mechanism. At least that's what I understood from the debugging logs. Maybe you can help me understand this:

Overall your approach seems very interesting and definitely is faster. Please address these issues and let me know if you have any questions or suggestions.

@m2rads this is what is executed, at the same time, I'll remove the screenshot steps, since they're irrelevant, and convert the click action to a XPath execution.

… faster

slavingia · 2024-12-25T23:56:15Z

Why is xpath better than clicking a coordinate? Feels like the latter is more like what a human would do. I can see it breaking more often I guess if the UI changes, but then it would just rerun the test and look for the button like a human would. Seems better to me than falling back on something deterministic like xpath (why not write that explicitly in the test then?) but perhaps I'm missing something.

m2rads · 2024-12-26T01:51:28Z

Why is xpath better than clicking a coordinate? Feels like the latter is more like what a human would do. I can see it breaking more often I guess if the UI changes, but then it would just rerun the test and look for the button like a human would. Seems better to me than falling back on something deterministic like xpath (why not write that explicitly in the test then?) but perhaps I'm missing something.

I agree with you and I am not convinced that xpath is better option either. Besides in some cases xpath can be flaky. The computer use API is trained on x,y coordinates. @kshmir Can you explain how you implemented xpath and used it with AI?

kshmir · 2024-12-26T04:17:12Z

@m2rads I'm still testing out the xpath solution... it's obviously a tradeoff...

This is what I see...

Some XPaths can be more reliable than others, the most basic is the dom tree structure which is similar to X,Y coordinates, except that inside a DOM.
data-testid is the "industry standard"
I also asked sonnet to give me a hint in the prompt that I can use to make a text search.

Based on what you say, XPaths could be a fallback, I'm now recording a structure like this

kshmir · 2024-12-26T04:18:54Z

I guess for now I'll leave the XPath part in another commits based on what you guys mentioned and release a final version tomorrow that minimizes screenshots. I think we have different goals for what the tool can be doing so I don't want to overdoit.

kshmir · 2024-12-26T04:31:37Z

I think the most important pending thing will be a flag to disable/enable all of this. I let the XPath stuff for other PR.

add a basic implementation of a snapshot based tool

6d1e10a

kshmir changed the title ~~add a basic implementation of a snapshot based tool~~ add a basic implementation of a recording based tool Dec 24, 2024

kshmir changed the title ~~add a basic implementation of a recording based tool~~ add a basic implementation of a recording based tool and client Dec 24, 2024

bojl mentioned this pull request Dec 24, 2024

Cache tests #124

Open

m2rads self-requested a review December 24, 2024 05:29

m2rads requested changes Dec 24, 2024

View reviewed changes

vercel bot deployed to Preview December 24, 2024 05:58 View deployment

Merge branch 'main' into feature/recorded-runs

511df20

fix irrelevant judgement prompt and skip snapshot steps to make tests…

f74cd00

… faster

kshmir added 2 commits December 25, 2024 21:09

better usage of snapshot files, it should work by default now

d1597b7

bump version

a8576ee

vercel bot deployed to Preview December 26, 2024 01:46 View deployment

fix initial launches and issues with filenames

c722e2e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add a basic implementation of a recording based tool and client #134

add a basic implementation of a recording based tool and client #134

kshmir commented Dec 24, 2024 •

edited

Loading

vercel bot commented Dec 24, 2024

CLAassistant commented Dec 24, 2024 •

edited

Loading

kshmir commented Dec 24, 2024 •

edited

Loading

kshmir commented Dec 24, 2024

kshmir commented Dec 24, 2024 •

edited

Loading

m2rads left a comment

m2rads Dec 24, 2024

kshmir Dec 25, 2024

m2rads Dec 24, 2024

kshmir Dec 25, 2024

m2rads Dec 24, 2024

kshmir Dec 25, 2024

m2rads commented Dec 24, 2024

vercel bot commented Dec 24, 2024 •

edited

Loading

slavingia commented Dec 24, 2024

kshmir commented Dec 25, 2024

kshmir commented Dec 25, 2024

slavingia commented Dec 25, 2024

m2rads commented Dec 26, 2024 •

edited

Loading

kshmir commented Dec 26, 2024

kshmir commented Dec 26, 2024

kshmir commented Dec 26, 2024


		constructor(config: AIConfig, debugMode: boolean = false) {
		constructor(config: AIConfig, debugMode: boolean = true) {

add a basic implementation of a recording based tool and client #134

Are you sure you want to change the base?

add a basic implementation of a recording based tool and client #134

Conversation

kshmir commented Dec 24, 2024 • edited Loading

vercel bot commented Dec 24, 2024

CLAassistant commented Dec 24, 2024 • edited Loading

kshmir commented Dec 24, 2024 • edited Loading

kshmir commented Dec 24, 2024

kshmir commented Dec 24, 2024 • edited Loading

m2rads left a comment

Choose a reason for hiding this comment

m2rads Dec 24, 2024

Choose a reason for hiding this comment

kshmir Dec 25, 2024

Choose a reason for hiding this comment

m2rads Dec 24, 2024

Choose a reason for hiding this comment

kshmir Dec 25, 2024

Choose a reason for hiding this comment

m2rads Dec 24, 2024

Choose a reason for hiding this comment

kshmir Dec 25, 2024

Choose a reason for hiding this comment

m2rads commented Dec 24, 2024

vercel bot commented Dec 24, 2024 • edited Loading

slavingia commented Dec 24, 2024

kshmir commented Dec 25, 2024

kshmir commented Dec 25, 2024

slavingia commented Dec 25, 2024

m2rads commented Dec 26, 2024 • edited Loading

kshmir commented Dec 26, 2024

kshmir commented Dec 26, 2024

kshmir commented Dec 26, 2024

kshmir commented Dec 24, 2024 •

edited

Loading

CLAassistant commented Dec 24, 2024 •

edited

Loading

kshmir commented Dec 24, 2024 •

edited

Loading

kshmir commented Dec 24, 2024 •

edited

Loading

vercel bot commented Dec 24, 2024 •

edited

Loading

m2rads commented Dec 26, 2024 •

edited

Loading