Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

We then use this as context and ask GPT-4 to generate temporal localization questions that require further reasoning to answer. We also ask GPT-4 to simultaneously generate the answer that includes the queried start and end timestamps, along with the explanation about the reasoning process. #6

Open
rixejzvdl649 opened this issue Jul 29, 2024 · 1 comment

Comments

@rixejzvdl649
Copy link

def get_caption_summary_prompt(gt_caption, predicted_captions):
    prompt_prefix_1 = "Generate a detailed and accurate description of a video based on the given ground-truth video caption and multiple frame-level captions. " \
                      "Use the following details to create a clear and complete narrative:\n"
    prompt_prefix_2 = "\nGround-truth Video Caption: "
    prompt_prefix_3 = "\nFrame-level Captions: "
    prompt_suffix = """\n\nInstructions for writing the detailed description:
    1. Focus on describing key visual details such as appearance, motion, sequence of actions, objects involved, and interactions between elements in the video.
    2. Check for consistency between the ground-truth caption and frame-level captions, and prioritize details that match the ground-truth caption. Ignore any conflicting or irrelevant details from the frame-level captions.
    3. Leave out any descriptions about the atmosphere, mood, style, aesthetics, proficiency, or emotional tone of the video.
    4. Make sure the description is no more than 20 sentences.
    5. Combine and organize information from all captions into one clear and detailed description, removing any repeated or conflicting details.
    6. Emphasize important points like the order of events, appearance and actions of people or objects, and any significant changes or movements.
    7. Do not mention that the information comes from ground-truth captions or frame-level captions.
    8. Give a brief yet thorough description, highlighting the key visual and temporal details while keeping it clear and easy to understand.
    Use your intelligence to combine and refine the captions into a brief yet informative description of the entire video."""

    # Create the prompt by iterating over the list_of_elements and formatting the template
    prompt = prompt_prefix_1
    prompt += f"{prompt_prefix_2}{gt_caption}{prompt_prefix_3}{'; '.join(predicted_captions)}"
    prompt += prompt_suffix

    return prompt

image

@rixejzvdl649
Copy link
Author

Will similar generated code be open source?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant