Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sanitize null bytes before ingestion #2090

Merged
merged 2 commits into from
Sep 25, 2024
Merged

Conversation

laoqiu233
Copy link
Contributor

Description

When ingesting documents using Postgres some PDF documents could cause ValueError: A string literal cannot contain NUL (0x00) characters.. This PR replaces all null bytes before ingesting the documents to make sure this error doesn't happen.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

I ran the pgpt server locally using Ollama and Postgers as both the vector and node store. Before the fix document was not injected, and after adding the sanitization the PDF was successfully ingested and stored.

  • I stared at the code and made sure it makes sense

Test Configuration:

  • Hardware: MacbookPro M2Pro
  • Toolchain: Ollama, Postgres

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules
  • I ran make check; make test to ensure mypy and tests pass

Copy link
Collaborator

@jaluma jaluma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks you!
Can you add a comment to explain why this is necessary?

@laoqiu233
Copy link
Contributor Author

Thanks you! Can you add a comment to explain why this is necessary?

Done! :]

Copy link
Collaborator

@jaluma jaluma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Thank you!

@jaluma jaluma merged commit 5fbb402 into zylon-ai:main Sep 25, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants