Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a LinkML-based writer option to Koza #818

Open
kevinschaper opened this issue Sep 19, 2024 · 0 comments
Open

Add a LinkML-based writer option to Koza #818

kevinschaper opened this issue Sep 19, 2024 · 0 comments
Assignees

Comments

@kevinschaper
Copy link
Member

As an opportunity to move past the implicit Biolink+kgx format assumptions of the current Koza writers, and a way to support writing to multiple output files from a single ingest, I think we should define a new writer configuration based on LinkML models. Supplying a schema and list of classes to a writer, along with an explicit output filename, will handle the challenge of specifying output columns in a dynamic way that is model agnostic (a longer term Koza goal), and less brittle than the current listing of node and edge properties, where a property set in the python but left out of the node/edge properties won't actually be written to the file.

A challenge is the method of specifying the schema. The two initial use cases I'm imagining are writing to biolink node or association classes, or SSSOM associations, and I think in both cases it might make the most sense to pull the model yaml from importlib, so my initial specification is {package}:{model.yaml} which for our standard Koza STRINGDB example looks like:

writers:
  "nodes":
    filename: 'protein_links_nodes.tsv'
    linkml_schema: 'biolink_model.schema:biolink_model.py'
    classes:
      - 'Gene'
  "edges":
    filename: 'protein_links_edges.tsv'
    linkml_schema: 'biolink_model.schema:biolink_model.py'
    classes:
      - 'PairwiseGeneToGeneInteraction'

With the expectation that the python part of the koza transform would change from:

koza.write(gene_a, gene_b, association)

to the slightly more verbose, but specific

koza.write(gene_a, gene_b, writer="nodes")
koza.write(association, writer="edges")

Note: I may walk out of basing this entirely on LinkML, even though that's the big win, because there have been times that we want to export some additional file, maybe for debugging or QC purposes, and in those cases it might be nice to have the option to just specify a list of columns.

@kevinschaper kevinschaper added this to the 2024-10 Release milestone Sep 19, 2024
@kevinschaper kevinschaper self-assigned this Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant