Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom Genome and Visualizations #1519

Open
singhbhavya opened this issue May 23, 2024 · 6 comments
Open

Custom Genome and Visualizations #1519

singhbhavya opened this issue May 23, 2024 · 6 comments

Comments

@singhbhavya
Copy link

Hi there,

I apologize for the naive question I'm about to ask, but I've been struggling with this for a week and would appreciate some help. I created a custom gene index using a FASTA file of fusions genes (each line > is a gene) and then aligned reads to these fusion genes. I'd like to visualize the FASTA as a "genome" in IGV, along with read alignments to the genes. When I try to load the FASTA in IGV as a genome, I get the error that IGV cannot set the starting chromosome. When I try to import the BAM alignments, I get the error "Invalid BAM file header: missing sequence name in file". Could you please help me understand what I'm missing? Do I need to also provide an annotation to go with this custom "genome", describing what each gene is?

@jrobinso
Copy link
Contributor

You should be able to load the fasta from the "Genome" menu, I don't understand the error you are getting. What version of IGV are you using?

By "import" bam alignments I assume you are loading the BAM file from the "File" menu, correct? The error message indicates there is something wrong with your BAM file.

If you are able to share these files (fasta, fasta index, bam, and bam index) email us at [email protected] and I can send you a secure dropbox link. But first confirm that you are using a recent version of IGV.

@singhbhavya
Copy link
Author

Hi, thank you so much for the response! The version I am using is 2.3.98.

Yes, correct, I am loading the BAM file from the "File" menu. Please let me know whether or not I can email you - thank you again!

@jrobinso
Copy link
Contributor

Sorry I can't provide any help for that version, it was released in 2017. You might try the latest version, 2.17.4. If you would like me to look at your files please send email to the address noted above for a dropbox link, or share them in some other way.

@singhbhavya
Copy link
Author

Hi there, I updated the version to 2.17.4 and received the same error. Sending you an email! thank you so much!

@singhbhavya
Copy link
Author

Hi there! I identified the problems and fixed them. In case anyone else goes through the same thing, here they are:

  • There were unexpected characters in the FASTA headers. I replaced those characters in the genome, and re-aligned the FASTQs to the genome.
  • Due to the characters, the genome wasn't being correctly loaded into IGV, and this fixed it as well.

I used a combination of these two scripts:

Python script to remove ">":

import re

def replace_gt_with_dash_except_first(filename):
    with open(filename, 'r') as file:
        lines = file.readlines()
    
    with open(filename, 'w') as file:
        for line in lines:
            if line.startswith('>'):
                # Replace '>' with '-' except the first instance
                parts = line.split('>')
                line = parts[0] + '>' + '-'.join(parts[1:])
            file.write(line)

filename = 'Genomic_sequences_fromFusion_batch1.fasta'
replace_gt_with_dash_except_first(filename)

Bash script to remove parentheses and dashes.:

#!/bin/bash

# Function to replace dashes and parentheses with colons in sequence names
replace_specific_chars() {
    input_file=$1
    output_file=$2

    sed -E '/^>/ s/[-()]/:/g' $input_file > $output_file
}

input_file="Genomic_sequences_fromFusion_batch1.fasta"
output_file="output.fasta"

replace_specific_chars $input_file $output_file

@jrobinso
Copy link
Contributor

jrobinso commented Jun 1, 2024

@singhbhavya Thanks for this, I'm sure it will be helpful. If you could post one of the offending fasta header lines here I will see if we can improve the parser to load it without modification. The main rule is the sequence name should be the string between the initial ">" and the first whitespace, we should be able to change the parser to ignore everything else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants