Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dEploid mcmc not converging #346

Open
bredelings opened this issue Jan 29, 2021 · 17 comments
Open

dEploid mcmc not converging #346

bredelings opened this issue Jan 29, 2021 · 17 comments

Comments

@bredelings
Copy link
Collaborator

Describe the bug

When run multiple times on the same data set, dEploid oscillates between two different answers. For example,

first answer:

         Effective_K: 1.00005
          Inferred_K: 1
Adjusted_effective_K: 1.00005

Proportions:
1.09651e-05	  0.999976	1.32491e-05	2.21877e-08

second answer:

         Effective_K: 1.75907
          Inferred_K: 2
Adjusted_effective_K: 1.75907

Proportions:
  0.314951	4.04259e-06	  0.685045	3.59593e-08

In this specific, case its clear from looking at the data that K=1 is right.

dEploid version: 659c770

To Reproduce
Steps to reproduce the behavior:

  1. dEploid -vcf ERR2678989.vcf -plaf plaf.txt -o ERR2678989-nopanel -noPanel
  2. -seed=0 produces one answer, but -seed=1 produces a different answer

I also tried increasing the run time from 800 to 8000 generations, but it didn't help.

Expected behavior
It seems that that the MCMC would be give the same answer each time. However, it seems that there are two modes, and the MCMC cannot move between them.

Additional context
I might be able to help you improve the mixing.

@bredelings
Copy link
Collaborator Author

Here are some files:
plaf.txt
ERR2678989.vcf.txt

@shajoezhu
Copy link
Member

Hi @bredelings , had a quick look at the data. You have got a lot of missing data. Would it be ok to encode your "." to zeros? does it make sense? I am not an expert in the sequence you are dealing with, but DEploid does not handle missing data at the moment.

Made a quick plot with the alt vs ref count, on the left is what's in the data, on the right, is after encode "." as zeros.

tmp

Eyeballing this data, it should be a mixed sample, unless there is a lot of error with the reading that you may want to filter them out.

DEploid MCMC struggle with certain mixing is a known problem. I don't have a solution yet, It would be awesome if you could help to solve this problem. Very often, we find it is the data is very difficult. our strategy is to take multiple runs, and take the consensus result. There is a newer version, "Best-practice" is soon coming. I want to finalize some of the documentation. But for this version, you must provide a reference panel. We found that it is very useful when learning k, especially if you have a very balanced mixing.

vcfFileName = "ERR2678989.vcf.txt"
numberOfHeaderLines = 42
ADFieldIndex=7
vcf <- read.table( gzfile(vcfFileName), skip = numberOfHeaderLines,
        header = T, comment.char = "", stringsAsFactors = FALSE,
        check.names = FALSE)

    sampleName <- names(vcf)[10]

    tmp <- vcf[[sampleName]]
    field <- strsplit(as.character(tmp), ":")

    tmpCovStr <- unlist(lapply(field, `[[`, ADFieldIndex))
    tmpCov <- strsplit(as.character(tmpCovStr), ",")

    refCount <- as.numeric(unlist(lapply(tmpCov, `[[`, 1)))
    altCount <- as.numeric(unlist(lapply(tmpCov, `[[`, 2)))

refCount0 = refCount
refCount0[is.na(refCount)] = 0
altCount0 = altCount
altCount0[is.na(altCount)]=0
plot(refCount0, altCount0)
png("tmp.png", width = 1200, height =600)
par(mfrow=c(1,2))
plot(refCount, altCount)
plot(refCount0, altCount0)
dev.off()

@bredelings
Copy link
Collaborator Author

I modified the utilities scripts locally to convert . to 0, similar to the PR for DEploid itself. Octopus writes VCFs with . when there is no haplotype that covers on of the alleles -- the author insists that it is a legal way to write VCFs, so probably the scripts should handle . by converting it to 0. I don't think this indicates anything wrong with the data. I can submit a PR to the utilities repo.

I did not think this was a mixed sample, because a comparison with other samples that are clearly mixed shows a very sparse scattering of mixed sites. Here are two plots, one for ERR2678989 (not mixed, I think) and ERR2678993 (mixed).
ERR2678989altVsRefAndWSAFvsPLAF
ERR2678993altVsRefAndWSAFvsPLAF
Do you still think that ERR2678989 is a mixed sample? If it is really mixed, would additional loci be mixed as well? Could it be that some loci just have mismapped reads and so appear mixed, but are not?

@shajoezhu
Copy link
Member

@bredelings I am with you. Should be clonal I think. Do you have a reference panel for this?

What your initial k is set to?

@shajoezhu
Copy link
Member

https://deploid.readthedocs.io/en/latest/FAQ.html#over-fitting

Hey @bredelings We have seen this problem before as well. You can force and set k to 1. And just use the plaf as the reference to learn the haplotype, and use this as a reference strain and build a panel to deconvolve your other samples. I have used this approach for pf3k

@bredelings
Copy link
Collaborator Author

bredelings commented Feb 6, 2021

Hi @shajoezhu, thanks for the advice! I will try adding a reference panel. I wasn't sure it would help because it seems that rho/theta is pretty high in vivax, so linkage disequilibrium is pretty low. But maybe the panel could still help. Do you know if rho/theta is lower in falciparum -- i.e. there are more SNPs per recombination event?

I haven't had time to fully look at the mixing in sufficient detail yet - I will get back to you soon, hopefully next week.

@shajoezhu
Copy link
Member

shajoezhu commented Feb 6, 2021 via email

@bredelings
Copy link
Collaborator Author

bredelings commented Feb 11, 2021

OK, I had a few ideas about fixing mixing problems. It looks like you have transition kernels to update (w|h) and transition kernels to update (h|w), where h is the haplotypes and w is the weights. The transition kernels for updating (h|w) can resample one or two haplotypes at a time. Does that sound right?

I think the issue is that if h and w are correlated, then updating w but not h might not work. So, what you need is a move that

  • proposes a move from w1 to w2 by changing the relative frequencies of strain i and strain j. This could perhaps perturb x[i] and x[j]
  • computes the probability of w1 by summing out the haplotypes for i and j using dynamic programming.
  • computes the probability of w2 by summing out the haplotypes for i and j using dynamic programming.
  • accept or reject the new weights w2 using the summed-out probabilities
  • if w2 is accepted, resample haplotypes i and j from the dynamic programming matrix associated with w2
  • if w2 is rejected, then resample haplotypes i and j from the dynamic programming matrix associated with w1

I can write down the math for detailed balance if needed.

It looks like computing the probability by summing out the haplotypes could be implemented fairly easily in UpdatePairHap::calcFwdBwdProbs() by just summing the last column.

I'm not sure if I'm missing something in UpdatePairHap:: calcExpectedWsaf( ) though. What is that doing?

@bredelings
Copy link
Collaborator Author

bredelings commented Feb 11, 2021

Another idea was to allow adding and removing haplotypes instead of fixing the number at five and setting the prior so that some have very low frequencies. This probably makes summarizing the posterior more complex. Maybe you would want to generate multiple summaries in that case, i.e. a summary for K=1 with one haplotype, a summary for K=2 with two haplotypes, a summary for K=3 with 3 haplotypes.

In any case, I think that you could add a 0/1 variable Z[k] for each of the 5 haplotypes, and say that
x'[k] ~ normal(eta, sigma^2)
x[k] = if z[k] == 1 then x'[j] else -Inf

It would be possible to actually remove x'[k] and h[k] from the MCMC state when Z[k] ==0 and the haplotype is missing. You would draw x'[k] from the prior and draw h[k] using UpdateSingleHap when adding the haplotype back in. But it would also be possible to keep all 5 haplopypes even when Z[k] == 0. However, for the purposes of mixing, it would still be important to resample h[k] from the posterior using UpdateSingleHap when flipping Z[k] to 1.

@bredelings
Copy link
Collaborator Author

Any thoughts?

It also looks like the doc branch has disappeared... do you know where it went?

@shajoezhu
Copy link
Member

shajoezhu commented Feb 19, 2021 via email

@shajoezhu
Copy link
Member

shajoezhu commented Feb 19, 2021 via email

@shajoezhu
Copy link
Member

shajoezhu commented Feb 19, 2021 via email

@bredelings
Copy link
Collaborator Author

Sorry, I just saw this because the e-mails went into my spam folder. Yeah, finding some time to talk would be great. I am also really interested in your ideas about the sequencing technology and mapping quality. More later.

@bredelings
Copy link
Collaborator Author

I have some patches I wanted to submit to illustrate some of what I was talking about and clarify about the dynamic programming. Is there a good branch for me to make the patches against? Mostly its against updateHap.cpp

@bredelings
Copy link
Collaborator Author

bredelings commented Feb 26, 2021

BTW, it looks like there might be some double-counting in this line of UpdateSingleHap::calcFwdProbs():

            fwdTmp[i] = emission_[j][panel_->content_[hapIndex][i]] * (fwdProbs_.back()[i] * pNoRec + massFromRec);

It looks like the transition from i -> i is included in both fwdProbs_.back()[i] * pNoRec and also in massFromRec. If so, maybe this could be fixed by changing the line to:

            fwdTmp[i] = emission_[j][panel_->content_[hapIndex][i]] * (fwdProbs_.back()[i] * (pNoRec - pRecEachHap) + massFromRec);

Not entirely sure...

@shajoezhu
Copy link
Member

shajoezhu commented Feb 27, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants