Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

db: allow excises to unconditionally be flushable ingests #3897

Merged
merged 1 commit into from
Sep 25, 2024

Conversation

itsbilal
Copy link
Member

Previously, we'd only allow IngestAndExcise to become flushable ingests if we had a guarantee from the caller that the ingestion contained rangedel/rangekeydel overlapping the excise span. This change reworks flushable ingests to return rangedels/rangekeydels from their own iterators, even if one does not exist in the sstables themselves. This, plus the update to the WAL to be able to persist the excise span, allows for IngestAndExcise to allow flushable ingests in more cases than before.

Also resolves TODOs in replayWAL() around properly sequencing flushable ingests so their sstables can go in levels lower than L0. This sequencing was necessary to make excises work correctly; as excises require an up-to-date view of the version to work correctly.

Fixes #2676.

@itsbilal itsbilal requested a review from a team as a code owner August 28, 2024 18:55
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@itsbilal itsbilal force-pushed the excise-always-flushes branch 2 times, most recently from 7822bc8 to 47a772a Compare September 3, 2024 20:35
@itsbilal
Copy link
Member Author

itsbilal commented Sep 3, 2024

The earlier meta test issue is fixed now, so this should be RFAL.

@itsbilal itsbilal force-pushed the excise-always-flushes branch 2 times, most recently from 67b388b to 59ff3eb Compare September 3, 2024 22:06
Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: I think Jackson should take a look at the replayWAL changes.

Reviewable status: 0 of 28 files reviewed, 5 unresolved discussions (waiting on @aadityasondhi and @itsbilal)


ingest.go line 1289 at r3 (raw file):

	// The excise span is going to outlive this ingestion call. Copy it.
	exciseSpan = KeyRange{
		Start: append([]byte(nil), exciseSpan.Start...),

[nit] slices.Clone


open.go line 874 at r3 (raw file):

			return errors.Wrapf(err, "running compaction during WAL replay")
		}
		ve.NewFiles = append(ve.NewFiles, newVE.NewFiles...)

[nit] maybe this should be a VersionEdit.Merge method?


open.go line 883 at r3 (raw file):

			}
		}
		if newVE.CreatedBackingTables != nil {

[nit] no need for the nil check


open.go line 1007 at r3 (raw file):

				}
				if kind == base.InternalKeyKindExcise {
					exciseSpan.Start = append([]byte(nil), encodedFileNum...)

[nit] slices.Clone


open.go line 1028 at r3 (raw file):

							panic("pebble: multiple excise spans in a single batch")
						}
						exciseSpan.Start = append([]byte(nil), key...)

[nit] slices.Clone

Copy link
Member Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: 0 of 28 files reviewed, 1 unresolved discussion (waiting on @aadityasondhi, @jbowens, and @RaduBerinde)


ingest.go line 1289 at r3 (raw file):

Previously, RaduBerinde wrote…

[nit] slices.Clone

Done.


open.go line 874 at r3 (raw file):

Previously, RaduBerinde wrote…

[nit] maybe this should be a VersionEdit.Merge method?

In this case we can make some simplifying assumptions that allows us to just directly merge two VEs (that is, we'll either create brand new files, or we'll delete files that existed from the very start, we will never delete a file that was created in the same VE). In other cases we might need something like BulkVersionEdit.Accumulate. So maybe it's best to just leave it here, to avoid misuse of a generic Merge method where these assumptions don't hold.


open.go line 883 at r3 (raw file):

Previously, RaduBerinde wrote…

[nit] no need for the nil check

Done.

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 14 of 28 files at r1, 2 of 6 files at r2, 2 of 3 files at r3, 3 of 5 files at r4.
Reviewable status: 21 of 28 files reviewed, 6 unresolved discussions (waiting on @aadityasondhi, @itsbilal, and @RaduBerinde)


flushable.go line 167 at r4 (raw file):

	// flush.
	exciseSpan KeyRange
	seqNum     base.SeqNum

nit: should this be called exciseSeqNum? or add a comment above seqNum discussing the sequence number assignment (with files adopting the len(files) sequence numbers starting at seqNum+1 if exciseSpan.Valid())


flushable.go line 305 at r4 (raw file):

	)
	miter.Init(s.comparer, keyspan.NoopTransform, new(keyspanimpl.MergingBuffers), liter)
	if rkeydelIter != nil {

if rkdelIter == nil should we just return liter directly (and avoid allocating a MergingIter)?


open.go line 936 at r4 (raw file):

			if kind, encodedFileNum, val, ok, err := br.Next(); err != nil {
				return nil, 0, err
			} else if ok && (kind == InternalKeyKindIngestSST || kind == InternalKeyKindExcise) {

can we pull this out into its own function? replayWAL is already a bit hard to follow


open.go line 940 at r4 (raw file):

				//
				// Ingests require an up-to-date view of the LSM to determine the target
				// level of ingested sstables, and to accurately compute excises. Outside

would it simplify things to just flushable ingest these into L0 during recovery? I don't think there's much of a loss in doing that and would love to keep the recovery code as simple as we can.


internal/base/internal.go line 163 at r4 (raw file):

	// if the current InternalKeyKindMax is a kind that is never added to an
	// sstable or memtable (eg. InternalKeyKindExcise).
	InternalKeyKindMaxForSstable InternalKeyKind = InternalKeyKindDeleteSized

nit: I think we tend to consistently use SSTable capitalization

Copy link
Member Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: 21 of 28 files reviewed, 4 unresolved discussions (waiting on @aadityasondhi, @jbowens, and @RaduBerinde)


flushable.go line 167 at r4 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit: should this be called exciseSeqNum? or add a comment above seqNum discussing the sequence number assignment (with files adopting the len(files) sequence numbers starting at seqNum+1 if exciseSpan.Valid())

Done.


flushable.go line 305 at r4 (raw file):

Previously, jbowens (Jackson Owens) wrote…

if rkdelIter == nil should we just return liter directly (and avoid allocating a MergingIter)?

Done.


open.go line 936 at r4 (raw file):

Previously, jbowens (Jackson Owens) wrote…

can we pull this out into its own function? replayWAL is already a bit hard to follow

Done. I kept the versionEdit/toFlush interactions here but moved the flushableEntry construction and batch reading into a function and it's much cleaner now.


open.go line 940 at r4 (raw file):

Previously, jbowens (Jackson Owens) wrote…

would it simplify things to just flushable ingest these into L0 during recovery? I don't think there's much of a loss in doing that and would love to keep the recovery code as simple as we can.

The issue isn't so much just the L0 ingestions, but also the fact that we need to excise files we've written up to this point. I think it's easier to reason about correctness when the recovery code shares more code with the regular ingest pipeline than to have it do something unique and special.


internal/base/internal.go line 163 at r4 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit: I think we tend to consistently use SSTable capitalization

Done.

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 28 files at r1, 1 of 5 files at r4, 3 of 4 files at r5, all commit messages.
Reviewable status: 26 of 28 files reviewed, 7 unresolved discussions (waiting on @aadityasondhi, @itsbilal, and @RaduBerinde)


open.go line 940 at r4 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

The issue isn't so much just the L0 ingestions, but also the fact that we need to excise files we've written up to this point. I think it's easier to reason about correctness when the recovery code shares more code with the regular ingest pipeline than to have it do something unique and special.

Thanks, I see. I still think this code structure is getting too unwieldy, and we're doing something unique and special right here. IIUC, before this PR replayWAL performed no calls to logAndApply. Its responsibility was a) building up the flushables list and b) corresponding set of version edit mutations. Then the caller was responsible for either just going on its way with the flushables as-is (read-only mode) or committing the version edit (read-write mode). (Are we respecting read-only mode here?)


open.go line 792 at r5 (raw file):

	}
	if kind == base.InternalKeyKindExcise {
		exciseSpan.Start = slices.Clone(encodedFileNum)

rename encodedFileNum? it's not an encoded file num here.


open.go line 795 at r5 (raw file):

		exciseSpan.End = slices.Clone(val)
	} else {
		addFileNum(encodedFileNum)

I didn't follow why the first iteration is handled up outside the loop?


open.go line 823 at r5 (raw file):

		return nil, err
	} else if ok {
		panic("pebble: invalid number of entries in batch.")

nit: stray trailing period


open.go line 950 at r5 (raw file):

	// updateVE is used to update ve with information about new files created
	// during the flush of any flushable not of type ingestedFlushable. For the
	// flushable of type ingestedFlushable we use custom handling below.

this comment is outdated now


open.go line 1047 at r5 (raw file):

					entry = d.mu.mem.queue[len(d.mu.mem.queue)-1]
					d.mu.mem.mutable = nil
					d.mu.mem.queue = d.mu.mem.queue[:len(d.mu.mem.queue)-1]

this line is confusing. the mutable memtable is the last item in the queue. when we flush, we'll flush the entirety of the queue. why is this trimming off just the mutable memtable at the end? are we guaranteed that the queue is a single element, the mutable memtable? if so we should add a comment describing the invariant, enforce it and use d.mu.mem.queue[:0] here


open.go line 1061 at r5 (raw file):

				if d.mu.versions.logSeqNum.Load() < maxSeqNum {
					d.mu.versions.logSeqNum.Store(maxSeqNum)
				}

why do we need to update the logSeqNum/visibleSeqNum early instead of waiting for the caller of replayWAL to do it?

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think this code structure is getting too unwieldy

In thinking about this more, why do we need to actually perform the excise immediately in in replayWAL? The goal of ingested flushables is to defer the operation until flush time. Shouldn't we able to queue the appropriate flushables during replayWAL and then perform flush operations to drain the queue of flushables? I.e, it seems like we uncouple the replaying into memtables/flushables from the construction of the version edits. We'll still be able to use the existing ingestion excise logic to perform the excise. If we go this route, we should first refactor the existing code to have the structure we want before we attempt to extend it to support excises with flushable ingestions.

Reviewable status: 26 of 28 files reviewed, 7 unresolved discussions (waiting on @aadityasondhi, @itsbilal, and @RaduBerinde)

@itsbilal itsbilal force-pushed the excise-always-flushes branch 2 times, most recently from e07a9ca to df24195 Compare September 20, 2024 20:34
Copy link
Member Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs!

I agree that the cleaner approach is to just always replay WALs into the flushable queue, and then do flushes until the flushable queue is empty. That way we can maintain one code path for flushes and flushable ingests, and makes this replay code much easier to reason about. I've gone ahead and done that refactor.

There's still one "memtable buffer not freed" bug in one test (more like, when one test is run with other tests and not on its own) that I spent a lot of time taking a look at, but wasn't able to quite get to the bottom of yet. Other than that, this is ready for a look.

Reviewable status: 22 of 39 files reviewed, 5 unresolved discussions (waiting on @aadityasondhi, @jbowens, and @RaduBerinde)


open.go line 792 at r5 (raw file):

Previously, jbowens (Jackson Owens) wrote…

rename encodedFileNum? it's not an encoded file num here.

Done.


open.go line 823 at r5 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit: stray trailing period

Done.


open.go line 950 at r5 (raw file):

Previously, jbowens (Jackson Owens) wrote…

this comment is outdated now

Done.


open.go line 1047 at r5 (raw file):

Previously, jbowens (Jackson Owens) wrote…

this line is confusing. the mutable memtable is the last item in the queue. when we flush, we'll flush the entirety of the queue. why is this trimming off just the mutable memtable at the end? are we guaranteed that the queue is a single element, the mutable memtable? if so we should add a comment describing the invariant, enforce it and use d.mu.mem.queue[:0] here

Done.


open.go line 1061 at r5 (raw file):

Previously, jbowens (Jackson Owens) wrote…

why do we need to update the logSeqNum/visibleSeqNum early instead of waiting for the caller of replayWAL to do it?

Done.

Copy link
Member Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's still one "memtable buffer not freed" bug in one test (more like, when one test is run with other tests and not on its own) that I spent a lot of time taking a look at, but wasn't able to quite get to the bottom of yet.

This is addressed now.

Reviewable status: 22 of 39 files reviewed, 5 unresolved discussions (waiting on @aadityasondhi, @jbowens, and @RaduBerinde)

@itsbilal itsbilal force-pushed the excise-always-flushes branch 2 times, most recently from f3e286f to 6cffbb6 Compare September 22, 2024 21:15
Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we go this route, we should first refactor the existing code to have the structure we want before we attempt to extend it to support excises with flushable ingestions.

Do you mind pulling out the refactor into a separate PR?

Reviewable status: 22 of 39 files reviewed, 5 unresolved discussions (waiting on @aadityasondhi, @itsbilal, and @RaduBerinde)

@itsbilal itsbilal force-pushed the excise-always-flushes branch 2 times, most recently from 2f29b72 to ee0beae Compare September 23, 2024 19:45
@itsbilal
Copy link
Member Author

Do you mind pulling out the refactor into a separate PR?

Done: #3955

Previously, we'd only allow IngestAndExcise to become flushable
ingests if we had a guarantee from the caller that the ingestion
contained rangedel/rangekeydel overlapping the excise span. This
change reworks flushable ingests to return rangedels/rangekeydels
from their own iterators, even if one does not exist in the sstables
themselves. This, plus the update to the WAL to be able to persist
the excise span, allows for IngestAndExcise to allow flushable
ingests in more cases than before.

Also resolves TODOs in replayWAL() around properly sequencing
flushable ingests so their sstables can go in levels lower than
L0. This sequencing was necessary to make excises work correctly;
as excises require an up-to-date view of the version to work
correctly.

Fixes cockroachdb#2676.
Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm_strong:

Looks great!

Reviewed 4 of 16 files at r6, 1 of 1 files at r8, 12 of 12 files at r9, all commit messages.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @aadityasondhi, @itsbilal, and @RaduBerinde)

@itsbilal
Copy link
Member Author

TFTR!

@itsbilal itsbilal merged commit d73ab80 into cockroachdb:master Sep 25, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ingest: Allow excises with flushable ingests
4 participants