Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add section on Event Labeling Queries #70

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

bmcase
Copy link
Collaborator

@bmcase bmcase commented May 8, 2023

There has been considerable interest in supporting event level outputs in IPA (IPA issue 60, PATCG issue 41). This was discussed at the May 2023 PATCG meeting where the consensus was we could support this so long as we can enforce a per user bound on the information released.

Here we outline a first draft of how an Event Labeling Query that labels events with noisy labels can be done in a way that lets us maintain a per matchkey DP bound. We also consider how these new queries can be compatible with an IPA system that flexibly supports either aggregation queries or event labeling queries.

w/ @benjaminsavage

@csharrison
Copy link
Contributor

A few quick notes:

  1. For dealing with users with multiple contributions I think it will probably be better to cap the sensitivity than to scale the noise dynamically.
  2. Supporting binary labels is good but ideally we would support enum labels / value bucket labels (Add a mechanism to support breakdowns by trigger data #55, Consider a randomized-response-like privacy mechanism #60)

cc @badih

@bmcase
Copy link
Collaborator Author

bmcase commented May 9, 2023

@csharrison

For dealing with users with multiple contributions I think it will probably be better to cap the sensitivity than to scale the noise dynamically.

Yeah, that would be a nice option. I actually started out trying to do it that way, but ran into some issues. Once you pass the cap you have to still return something for that source row. For rows that exceeded the cap you can't give back a "not labeled" indicator or that can leak information. I think you could try to return just a random label, maybe by setting to 0 and then flipping with the same flip probability as other rows, but it wasn't immediately clear to me that this also wouldn't increase the DP bound on how much information was revealed about that user. It also wasn't clear to me if random labels on some events would be worst for utility than scaling all the noise.

I think we can probably make it work, but like I said this was just a quick first draft to get started.

@csharrison
Copy link
Contributor

I think you could try to return just a random label, maybe by setting to 0 and then flipping with the same flip probability as other rows, but it wasn't immediately clear to me that this also wouldn't increase the DP bound on how much information was revealed about that user.

I think this should be fine for a DP bound and the argument will look really similar to sensitivity capping + Laplace with per-event queries. Consider two databases D and D' that are adjacent. After rate-limiting to sensitivity $s$, a users contribution will look like $(x_1, x_2, ... x_n)$ and $(x_1', x_2', ..., x_n')$ for n events, where the value at any $x_i$ is the label. After capping, these vectors will differ on at most $s$ columns1. If you consider an output of $(y_1, ..., y_n)$, then we have

$$\frac{Pr[y_1,...,y_n | x'_1,...x'_n]}{Pr[y_1,...,y_n | x_1,...x_n]} = \prod_{i=1}^{n}{\frac{Pr[y_i | x'_i]}{Pr[y_i | x_i]}} = \prod_{i \in differing\_rows}{\frac{Pr[y_i | x'_i]}{Pr[y_i | x_i]}}$$

That is, we can analyze each of the $s$ differing columns independently, and if you can show that the mechanism on a single column is bounded by $e^\epsilon$ it implies the whole mechanism is $s \epsilon$-differentially private.

Footnotes

  1. This is with an add-remove notion of DP that will just strip conversions for a converting user or add fake conversions for a non-converting user. With a "replacement" notion of DP that merely modifies the conversion patterns for a user the rows will differ on $2s$ values and everything goes through with a factor of 2 in the bound.

@martinthomson
Copy link
Collaborator

GitHub fail:
image

Shame that I'm not smart enough to parse that out myself.

@bmcase
Copy link
Collaborator Author

bmcase commented May 17, 2023

Updated the doc to include the version that caps the number of events labeled per matchkey.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants