Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Note about IPA wire format #65

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
172 changes: 172 additions & 0 deletions details/input.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
# Interoperable Private Attribution wire format


This documents provides clarification on the format IPA parties use to submit queries.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This documents provides clarification on the format IPA parties use to submit queries.
This documents the format that report collectors use to submit IPA queries to helper party networks.


## Query

IPA query consists of a mix of source and trigger events obtained from one or many source and trigger websites.
**Query size** is determined by the number of events included within a single query request.

It is desirable for report collectors to submit large queries as it brings more utility and saves cost,
therefore it makes sense to optimize the query format on the wire.

The following sections propose the format that is space-optimized at the expense of being more complicated to assemble.

## Assumptions

* Report collector use HTTP over TLS to send queries to helper party networks.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we choosing http instead of tls here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"over" here means that the one protocol layer (HTTP) runs on top of the other (TLS). It doesn't mean "instead of", even though that is another meaning that "over" can take, it isn't the usual assumption in this context.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh ok.. maybe we should reword it to say "on top of" TLS to avoid confusion?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heh, I never read it that way - too used to this expression. I agree that it does sound confusing, so I'll just use HTTPS instead.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"HTTP over TLS" is the name of the protocol. Or "HTTPS". To call it something else would be far worse.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didnt know about that. How about we add a link to RFC for it : https://www.rfc-editor.org/rfc/rfc2818

* Number of events within a single query is between $10^6$ and $10^9$.
* Number of unique source and trigger websites is significantly lower than total number of events in the input set.


## Format considerations

It is worth to look at a single event first. To be concrete, this assumes match key to be a 40-bit length byte string,
but this does not change the fundamentals. It is described in more details
[here](https://github.com/patcg-individual-drafts/ipa/blob/main/IPA-End-to-End.md#generating-source-and-trigger-reports-by-the-report-collector).

A single event consists of encrypted replicated shares of match key share, replicated shares of event data and additional
authenticated data. Event data varies depending on whether it is a source or a trigger event. Authenticated data is used
to decrypt the shares of match key.

Two things that consume the most space on the wire are **site registrable domain** and **match key provider origin**.
Both are ASCII strings, potentially large, that each event must refer to in order for helper parties to correctly obtain
the plain text shares.

The biggest savings from the custom format come from making each query to carry only one copy of unique site domain
and match key provider origin strings. This proposal suggests building two lookup tables (one for each entity) on the
caller site
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

incomplete

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that source queries have a single source site and trigger queries have a single trigger site, you could start by indicating the type of query (which can be implicit or part of the query creation step), then you can have two tables: one for the "same" side (source configurations for source queries, trigger configurations for trigger queries) and one for the "other" side (the converse). Then you could concatenate the two tables, index rows starting from zero and refer to the configurations.

Each row in the table would then effectively be a configuration that lists:

  1. Site: length prefixed ASCII; 1 byte length. Optimization hack: length = 0 copies the previous value.
  2. Epoch: 2 bytes.
  3. Key identifier: 1 byte.

There are three implied values that fill out the common stuff:

  1. (implied) Event type is inferred from the table type, so this is effectively run-length encoded.
  2. (implied) The match key provider should be the same for all events, so that can be part of query configuration.
  3. (implied) The helper party should know its own name, so that can be omitted completely.

Indexing into this table shouldn't take too many bytes. But I don't think that a 1 byte is going to work out in all cases. But the table size is known before you start processing individual items, so we can make the index size based on the table size ($\lceil log_2(t)/8 \rceil$).



## Proposed format

Each query request must carry the following information:
```text
[match key][event data][authenticated data][match key][event data][authenticated data]...
```

This proposal suggests splitting query requests into two sections: lookup tables section for match key providers and
site domains and payload section with encrypted replicated shares of match key and event data. Each encryption is annotated
with a unique id pair that points to site and match key provider strings that are used to authenticate the match key
encryption.

```text
.lookups
[site origin 1][site origin 2]..[site origin N]
[match key provider 1][match key provider 2]..[match key provider M]
.payload
[site origin id][match key provider id][authenticated data][match key][event data]
[site origin id][match key provider id][authenticated data][match key][event data]
...
```

where `id` is the index of site origin or match key provider inside the lookup table.

It is natural to assume $M \ll N$, so fewer bits is required to encode match key provider index.

The total number of site domain entries inside the lookup table must be less than $2^{32}$.
The total number of match key provider origin entries inside the lookup table must be less than $2^8$.


### Metadata

Every query request must specify several parameters to the helper parties that impact the size of the payload.
These parameters are sent in the [header](https://www.rfc-editor.org/rfc/rfc9110.html#name-header-fields) of HTTP request.

The list of supported parameters include:

| Header name | Type | Description | Accepted values | Default? | Mandatory? |
|------------------|------------------------------|----------------------------------------|-----------------|----------|------------|
| `x-ipa-field` | US-ASCII encoded string | Field type used to secret-share values | `fp32` | No | Yes |
| `x-ipa-query` | US-ASCII encoded string | Desired query to run in MPC | `ipa` | `ipa` | No |
| `x-ipa-version` | single byte unsigned integer | Version of the request | `1` | No | Yes |
Comment on lines +77 to +83
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RFC 6648.

We need a format for creating a query, which needs to include these values somehow. I would not use header fields for this, but instead define a payload format. This doesn't need to be tightly packed, so JSON is probably where I would go.

Also, some of this is information that could be part of the resource identity. That is, you would have one URI that does IPA and another that does something different. That means that you don't need to include explicit versioning.

Parameters are only necessary if you think that something needs tuning, or there are things that need to be known in order to accept the query. I think that we should directly signal the query size in this request as that has a direct bearing on what is being requested.

IPA already has a bunch of parameters that we have built into our implementation:

  • The number of breakdown keys.
  • The maximum value of individual trigger values.
  • The per-user cap.
  • The attribution window.

These are what I would expect to see in the request that creates a query.


### Lookup table

The site origin lookup table consists of RLE unique site domain values and unique match key provider origins, encoded as
ASCII strings. Each section is terminated with a single zero-valued byte.

For example, a query that has two unique site origins and one match key provider will have the lookup table encoded as
follows:

```text
15www.example.com7docs.rs\016matchkeyprovider
```

All entries are implicitly zero-indexed and the unique index of each entry is used inside the payload to indicate the
site origin all events within that group are associated with.

In the example above, `www.example.com` would have index 0 and `docs.rs` would be associated with the index 1.

### Payload

Payload section carries the match key encryption and event data along with additional authenticated data not included
in the lookup section. It includes one or more event encoded as follows:

1) The unique index of the site origin for this event encoded as four-byte integer in big-endian byte order.
This index must be unique inside the payload group and be a valid index from the lookup table.
2) The unique index of the match key provider origin for this event encoded as single-byte integer.
3) The single-byte key identifier from the key configuration for the helper party.
4) The current epoch, encoded as a two-byte integer in big-endian byte order.
This index must be unique inside the payload group and be a valid index from the lookup table.
5) The [HPKE](https://datatracker.ietf.org/doc/html/rfc9180) of replicated match key shares:
1) The 32 byte [encapsulated key](https://datatracker.ietf.org/doc/html/rfc9180#section-4) encoded in big endian byte order.
2) The PKE of match key shares encoded in big endian order. Using 40 bit match keys will result in 80 bit encryption, etc.
3) The 16 byte authentication tag encoded in big endian byte order.
6) The timestamp of the event encoded as three-byte integer in big-endian byte order. Timestamp of the event is represented as
number of seconds since the beginning of epoch.
7) The secret-shared value of the trigger bit, encoded as two [field](#metadata) values in big-endian order.
8) The secret-shared value of the trigger value, encoded as two [field](#metadata) values in big-endian order.
9) The secret-shared value of the breakdown key, encoded as two [field](#metadata) values in big-endian order.

## Simulation

It's worth simulating various distributions of events between unique site origins to estimate potential savings on the wire.
Intuitively, the biggest gains can be achieved when relatively few site origin have large number of events associated with them.

Note: The following estimations ignore the TCP/IP/Ethernet frame overhead as it remains the same regardless of the format chosen
by the implementations.

The following simulation assumes each event to take **112 bytes** on the wire, including encryption overhead
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that you need to spell out assumptions here. Something like:

  • enc = 32
  • ciphertext = 2*40/8 = 10
  • tag = 16
  • site = 1 + 50 (say)
  • key id = 1
  • epoch = 2
  • breakdown key = 1 (assuming XOR shares here and a small space, not sure about state of the art)
  • trigger value = 4
  • ts = 4 (not sure here again)

That's a little more than you have.

But with a table, and if we make breakdown key and trigger value mutually exclusive (and the same size), then we have 68 bytes, plus the table size, which is trivial for a large data set.

The other thing with tables is that you can reuse them...

(see [encryption](.encryption.md)) and site origin to be a random 25-160 byte ASCII string. The overhead of sending
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(see [encryption](.encryption.md)) and site origin to be a random 25-160 byte ASCII string. The overhead of sending
(see [encryption](./encryption.md)) and site origin to be a random 25-160 byte ASCII string. The overhead of sending

additional authenticated data is ignored except for site domain. The assumption is match key provider set per query
is small and while an additional lookup table is warranted, the relative overhead won't be visible in the simulations.

### 1M input

When input size is 1M events, total size without any optimizations is **194 MiB**.

| Unique site origins | Optimized size |
| --- | --- |
| 1 M | 194.5 MiB |
| 500 k | 150.7 MiB |
| 250 k | 128.7 MiB |
| 100 k | 115.6 MiB |
| 50 k | 111.2 MiB |
| 20 k | 108.6 MiB |
| 10 k | 107.7 MiB |

### 1B input

With 1 billion events,
savings between 10K and 1M unique site origins become marginal.
Without any optimization, 1B events will take **190 GiB**.

| Unique site origins | Optimized size |
| --- | --- |
| 500 M | 147.2 GiB |
| 100 M | 112.9 GiB |
| 10 M | 105.2 GiB |
| 1 M | 104.4 GiB |
| 500 k | 104.4 GiB |
| 250 k | 104.3 GiB |
| 100 k | 104.3 GiB |
| 50 k | 104.3 GiB |
| 20 k | 104.3 GiB |
| 10 k | 104.3 GiB |


Space gains vary from 30% to 50% assuming the number of unique websites lies between 10% to 30%.