Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supporting batch uploads from the client (and routing reports through the collector) #64

Closed
csharrison opened this issue Jun 30, 2021 · 4 comments
Labels
parking-lot Parking lot for future discussions

Comments

@csharrison
Copy link
Contributor

In today's design call we discussed the collector receiving encrypted reports from clients and forwarding them to the leader. This aligns with the design we have in the WICG with some of the reasoning documented here.

I also brought this up for discussion on our regular calls in the WICG (minutes). Where there was some agreement that this was a good idea.

Pros / Cons of routing reports through the collector

These are probably non-exhaustive.
Pros:

  • Doesn't require aggregation servers to be highly online / available
  • Supports graceful failure ("If something goes wrong, we could re-query")
  • Gives some indication that "the API is working" on the server without needing to wait until query time, or via some other side-channel.
  • Distributes state out of the aggregation servers (% protection from replay attacks). Arguably this aligns more in our API with who "owns" the data at some fundamental level.
  • Adds query flexibility "for free" without explicitly adding support in the aggregation servers by allowing querying subsets of reports (for instance)
  • Allows support for some level of report authentication by the collector, who (in our model) is the entity that is best in the position to validate reports. This could be done completely outside the protocol.

Cons:

  • Adds query flexibility, which could be detrimental to privacy
  • Leaks some metadata about each request that may otherwise be only visible to the leader (e.g. ip address), unless using some anonymizing proxy
  • Introduces a new vector for replay attacks

Protocol solution

It seems there is a fairly simple solution to this problem, and that is to simply:

  • Instantiate the protocol where the collector is also the client, where the interactions between the "real clients" and the "client/collector" is unspecified by the protocol.
  • Allow the "client" to optionally batch upload reports in the protocol rather than sending them one by one.

In the existing protocol there is no client authentication so it is technically possible to have a collector that just collects encrypted reports from clients and forwards them on to the leader. Of course the actual clients would need to be set up to do this but it is permitted by the protocol. By allowing batch uploads we just optimize this already-permitted configuration.

Alternatively, if we deem collector-clients to be bad for the protocol, we ought to have a mechanism which actually forbids them (e.g. by authenticating clients). However, I think that it is pretty reasonable to have this allowed by the protocol and leave it up to specific instantiations how the "client" is configured/trusted.

@csharrison csharrison changed the title Supporting batch Supporting batch uploads from the client (and routing reports through the collector) Jun 30, 2021
@tgeoghegan
Copy link
Collaborator

Gives some indication that "the API is working" on the server without needing to wait until query time, or via some other side-channel.

Can you elaborate on this? Is the idea here that when a client (that is, a real end user client, not a batching one) gets a 200 OK after posting a report to a batching client, the client can be assured that its report has been durably persisted somewhere? I think a leader server could provide a similar guarantee at the end of the upload phase so i'm trying to understand what extra assurances the batching client provides.

Adds query flexibility "for free" without explicitly adding support in the aggregation servers by allowing querying subsets of reports (for instance)

IIUC the query flexibility is because the batching client can submit the same reports multiple times, in different-sized batches. If we decide this query flexibility is bad, we could mitigate this by having the original client include a report timestamp in the encrypted input, where it can't be tampered with by the batching client. Aggregators would then maintain query/privacy budgets per aggregation window and would be able to refuse queries on reports that fall in an aggregation window whose budget is already spent.

@csharrison
Copy link
Contributor Author

csharrison commented Jul 7, 2021

Can you elaborate on this? Is the idea here that when a client (that is, a real end user client, not a batching one) gets a 200 OK after posting a report to a batching client, the client can be assured that its report has been durably persisted somewhere? I think a leader server could provide a similar guarantee at the end of the upload phase so i'm trying to understand what extra assurances the batching client provides.

I think the use-case is more that the collector is assured the system is working without requiring an interaction with the helpers. It is possible this case could be met by introducing some "do I have some reports" functionality though.

IIUC the query flexibility is because the batching client can submit the same reports multiple times, in different-sized batches. If we decide this query flexibility is bad, we could mitigate this by having the original client include a report timestamp in the encrypted input, where it can't be tampered with by the batching client. Aggregators would then maintain query/privacy budgets per aggregation window and would be able to refuse queries on reports that fall in an aggregation window whose budget is already spent.

I think this is one part of it. There are a few ways this batching introduces flexibility even if reports can only be queried once. Mainly this is via separating / combining reports across multiple in-the-clear dimensions (in our design we give some info in the clear like the advertiser site a user converted on). A collector could combine multiple small advertisers reports together if they are too small to receive aggregate data. This is recoverable with a robust query model in the helpers though it adds complexity.

Another example along these lines is time-based querying. One collector might want data on hour boundaries, another might want on 4 hour boundaries etc.

@cjpatton
Copy link
Collaborator

Closed the PR, but here's where we left the discussion: #78 (comment)

@cjpatton
Copy link
Collaborator

Seems like the protocol already has everything needed to address this issue. In a combined Collector-Leader deployment, the details of the upload protocol in the spec can probably just be disregarded. What matters for interop in that case is the aggregation and collection flows running between Leader and Helper.

Closing as "won't fix". @csharrison please feel free to re-open if there's more to discuss.

@cjpatton cjpatton closed this as not planned Won't fix, can't repro, duplicate, stale Sep 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parking-lot Parking lot for future discussions
Projects
None yet
Development

No branches or pull requests

3 participants