June 28, 2021
Meet link: https://meet.google.com/jnn-rhxv-nsy
Previous meetings: https://github.com/WICG/conversion-measurement-api/tree/main/meetings
Use Google meet “Raise hand” for queuing
- Introductions
- Scribe volunteer
- (Charlie Harrison) Debugging (#160)
- Aggregate privacy-safe reports
- Other ideas? Debugging mode?
- (Charlie Harrison) Architecture questions → where to send encrypted reports for aggregation?
- Discussed a bit here
- Latency requirements
- Data ownership, etc
- (Maud Nalpas: quick note on the new README — especially external documentation the bottom, more to come there)
- Any other business
- Brian May (dstillery)
- Charlie Harrison (Google Chrome)
- Brendan Riordan-Butterworth (IAB Tech Lab / eyeo GmbH)
- Abdellah Lamrani Alaoui (Scibids)
- Andrew Pascoe (NextRoll)
- Larissa Licha (NextRoll)
- Erik Taubeneck (Facebook)
- Sean Bedford (Facebook)
- Aleksei Gorbushin (Walmart)
- Bill Landers (Xandr)
- Andrew Knox (Facebook)
- Maud Nalpas (Google Chrome)
- Ryan Avecilla (Neustar)
- Julien Delhommeau (Xandr)
- Viraj Awati (Amazon)
- Marshall Vale (Google Chrome)
- Erik Anderson (Microsoft)
- Aditya Desai (Google)
- Przemyslaw Iwanczak (RTB House)
- Daniel Rojas (Google Chrome)
- (Charlie): One of the concerns raise is that debugging is hard, when things doesn’t work as expected, even in the presence of 3rd party cookies. If there’s any mismatch, it’s hard to identify the culprit. We came up with something that’s in this issue [please link], to send some privacy safe error report that might be useful in the aggregate, or that would not reveal cross site information, like misconfigured tags, etc. For cross site data, it would not reveal that data, or be sent through the aggregate API. These don’t necessarily tell the whole story, might need more. We had some concerns about the privacy of that. There’s just a big tradeoff between user privacy and the ability to debug, fundamentally. One thing we’re looking for is feedback on this.
- (Brian) At the risk of suggesting something controversial, wouldn’t this ultimately be an error in the Chrome browser that would be sent back to help Chrome developers review it.
- (Charlie) One error that may come up is send the error to some end point. That might be something that’s not the browsers fault, but you might want to know about it. I think there’s a large class of those. Similarly, there may be genuine bugs in the browser. If it’s difficult to find these, they may persist. It’s pretty difficult to test these end to end.
- (Brian) Perhaps one strategy is to have a set of services that pings things that should be working, could be done in a way that’s privacy preserving since it doesn’t have user data.
- (Charlie) Would that be owned and operated by the browser, or by a 3rd party.
- (Brian) I would set something up in my system. That happens all the time now. As long as the browser is giving the hints.
- (Charlie) The hints are the problem, a priori, you may not know to set something up to monitor. If you’ve done it preemptively, you’d be fine. But if you see some loss, where is your first step, rather than doing a scatter shot in a complex API.
- (Brian) When I talk about hints, the browser says “this is how we’re going to do this thing” and “this is how you go about testing that this thing is doable by us”. You might want the browser to be able to generate random requests that could send something if it’s broken.
- (Erik) Which of these are in Origin Trial.
- (Charlie) Event level click is in trial, view through and aggregate are not.
- (Maud) There was originally fall backs for retries, it’s been removed due to complexity.
- (Charlie) Something we could consider. Do you have any resources to justify removing this. Did they find it not useful?
- (Maud) I could dig it up; it was for a number of reasons.
- (Aleksei) We want to know if the browser is operating correctly. Is there any way we can use the headers?
- (Charlie) That is along the lines of what we are thinking of. This one specific example of not reaching the end point is just one example, we might want to enumerate them. For instance, there might be instances of misconfigured tags so the browser doesn’t understand the request. If we have rate limiting, do we want to send an error that says there have been too many? This could be a privacy violation. You might be able to see things locally in the console, but not globally. There are also storage limits, if a user sees a million impressions and that’s going to use up all their disk space, is that something where we want to send an error.
- (Brian) I second the suggestion of starting a list. Not sure we deal with them all in general ways, think you have to deal with them specifically and maybe classify them. Starting a list would be a good idea.
- (Charlie) I think John has started on these. He can start a list, he’s been the main
- (Erik) I assume we have a local debugging mode, could use Aggregate API to “privatize” the state and return it for debugging.
- (Charlie) We do have a debugging mode, so you get reports immediately and it removes noise.
- (Charlie) We have a mode where you can get more fine grain debugging information. It wouldn’t be enabled on normal devices, but maybe on a managed organization, you could turn this on.
- (Brian) Having personally exploited this in the class, you want to be careful about having two classes of browsers. Someone will figure out how to exploit that.
- (Charlie) Can you elaborate on.
- (Brian) In a pervious product used the debug mode to take over the machine on Windows. If you do that in the browser, someone will figure out how to get in.
- (Charlie) This is an existing feature, enterprise mode which can set certain flags.
- (Maud) What’s hard at the moment is building at scale. In the current origin trial, you can already see issues in the console. Detecting what could go wrong for the end user is hard to know. Getting feedback on what would be needed. Maybe knowing that you dropped x%, would be enough. Listing the issues as Brian mentioned.
- (Maud) Wanted to highlight that we updated the readme. Eventually this was event level for clicks, we now have views, aggregate, etc. The other thing, if you scroll down, we linked to external documentation to make it more discoverable. This will be evolving with API guides, and higher level intros if you’re less familiar, and others that are more in depth. If you have suggestions for what would be helpful, it would be awesome to hear, please open an issue.
- (Charlie) Wanted to get feedback on feeding the reports to the ad tech rather than directly to the aggregation helpers. The main reasons:
- Avoid needing the aggregation helpers always online, if one is offline it might lead to data loss.
- Need to maintain state for everyone, big responsibility. Responsible for redundancy, etc. If something bad happens, it’s not their data.
- Gets at a data ownership question, it’s owned by the ad tech even if they can’t read it.
- Gives some flexibility to the ad tech being able to filter the reports.
- Also some major cons:
- If we have high latency requirements. There is a challenge with this batch processing model. Need to collect and then send them all to get processed. If latency is a big concern, might want a streaming model. Could be computed online and not have a big chunk of compute to do at the end.
- Ad tech could associate an IP to these, or other meta data.
- Questions:
- Would you be comfortable with the aggregation servers holding the data?
- (Andrew) I would say I strongly prefer the encrypted things going to the ad tech side. It gives us a lot of ability to know reports are coming through. If something goes wrong, we could re-query. We already collect a lot of data, so not something new.
- (Brian) I think there’s cases where you’d want things quickly. In other cases, you’re going to want to hold things on your server.
- (Erik) I think you can solve the first con by having the Ad Tech just forward immediately.
- (Charlie) One way to view the second con is how much flexibility you want to give the ad tech server to have in terms of queries. They might be able to filter out something coming from a certain geo IP.
- (Brian) Is the intent to provide specific sets of data to specific ad tech servers. It’s all opaque, could it be distributed? Ad tech servers might not know what they are caching.
- (Charlie) This sounds like adding another MPC system on top of the one we have. It’s maybe a middle compromise solution. A distributed system may also be hard to see if you’re missing reports. We could think about something like this.
- (Brian) Sounds like there are two different use cases: making sure it was routed and giving people ownership.
- (Erik)
- There’s a 3rd con: need to avoid replay attacks. Probably not to difficult, but it exists.
- Helper nodes could strip IP addresses.
- (Charlie) Seems a bit odd to route from helper -> ad tech -> helper, but it does make sense to allow ad tech to make their own trade offs around up time, storage, etc.
- (Abdellah) Comment on the Readme, thank you. It’s much easier to navigate. I was wondering if we could add something about the business use cases. I know there were business folks who were confused about how these could be used for campaign optimization, billing etc.
- (Charlie) Right now, the report with sensitivity is with the aggregate service. Right now those limits are things that are maintained exclusively on the client site. Whatever week/month whatever we end up choosing will have a local value with how much has contributed to it. It’s important to separate that from any querying that happens. It will be orthogonal. Once the reporting endpoint has the reports, those servers just need to do some simple replay protections: you can’t ask questions of these reports more than N times, where N is some small numbers. When we’re talking about sharing budgets, it’s only within one of these partitions. The basic structures is if there are multiple reporting endpoints. If there is one endpoint, there’s no sharing. If the publisher has a reporting endpoint and the advertiser has an endpoint. Right now it’s first come, first serve. I think that’s a problem and something we can think about. Right now there’s no mechanism, it’s just a shared public resource. It can be consumed by any reporting service. The reason we need this, it would be really easy to abuse the reporting API to send to ad-tech-1.com, ad-tech-2.com, etc, which are all forwarded to ad-tech.com.
- (Abdellah) It will be set by the reporting origin. They can specify the value, the L1 would not be the same for a campaign that generates a lot of conversions than one that doesn’t.
- (Charlie) How the L1 is set, we set it to something arbitrary and fixed for everyone. Instead of letting it change, we allow the contributions to scale. For all these partitions, the L1 is fixed to say 1, but the contribution can scale for any real number between 0 and 1. So you can then set the contribution to 1 if you only expect 1 conversion, or 0.1 if you expect 10. This allows the browser to maintain a fixed L1 that can be hard coded. To make things compatible with MPC, we can’t work with real numbers, so we propose a discretization of integers 0 to 2^16, it’s just a discretization factor.
- (Erik) The publisher / advertiser should be responsible for specifying how to use the budget, and default should be even.
- (Charlie) Unclear what “even” would be without any specification.
- (Aleksei) The reporting calls, it happens when the browser is browsing the publisher?
- (Charlie) In all the current design, it happens divorced, after the user browsers to the conversion site, and then some delay.