Authors:
Akshay Pundle, Google Privacy Sandbox
Trenton Starkey, Google Privacy Sandbox
This document describes the proposed inference capabilities on Bidding and Auction Services (B&A). We examine the building blocks that B&A will provide for running inference.
B&A enables real-time bidding and auctions for the Protected App Signals (PAS) and Protected Audience (PA) APIs. Real-time bidding allows advertisers to bid on each impression individually. Calculating the bid is a core function in B&A. The B&A servers run in the cloud in a Trusted Execution Environment (TEE) and apply technical data protection measures through different phases of the auction including when bids are computed.
During bid calculation, it is often beneficial to make inference predictions. This is the process of using a pre-trained ML model supplied by the Ad Tech to make predictions using the given input. For example, predicting click-through (pCTR) and conversion rates (pCVR) can help to produce the right bid. To this end, we aim to enable the running of such inferences against a variety of models.
Bidding and Auction (B&A) will be expanded to include inference capabilities. This means B&A will have the ability to run pre-trained ML models supplied by the Ad tech to make predictions. Initially, we will concentrate on supporting inference for Protected App Signals (PAS). Details on how to do inference for Protected Audience API will be shared in a later update.
Inference will be supported from the PAS user-defined functions (UDFs)
generateBid() and prepareDataForAdRetrieval(). Throughout this document
we will use generateBid()
as our example. generateBid()
is a UDF defined by
the ad tech that includes ad tech business logic and encapsulates the logic for
calculating bids. The same inference capabilities will be available in
prepareDataForAdRetrieval()
and may be extended to other ad tech UDFs in the
future. For more details on these functions see Protected App Signals
(PAS)
We will support two main ways of running inference:
- Inference with embedded models: Models are encapsulated in ad tech provided JavaScript code
- Inference with externally provided models: Models are external to the ad tech provided JavaScript code and APIs are provided to access them
Models can be embedded in the ad tech's implementation of generateBid()
. The
ad tech can embed simpler models in their code, either implemented in JavaScript
or with WebAssembly (Wasm).
With this method, the model's binary data and all code required to use it for inference are bundled in the ad tech's provided code. This method does not require fetching the model binary data (such as weights) externally. Instead, it relies on all data being bundled along with the code.
This method can support fully custom models. Since the model and code are fully controlled by the ad tech, any compatible model format and code for inference can be used: code written in JavaScript and code that can be compiled into Wasm.
This method works better for smaller and simpler models.
- Complex models often require large frameworks or libraries, and other initialization, which may incur a higher cost to run inside the JavaScript sandbox.
- Not all model code can be translated into Wasm or JavaScript in a straightforward way. It may be more difficult to translate complex models and supporting code. There is a tradeoff between model size and memory cost.
In this method, we will support loading binary model data from external sources.
The models will be fetched periodically from a linked source (e.g. Google Cloud
Storage buckets, Amazon S3 buckets) and loaded into memory. An API
will be provided to generateBid()
to run inference against the loaded models.
The whole operation of invoking runInference()
, running predictions using the
loaded models and returning the values to the caller will take place inside a
single TEE. Data will not leave the TEE, and will not be observable externally
either. Outside observers will not be directly able to observe how many times
inference was invoked, if at all.
We will initially support Tensorflow and PyTorch based models.
We will provide the following APIs, which will be usable from ad tech code:
- Run inference using the external models: This is the main way to use
inference. A batch of inputs can be constructed in ad tech code and passed to
the API. Inference will run for each input on the given pre-loaded model and
the responses will be returned. The call, for illustrative purposes, may look
like the following:
runInference(modelPath, [input1, input2...])
- Get the list of paths of models available for use: This would get the list
of models currently loaded and available for use in the server. The paths of
models would typically encapsulate version information (see Versions in
B&A). Along with the versioning strategy, this API gives the ad tech
precise control in choosing versions of models to run inference against. The
call, for illustrative purposes, may look like the following:
getModelPaths(...)
Note that these APIs are for clarity only and may not reflect the actual specification. We will define the actual API in more detail in future iterations of this explainer. These APIs will also be callable from Wasm.
Model inputs can use private features (e.g. user data) and non-private features (e.g. ad data). Private features are available only inside the TEE and so, such inference must run inside the TEE. This is because the private data should not leave the privacy preserving boundary. Non-private features can be available on the contextual path (see B&A system design), and models relying only on such data can run outside the TEE, on ad tech hardware, and may also pre-compute inferences.
Common bid prediction models use contextual, ad and user features for prediction. Out of these, only the user features are private. The machine learning models using such private features are the only ones that need to run inside the TEE. Contextual and ad features are not private, so pre-computation on these can be done outside the TEE.
For example, a model that relies only on contextual features can be run in the
contextual path during the RTB request. A model based on only ad features can
pre-compute results per ad and upload them to the K/V server. Such results can
be queried during request processing and made available to generateBid()
.
On the other hand, a model that uses user features, or one that uses user and ad features will need to run inside the TE, and cannot run outside the TEE boundary.
A unified approach to such a prediction could run a model that takes contextual, ad and user features as input, and makes a prediction, running the entire model inside the TEE. B&A will support unified models directly through the two ways of running inference (embedded and external models) described above. In the unified approach, the entire model will need to run inside the TEE, since it uses private features. These models will tend to be larger. Both these factors can increase the cost and latency of running such models.
Model factorization is a technique that makes it possible to break a single model into multiple pieces, and then combine those pieces into a prediction. In B&A-supported use cases, models often make use of three kinds of data: user data, contextual data, and ad data.
As described above, in the unified case, a single model is trained on all three kinds of data. In the factorized case, you could break the model up into multiple pieces. The model that includes the sensitive user data needs to be executed within the trust boundary, on the buyer's bidding services.
This makes the following design possible:
- Break the model up into a private piece (the sensitive user data) and one or more non-private pieces (the contextual and ad data).
- Optionally, inferences from some or all of the non-private pieces can be
passed in as arguments to a User Defined Function (UDF), e.g.
generateBid()
. For example, contextual embeddings can be computed along the contextual path, returned to the seller's frontend in the contextual response, and then can be passed into the B&A auction as part of thebuyer_signals
constructed by the seller when they initiate the Auction. - Optionally, ad techs can create models for non-private pieces (such as ads features) ahead of time, then materialize embeddings from those models into the ad retrieval server and fetch these embeddings at runtime.
- To make a prediction within a UDF, combine private embeddings (from inference within the TEE) with non-private embeddings (from UDF arguments or from the ad retrieval server) with an operation like a dot product. This is the final prediction.
The following figure shows a factorized pCTR model broken into three parts, contextual, user and ads towers. Only the User tower needs to run inside the TEE. The contextual and ad towers can run outside the TEE since they do not rely on private features.
Figure 1. Factorized pCTR ModelFactorized models are useful because loading and executing large models inside the TEE may be expensive. Oftentimes, only a small part of the model relies on private data (like user data), and the rest needs non-private (e.g. contextual or ad) data. With factorization, parts that depend on non-private features can be outside the TEE where they can be executed on ad tech hardware, while keeping the models requiring private features inside the TEE.
Factorization can be utilized only when the model can be broken down into its constituents (as described above) fairly easily. If models rely on a lot of cross features (e.g. ad x contextual information), factorization may not be an effective strategy.
We will support data flows that allow querying and passing such embeddings to
generateBid()
, where they can be combined with user (private) embeddings to make the final prediction. The next section describes this data flow.
B&A will support loading models inside the TEE. B&A servers will fetch and load
ad tech models periodically from a connected cloud bucket and make them
available for inference in the ad tech provided code (e.g. generateBid()
). The
loading will be performed in the background, separate from the execution of
generateBid()
. After loading, the models will also be pre-warmed in the
background, so that when the models are used in the request flow, they are
already warmed.
The loaded models will be executed in the supported frameworks (e.g. Tensorflow,
PyTorch). When runInference()
is invoked from within `generateBid(), the inference request will be forwarded to the corresponding executor for processing, and the results will be returned back to the calling function. These calls occur within the confines of a single TEE host, and data never leaves the TEE.
Note that runInference()
will not call out to the cloud buckets, or trigger
loading of the model. It will only execute requests against models that have
already been loaded.
To support factorized models, B&A will support data flows that enable
transferring embeddings produced from the model parts to generateBid()
, so
that they can be combined together with the prediction made through the
runInference()
call. These flows are based on the existing request
processing flows in B&A.
B&A will support these flows as following:
- Support passing embeddings generated as a part of the Contextual RTB auction
to
generateBid()
throughbuyer_signals
. These embeddings can be produced in real time, or can be looked up from ad tech servers during the contextual RTB request processing. - Support model loading and inference from
prepareDataForAdRetrieval()
through therunInference()
call.prepareDataForAdRetrieval()
can be used to generate private embeddings that are passed to the Ad Retrieval service to filter and fetch Ads. - Support querying embeddings from the Ad Retrieval service (runs inside TEE)
and passing them to
generateBid()
. These embeddings will be produced offline and uploaded to the Ad retrieval service. The keys will define what the embeddings are for, and appropriate keys can be looked up during retrieval request to fetch the pre-computed embeddings - Support model loading and inference in
generateBid()
through therunInference()
call, as described above. Here, inference runs inside TEE and can use private features based on the inputs ofgenerateBid()
. This produces an embedding that can be combined with the embeddings from the contextual path and ad retrieval service, as described above. The data flows make these embeddings available ingenerateBid()
so they can be combined with the private embedding to make a final prediction.
The entire flow is encapsulated in the diagram below along with an explanation of each step. This flow is based on the Protected App Signals flow which fits into the B&A request processing flow.
Figure 3. Sample flow for factorized models- Client sends a request to the seller's untrusted server.
- As part of real-time bidding (RTB) ad request, the seller sends contextual data (publisher provided data) to various buyers. At this stage, buyers run an ML model that requires only contextual features. This request occurs on the contextual path not running on the TEEs. It runs on ad tech-owned hardware of their choosing.
- Once the contextual requests are processed, the contextual ad with bid is
sent back to the seller. Along with this, the buyer can send
buyer_signals
, which the seller will forward to the appropriate buyer front end (BFE). Thebuyer_signals
is an arbitrary string, so the buyer can send the contextual embedding and the version of the models being used (thatgenerateBid()
should use) as a part of this. - The seller server forwards the request to the seller front end (SFE), which runs inside the TEE.
- The SFE sends requests to the buyer's BFE. It will include the
buyer_signals
as a part of the request. - The BFE sends a request to the Bidding service. The Bidding service runs
prepareDataForAdRetrieval()
that can optionally invoke inference. This UDF prepares the request to send to the Ad retrieval server. - The request prepared above is sent to the Ad retrieval server, that will run logic to fetch Ads.
- The retrieved ads are sent back to the bidding service, where they are given
as input to
generateBid()
. Ad embeddings can be included as a part of this response. The Bidding service then invokesgenerateBid()
. This call includes thebuyer_signals
and data from the Ad retrieval server. Thus, contextual embedding and version (part ofbuyer_signals
) and the ad embeddings (part of data looked up from the Ad retrieval server) are sent to thegenerateBid()
invocation. - The calculated bid is returned. To compute the bid,
generateBid()
can run models throughrunInference()calls
and combine the output embedding with other embeddings that were fetched above to make a final prediction.
In the above flow, prepareDataForAdRetrieval()
and generateBid()
can use
external models through the runInference()
API call. This runs inside the TEE
and can perform inference with models using private features (e.g. the user
tower). Alternatively (not shown in figure), each of the UDFs can use embedded
models to perform inference.
Inference on B&A will support factorized models. There are multiple models and embeddings that contribute towards the final prediction, some of which may be pre-computed. It is important to make sure that model and embedding versions used are consistent.
For facilitating this, we will provide the buyer a way to choose the version of
models to run per request. As a part of contextual request processing, a version
for inference can be decided by the buyer. The model version information can be
passed through the buyer_signals
. This field is a string and can be used to
pass arbitrary information from the contextual RTB request to generateBid(). buyer_signals
will also be passed to the K/V server where this model version
information can be used to look up embeddings. The ad tech is free to choose the
structure of the buyer_signals
including how the model information is passed
and what it contains. For example, something like the following could be used:
buyer_signals = {
...
"model_version": { "pCVR_version": 1, "pCTR_version": 2 }
...
}
The runInference()
API runs inferences against a named model. generateBid()
is responsible for interpreting the version included in buyer_signals
and
translating it to an actual path that corresponds to an entity in the cloud
bucket. A simple translation scheme could be something like
${model_name}-${model_version
}. This name will be translated into a cloud
bucket folder from which the model was loaded as shown in the figure below. This
figure is for illustrative purposes only. It shows the mapping from model path
to an actual path in the cloud bucket. The cloud bucket will only be synced
periodically, so each runInference()
call will not call out to the cloud
bucket. Thus, there will be some delay between a new model being available in
the cloud bucket, and it being made available to runInference()
. We will cover
such details in future updates to this explainer.
The getModelPaths()
function described earlier can be used along with
the version information to implement a versioning strategy by calling
runInference()
with the chosen version. Here are some examples how this
could be done:
// run inference against a hardcoded version of a model
modelPath = translateToPath("pcvr", 1.0) // custom ad tech logic
runInference(modelPath, [inputs...])
// run inference against version specified in buyer signals
modelPath = translateToPath("pcvr", buyer_signals.model_version.pCVR_version);
runInference(modelPath, [inputs...])
// run inference against the latest version of a model
models = getModelPaths()
latestPcvrVersion = findLatestPcvr(models) // custom ad tech logic
modelPath = translateToPath("pcvr", latestPcvrVersion);
runInference(modelPath, [inputs...])
Ad tech will have full control over what platforms and model types are used here. The models can be encapsulated in the ad tech provided JavaScript or Wasm code, thus JavaScript and any language that compiles to Wasm will be supported. Ad tech can bundle any libraries of their choice along with their code, so any frameworks or custom written code can be supported as long as it is capable of executing in the JavaScript sandbox.
We will initially support 2 popular ML platforms: Tensorflow and PyTorch. More detail to come in future updates on what model formats we will support based on ecosystem feedback.
Accelerator hardware availability is dependent on the cloud platforms. As of this writing, confidential processing within TEEs takes place on CPUs. If or when additional hardware like GPUs becomes available in cloud provider TEE environments, we look forward to supporting them.
Note that this pertains only to models executed inside the TEE (i.e. on B&A servers). For factorized models with components outside the TEE , the ad tech is free to choose whatever hardware they prefer.
Larger models can help ad techs get higher utility while trading off cost. Larger models have more memory and computation cost, and thus can be more expensive to run. They can also cause greater latency in predictions. Thus, the size of models that are reasonable to run depends on ad tech setup and the tradeoffs made.
We will not have any particular model size limits, and it will be up to the ad tech to determine the cost vs utility curve to see what sizes of models are sufficient, and how many inferences are made per request.
For external models, based on our experiments, models up to 100Mb should have a reasonable cost and latency. Ad techs should verify performance against their own models to make sure the cost, utility and latency is in line with their expectations.
Embedded models are much more custom, and ad techs should experiment with their setup to determine reasonable limits based on their cost and latency thresholds.
For inference, the cost predominantly comes from the resource usage to perform inference (CPU costs, in our case). There can be some overhead from various factors (e.g. translating inputs during request dispatch, fetching and storing models etc.) Since this translates to cost, we will try to minimize such overhead.
We have seen that CPU utilization for inference scales with model size and number of inference requests. Larger models consume more resources, as do more requests. This can be a factor in considering how you factorize your models.
For example, if the model running inside the TEE uses only user features,
then the results from this model can be shared across all ads being processed in
that invocation of generateBid()
. Thus, inference needs to be called only once
per model class (i.e. once for pCVR, once for pCTR, etc.). This cost will scale
with the number of model classes.
On the other hand, if the model running inside the TEE uses user x ad
features, then the predictions will be different per ad. Thus, inference will
need to be called for each ad that is being processed in generateBid()
. So,
for each ad, pCVR and pCTR models will need to be called. Here the cost will
scale with the (number of model classes) x (number of ads).
We will provide more guidance around cost aspects in future updates, or in a separate explainer.