One of the key features of VETRI’s user-controlled digital identity solution is that users should be able to easily start monetizing their personal data in a fully secure and controlled fashion on the VETRI marketplace. In return, data consumers gain direct access to reliable, anonymized data in a compliant and cost-efficient way.
How to most effectively match users and data consumers on the VETRI marketplace is where VETRI’s Matching Engine comes into the picture.
In this longer-than-usual blog, we would like to outline some of our thinking on this. First, though, some context for those who might be reading this without first having studied our White Paper: Users’ sensitive data is stored locally on their mobile devices in a VETRI wallet. By design, VETRI will not be able to access, read or collect any information contained in users’ wallets. VETRI will not, therefore, have a database of users’ sensitive data.
Working without a honey pot
To still be able to match users and data consumers, therefore, VETRI resolves this issue by assuming that data consumers do not need to know people’s names, or contact details to run effective marketing campaigns. Our assumption is that all they require is a certain assurance that they can reach a desired target audience as defined by a combination of users’ demographic data, tastes and preferences (psychographic data). With VETRI, users will be able to share — either pro-actively or on-demand — anonymized demographic and psychographic data, thus enabling the creation of target audience filters on the VETRI platform for digital advertisers to use. As an extra layer of protection, VETRI ensures that these advertisers can never directly read or copy the information. Instead, they are simply assured that a defined pool of users corresponds to their audience criteria and have consented to receiving advertising from them or from similar companies, as specified by the users’ interest and preference settings.
This is where the Matching Engine comes in. The Matching Engine selects optimal samples along predefined characteristics. And this can be done very flexibly:
- Admissible ranges can be specified for all variables.
- Restrictions can be specified on how the variables are distributed in the sample, and on how they move together.
- Restrictions may be hard or soft: Under a hard restriction, user data does not enter the sample if it does not match certain conditions. Soft restrictions mean that, all else equal, users who match the restrictions are preferred to users who do not. Soft restrictions may be asymmetric.
- Missing values may be handled by replacing them through statistical estimates.
Eventually, the data consumer gets a highly customizable sample for an optimal pricing using the VLD token. All data is, of course, anonymized and hence user identity is protected at all times.
Let’s take the following example: For a pharmaceutical cohort study the Matching Engine may need to select the following sample consisting of 5,000 users:
- Females in Europe
- Aged between 35 and 55, matching the Swiss age structure as a soft criterion
- Normal blood pressure, but a body mass index (BMI) above average
- Salary ideally above EUR50,000 but at least EUR40,000
Even if some users prefer not to offer their salary statistics and even if most users who have responded are European males with an average age of 30, the Matching Engine would still be able to put together a sample to reflect the confidence level required.
A closer look
Since the quality of the data is what matters for pricing, the Matching Engine computes the matching quality of the sample, i.e. it assesses how well a given sample fulfills the specified restrictions.
The simplest restrictions are ranges. For instance, one may require that all individuals in the sample are between 35 and 55 years old. Such restrictions may be hard (“must be”) or soft (“ideally should be”). Soft restrictions essentially give a selection penalty to individuals whose data violates the set conditions. Such penalties may be asymmetric: it may be less desirable to be younger than 35 years than to be over 55.
However, even if there are individuals who are either too old or too young to be included under a hard restriction, if such individuals contribute much to the matching quality through other variables, they may still, under soft restrictions, become part of the sample.
Specifying ranges may not be enough, though, and a data consumer may prefer that the distribution of variables in the sample matches a target distribution. The Matching Engine can handle this too. In the cited example, it might be undesirable to end up with a sample of only 35-year olds. Suppose age is distributed in the available data as follows, with each grey line representing an individual.
If we randomly selected people between 35 and 55 years, then because of the high proportion of young people in the available data set, the age distribution in the sample will probably be skewed:
Instead, the data consumer may prefer — as a soft restriction — that age is distributed evenly:
(Other target distributions are possible, too: one may, for instance, prefer to match the empirical distribution of variables in a population.)
Distributions as pictured above only relate to a single variable at a time. However, the ME can make sure that a target distribution is achieved in subsamples; for instance, among all female individuals in the sample. This feature may be used to disentangle correlated variables: For instance, empirically, there is a positive correlation between BMI and blood pressure, which would normally persist in the sample. Because the data set contains enough users, the Matching Engine can be very selective about the individuals it chooses. And even from highly correlated data, one may extract subsets that lack correlation or exhibit exactly the desired correlation features. A graphical example:
In addition, the Matching Engine can also group similar users together in clusters, which is useful for experiments. The engine may also build samples that directly match properties, even of single users, of external datasets.
Another of the Matching Engine’s tasks is the calculation of the marginal contribution that a single user’s data have made to the overall matching quality. In other words, the matching engine quantifies how valuable every single user has been in achieving the final matching quality.
Often, user data will not be complete, so the Matching Engine also takes care of this. Missing data may be handled in different ways. The simplest way is to require complete observations, in which case individuals with incomplete data may never enter the sample. The Matching Engine supports this option; but a better way is to replace such missing values by estimates through a process called multivariate imputation: a model is fitted in the cross-section of individuals, and missing values are replaced by the estimates of this model.
For instance, suppose we are interested in an individual’s blood pressure. Blood pressure can be well predicted by a function of age, sex, number of steps walked per day and other variables. While such a replacement remains an estimate, it is more useful than discarding incomplete data and more accurate than simply taking an average value.
Since the imputation procedure may use all correctly flagged data in the database, such estimates are much more accurate than such that rely on the sample only.
There remains a final job that the Matching Engine has to do, and it is the most crucial. Given a set of restrictions, it needs to find the sample that optimally matches them. Such a search is a combinatorial problem, so-called because, in principle, there is a straightforward way to solve it: suppose there is data on two million users and a sample size of five thousand is to be selected. Then write down all possible samples of five thousand users, for each compute its matching quality, and keep the best sample.
That only works in principle; in practice, the number of possible combinations is so HUGE that a computer cannot compute it using standard arithmetic. But such a problem can still be solved with modern optimization techniques, which are inspired by the principles of evolution. The search algorithm will, in effect, simulate evolution: it will start with a random combination of users and then, through hundreds of thousands of iterations, improve the solution ever more. This is a lot of computational work to do for the matching engine, but it is made fast by distributing computations over several computers, and also through some clever implementation, which reuses parts of the computation that has already been done.
The challenge for us is that, in the early stages following VETRI’s launch, the data set from which a sample is to be selected may be small and biased, given that the composition of the available data will almost certainly not reflect the demographic profile of the population at large. This challenge can partially be addressed by increasing the number of users.
But even if VETRI has an enormous number of users, that data set will still not represent a perfect distribution. And even with a very large number of users, not all VETRI users will want to share the required data, even in an anonymized form.
We can still improve over the alternatives, though, such as random sampling. Even if the available data is biased, selective sampling methods would do better for many applications than drawing randomly. For example, let’s suppose that most users are very young, and that there are many more men than women (i.e. age and sex distribution is very different from the real world). Then, if a data consumer wishes to obtain a sample that better matches the actual demographic properties of the population, a sampling scheme that gives more weight to groups underrepresented in the database makes a lot of sense.
Our guess is that data consumers will often know quite well what they want. For instance, if they wish to use the sample as the basis for experiments, they may well appreciate the possibility that the Matching Engine can extract samples in which variables are orthogonal (i.e. statistically independent), despite being correlated in the population and/or the data set.
The Matching Engine is being built to deliver high quality results in an era where data users are claiming sovereignty over the ownership of the data they generate.