Bedrock Humanised Intelligence
  • Data by Design
  • Our work
  • Knowledge
    • Podcast
    • Articles
    • La Diaspora
    • La Pipa TV
  • People
    • Team
    • Careers
  • LA PIPA
  • Let’s talk
  • Data by Design
  • Our Work
  • Knowledge
    • Podcasts
    • Articles
    • La Diaspora
    • La Pipa TV
  • People
    • Team
    • Careers
  • La Pipa
Let's talk

Author: bedrock

Back to Articles

Jan 24 — 2023

Data Clean Rooms

Data Clean Rooms

In this new article we bring back the subject of Data Clean Rooms (DCR), which are becoming more and more of a topical challenge due to the regulatory changes regarding data that are coming in Europe and, eventually, throughout the world. In case you missed it, I encourage you to give our original article a quick read here.

Data Clean Rooms are being adopted across all facets of data-driven marketing for many use cases, such as:

  • Direct partnerships between media buyers and sellers to improve ROI and yield, respectively
  • Providing a framework for media owners to scale and control their 1st party data
  • Creating one transparent environment to resolve identities in a unified manner
  • Acting as a 2nd party Data Hub to further enrich audience and identity profiles

The value of Data Clean Rooms goes further than the technical advantages, since from its implementation it quickly becomes a critical company asset. It showcases a future-proof data strategy aligned with the companies’ interests, ensuring both data quality and robustness in the analysis and decisions taken.

In this article, we will delve into the inner workings of a DCR and what to keep in mind when setting one up. Specifically, we will cover the following sections:

  • How does a DCR actually work?
  • What technical characteristics does it have to ensure data privacy?
  • What are some common use cases?
 

How does a Data Clean Room work?

Data Clean Rooms are encrypted, secure and decentralised storage locations where first-party data is anonymised, layered, and matched between parties safely. This technology allows for future-proofed solutions safeguarded against new privacy changes while remaining in complete control of your data.

If you recall from our previous article, there are three main types of DCRs, mainly distinguished by using an existing one (e.g, walled gardens) or building your own DCR together with a partner. For the adventurous types reading this article, we will detail here what a DCR looks like and relevant points to take into account.

The first step is to decide the relevant 1st party data you want to add to the DCR, such as ecommerce transactions, CRM or CDP records. It’s important to note at this stage that PI and PII data is read, encrypted and copied to a client-specific DCR Cloud Storage; no movement of sensitive data is done, and inside the DCR only a “hashed” copy of the data is present. Data ownership remains on the client side, and no leak or exposure of data is risked while on the DCR platform.

Figure 1. Infrastructure screenshot as one client, part of a multiple-partner collaboration
Figure 1. Infrastructure screenshot as one client, part of a multiple-partner collaboration

The selected data sources are then read, encrypted and stored on the DCR provider, where it’s further anonymised and aggregated into user and demographic groups. This resulting data is then available in the DCR platform for further analysis or overlapping with additional data sources; one example of such is the audience matching between two brands using encrypted data inside the DCR.

In order to avoid fingerprinting and to keep the data anonymous, further mathematical techniques that introduce noise are applied to the dataset. This ensures the aggregate data queries are consistent, while maintaining the privacy of individuals and complying with all privacy regulations; it’s especially important when dealing with smaller audiences, since this effect increases. One simple example is the following:

Suppose a particular person lives in Madrid. And suppose you are collaborating with a partner that wants to identify if this person is on your CRM.

Without privacy controls, after uploading your CRM dataset to the DCR, they could upload a database of their own to their DCR instance, listing thousands of people who live in Barcelona and just this one person who lives in Madrid.

Then, they could ask for statistics on the audience overlaps. If this report revealed that there was at least someone living in Madrid, then they would know that it was this very same person.

In collaboration scenarios, approved partners and advertisers are granted access to the anonymised data by the data owner on the client side, using mathematical models that contain no PII and are shared between the client’s DCR instances. The protected data can be used by partners and advertisers for data analytics, audience intelligence or data enrichment with identity providers, among others. More details on the actual data sharing are given in the section below.

Figure 2. Data can also be matched on multiple IDs at once to eliminate the reliance on one type of identity resolution or partner
Figure 2. Data can also be matched on multiple IDs at once to eliminate the reliance on one type of identity resolution or partner

Audience activation can be done directly from the DCR when connected through APIs to the programmatic stack, or the relevant insights can be taken outside of the platform to fuel campaign planning as an ad-hoc solution.

All in all, the main DCR providers offer a technological stack rooted on a decentralised and agnostic infrastructure, allowing the greatest degree of freedom suited to any need while maintaining full data ownership and control.

How is data handled?

As mentioned in the previous section, data remains on the client’s premises and is not stored inside the DCR itself. Instead, an encrypted hashed copy of the data is created and stored for analysis and sharing; but what does hashing mean?

In simple terms, hashing refers to the act of changing a plain text value into an alphanumeric value by using a function. Hashing returns an anonymised version of the input value that can’t be reversed and acts as the first privacy wall against possible data leaks. It’s important to note that hashing always returns the same output for a given input, in order to keep consistency.

However, encrypted data is a two-way process, where a private key is generated for the specified data and it’s used in order to hash or scramble the data. Then, only someone with the same key can decrypt and access the raw data.

Data Clean Rooms go one step further and apply advanced mathematical models such as homomorphic encryption to ensure data is secure, but also allow the analysis of encrypted data without the need for the private key! Further detail on this model can be seen here, but in short, it allows operations on the encrypted data that can be directly translated to the unencrypted data.

Figure 3. Ordinary encryption process (left) versus homomorphic process (right)
Figure 3. Ordinary encryption process (left) versus homomorphic process (right)

Thus, data owners can share the encrypted data and perform analyses on top of them without the need to decrypt it. This is the fundamental piece of privacy that enables DCRs to comply with all privacy regulations and allow DCR users to stay ahead of the market.

Figure 4. Example of audience matching and activation for a brand inside the DCR. Note that your hashed IDs are matched and activated by the partner without any PII being shared
Figure 4. Example of audience matching and activation for a brand inside the DCR. Note that your hashed IDs are matched and activated by the partner without any PII being shared

Relevant DCR use cases

Now that we have reviewed how the internals of a DCR work, we will take a look at three use cases in the industry where DCRs present a competitive advantage for companies implementing or collaborating in them.

Single customer view and breaking internal silos

DCRs are the perfect tool for the digital transformation of companies, especially ones where there’s a disconnect of customer data sources and information silos. With DCRs, not only is this achievable, but the (sometimes) internal politics of the company that prevent data movement can be circumvented.

These silos can be the internal departments of a company, different companies that are part of the same group, or any similar entity structure where there’s reticence to share data but also a value in analysing it together.

Recall that the actual data is not moved between the instances and only the encrypted data is matched and analysed; in the end, these entities act as if they were different clients, so they can keep the ownership of their data while, in the meantime, allow cross collaboration.

Figure 5. Different teams or departments can collaborate securely to allow a single customer view across their organisation
Figure 5. Different teams or departments can collaborate securely to allow a single customer view across their organisation

Audience overlap

One of the other common uses of DCRs is crossing 1st party consumer information between several parties in order to broaden their perspective on the behaviour of their own consumers.

This benefits all parties involved, as usually this information is both very personal (and therefore privacy compliance makes it difficult to analyse across companies) and a big asset for each brand involved, so a private and secure environment such as a DCR is the perfect fit for this.

One specific example could be a retail brand selling clothes that crossed their data with a fashion magazine. In this case, the added benefit for the brand is clear: they can understand if that channel is a relevant media source for their own consumer, and thus a way to improve their brand awareness and notoriety. For the magazine, the advantage is knowing more about their own readers through their fashion interests, providing further collaboration opportunities with other brands.

Figure 6. Crossing 1st party data while remaining in such an environment allows brands to understand other interests of their consumers
Figure 6. Crossing 1st party data while remaining in such an environment allows brands to understand other interests of their consumers

Monetisation of your owned data

This is a relevant use case for media owners and networks, as the prospect of a secure environment and standardised procedures to share their 1st party data is critical to go one step further in their business model. In this use case, brands and interested parties would come to understand their consumers’ media interests and behaviour. The direct impact of this for brands is that media planning can then determine the expected reach of each channel with more certainty than using the common aggregated data sources currently used.

Figure 7. Monetise your owned datasets in a safe manner, while allowing brands to enrich their own understanding of their consumers
Figure 7. Monetise your owned datasets in a safe manner, while allowing brands to enrich their own understanding of their consumers

Conclusion

Data Clean Rooms are an incipient solution that is growing in demand and that many companies are currently garnering to their advantage. We hope this article has given you a new perspective into how a Data Clean Room could suit your organisation’s needs.

We are one of the leading providers in DCR solutions in Southern Europe as we work with all the major DCR technologies that exist in the market today. If you feel like this solution may be the best fit for you, contact us anytime and we will be happy to discuss it in detail with you!

Back to Articles
Nov 28 — 2022

Consumer-centric marketing in a cookie-less world

Consumer-centric marketing in a cookie-less world

It’s not an understatement to say that data has become a crucial part of marketing & media planning, fully integrated into the current workflows. However, new privacy regulations around personal data and its use are coming (or are already here!), and many brands and marketeers are wondering how they can still understand their audiences and stay relevant while also being privacy compliant.

Google’s announcement of third-party cookie deprecation is pushing this even further (although browsers such as Firefox or Safari had already removed these tracking solutions in the past), but you may be wondering: why is this having such a large impact? One of the main reasons is that companies believe that cookies are essential to effectively reach their audiences, and that while first-party data is valuable, it’s still far from offering the same value.

We do not agree with this view, and believe that there are options that can bridge this gap by improving on the analytic capabilities surrounding both the collection and use of all available data.

How can we do this?

We encourage brands to not lose sight of what’s important, take a step back, and reflect on the most important factor: relevancy. Traditional media approaches based on psychology and sociology, coupled with the latest advances in technology/data in the MarTech landscape, can provide better insights into audience’s behaviour and interest. This, in turn, can lead to a unified and holistic solution for all brands.

Backed by Data Science capabilities, this approach can greatly improve consumers’ interactions with the brand across all stages of their customer journey. From our perspective, some important points that brands should bear in mind are:

  • Focus on building better first-party data
  • Unlock the power of “hyper-local” actions
  • Understand the most receptive context
  • Create real-time lead-scoring models
  • Don’t buy into one-size-fits-all solutions
  • Integrate your short-term solutions with a long-term business plan

Focus on building better first-party data

One of the most important points that should be considered is the intelligent enrichment of first-party data stack. Intelligent in the sense that data has traditionally been collected in a reckless manner, trying to gather every available data point in order to not miss anything.

Nowadays, this strategy must instead be focused on identifying and capturing only relevant information. What constitutes relevant information depends on each company’s audience and strategy, but in the end relies on understanding what their brand represents and how their audience behaves and interacts with them.

Otherwise, Data Science and Artificial Intelligence solutions run the risk of becoming another “black box” where multiple data sources are thrown into the mix without first checking if they have actual value for the company’s purposes. The risk is precisely in that these superfluous data points can affect the result in unexpected ways!

Once the first-party data is correctly designed, advertisers should then turn towards how to leverage this data to, for instance, activate “lookalikes” when cookies are gone. Multiple alternatives are present today, but one such solution is Data Clean Rooms: a secure and private solution to store and analyse relevant anonymous user information, and further enrich it with additional data sources. Check the following article written by my colleague Jesús Templado for further context and information.

Unlock the power of “hyper-local” actions

Most current solutions in the market are reliant on IDs or identifiers for users based on some encryption and anonymisation process; for example, based on hashed emails or other unique user identifiers. However, Data Science can enable further analysis detail by focusing on location and geographical information instead.

Although direct geolocation data is strictly regulated by GDPR, the use of complementary data at a high granular level can help improve the first party data stack and obtain valuable insights to drive the marketing strategy. One such example is Google Search data, which can be analysed using Data Science algorithms such as clustering or regression models to identify regional trends, build keyword universes to understand competitor actions, and make targeted campaigns down to the city level.

Understand the most receptive context

Couple traditional adserving strategies and methodologies with the right medium to present your message. This is exactly what content relevancy through contextual marketing provides.

It uses Artificial Intelligence (AI) models to identify and index the content of different websites and place relevant ads on websites with similar content. This is done through a combination of Natural Language Processing (NLP) to analyse the text content of the webpage and the search results, Computer Vision (CV) models to identify the image and video content, clustering models to group and identify the topicality segments for webpages, and so on.

In short, the usefulness of context can help drive a message to users when they are most receptive, and can help increase the efficiency of specific steps in a marketing campaign.

Create real-time lead-scoring models

User intent is defined as the set of signals and actions a user performs and that leave some footprint that can be measured, which in turn showcase the user’s inclination towards a certain brand, category or product. These signals are not only impending purchase intention, but also “softer” intention signals; these include reading blog posts, subscribing to a newsletter, following the brand’s LinkedIn page, or browsing several product pages from a specific product category, for instance.

User intent can help multiple areas inside an organisation, from helping with Account-Based Marketing (ABM) and Sales Departments’ lead qualification processes to identifying lower intent users and guiding them to discover and engage more with the brand.

One application of intent in marketing is the real-time intent modelling of website visitors, assigning an intent score towards their possible conversion. Since most of the web traffic received on a webpage is anonymous, this solution can help drive more conversions on previously unknown users. This solution uses web analytics data, the user’s real-time web interaction, and other relevant data sources to build an AI model capable of qualifying each new user at each point of their website experience.

Don’t buy into one-size-fits-all solutions

More often than not, tech companies tend to replicate their solutions across clients, independently of the context surrounding their business, operations and specific characteristics. This is usually due to difficult requirements on the provider side and higher operational expenses, but can hurt the overall solution in the end.

A simple example just to illustrate this could be an AI-powered attribution model. Imagine such a model was developed for a car manufacturer, presenting the most relevant factors influencing their media investment, the ROAS per channel and so on as to successfully portray and influence their strategy decisions. Could this model be reused, with the new datasets available, for a chocolate manufacturer? Aside from the obvious difference in the product category, the underlying connotations (price, planned vs impulse buying, different media spends, different target audiences…) already show that this would not reflect the reality of the brand.

Whenever possible, it should be ensured that data solutions implemented in a company are tailored specifically to suit the brand’s needs, in order to avoid headaches in the future.

Integrate your short-term solutions with a long-term business plan

Data projects are often seen as tactical approaches to solve one impending need or digitalise a specific process. However, these “one-off” solutions often cause more problems than they solve, since:

  • They might not be aligned with the company strategy
  • They can be disconnected from other one-off projects or internal processes
  • They usually lack context on the root of the challenge, focusing on a consequence and not answering the important questions

Data is just one more asset that can help brands stay ahead of the market, and therefore must be aligned with the company strategy and processes. Defining a long-term roadmap of actions and subdividing it into specific projects and tasks keeps all solutions aligned with the brand objectives and strategy.

In order to achieve this, an internal culture of change and data-driven decisions must be fostered inside the company, as adoption and belief in these solutions is key to their usefulness.

In short:

We hope these examples have shown several options that advertisers can take advantage of to stay relevant and ahead of the market. Although the landscape may seem daunting, allying yourself with a trusted partner that can help bridge the gap from vision to execution can be of great help moving forward.

BEDROCK is a Data Consultancy company with ample experience in the media & marketing field, accompanying clients in their process to becoming data-driven while making it understandable, easy and useful in every step of the way.

Back to Articles
Aug 9 — 2022

Optimisation techniques for Neural Networks II: Activation functions & vanishing gradient

Optimisation techniques for Neural Networks II: Activation functions & vanishing gradient

Have you ever stopped to think about why ReLU function works so well?

In our previous article in this series we presented two techniques to improve the optimisation of a neural network, both based on adaptive techniques: adapting the learning rate α and adapting the gradient descent formula by including new terms in the equation.

Previus Article: How using adaptive methods can help your network perform better.

In this article we are going to focus on another essential element of Artificial Neural Networks (ANNs), the activation function, and we will see what limitations it has and how they can be overcome.

The activation function

We already mentioned the activation function in our introduction article to ANNs, but what does it do specifically and how can we know which one to use?

The nodes of an ANN are characterised by performing an operation on the information they receive through this activation function, which we’ll denote by φ. Thus, the output of a neuron j, yⱼ, is the result of the activation function φ applied to the information that neuron j receives from its m predecessor neurons:

neural networks
Conventionally this φ function was equal to the sigmoid function or the hyperbolic tangent, but these functions have several drawbacks that can make our optimisation worse.

Why are these functions not always perfect?

The sigmoid activation function takes the input v and transforms it into a value between 0 and 1, while the hyperbolic tangent does it between -1 and 1.
neural networks
Left: Plot of sigmoid function. Right: Plot of tanh function

The problem is that the neurons can saturate; arbitrarily large input values will always return 1, and very small values, 0 — or -1 in the case of tanh — . Therefore, these functions are sensitive to changes only when vⱼ is very close to 0.5 and 0 respectively. Once the neurons saturate, it becomes very difficult for the algorithm to adapt the weights to improve the performance of the model.

In addition, deep networks — those with many hidden layers — can be difficult to train because of the way the gradients of the first layers are related to the ones in the final layers. It is possible that the magnitude of the error decreases exponentially with each additional layer that we add, and this means that the algorithm won’t know how to adjust the parameters for improving the cost function. This problem is the well-known vanishing gradient problem.

Understanding the vanishing gradient problem

Let us look at this problem in a little more detail. Suppose we have a network with m hidden layers that has a single neuron in each layer, and let us note the weights between layers by ω⁽¹⁾, ω⁽²⁾, …, ω⁽ᵐ⁾. Suppose too that the activation function of each layer is the sigmoid function and that the weights have been randomly initialised so that they have an expected value equal to 1. Let x be the input vector, y⁽ⁱ⁾ the hidden values of each layer and φ⁽ᵗ⁾’(v⁽ᵗ⁾) the derivative of the activation function in the hidden layer t. From the backpropagation algorithm we know the expression:

What happens if we explicitly calculate the value of the derivative φ’(x) for the case of a sigmoid function?
We can see that the value of φ’(x) reaches, at maximum, 0.25. As the expected absolute value of ω⁽ᵗ⁺¹⁾ is 1, it follows from the backpropagation equation that each weight with each update will cause the value of ∂J/∂y⁽ᵗ⁾ to be less than 0.25*( ∂J/∂y⁽ᵗ⁺¹⁾). Therefore, when we move backwards r layers, we will find that the value is less than 0.25ʳ. If r equals, say, 10, the gradient updates fall up to 10⁻⁶ of the value they originally had. Consequently, the first layers of the network receive updates much smaller than those closest to the output layer.
Plot of the derivative of the sigmoid function

Left: Plot of the derivative of the sigmoid function. Right: Plot of the derivative of the tanh function

This implies that the parameters of the last layers undergo large variations with each update, while those of the first layers do not. As a result, it is common to find ourselves in a situation where, even if we train for a long time, it is not possible to reach the optimum. Usually, unless we initialise each weight of each junction between neurons of different layers so the product ω ⋅ φ’ is exactly one, we will have problems.

The ReLU function

This is nowadays an open problem and we know different solutions that can be applied to this particular issue. One which has been proposed in recent years is to use the Rectifier Linear Activation function Unit or, as it is known in the Data Science circle, RELU function. It is defined by the expression:
We see that the function acts as the identity if the input is positive and if not, it returns zero. With this function the vanishing gradient problem occurs in fewer situations since most neurons have a gradient equal to one. When ReLU was first used — between 2009 and 2011 — it provided better performance in networks that had previously been trained with the sigmoid or hyperbolic tangent functions. Some benefits of the ReLU to highlight are the following:
  • Computational simplicity: Unlike the other functions, it does not require the calculation of an exponential function.
  • Representational sparsity: A great advantage of this function is that it is able to return an absolute zero. This allows hidden layers to contain exactly one or more nodes “off”. This is called a sparse representation, and it is a desirable property since it speeds up and simplifies the model.
However, recent work has found that the use of this function can cause another type of problem: The death of certain neurons. Imagine the situation where the input of a neuron is always positive and the weights, by chance, have been initialised with negative values. Then the output of that neuron will always be zero and we will lose the information it receives. To solve this, variants such as the leaky ReLU are currently being studied:

Conclusion

When we are training a neural network that is not converging it can be due to this vanishing gradient problem, which is especially relevant when we have several hidden layers. It is important to keep in mind that machine learning algorithms are based on mathematical rules, — it’s not a magic trick and if we pull back the curtain we can see how they work — , sometimes we can improve their training simply by understanding what is going on behind the scenes. A possible solution is the one proposed in this article, but there are others that can be tried. There will be trade-offs, for example there may be some that can give better results, but at the cost of being less efficient. We must evaluate each problem and try various alternatives. One size does not fit all.

References

[1] Bengio, Y. 2012. Practical Recommendations for Gradient-Based Training of Deep Architectures. arXiv:1206.5533v2. [2] R. Grosse, University of Toronto. (n.d.). Exploding and Vanishing Gradients. [3] Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2), 157–166.
Back to Articles
AUG 1 — 2022

Cash & Carry adoption of data science is growing momentum

Cash & Carry

The first book on wholesaling — Wholesaling Principles and Practice (1937) by Beckman and Engle states that “During the era of rapid change in the field of wholesaling which began in the middle of the twenties, the cash and carry wholesale house was ushered in.”

Wholesalers buy primarily from manufacturers and sell mainly to retailers, industrial users and other wholesalers. They also perform many value-added functions.

The main traits of cash and carry are summarised best by the following definitions:

  • Cash and carry is a trade in which goods are put on sale by a wholesale warehouse and utilised either on self-service grounds or using samples.
  • Customers include retailers, caterers, institutional buyers, etc, who settle the invoice on-site and carry the goods away.
  • The main difference between “classical” wholesale sales and cash and carry is that their customers transport goods themselves and pay on the spot, rather than on credit.
  • Access to cash and carry is normally restricted to operators of businesses, and the general public is not admitted.

The problem

FMCGs and consumer goods evolution in the B2C segment towards consumer understanding, improved brand experiences using data, automation and process improvement has left the wholesale lagging behind.

The opportunity

Using Data Science, Artificial (Augmented) Intelligence and advanced Analytics these data-driven technologies will generate improvements across the whole supply chain, from procurement, throughout inventory control, to cart check-out. AI can also create more efficient lines of communication and interoperability between systems, functions, processes and people (staff, providers and customers alike.

How to seize this opportunity

Wholesale companies must always collect their customers’ purchase history as a B2B operator.

Therefore these companies have a significant advantage regarding B2C (who often struggle to obtain this kind of information/data.

Many already have data sets of great value stored and underused in one or more data warehouses and should start to make better use of this information.

Most are not fully aware of the actual value of the data as a strategic asset and its economic value to their growth potential.

We will highlight a few different systems that can work autonomously or combined. Their application in the wholesale sector not only focuses on customer experience and revenue generation but also enables a rational use of resources, improving the management of stocks and perishables. In turn, this creates a positive impact on the sustainability of the wholesale distribution sector.

  • Recommendation engines can help to improve the purchase process and help clients to remember what they usually purchase by anticipating their needs. It can also encourage these customers to buy more of the same or purchase new products that engage them. Algorithms found in the back-end of recommendation engines can suggest products based on purchasing habits of every customer while considering key variables of product supply, margin per unit and availability.
  • Optimisation of purchase times and movement flow of customers through the store by using AI applied to digital floor plans. Additionally, insights distilled from these applications can help with a more efficient design of new floor plans and redesign of underperforming stores based on the shopping behaviours of existing end customers.
  • Stock management and forecasting: Prediction models applied to stock management and storage forecasting, aligned with future purchase recommendations. Forecasting using Data Science for precision is critical for sustainability and competitiveness, especially when dealing with perishable goods. Market players in this sector need to be conscious that forecasting using spreadsheets is no longer an option to stay competitive.
  • Loyalty systems evolve and enable individualised, tailor-made programs and promotions in line with the products suggested by the recommendation engine. Supported by automated alerts including product offers, suggestions and ideas using SMS or APP notifications increase average shopping basket spend significantly, improving shopper experience and satisfaction at the same time.
  • Smart interconnected stock systems to know the accurate inventory of each store in real-time while feeding the recommendation engine to send up-to-date suggestions.

But the role of AI isn’t only to forecast and predict. Access to richer customer insights of higher value will help get other stakeholders and business units more involved, reduce silos and improve communication throughout the operation because Data Science and AI are cross-functional and interdisciplinary tools.

What is in it for early adopters?

AI can help to liven and transform the B2B wholesale market, add value to the final consumer, increase revenues, boost sustainability and generate efficiency leading to growth and profit.

Those who act quickly will see how algorithms can help with resource allocation, stock management, customer engagement, and brand experience all at once.

However, it is still hard to decide how to take the first steps, and there are dozens of possible solutions and approaches depending on the current digital mindset and situation of the organisation.

At Bedrock we are working with partners and clients in the sector, defining custom roadmaps and implementing immediate wins first whilst identifying mid-long term high growth opportunities.

Our pioneering solutions are backed by the Spanish Ministry of Science and Innovation since 2021, which awarded us with a NEOTEC* non-refundable grant to help clients finance the implementation of our ground-breaking solutions.

We are open to including further partners in the project, so don’t hesitate to get in touch if you want to know more!

__________________

*About NEOTEC:

NEOTEC aids are non-refundable and destined for business projects in any technological or sectorial field. Aimed at business models based primarily on services to third parties, only projects developing their own — new technology are eligible.

On average, about 1 in 8 of such projects are approved, and those granted are widely regarded as the most innovative and pioneering applications of all reviewed across industries on a National scale.

Back to Articles

Jun 6 — 2022

Data Clean Rooms

Data Clean Rooms

An effective data-driven solution for modern advertising


Intro

Unless your head has been buried in the sand for the last few years you already know that third-party cookies and ID deprecation is coming at us quickly. A tighter privacy framework and growing Walled Gardens have us all scratching our heads for technical solutions that allow us to develop data-driven marketing strategies and top-notch Customer Journeys.

The Marketing Data Technology (MarDaTech) space is hungry for compliant solutions that allow relevant advertising, accurate campaign performance measurement and media attribution. Everyone is looking for safe mechanisms to understand prospects and customers as well as to identify them across different online environments.

At Bedrock we have started to help our clients navigate this changing and uncertain situation with effective, yet compliant solutions. One of which is Data Clean Rooms (DCRs).

DCRs deliver security, privacy, and data governance controls and processes. Moreover, data science can perform queries to build an analysis layer on top of them.

Data Science on Data Clean Rooms

These neutral environments offer statistical rigour, analytical possibilities and privacy protection. In data engineering jargon, DCRs are Data Warehouses where brands and/or advertisers exchange, can share and enrich their own data to run and measure optimised advertising actions. DCRs enable two or more parties to bring data together for joint analysis and custom data science developments.

For data sets (e.g. from CRM or transactions log) to be loaded, PII data needs to be “hashed” for transmission and uploaded, and then appropriately secured and encrypted. The resulting datasets are merged into aggregates and later divided into cohorts (as groups of users). This way, marketing and data science teams can work together to find value across these cohorts without relying on user-level data or individual addressability.

Aside from advertising, data scientists can apply their analytical techniques and algorithms to find insights across these groups, including; common relevant characteristics of high-value clients, their passions, interests, unexploited user profiles and even predict cohorts’ lifetime value.

By performing multivariate statistical analysis, the most important data attributes that allow us to understand the clients can be derived. Through PCA (Principal Component Analysis), MDS (Multi-dimensional scaling) or density-based unsupervised clustering, we can understand the taxonomies for each client base; and to understand the evolutions of these groups we could rely on algorithms such as STATIS methods.

Aside from advertising, data scientists can apply their analytical techniques and algorithms to find insights across these groups.

Three groups of companies are offering them:

  1. Walled Gardens : Google or Amazon. A large number of companies using Data Clean Rooms are those that have already spent and still invest a significant chunk of their advertising budget within the Walled Gardens. In return, the big players allow brands to obtain user-level data, though this is only available within those environments. There are two main Data Clean Rooms provided by the Big Tech as Facebook Advanced Analytics (FAA) was deprecated as of July 1st 2021. Google Ads Data Hubs (ADH) was one of the first Data Clean Room solutions in 2017 and it was developed as a privacy-based replacement for their previous advertising solution called DoubleClick. It is a privacy-native warehouse built on Google Cloud, and Google BigQuery is where we construct and perform queries that get our clients the insights they are looking for. No personally identifiable information (PII) is stored in BigQuery, which is excellent for processing large data volumes coming from Google owned tools and data sources such as Campaign Manager, Display & Video 360 (DV360), Google Ads, and YouTube. Amazon Marketing Cloud (AMC) is built on the Amazon Web Services cloud, this solution helps our clients discover the net impact of cross-channel media investments, perform analytics across multiple pseudonymised data sets, and generate aggregate reports. A unique feature is that anyone can use Amazon Marketing Cloud without having to have an AWS account.
  2. Emerging tech companies like InfoSum, LiveRamp (Safe Haven) Optable, Habu (CleanML), Permutive. These companies run independently and act as a middle man between two companies who want to exchange data. They build neutral rooms on demand for those that seek to adapt their advertising strategy so that a brand and a retailer can match data and refine audience segments, for instance. All of them offer pre-built turnkey integrations to ease data collaboration and they work across many verticals, including retail, CPG, travel, etc. In most cases only two parties are allowed at a time, but there are exceptions, i.e. Snowflake (distributed data clean room) enables data sharing with multiple parties at the same time, while keeping the same level of security.
  3. Organisations that own huge amounts of users and content data (e.g. membership platforms or retailers). Some business models related to long-term subscriptions and memberships tend to own and acquire large amounts of visitors’ and users’ information, such as Netflix or TikTok, allowing them to construct their own instances and convince providers and/or other retailers to partner and exchange their own first-party data. An example is Disney’s Advertising Sales group that launched their own clean room solution last October 2021 using a combination of Snowflake, Habu, and InfoSum tech stack. Now Disney allows advertisers to access thousands of first-party segments from its vast portfolio of brands across media platforms.

What to do next?

At Bedrock we help organisations to decide which Data Clean Room software fits them better as we already work with all major providers of Cloud infrastructure. We not only support the implementation of DCRs, but we can also help you integrate new data sources and provide data intelligence and analysis on demand.

Reach out to learn how we can help you leverage best-of-breed Data Clean Room technologies and value-added services!

Back to Articles
Abr 29 — 2022

Driving adoption in data science solutions

Driving adoption in data science solutions

Have you ever seen or participated in a project where a great tool was developed and then no one ever used it? It’s a story that is repeated time and time again: when a project is delivered and adoption is not as high as would have been desired. How is it possible if the solution works perfectly and fully meets the specifications?

Maybe because the interface looked confusing to the final users, maybe because they didn´t perceive the value that it would bring, maybe because the users didn’t feel confident about what the solution was doing in the back-end. Or maybe, it is a combination of all these, plus many other reasons. 

Successful data solutions must ensure the involvement of all parties in the development phase or risk becoming “a cool demo collecting dust in a drawer”. Human-centred solutions make all parties feel involved and proud of the process, which naturally drives adoption.

This article includes some steps and thoughts about what’s needed to ensure that a tool has widespread adoption once it’s finished. And by adoption I don’t mean making the use of the tool compulsory (which could also get 100% adoption!) but adoption because the users feel involved in the creation process and see value in the solution.

How to develop meaningful solutions

The starting point should be to understand the tool’s users. Ask the people who are actually going to use the tool: what are your day-to-day problems? Where do you think that there is margin for improvement? Additionally, involve yourself in their processes to really know what they are dealing with; listening to someone’s problem is not the same as actually seeing the problem with your own eyes. This step of the project is crucial, make informed decisions and do not rush them, doing it well the first time saves time and money in the long run.

Once we know the first version of what we want to develop, we need to start actually developing it. As the agile principles say, we should deliver working software frequently and welcome changing requirements. Expectations evolve with business needs, and data driven teams must adapt quickly to ensure they are driving value where needed. Even if no new suggestions are made, keeping everyone updated about the development process is an investment that’s useful in and of itself. A high focus should always be kept on the validation of this code: it’s a two way process, the code should be delivered, the stakeholders should test it, and then feedback should be sent. Otherwise this is rendered almost useless.

If communication with the end user is constant from the beginning and everyone makes sure that they are on the same page, then the requirements are less likely to change. Then if something is found along the way, via constant delivery, it’s corrected early. What’s often most highlighted about this is that it makes the tool better -which is indeed true- but it’s just as important to note that it also helps make everyone feel that they own the solution, so they’ll want to use it, and they’ll also want other people within the organisation to use it.

So in summary, we should involve everyone during the development of the tool and listen to what they think. This way we will get adoption not only by them, but also by everyone in the company.

Highlights for data science solutions

Although in the title I talked about data science solutions, so far everything could apply to any technical (or non-technical) solution. So where does the difference lie when applying this for data science solutions? There are plenty of different factors, and I’d like to highlight these two:

  • Fear of being substituted by an AI that does your job: this is something that I’ve seen in quite a few projects, where the humans who are going to be helped by the AI are scared that they will be replaced. To overcome this they need to understand perfectly what the AI can and cannot do, and why although it’s a great help, it isn’t a substitute for humans. If the users are scared of the tool, adoption will be impossible.
  • The results of these projects are fully dependent on the quality of the data, and this data needs to be provided and explained by humans. Very often, different people and departments need to collaborate in order to get all the data required for a project, so we must involve those people from the beginning and explain the importance of the task.

In summary, the way to get adoption for data science solutions is through constant communication where everyone listens to each other: periodic meetings to explain the status of the project and receive feedback, individual meetings with different stakeholders/departments, and always being reachable by anyone who might have relevant feedback (which is everyone!). This helps develop these projects faster and with better results, both in terms of the quality of the product delivered and in terms of the use of this product.

Back to Articles

Abr 7 — 2022

Outliers in data preprocessing: Spotting the odd one out!

Outliers in data preprocessing: Spotting the odd one out!

More than 2.5 quintillion bytes of data were generated daily in 2020 (source: GS Analytics). To put this in perspective, a quintillion is a million million million or, more simply, a 1 followed by 18 zeros. It is therefore not surprising that a significant amount of this data is subject to errors. Data scientists are aware of this and routinely check their databases for values that stand out from the rest. This process referred to as “outlier identification” in the case of numerical variables, has become a standard step in data preprocessing.

The search for outliers

The search for univariate outliers is quite straightforward. For instance, if we are dealing with human heights and most of the individuals´ measurements are expected to range between 150cm and 190cm, then heights such as 1,70cm and 1700cm must be understood to be annotation errors. Aside from such gross outliers, which should definitely be cleaned when performing data preprocessing tasks, there is still room for outliers that are inherent to the type of data we are dealing with. For instance, some people could be 140cm or 200cm tall. This type of outlier is typically identified with rules of thumb such as the absolute value of the z-score being greater than 3. Unless there is an obvious reason (such as an annotation error), these outliers should not be removed/cleaned in general, still, it is important to identify them and monitor their influence on the modelling task to be performed.

Multivariate outliers

A more difficult problem arises when we are dealing with multivariate data. For example, imagine that we are dealing with human heights and weights and that we have obtained the data represented in the scatterplot below. The individual marked in red is not a univariate outlier in either of the two dimensions separately, however, when jointly considering both height and weight this individual clearly stands out from the rest.

A popular technique for the identification of multivariate outliers is based on the use of the Mahalanobis distance, which is just a measure of how far a point x is from the centre of the data. Mathematically speaking, the formula is as follows:

where mu represents the mean vector (i.e., the centre of the data) and Sigma the covariance matrix, both of them typically being estimated from the data by the sample mean vector and the sample covariance matrix.

Interestingly, the Mahalanobis distance may be used for drawing tolerance ellipses of points that are at a certain Mahalanobis distance from the centre of the data, thus allowing us to easily identify outliers. For instance, returning to the example of human height and weight, it can be seen that the individual marked in red is actually the most outlying point when taking into account the graphical shape of our dataset.

In fact, one could understand the Mahalanobis distance as the multivariate alternative to the z-score. More precisely, ‘being at a Mahalanobis distance d from the centre’ is the multivariate equivalent of ‘being d standard deviations away from the mean’ in the univariate setting. Therefore, under certain assumptions, such as the data being obtained from a multivariate Gaussian distribution, it is possible to estimate the proportion of individuals lying inside and outside a tolerance ellipse. In the case above, we are representing a 95% tolerance ellipse, meaning that around 95% of the data points are expected to lie inside the ellipse if the data is obtained from a multivariate Gaussian distribution.

The identification of multivariate outliers becomes even more problematic as the number of dimensions increases because it is no longer possible to represent the data points in a scatterplot. In such a case, we should rely on two/three-dimensional scatterplots for selected subsets of the variables or for new carefully-constructed variables obtained from dimensional reduction techniques. Quite conveniently, the Mahalanobis distance may still be used as a tool for identifying multivariate outliers in higher dimensions, even when it is no longer possible to draw tolerance ellipses. For this purpose, it is common to find graphics such as the one below, where the indices of the individuals on the dataset are plotted against their corresponding Mahalanobis distances. The blue dashed horizontal line represents the same level as that marked by the tolerance ellipse above. It is easy to spot the three individuals lying outside the tolerance ellipse by looking at the three points above the blue dashed horizontal line and, in particular, the individual marked in red is shown again to clearly stand out from the other data points.

As a drawback of this method for the identification of multivariate outliers, some authors have pointed out that the Mahalanobis distance is itself very influenced by the outliers. For instance, imagine that five additional individuals — also marked in red in the scatterplot below — are added to the dataset. The tolerance ellipse (in red) has now been broadened and contains the individual previously considered as the most outlying. To avoid this problem, we may replace the sample mean vector and the sample covariance matrix in the definition of the Mahalanobis distance by other alternatives that are not strongly influenced by the outliers. A popular option is the Minimum Covariance Estimator (MCD) for jointly estimating the mean vector and the covariance matrix, which will identify a tolerance ellipse that is closer to the original ellipse (the blue one) than to the ellipse heavily influenced by the outliers (the red one).

Another potential drawback for the identification of multivariate outliers is the shape of the dataset since the Mahalanobis distance only takes account of linear relationships between variables. More specifically, the Mahalanobis distance should not be used when there is clear evidence that there exist several clusters of individuals in the data or, more generally, if the shape of the dataset is not somehow elliptical. In this case, we may want to tap into different techniques such as “depth-based” and “density-based” outlier detection techniques.

Conclusion

To summarise, in this article we have seen a prominent technique for outlier identification that should be performed as a data preprocessing task. Additionally, data analysts or data scientists may also be interested in reducing the influence of outliers on the resulting model by considering techniques that are less sensitive to the presence of outliers (for such purposes, the reader is directed to classic books on robust statistics). However, the study of outliers should not end there since it is also important to ultimately analyse the influence of the outliers on the performance of the analytical model. More precisely, one must be careful with the so-called influential points, which are outliers that, when deleted from the dataset, noticeably change the resulting model and its outputs. Further analysis of the reasons why these influential points appear in the dataset must be performed, not only by the data professionals but also by experts with vast specific domain knowledge on the nature of the problem.

Back to Articles

Jan 20 — 2022

7 trends that will define transformational programs and data initiatives in 2022

You can argue about how much the pandemic had to do with the increasing pace at which Artificial Intelligence (AI) was adopted throughout 2021, but what you cannot argue with is that Covid has pushed leaders to accelerate research and work in this field. Managing uncertainty for the past two years has been a major reason for our clients to keep a data-driven business model as their top strategic priority to stay relevant and competitive, empowering them to actively and effectively respond to rapidly shifting situations. However, all of us are faced with a myriad of technology solutions and tools of increasing technical complexity. To help navigate this sheer amount of information, I have prepared a brief summary of my own perspective on what lies ahead for the next year. When putting this article together I found summarising the more than 30 conversations I had when recording our Data Stand-up! podcast really helpful. I spoke with successful entrepreneurs, CIOs, CDOs, Lead Data Scientists from all around the world, and all of them brought a great share of perspectives on the question: Where is data going in 2022 when it comes to supporting business strategies. So, what does 2022 have in store for us? Let‘s dive in!

1. Data Lake Houses

Putting it simply, there have been two “traditional” ways to operationalise data analytics at a business level in terms of the underlying infrastructure used and the type of data being fed:
  • Structured datasets and Data Warehouses: This is about retrieving datasets that display a consistent schema (i.e. data from business applications such as CRMs) that is imported into a Data Warehouse storage solution that then feeds Business Intelligence tools. These “Warehousing architectures” particularly struggle with advanced data use cases. For instance, their inability to store unstructured data for machine learning development is a downside that cannot be overlooked. Furthermore, proprietary Data do not match well with some open-source data science and engineering tools like Spark.
  • Unstructured, semi-structured datasets and Data Lakes: Data lakes were designed to store unprocessed data or unstructured data files such as pictures, audio or video that cannot fit as neatly into data warehouses. Retrieving raw data and importing it directly into a Data Lake without any cleansing or pre-processing in between becomes handy when dealing with these files. The majority of data being generated today is unstructured so it is now imperative to use tools that enable processing and storing unstructured sets. Data lake’s drawback is the difficulty in maintaining data quality and governance standards, sometimes becoming “Data Swamps” full of unprocessed information lacking a consistent schema. This makes it difficult to search, find and extract data at will.
The reality is that both scenarios need to “coexist”, integrating and unifying a Data Warehouse and Data Lake becomes a requirement as analytics teams need structured and unstructured data both indexed and stored. Any modern company needs the best of both worlds by building a cost-efficient resilient enterprise ecosystem that flexibly supports its analytical demands. Meaning, any Data Engineer should be able to configure data pipelines and grant retrieval access to Data Scientists regardless of underlying infrastructure in order to perform their downstream analytics job duties. This is the idea and vision behind The“Data Lakehouse”:
A unified architecture that provides the flexibility, cost-efficiency, and ease of use of Data Lakes with the data granularity, consistency, reliability, and durability of data warehouses, enabling subsequent analyses, ML and BI projects.
There are a few providers out there that offer top-notch Data Lakehouse solutions. Databricks seem to be leading the race and is the industry leader as it was the original creator of the Lakehouse architecture (i.e. Delta Lake). Amazon Web Services (AWS) is another winning horse with a Lakehouse architecture (i.e. Lake Formation + AWS Analytics). Snowflake is also a relevant provider of this emerging “hybrid” infrastructure. I predict that the Data Lakehouse architecture will continue to be in the spotlight in 2022 as companies will also focus on Data Engineering even more than previously. There is already a huge demand for data architects and engineers in charge of platforms, pipelines and DevOps.

2. Low-code and No-code AI. Is it really the future?

Data Science is not just a research field anymore and it has been many years since it was validated as a powerful tool that every area of the business wants a piece of. However, the market continues to struggle to keep up with the filling of new openings as talent demand still exceeds supply. Low-code or no-code platforms were and still are one of the promising solutions to turn this around as they empower non-technical business professionals to “act” as Data Scientists. Moreover, these tools present an added benefit: More people across the organisation may begin to understand what can be done with data and, therefore, know better what questions can be realistically asked. Some well-known solutions such as DataRobot, H2O AutoML, BigML or ML Studio allow the development of practical data applications with little to no programming experience but…
Is it realistic for people who haven’t learned how to code to implement functional and safe analytical systems or AI solutions? Yes, but only if these non-technical professionals are guided and supported.
These days you may find a marketing executive building an NLP solution for sentiment analysis or a Hypermarket operations manager building a demand prediction system, but I must share a word of caution based on recent experience. Codeless does not mean maths-less. Background knowledge of the processes and mathematics behind data transformation, feature engineering and algorithms is needed for the correct ideation and implementation of effective solutions. These days you may find a marketing executive building an NLP solution for sentiment analysis or a Hypermarket operations manager building a demand prediction system, but I must share a word of caution based on recent experience. Codeless does not mean maths-less. Background knowledge of the processes and mathematics behind data transformation, feature engineering and algorithms is needed for the correct ideation and implementation of effective solutions. My take here:
These tools´ adoption will continue to grow and low-code solutions will continue to be a relevant trend in 2022. However, the definition of new roles ( QA, Coaches, Evangelists, etc.) surrounding the adoption of these tools will be needed too.
Many have quickly realised that the supervision and guidance of qualified data professionals is critical, more so when explainable and transparent AI is an upcoming legal prerequisite.

3. Augmented and hybrid human workforce

Employees have been understandably concerned about robots taking over during the last few years, especially when Gartner claimed that one in three jobs will be taken by software or robots, as some form of AI, by 2025. It seems common sense that organisations should have highlighted earlier that AI is only aimed to augment our capabilities, providing us with more time for creative and strategic thinking tasks, and not just replacing people. In my view, Machine Learning will now start to really enhance the lives of employees. Boring and repetitive admin tasks will fade into obscurity and soon be long gone. I believe that 2022 will be the year when we begin to see that AI, in the form of digital-co workers, is really welcomed by people at organisations. Whether you choose to call them robots, RPA systems, or digital co-workers, AI will allow us all to make quicker decisions, automate processes and process vast amounts of information at scale much faster.
In order to remain competitive, businesses of all kinds will have to start designing a hybrid workforce model where humans and “digital co-workers” work hand in hand.
We should still be realistic about expecting automation to fully replace some jobs, but I do hope that reinvented jobs and new positions will balance out all the jobs lost. Cultural adoption barriers still pose a major challenge, but despite popular pessimistic beliefs and potential drawbacks, the redefined augmented workforce is one of the key trends to keep an eye on during 2022 and beyond.

4. Efficiency vs complexity

Whilst a huge chunk of the research efforts and R&D data initiatives by FANGs are directed towards pushing the boundaries of narrow AI in the pursuit of General AI, developing, training and running complex models in this pursuit has inevitably had a negative collateral impact on the environment. Due to the computational power required to fuel some hyper-parameterized models, it is no surprise that data centres are beginning to represent a significant chunk of global CO2 emissions. For reference, back in 2018, the number of parameters in the largest AI models was 94 million parameters and this grew to 1.6 trillion in 2021 as these larger players pushed the boundaries of complexity. Today, these trillions of parameters are language and image or vision-based. Models such as GPT-3 can comprehend natural language, but also require a lot of computational power to function. This has motivated leading organisations to explore how they can effectively reduce their Machine Learning carbon footprint.
Big players have started to look at ways of developing efficient models and this has had an impact in the Data Science community as teams seem to now be looking for simpler models that perform as well as complex ones for solving specific problems.
A relatively simple Bayesian model may sometimes perform as well as a 3D-CNN while using significantly less data and computational power. In this context, “model efficiency” will be another key aspect of modern data science.

5. Multi-purpose modelling

It takes a lot of data sets, hard-to-get talent, costly computing resources and valuable time to ideate, develop and train AI models. Data teams are very familiar with the effort that it takes to deploy a model that works properly and accurately, hence Data Scientists understand that every aspect of the development work should be reapplied if possible in other modelling exercises. We have seen this happening in many industries and this trend seems to be pointing in the direction of training capable general-purpose models that are able to handle very diverse data sets and therefore solve thousands of different tasks. This is something that may be incremental over the next few years.
These multimodal models could be thought of and designed from the beginning to be highly efficient reapplicable tools.
These AI models would be combining many ideas that may have been pursued independently in the past. For instance, Google is already following this vision in a next-generation kind of data architecture and umbrella that they have named Pathways. You should not be surprised if you read about substantial progress in the research field of multi-purpose modelling in the next few months.

6. People Analytics

Dissatisfaction with job conditions, reassessments of work-life balance, and lifestyle alterations due to the hardships of the pandemic led to the Great Resignation, an informal name for the widespread phenomenon of hundreds of thousands of workers leaving their jobs during the COVID-19 era. Also called the Big Quit, it has often been mentioned when referring to the US workforce, but this trend is now international. All pandemic effects are still unpredictable but organisations have been forced to wake up and now seem committed to understanding their people. Companies are looking for effective ways to gain this comprehension of their employees. Many have come to the realisation that People Analytics could be the answer. In my view, there are two main drivers that have encouraged leaders to consider People Analytics:
  • The KPIs that define business value have changed during the past years. In the past, it was related to tangible stuff such as warehouse stock, money in the bank, owned real estate, etc., but value nowadays is highly tied to having a talented workforce that can be an industry reference and that nurtures innovation. This relates to the previous trend about workforce changes where creativity will become more and more important hence the need to have a motivated and innovative team that thinks outside the box.
  • Data Technology and AI now form the backbone of the strategic decision-making toolkit at most advanced companies.
People analytics has become a data-driven tool that allows businesses to measure and track their workforce behaviour in relation to their strategy.
People analytics is built upon the collection of individual talent data and the subsequent analysis of the same, allowing companies to comprehend the evolving workplace, but also surfacing insights that drive customer behaviour and engagement. Moreover, it assists the management and HR units to manage and steer the holistic people strategy by prescribing future actions. These actions may be related, but not limited to, improving talent-related decisions, improving workforce processes and promoting positive employee experience. People Analytics was only adopted by large enterprises with big budgets in the past and it has not been until recently that mid-size organisations joined in too. As of 2020, more than 70 percent of organisations were investing in people analytics solutions to integrate resulting insights into their decision-making. I am pretty certain that this percentage will increase significantly during the next months

7. Data marketplaces

If data is now understood as the new oil and the most valuable asset for any company, data marketplaces may become a mainstream way in 2022 when it comes to exchange and trade information. Even though some companies in specific sectors still jealously guard their data, others have spotted an opportunity in exchanging information. Some platforms such as Snowflake’s Data Marketplace allow businesses to become data providers, enabling them to easily share and monetise large data sets. For enterprises that generate large datasets or highly unique ones as part of their day-to-day activities, some companies may find that it is worthwhile to explore this route as a new way of generating additional revenue. In contrast, a few years back, it was common that medium and large businesses would fully outsource data analytics projects to an IT provider that would eventually use the 3rd parties data without consent. Now that everyone has understood that data is the most valuable asset, data will be exchanged and shared at will, but always with the expectation of something in return. Nevertheless, companies that aim to capitalise on this opportunity need to ideate a robust strategy for it by carefully assessing all legal and privacy implications. Similarly, they will have to build processes that automate the required data transformations so that data exports comply with existing regulations
The rise in AI applications will contribute to the widespread adoption of this trend. Complex models require vast amounts of data to be fed and many will also use these exchanges as a way of developing and training models.
2022 might be the year when the well-known statement by the Economist from 2017 about data being the new Oil will come closer to business reality with the first ‘commodity exchanges’.

Conclusions

There are almost certainly more than these 7 trends, but I have chosen to focus on the high-level ones in order to provide a rough prediction of what may shape corporate strategies and business plans around the world. Now to recap the 7 trends that we have discussed and that you could expect to see in the Analytics and the Artificial Intelligence space in 2022:
  • Data Lakehouses as a hybrid architecture that allows efficient processing and analysis of structured, semi-structured and unstructured data sets.
  • Low and no-code data solutions will continue to be a way to democratise Data Science, but new supervisor roles may appear around them.
  • AI-enhanced workforce will continue to rise where analytical mechanisms and automation are the norm
  • Model efficiency and simplicity will be a defining metric more than ever.
  • Data Science teams will demonstrate significant interest in >Multi-purpose AI as a way to efficiently reutilise pieces of modelling work from previous tested developments.
  • People Analytics will be one of the most sought after data initiatives that can realistically support business goals
  • Data Marketplaces and data exchanges will present a new revenue opportunity for businesses that generate large or unique data sets
  So what data trends are here to stay and what is coming next? Truthfully, no one could tell you and this is just my opinion so we will have to play along and see!
Back to Articles

Jan 11 — 2021

TRANSFORMERS: multi-purpose AI models in disguise (Part 2)

TRANSFORMERS: multi-purpose AI models in disguise

In the first part of this article, we took a look at the Transformer model and its main use cases related to NLP, which have hopefully broadened your understanding of the topic at hand. If you have not read it yet, I suggest you give it a brief glance first since it will help you understand its current standing.

In the second part of this article, we will present novel model architectures and research employing Transformers in several fields unrelated to NLP, as well as showing some code examples of the capabilities of these remarkable new approaches.

 

APPLYING THE TRANSFORMER TO OTHER AI TASKS

As previously mentioned, the Transformer architecture provides a suitable framework designed to take advantage of long-term relationships between words. This allows the model to find patterns and meanings in the sentences, and makes it suited for many tasks in NLP. The most common ones are:

 

  • Text classification into categories, such as obtaining the sentiment of a text
  • Question answering, where the model can extract information from a text when prompted to do so
  • Text generation, such as GPT-3
  • Translation; Google Translate already employs this technology
  • Summarization of a text into few words or sentences

 

After the success of the Transformer model applied to NLP tasks, people began to wonder: If it can find long-term relationships in the data and be trained efficiently, then could it be as efficient in other tasks besides  ? This is the start of the current movement of research, where this model is used as the backbone for many algorithms in AI and machine learning previously dominated by other techniques. Some amazing contributions in other AI fields are:

 

Lip reading and text transcription:

A recurring problem in society is related to text transcription, especially for hearing impaired people. The current advances are divided into two groups: using an audio track and transcribing it, or directly interpreting the words from the person’s lip movements. The latter problem is a lot harder to resolve since many factors are involved.

 

Potentially, this solution could provide help in situations where recording audio is impossible, such as in noisy areas. In this regard, most researchers use CNN or LSTM as the main model interpreting this as a pure Computer Vision task. Recently, some studies such as (https://ieeexplore.ieee.org/document/9172849) have been published where Transformer-based solutions to this problem are presented, providing better results than the current state-of-the-art solutions.

Traffic route prediction at sea:

Figure x: maritime traffic density map as of 2015. (https://www.researchgate.net/figure/2015-worldwide-maritime-traffic-density-map-The-density-is-evaluated-as-the-number-of_fig1_317201419)

 

One of the main focuses of AI models is the prediction of routes, either for individual people or traffic. However, in this regard, the models employed are mostly trained on land traffic since it is easier to obtain data and model it. Regarding sea traffic, the scarcity and dependence on external factors such as weather or sea currents makes it more difficult to provide accurate predictions for the next few hours.

Most of the employed models rely on LSTM or CNN for the same previous reasons, but these models struggle when dealing with long-term predictions and they don’t take into account the specific characteristics of data obtained at sea. A recent study (https://arxiv.org/abs/2109.03958) presents a novel algorithm that takes into account the data’s nuances and provides a vessel trajectory prediction using a Transformer model. The accuracy of the predictions is well above the alternative models available, where long-term predictions are mandatory.

 

Object detection:

This is a subset of Computer Vision and one of the most common AI tasks. In this task, the model can detect certain objects in an image or video and draw a box around them; some common examples are your phone’s face recognition functionality when you take a picture or unlock it, or CCTV detection of license plates.

In this regard, the models that have been employed in the past are mostly based on CNN since these excel at finding relationships in images; the most common ones being SSD and Faster R-CNN. As a result, most algorithms currently used in these tasks have some variation of this model architecture.

However, as was the case for the other tasks, the Transformer architecture has also been experimented with for finding patterns in images. This has lead to several approaches where CNN and Transformer are used jointly, like Facebook’s DETR (https://arxiv.org/abs/2005.12872), or purely Transformer-based architectures like Vision Transformer (https://arxiv.org/abs/2010.11929). The most impactful research in the past few months has been the novel approach of shifted windows in the Swin Transformer (https://arxiv.org/abs/2103.14030), achieving cutting edge results on a number of categories involving image analysis.

LEARN BY SEEING: OBJECT DETECTION WITH DETR

For most of these models, the code and training data is publicly available and open-sourced, which eases their use for inference and fine-tuning. As an example, we will show below how to load and use the DETR model for a specific image.

First, install the dependencies (Transformer, timm) and load an image of a park using its URL:

Figure x: image of pedestrians in a park.

# Install dependencies
!pip install -q transformers
!pip install -q timm

# Load the needed libraries to load images
from PIL import Image
import requests

# In our case, we selected an image of a park
url = ‘https://www.burnaby.ca/sites/default/files/acquiadam/2021-06/Parks-Fraser-Foreshore.jpg’im = Image.open(requests.get(url, stream=True).raw)

# Show the image
im

Then, we apply the feature extractor to resize and normalize the image so the model can interpret it correctly. This will use the simplest DETR model, with the ResNet-50 backbone:

from transformers import DetrFeatureExtractor

feature_extractor = DetrFeatureExtractor.from_pretrained(“facebook/detr-resnet-50”)

encoding = feature_extractor(im, return_tensors=”pt”)

encoding.keys()

Next, load the pre-trained model and pass the image through:

from transformers import DetrForObjectDetection

model = DetrForObjectDetection.from_pretrained(“facebook/detr-resnet-50”)

outputs = model(**encoding)

And that’s it! Now we only have to interpret the results and represent the detected objects with some boxes:

import matplotlib.pyplot as plt

# colors for visualization

COLORS = [[0.000, 0.447, 0.741], [0.850, 0.325, 0.098], [0.929, 0.694, 0.125],[0.494, 0.184, 0.556], [0.466, 0.674, 0.188], [0.301, 0.745, 0.933]]

# Define an auxiliary plotting function
def plot_results(pil_img, prob, boxes):
plt.figure(figsize=(16,10))
plt.imshow(pil_img)

ax = plt.gca()
colors = COLORS * 100

for p, (xmin, ymin, xmax, ymax), c in zip(prob, boxes.tolist(), colors):

ax.add_patch(plt.Rectangle((xmin, ymin), xmax – xmin, ymax -ymin, fill=False, color=c, linewidth=3))

cl = p.argmax()
text = f'{model.config.id2label[cl.item()]}: {p[cl]:0.2f}’
ax.text(xmin, ymin, text, fontsize=15,
bbox=dict(facecolor=’yellow’, alpha=0.5))
plt.axis(‘off’)
plt.show()
import torch

# keep only predictions of queries with 0.9+ confidence (excluding no-object class)
probas = outputs.logits.softmax(-1)[0, :, :-1]
keep = probas.max(-1).values > 0.9

# rescale bounding boxestarget_sizes = torch.tensor(im.size[::-1]).unsqueeze(0)postprocessed_outputs = feature_extractor.post_process(outputs, target_sizes)bboxes_scaled = postprocessed_outputs[0][‘boxes’][keep]# Show the detection results
plot_results(im, probas[keep], bboxes_scaled)

Figure x: image of pedestrians in a park.

The accuracy of these models is remarkable! Even smaller objects, which are harder to detect for the usual neural networks, are identified correctly. Thank you @NielsRogge (https://github.com/NielsRogge) for the awesome implementation (https://github.com/NielsRogge/Transformers-Tutorials) of these models in the Transformers library!

These examples are just the tip of the iceberg of this research movement. The high flexibility of this architecture and the numerous advantages provided are well suited for a number of AI tasks, and multiple advances are being made on a daily basis on multiple fronts. Recently, Facebook AI published a new paper presenting the scalability of these models for CV tasks that has stirred the community quite a bit; you can also check it out here (https://medium.com/syncedreview/a-leap-forward-in-computer-vision-facebook-ai-says-masked-autoencoders-are-scalable-vision-32c08fadd41f).

Will this be the future of all AI models? Is the Transformer the best solution for all tasks, or will it be resigned to its NLP applications? One thing is for sure: for the time being, the Transformer is here to stay!

Back to Articles

Jan 10 — 2021

TRANSFORMERS: multi-purpose AI models in disguise (Part 1)

TRANSFORMERS: multi-purpose AI models in disguise

Novel applications of this powerful architecture set the bar for future AI advances.

If you have dug deep into machine learning algorithms, you will probably have heard of terms such as neural networks or natural language processing (NLP). Regarding the latter, a powerful model architecture has appeared in the last few years that has disrupted the text mining industry: The Transformer. This model has altered the way researchers focus on analysing texts, introducing a novel analysis that has improved the models used previously. In the NLP field, it has become the game-changer mechanism and it is the main focus of research around the world. This has brought the model wide recognition, especially through developments such as OpenAI’s GPT-3 model for the generation of text.

Moreover, it has also been concluded that the architecture of Transformers is highly adaptable, hence applicable to tasks that may seem totally unrelated to each other. These applications could drive the development of new machine learning algorithms that rely on this technology.

The goal of this article is to present the Transformer in this new light, showing common applications and solutions that employ this model, but also remarking on the new and novel uses of this architecture that take into account its many advantages and high versatility.

So, a brief introduction to the Transformer, its beginnings and the most common uses will be presented next. In the second part of this article, we will delve deeper into the new advances being made by the research community, presenting some exciting new use cases and code examples along the way.

It should be noted that AI solutions sometimes lack the responsibility and rigour  required when practising Data Science. The undesired effect is that models can retain the inherent bias of the data sets used to train them, and this can lead to fiascos such as Google’s Photos app. (https://www.bbc.com/news/technology-33347866). I recommend you check out my colleague’s Jesús Templado article on responsible AI and some hands-on criteria to follow when ideating, training or fine-tuning these models.

(https://medium.com/bedrockdbd/part-i-why-is-responsible-ai-a-hot-topic-these-days-da037dbee705).

 
 

TRANSFORMER: APPEARANCE & RESEARCH

NLP is one of the cornerstones of Data Science, and it is involved in most of our daily routines: web search engines, online translations or social networks are just some examples where AI algorithms are applied in the understanding of textual data. Until 2017, most research in this field was focused on developing better models based on recurrent and convolutional neural networks. These models were the highest performers in terms of accuracy and explainability at the time, albeit at the cost of enormous processing power and long training times. This meant the focus of the whole research community was on how to make these models perform better, or how to reduce the machine processing costs. However, a bottleneck was quickly being reached in terms of computational power, and novel ways of analysing text were needed more than ever.

In December 2017, the Transformer model architecture was proposed by Google Brain and Google Research members in the paper Attention is all you need (https://arxiv.org/abs/1706.03762), providing a new approach to NLP tasks through self-attention technology. This architecture completely outperformed previous models, both in terms of accuracy and training time, and quickly became the state-of-the-  architecture for these applications.

One question may come to your mind: How does a Transformer work? How and why is it better? Although we will avoid highly technical explanations, a basic grasp of the fundamentals for each model is needed to understand its many advantages.

Figure x: schema of a neural network. (https://www.w3schools.com/ai/ai_neural_networks.asp)

Neural networks are connections of nodes that represent relationships between data. They consist of input nodes where data is introduced, intermediate layers where it is processed, and output nodes where the results are obtained. Each of these nodes performs an operation on the data (specifically a regression) that affects the final result.

Figure 2: Graphical comparison between a neural network and a RNN. The loop provides the time dimension to the model.

Recurrent neural networks or RNN also take into account the time dimension of the data, where the outcome is influenced by the previous value. This allows the previous state of the data to be kept and sent into the next value. A variation of the RNN named LSTM or long short-term memory also takes into account multiple points, so the result avoids short-term memory issues with the model that the RNN usually presents.

Figure 3: schematic view of a CNN. Feature learning involves the training process, while classification is the model output.

Convolutional neural networks or CNN apply a mathematical transformation called convolution to the data over a sliding window; this essentially looks at small sections of the data to understand its overall structure, finding patterns or features. The architecture is especially useful for Computer Vision applications, where objects are detected after looking at pieces of each picture.

Recurrence is the main advantage of these models and makes them particularly suited for Computer Vision applications, but it becomes a burden when dealing with text analysis and NLP. The computational power increase when dealing with more complex word relationships and context quickly became a limiting factor for the direct application of these models.

 

The advantage of the Transformer is replacing recurrence for Attention. Attention in this context is a relation mechanism that works “word-to-word”, computing the relationship of each word with the rest, including itself. Since this m  components, the computational cost needed is lower than recurrence methods.

In the original Transformer architecture, this mechanism is actually a multi-headed attention that runs these operations in parallel to both speed the calculations, as well as to learn different interpretations for the same sentence. Although other factors are involved, this fact is the main reason why the Transformer takes less time to be trained and produces better results than its counterparts, and the reason why it is the predominant algorithm in NLP.

If you want to learn more about the original Transformer and its most famous variants, I suggest you take a look at Transformers for Natural Language Processing by Denis Rothman; it includes a hands-on explanation and coding lines for each step performed by the model, which helps to understand its inner workings.

Another great thing about the Transformer research community is the willingness to share and spread knowledge. The online community HuggingFace provides a model repository, a Python library and plenty of documentation to use and train new models based on the available frameworks developed by researchers. They also provide a course for those interested in learning about their platform, so this should be the first stop for you, as an interested reader, if you aim to learn more about the current state-of-the-art models!

Using these models is also very easy with the help of their library, in just a few lines of code we can use pre-trained models for different tasks. One of those is the use of over 1000 translation models developed by the University of Helsinki:

# Import the libraries
from transformers import MarianMTModel, MarianTokenizer
import torch

# Load a pretrained “English to Spanish” model
tokenizer = MarianTokenizer.from_pretrained(“Helsinki-NLP/opus-mt-en-es”)

model = MarianMTModel.from_pretrained(“Helsinki-NLP/opus-mt-en-es”)

# Input a sentence
input = tokenizer(“Transformers are a really cool tool for multiple NLP tasks, but they can do so much more!!”, return_tensors = ‘pt’, padding = True)

# Print the results
print(tokenizer.batch_decode(model.generate(**input), skip_special_tokens=True)[0])

The output is the sentence: Los transformadores son una herramienta realmente genial para múltiples tareas NLP, pero pueden hacer mucho más!!

Our team at Bedrock has been able to leverage these models to deliver powerful business solutions to People Analytics companies, further reinforcing their utility in the professional environment!

Stay tuned for the next part of this article, where we will present cutting-edge uses of the Transformer in other areas of application of AI, where previously other models reigned supreme.

Posts navigation

Older posts

Newsletter
Discover the latest on our work, scientific breakthroughs, content, and events.

By clicking "subscribe" I accept the privacy policy.

Data by Design
Our work
Podcast
Articles

Team
Careers
La Pipa
Contact

Let's talk

+34 984 886 003
+44 20 4525 1661

©BEDROCK Intelligence S.L.U. | hello@bedrockdbd.com | Privacy Policy & Cookies