How to solve the problem of product similarity with data science

What makes cat food similar to dog food? Yes, they’re both food and both for household pets. But they’re hardly interchangeable. Questions of similarity are everywhere in retail, and it’s a real issue for retailers.

When you browse items on Amazon, you’ll notice their “people who bought this also bought…” feature. This system identifies products that complement the one you’re looking at and nudges you to add them to your basket. So if you’re thinking of buying cat food, they may offer you another flavour, or cat litter. When you order groceries online, you might expect some products to be out of stock and substituted with ‘similar’ items. However, your cat will be less than impressed if their favourite dish has been replaced by canine treats.

Similarity wanted, £1 million reward

The truth is that solving item similarity is hard; I’d say it’s one of the grand challenges in data science. It’s so hard that companies have invested millions of pounds trying to solve it. Whole teams of data scientists have created enormous computing clusters and ultra-complex algorithms, just to calculate similarities. Famously, Netflix offered $1m to the data scientists that could best predict the next movie watched by each of their users. The winning team identified movies that a user hadn’t watched but had been watched by other users with ‘similar’ tastes. Though the winning algorithm was the most predictive, it was never implemented due to the complexity of the engineering effort required.

To find an answer to the similarity problem, we’ve looked beyond retail to language. Colleagues in the field of natural language processing (NLP) have been pondering the question for decades and have become pretty good at it. An Amazon search for “pet food” will deliver some pretty relevant results. We would be less impressed if the search algorithm returned food that wasn’t for pets, even though the word “food” is in the search term. So how are they doing this?

Are embeddings the answer?

One of the tricks NLP researchers use to represent words and sentences is known as ‘embeddings’. These are special representations used to compute similarities between items. When the embeddings are good, similar items (e.g. words) end up with similar embeddings; dissimilar items have dissimilar embeddings. NLP researchers spend hours fine-tuning algorithms to create embeddings of text corpora; for example, training them on words in a newspaper.

Can you learn good embeddings? One popular technique monitors the co-occurrence of words in sentences. When words co-occur a lot, they tend to end up with similar embeddings. We can use these word-level embeddings to create document level embeddings. For example, we can calculate the similarity between sentences or paragraphs by comparing the words (and associated embeddings) inside them.

Words are to sentences as products are to baskets

In a sense, words in sentences are similar to products in baskets. Some occur together frequently, others less so. We’ve been considering this and experimenting with NLP algorithms (such as those that learn embeddings), switching out words and sentences for products and baskets or customers.

Training these algorithms isn’t straightforward. We tend to lack a ‘ground truth’ for similarity, which means we have to manually monitor the appropriateness of the embeddings’ sense of similarity. Thankfully, NLP researchers have come up with a solution to that too: visualising the similarities in a 3D plot.

By visualising embeddings in this way, we could see the impact of the algorithm. For example, in one experiment the algorithm had learnt that Easter products were all similar to each other. It had no idea that the word ‘Easter’ appeared in the descriptions of these products – it just used the fact they tended to be bought together. In another experiment, the algorithm had started to group vegetarian products together. This is equally impressive, as the algorithm had no idea that these products contained ‘vegetarian’ in the name, or were meat free. It simply figured out that they tended to be bought in similar contexts.

Reasoning experiments, for veggie burgers

NLP researchers have a slightly more objective way of measuring the quality of their embeddings. They run little maths experiments with their embeddings, called ‘analogical reasoning’ experiments. One of the most famous analogical reasoning experiments is:

King – man + woman = ?

When NLP researchers perform this calculation between the embedding vectors for the words (king, man and woman) they find that the embedding that is most similar to the result is ‘queen’. This, they argue, suggests that the algorithm understands the semantic relationships between words, such as the gender relationships between men, women, kings and queens.

Inspired by these examples, we ran the following:

Frozen Burgers – Beef + Tofu = ?

The results showed that the algorithm considered ‘Frozen vegetarian meat’ to be within the most similar products. This suggests that the embeddings had distilled the concepts of ‘vegetarianism’ and ‘frozen’ foods.

We don’t get excited easily, but…

Embeddings are becoming an important part of our science infrastructure. By exploring the value of embeddings in ranging; we can use these representations to suggest alternative products when others have been delisted. Integrated into personalised recommender systems, embeddings can also be used to suggest products that are similar to a customer’s tastes.

There’s huge potential for embeddings to transform retail. Perhaps one of the most exciting things about them is that they can help us address one of the thorniest problems in our industry: the problem of similarity.

Cookie	Description
cli_user_preference	The cookie is set by the GDPR Cookie Consent plugin and is used to store the yes/no selection the consent given for cookie usage. It does not store any personal data.
cookielawinfo-checkbox-advertisement	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Analytics" category .
cookielawinfo-checkbox-necessary	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
CookieLawInfoConsent	The cookie is set by the GDPR Cookie Consent plugin and is used to store the summary of the consent given for cookie usage. It does not store any personal data.
viewed_cookie_policy	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
wsaffinity	Set by the dunnhumby website, that allows all subsequent traffic and requests from an initial client session to be passed to the same server in the pool. Session affinity is also referred to as session persistence, server affinity, server persistence, or server sticky.

Cookie	Description
CONSENT	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
vuid	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website.
_ga	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognise unique visitors.
_gat_gtag_UA_*	This cookie is installed by Google Analytics to store the website's unique user ID.
_ga_*	Set by Google Analytics to persist session state.
_gid	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
_hjAbsoluteSessionInProgress	Hotjar sets this cookie to detect the first pageview session of a user. This is a True/False flag set by the cookie.
_hjFirstSeen	Hotjar sets this cookie to identify a new user’s first session. It stores a true/false value, indicating whether it was the first time Hotjar saw this user.
_hjIncludedInPageviewSample	Hotjar sets this cookie to know whether a user is included in the data sampling defined by the site's pageview limit.
_hjIncludedInSessionSample	Hotjar sets this cookie to know whether a user is included in the data sampling defined by the site's daily session limit.
_hjSessionUser_{site_id}	A Hotjar cookie that is set when a user first lands on a page with the Hotjar script. It is used to persist the Hotjar User ID, unique to that site on the browser. This ensures that behaviour in subsequent visits to the same site will be attributed to the same user ID.
_hjSession_{site_id}	A Hotjar cookie that holds the current session data. This ensures that subsequent requests within the session window will be attributed to the same Hotjar session.
_hjTLDTest	To determine the most generic cookie path that has to be used instead of the page hostname, Hotjar sets the _hjTLDTest cookie to store different URL substring alternatives until it fails.
_lfa	This cookie is set by the provider Leadfeeder to identify the IP address of devices visiting the website, in order to retarget multiple users routing from the same IP address.

Cookie	Description
aam_uuid	Set by LinkedIn, for ID sync for Adobe Audience Manager.
AMCVS_14215E3D5995C57C0A495C55%40AdobeOrg	Set by LinkedIn, indicates the start of a session for Adobe Experience Cloud.
AMCV_14215E3D5995C57C0A495C55%40AdobeOrg	Set by LinkedIn, Unique Identifier for Adobe Experience Cloud.
AnalyticsSyncHistory	Set by LinkedIn, used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries (which LinkedIn determines as European Union (EU), European Economic Area (EEA), and Switzerland).
bcookie	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognise browser ID.
bscookie	LinkedIn sets this cookie to store performed actions on the website.
ELOQUA	This cookie is set by Eloqua Marketing Automation Tool. It contains a unique identifier to recognise returning visitors and track their visit data across multiple visits and multiple OpenText Websites. This data is logged in pseudonymised form, unless a visitor provides us with their personal data through creating a profile, such as when signing up for events or for downloading information that is not available to the public.
gpv_pn	Set by LinkedIn, used to retain and fetch previous page visited in Adobe Analytics.
lang	Session-based cookie, set by LinkedIn, used to set default locale/language.
lidc	Set by LinkedIn, used for routing from Share buttons and ad tags.
lidc	LinkedIn sets the lidc cookie to facilitate data center selection.
li_gc	Set by LinkedIn to store consent of guests regarding the use of cookies for non-essential purposes.
li_sugr	Set by LinkedIn, used to make a probabilistic match of a user's identity outside the Designated Countries (which LinkedIn determines as European Union (EU), European Economic Area (EEA), and Switzerland).
lms_analytics	Set by LinkedIn to identify LinkedIn Members in the Designated Countries (which LinkedIn determines as European Union (EU), European Economic Area (EEA), and Switzerland) for analytics.
s_cc	Set by LinkedIn, used to determine if cookies are enabled for Adobe Analytics.
s_ips	Set by LinkedIn, tracks percent of page viewed.
s_plt	Set by LinkedIn, this cookie tracks the time that the previous page took to load.
s_pltp	Set by LinkedIn, this cookie provides page name value (URL) for use by Adobe Analytics.
s_ppv	Set by LinkedIn, used by Adobe Analytics to retain and fetch what percentage of a page was viewed.
s_sq	Set by LinkedIn, used to store information about the previous link that was clicked on by the user by Adobe Analytics.
s_tp	Set by LinkedIn, this cookie measures a visitor’s scroll activity to see how much of a page they view before moving on to another page.
s_tslv	Set by LinkedIn, used to retain and fetch time since last visit in Adobe Analytics.
U	Set by LinkedIn, Browser Identifier for users outside the Designated Countries (which LinkedIn determines as European Union (EU), European Economic Area (EEA), and Switzerland).
UserMatchHistory	This cookie is used by LinkedIn Ads to help dunnhumby measure advertising performance. More information can be found in their cookie policy.
UserMatchHistory	LinkedIn sets this cookie for LinkedIn Ads ID syncing.
VISITOR_INFO1_LIVE	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	YSC cookie is set by YouTube and is used to track the views of embedded videos on YouTube pages.
yt-remote-connected-devices	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

How to solve the problem of product similarity with data science

TOPICS

Get in touch

The latest insights from our experts around the world

AI's Regulatory Crossroads: Innovation vs. Control

Why you need a demand model

AI: three breakout applications for consumer brands

How to solve the problem of product similarity with data science

TOPICS

RELATED PRODUCTS

Get in touch

The latest insights from our experts around the world

AI's Regulatory Crossroads: Innovation vs. Control

Why you need a demand model

AI: three breakout applications for consumer brands