10 vital ingredients for the dunnhumby data scientist

At dunnhumby, our team of 500 data scientists are a vital part of our services for retailers, brands and customers. Julie Sharrocks, Head of Science for Category Management and Price & Promotions, looks at the ten key traits that are a must for a good dunnhumby data scientist.

A strong background in mathematics

Mathematics is the foundation of any modern field of science, and that includes machine learning. While some of your colleagues in other organisations might be happy to apply standardised algorithms and approaches to answer their business question, you will have the edge if you can build a deeper understanding of what’s going on. Many data scientists will study mathematics, science or engineering at university but some of these courses fall short of covering the right level of detail. It’s equally important to have a good understanding of statistics, linear algebra and calculus to understand some of the techniques you’ll go on to learn.

Flexible programming skills

Only a few years ago, a combination of R, Hadoop, Impala and Mahout was considered cutting edge, whereas nowadays these are considered old fashioned. Today we talk about combining Python for machine learning and Spark for its processing power. In the future we can expect even more change. A data scientist must have the ability and appetite to pick up new tools quickly, so that they’re not left behind.

To get a job as a data scientist you’ll need to demonstrate some experience with programming and manipulating large data sets. Even more important will be the ability to quickly learn new coding languages. The technologies and software that we make available to our data scientists through our Data Science Toolkit will evolve over time as they bring new capability to the business.

The ability to communicate to a variety of audiences

Soft skills have become increasingly important, and the best data scientists differentiate themselves on this basis. Employers look for professional skills, including managing timelines, priorities and stakeholders, as well as the ability to communicate difficult concepts to a non-technical audience. Good customer data scientists understand the commercial realities of retail and know what retailers are trying to achieve, and the issues they need to solve etc. A solid understanding of the sector pains and gains will help scientists choose the right strategy to solve retail problems.

At dunnhumby we have a comprehensive training plan, including courses in critical thinking, problem solving and effective communication including data visualisation as a powerful channel of communication.

A high awareness of privacy and security

New privacy legislation is rightly giving consumers more rights on how their data is used. This includes GDPR in Europe and CCPA in California. As well as keeping consumer data secure and private we must also understand the implications of the new legislation on data science. As a data scientist, you can’t simply blame the computer: the author needs to take full responsibility for the outputs. New legislation gives people the right to be forgotten; this also needs to be considered for copies of data used for building science. And we all have a responsibility to keeping individual data private, employing techniques to analyse data without individual data ever being identifiable.

A thirst for the new…

At dunnhumby we stay at the cutting edge of scientific method and machine learning. We benefit enormously from our work with world class universities to introduce new thinking, allowing us to ‘bring the outside in’. We couple it with our insatiable appetite for the new. Our Data Science Club is a global movement, engaging colleagues across the world and extends to:

Sessions to celebrate the projects undertaken as part of our academic partnerships
Updates from global conferences to gain insight into new machine learning techniques used in multiple sectors
Advanced onsite training programmes led by our academic partnership students
Local reading groups deep-diving into research papers and investigating new machine learning techniques

…but also a thirst for the old

While staying up to date with new data science trends, we see some of our best loved paradoxes popping up across a whole host of our data science solutions. These paradoxes are far from new, with some more than 100 years old. Two of our favourites include Simpson’s Paradox (where trends appear to be reversed when groups are combined) and Hitchhiker’s Paradox (look this one up the next time you are waiting for a bus!).

An eye for false detail

As data scientists we must know the limits of our science. We are often asked to do something that would supply a level of precision that may look great, but we know that in practice, it would disadvantage its accuracy. For example, if we are to build a predictive model at too granular a level, we may find the results will on average look better, but in fact there will be a high level of spread in the results. So each individual prediction will be worse.

Knowledge of the value of an explainable model

Standard machine learning techniques of cross-validation, and regularisation will help reduce overfitting. But at dunnhumby our advice is to go one step further to ensure your model is explainable. In short, we need to fully-understand what is going on under the bonnet. A black-box model may contain many spurious correlations that will break when fed a new dataset.

Our science must be robust enough to be effective across the dozens of retailers that we work with. Our unified demand model for price and promotions, for example, is econometric at its core to ensure we capture the most important and explainable effects.

The ability to build at scale

Our data architecture and platforms have come a long way over the last 25 years. Today we work with vast quantities of data (into the petabytes) across a flexible combination of multi cloud and dunnhumby hosted data platforms. Despite this new environment, there will always be limitations to what is practical to run in a productionised environment. Storage may be infinite but costly, while processing excessively large volumes of data is not likely to curry favour with your infrastructure team, or your team members when the lights start to dim.

As we build science products, we need to consider the balance between complexity and value. New science will be productionised through the science engineering team and made available as reusable science modules for the wider dunnhumby teams.

Awareness that the ultimate goal is automation

We want our data scientists to be able to easily plug-in and use each other’s code. Doing this saves time and helps us to focus our energy on true innovation. Code lines need to be well written, efficient and commented so they can be shared and picked up elsewhere. Equally important is proactively sharing or publishing your approach and code lines, to ensure there is no simultaneous invention happening within the same community.

At dunnhumby, we have long been applying automated science through our products. Recent examples include our assortment planning software where we have automated the process of identifying shopper need states and customer decision trees, allowing our assortment science to be easily accessible for retailers, whatever their size or sophistication.

Find out more

Interested in joining dunnhumby’s data science teams? Find out more about our science here and learn how we apply data science to help retailers and brands put their customers first.

Julie Sharrocks is Head of Science for Category Management and Price & Promotions at dunnhumby. She has been a member of the data science community at dunnhumby for more than 15 years and currently manages a large team of expert research data scientists and science engineers.

Consumer Pulse Webinar – Decoding 2024 Retail: Post-Inflation Consumer Trends in Europe

Cookie	Description
cli_user_preference	The cookie is set by the GDPR Cookie Consent plugin and is used to store the yes/no selection the consent given for cookie usage. It does not store any personal data.
cookielawinfo-checkbox-advertisement	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Analytics" category .
cookielawinfo-checkbox-necessary	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
CookieLawInfoConsent	The cookie is set by the GDPR Cookie Consent plugin and is used to store the summary of the consent given for cookie usage. It does not store any personal data.
viewed_cookie_policy	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
wsaffinity	Set by the dunnhumby website, that allows all subsequent traffic and requests from an initial client session to be passed to the same server in the pool. Session affinity is also referred to as session persistence, server affinity, server persistence, or server sticky.

Cookie	Description
wordpress_test_cookie	WordPress cookie to read if cookies can be placed, and lasts for the session.
wp_lang	This cookie is used to remember the language chosen by the user while browsing.

Cookie	Description
CONSENT	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
fs_cid	This cookie is set by FullStory to store the user’s cookie consent preferences for session tracking.
fs_lua	This cookie is set by FullStory to record the time of the user’s last activity, helping manage session timeouts.
fs_uid	This cookie is set by FullStory to assign a unique ID to each user and record session replays and interactions.
osano_consentmanager	This cookie is set by FullStory’s consent management system (Osano) to store the user’s cookie consent preferences and ensure compliance with privacy regulations.
osano_consentmanager_uuid	This cookie is set by FullStory’s consent management system (Osano) to uniquely identify a user’s consent session for consistent consent tracking.
vuid	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website.
yt-remote-device-id	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
_fs_tab_id	This temporary session value is used by FullStory to track user activity across multiple tabs.
_ga	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognise unique visitors.
_gat_gtag_UA_*	This cookie is set by Google Analytics to throttle request rates and limit data collection on high-traffic sites.
_ga_*	Set by Google Analytics to persist session state.
_gid	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
_lfa	This cookie is set by the provider Leadfeeder to identify the IP address of devices visiting the website, in order to retarget multiple users routing from the same IP address.
__q_state_*	This cookie is set by FullStory to track session state and user interactions across page views. It helps rebuild session context for accurate session replay and analytics.

Cookie	Description
aam_uuid	Set by LinkedIn, for ID sync for Adobe Audience Manager.
AEC	Set by Google, ‘AEC’ cookies ensure that requests within a browsing session are made by the user, and not by other sites. These cookies prevent malicious sites from acting on behalf of a user without that user’s knowledge.
AMCVS_14215E3D5995C57C0A495C55%40AdobeOrg	Set by LinkedIn, indicates the start of a session for Adobe Experience Cloud.
AMCV_14215E3D5995C57C0A495C55%40AdobeOrg	Set by LinkedIn, Unique Identifier for Adobe Experience Cloud.
AnalyticsSyncHistory	Set by LinkedIn, used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries (which LinkedIn determines as European Union (EU), European Economic Area (EEA), and Switzerland).
bcookie	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognise browser ID.
bscookie	LinkedIn sets this cookie to store performed actions on the website.
DV	Set by Google, used for the purpose of targeted advertising, to collect information about how visitors use our site.
ELOQUA	This cookie is set by Eloqua Marketing Automation Tool. It contains a unique identifier to recognise returning visitors and track their visit data across multiple visits and multiple OpenText Websites. This data is logged in pseudonymised form, unless a visitor provides us with their personal data through creating a profile, such as when signing up for events or for downloading information that is not available to the public.
gpv_pn	Set by LinkedIn, used to retain and fetch previous page visited in Adobe Analytics.
lang	Session-based cookie, set by LinkedIn, used to set default locale/language.
lidc	LinkedIn sets the lidc cookie to facilitate data center selection.
lidc	Set by LinkedIn, used for routing from Share buttons and ad tags.
li_gc	Set by LinkedIn to store consent of guests regarding the use of cookies for non-essential purposes.
li_sugr	Set by LinkedIn, used to make a probabilistic match of a user's identity outside the Designated Countries (which LinkedIn determines as European Union (EU), European Economic Area (EEA), and Switzerland).
lms_analytics	Set by LinkedIn to identify LinkedIn Members in the Designated Countries (which LinkedIn determines as European Union (EU), European Economic Area (EEA), and Switzerland) for analytics.
NID	Set by Google, registers a unique ID that identifies a returning user’s device. The ID is used for targeted ads.
OGP / OGPC	Set by Google, cookie enables the functionality of Google Maps.
OTZ	Set by Google, used to support Google’s advertising services. This cookie is used by Google Analytics to provide an analysis of website visitors in aggregate.
s_cc	Set by LinkedIn, used to determine if cookies are enabled for Adobe Analytics.
s_ips	Set by LinkedIn, tracks percent of page viewed.
s_plt	Set by LinkedIn, this cookie tracks the time that the previous page took to load.
s_pltp	Set by LinkedIn, this cookie provides page name value (URL) for use by Adobe Analytics.
s_ppv	Set by LinkedIn, used by Adobe Analytics to retain and fetch what percentage of a page was viewed.
s_sq	Set by LinkedIn, used to store information about the previous link that was clicked on by the user by Adobe Analytics.
s_tp	Set by LinkedIn, this cookie measures a visitor’s scroll activity to see how much of a page they view before moving on to another page.
s_tslv	Set by LinkedIn, used to retain and fetch time since last visit in Adobe Analytics.
test_cookie	Set by doubleclick.net (part of Google), the purpose of the cookie is to determine if the users' browser supports cookies.
U	Set by LinkedIn, Browser Identifier for users outside the Designated Countries (which LinkedIn determines as European Union (EU), European Economic Area (EEA), and Switzerland).
UserMatchHistory	LinkedIn sets this cookie for LinkedIn Ads ID syncing.
UserMatchHistory	This cookie is used by LinkedIn Ads to help dunnhumby measure advertising performance. More information can be found in their cookie policy.
VISITOR_INFO1_LIVE	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	YSC cookie is set by YouTube and is used to track the views of embedded videos on YouTube pages.
yt-remote-connected-devices	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
_gcl_au	Set by Google Tag Manager to store and track conversion events. It is typically associated with Google Ads, but may be set even if no active ad campaigns are running, especially when GTM is configured with default settings. The cookie helps measure the effectiveness of ad clicks in relation to site actions.

10 vital ingredients for the dunnhumby data scientist

Get in touch

The latest insights from our experts around the world

Webinar On-Demand | Consumer Pulse's Midway check-in: set to sprint!

Canada Consumer Trends Tracker | Navigating the Shifting Landscape of Grocery Shopping

United States Consumer Trends Tracker | Navigating the Shifting Landscape of Grocery Shopping