At dunnhumby, our team of 500 data scientists are a vital part of our services for retailers, brands and customers. Julie Sharrocks, Head of Science for Category Management and Price & Promotions, looks at the ten key traits that are a must for a good dunnhumby data scientist.
Mathematics is the foundation of any modern field of science, and that includes machine learning. While some of your colleagues in other organisations might be happy to apply standardised algorithms and approaches to answer their business question, you will have the edge if you can build a deeper understanding of what’s going on. Many data scientists will study mathematics, science or engineering at university but some of these courses fall short of covering the right level of detail. It’s equally important to have a good understanding of statistics, linear algebra and calculus to understand some of the techniques you’ll go on to learn.
Only a few years ago, a combination of R, Hadoop, Impala and Mahout was considered cutting edge, whereas nowadays these are considered old fashioned. Today we talk about combining Python for machine learning and Spark for its processing power. In the future we can expect even more change. A data scientist must have the ability and appetite to pick up new tools quickly, so that they’re not left behind.
To get a job as a data scientist you’ll need to demonstrate some experience with programming and manipulating large data sets. Even more important will be the ability to quickly learn new coding languages. The technologies and software that we make available to our data scientists through our Data Science Toolkit will evolve over time as they bring new capability to the business.
Soft skills have become increasingly important, and the best data scientists differentiate themselves on this basis. Employers look for professional skills, including managing timelines, priorities and stakeholders, as well as the ability to communicate difficult concepts to a non-technical audience. Good customer data scientists understand the commercial realities of retail and know what retailers are trying to achieve, and the issues they need to solve etc. A solid understanding of the sector pains and gains will help scientists choose the right strategy to solve retail problems.
At dunnhumby we have a comprehensive training plan, including courses in critical thinking, problem solving and effective communication including data visualisation as a powerful channel of communication.
New privacy legislation is rightly giving consumers more rights on how their data is used. This includes GDPR in Europe and CCPA in California. As well as keeping consumer data secure and private we must also understand the implications of the new legislation on data science. As a data scientist, you can’t simply blame the computer: the author needs to take full responsibility for the outputs. New legislation gives people the right to be forgotten; this also needs to be considered for copies of data used for building science. And we all have a responsibility to keeping individual data private, employing techniques to analyse data without individual data ever being identifiable.
At dunnhumby we stay at the cutting edge of scientific method and machine learning. We benefit enormously from our work with world class universities to introduce new thinking, allowing us to ‘bring the outside in’. We couple it with our insatiable appetite for the new. Our Data Science Club is a global movement, engaging colleagues across the world and extends to:
While staying up to date with new data science trends, we see some of our best loved paradoxes popping up across a whole host of our data science solutions. These paradoxes are far from new, with some more than 100 years old. Two of our favourites include Simpson’s Paradox (where trends appear to be reversed when groups are combined) and Hitchhiker’s Paradox (look this one up the next time you are waiting for a bus!).
As data scientists we must know the limits of our science. We are often asked to do something that would supply a level of precision that may look great, but we know that in practice, it would disadvantage its accuracy. For example, if we are to build a predictive model at too granular a level, we may find the results will on average look better, but in fact there will be a high level of spread in the results. So each individual prediction will be worse.
Standard machine learning techniques of cross-validation, and regularisation will help reduce overfitting. But at dunnhumby our advice is to go one step further to ensure your model is explainable. In short, we need to fully-understand what is going on under the bonnet. A black-box model may contain many spurious correlations that will break when fed a new dataset.
Our science must be robust enough to be effective across the dozens of retailers that we work with. Our unified demand model for price and promotions, for example, is econometric at its core to ensure we capture the most important and explainable effects.
Our data architecture and platforms have come a long way over the last 25 years. Today we work with vast quantities of data (into the petabytes) across a flexible combination of multi cloud and dunnhumby hosted data platforms. Despite this new environment, there will always be limitations to what is practical to run in a productionised environment. Storage may be infinite but costly, while processing excessively large volumes of data is not likely to curry favour with your infrastructure team, or your team members when the lights start to dim.
As we build science products, we need to consider the balance between complexity and value. New science will be productionised through the science engineering team and made available as reusable science modules for the wider dunnhumby teams.
We want our data scientists to be able to easily plug-in and use each other’s code. Doing this saves time and helps us to focus our energy on true innovation. Code lines need to be well written, efficient and commented so they can be shared and picked up elsewhere. Equally important is proactively sharing or publishing your approach and code lines, to ensure there is no simultaneous invention happening within the same community.
At dunnhumby, we have long been applying automated science through our products. Recent examples include our assortment planning software where we have automated the process of identifying shopper need states and customer decision trees, allowing our assortment science to be easily accessible for retailers, whatever their size or sophistication.
Find out more
Interested in joining dunnhumby’s data science teams? Find out more about our science here and learn how we apply data science to help retailers and brands put their customers first.
Julie Sharrocks is Head of Science for Category Management and Price & Promotions at dunnhumby. She has been a member of the data science community at dunnhumby for more than 15 years and currently manages a large team of expert research data scientists and science engineers.
Cookie | Description |
---|---|
cli_user_preference | The cookie is set by the GDPR Cookie Consent plugin and is used to store the yes/no selection the consent given for cookie usage. It does not store any personal data. |
cookielawinfo-checkbox-advertisement | Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category . |
cookielawinfo-checkbox-analytics | Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Analytics" category . |
cookielawinfo-checkbox-necessary | This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary". |
CookieLawInfoConsent | The cookie is set by the GDPR Cookie Consent plugin and is used to store the summary of the consent given for cookie usage. It does not store any personal data. |
viewed_cookie_policy | The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data. |
wsaffinity | Set by the dunnhumby website, that allows all subsequent traffic and requests from an initial client session to be passed to the same server in the pool. Session affinity is also referred to as session persistence, server affinity, server persistence, or server sticky. |
Cookie | Description |
---|---|
wordpress_test_cookie | WordPress cookie to read if cookies can be placed, and lasts for the session. |
wp_lang | This cookie is used to remember the language chosen by the user while browsing. |
Cookie | Description |
---|---|
CONSENT | YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data. |
vuid | Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website. |
_ga | The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognise unique visitors. |
_gat_gtag_UA_* | This cookie is installed by Google Analytics to store the website's unique user ID. |
_ga_* | Set by Google Analytics to persist session state. |
_gid | Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously. |
_hjSessionUser_{site_id} | This cookie is set by the provider Hotjar to store a unique user ID for session tracking and analytics purposes. |
_hjSession_{site_id} | This cookie is set by the provider Hotjar to store a unique session ID, enabling session recording and behavior analysis. |
_hp2_id_* | This cookie is set by the provider Hotjar to store a unique visitor identifier for tracking user behavior and session information. |
_hp2_props.* | This cookie is set by the provider Hotjar to store user properties and session information for behavior analysis and insights. |
_hp2_ses_props.* | This cookie is set by the provider Hotjar to store session-specific properties and data for tracking user behavior during a session. |
_lfa | This cookie is set by the provider Leadfeeder to identify the IP address of devices visiting the website, in order to retarget multiple users routing from the same IP address. |
Cookie | Description |
---|---|
aam_uuid | Set by LinkedIn, for ID sync for Adobe Audience Manager. |
AEC | Set by Google, ‘AEC’ cookies ensure that requests within a browsing session are made by the user, and not by other sites. These cookies prevent malicious sites from acting on behalf of a user without that user’s knowledge. |
AMCVS_14215E3D5995C57C0A495C55%40AdobeOrg | Set by LinkedIn, indicates the start of a session for Adobe Experience Cloud. |
AMCV_14215E3D5995C57C0A495C55%40AdobeOrg | Set by LinkedIn, Unique Identifier for Adobe Experience Cloud. |
AnalyticsSyncHistory | Set by LinkedIn, used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries (which LinkedIn determines as European Union (EU), European Economic Area (EEA), and Switzerland). |
bcookie | LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognise browser ID. |
bscookie | LinkedIn sets this cookie to store performed actions on the website. |
DV | Set by Google, used for the purpose of targeted advertising, to collect information about how visitors use our site. |
ELOQUA | This cookie is set by Eloqua Marketing Automation Tool. It contains a unique identifier to recognise returning visitors and track their visit data across multiple visits and multiple OpenText Websites. This data is logged in pseudonymised form, unless a visitor provides us with their personal data through creating a profile, such as when signing up for events or for downloading information that is not available to the public. |
gpv_pn | Set by LinkedIn, used to retain and fetch previous page visited in Adobe Analytics. |
lang | Session-based cookie, set by LinkedIn, used to set default locale/language. |
lidc | LinkedIn sets the lidc cookie to facilitate data center selection. |
lidc | Set by LinkedIn, used for routing from Share buttons and ad tags. |
li_gc | Set by LinkedIn to store consent of guests regarding the use of cookies for non-essential purposes. |
li_sugr | Set by LinkedIn, used to make a probabilistic match of a user's identity outside the Designated Countries (which LinkedIn determines as European Union (EU), European Economic Area (EEA), and Switzerland). |
lms_analytics | Set by LinkedIn to identify LinkedIn Members in the Designated Countries (which LinkedIn determines as European Union (EU), European Economic Area (EEA), and Switzerland) for analytics. |
NID | Set by Google, registers a unique ID that identifies a returning user’s device. The ID is used for targeted ads. |
OGP / OGPC | Set by Google, cookie enables the functionality of Google Maps. |
OTZ | Set by Google, used to support Google’s advertising services. This cookie is used by Google Analytics to provide an analysis of website visitors in aggregate. |
s_cc | Set by LinkedIn, used to determine if cookies are enabled for Adobe Analytics. |
s_ips | Set by LinkedIn, tracks percent of page viewed. |
s_plt | Set by LinkedIn, this cookie tracks the time that the previous page took to load. |
s_pltp | Set by LinkedIn, this cookie provides page name value (URL) for use by Adobe Analytics. |
s_ppv | Set by LinkedIn, used by Adobe Analytics to retain and fetch what percentage of a page was viewed. |
s_sq | Set by LinkedIn, used to store information about the previous link that was clicked on by the user by Adobe Analytics. |
s_tp | Set by LinkedIn, this cookie measures a visitor’s scroll activity to see how much of a page they view before moving on to another page. |
s_tslv | Set by LinkedIn, used to retain and fetch time since last visit in Adobe Analytics. |
test_cookie | Set by doubleclick.net (part of Google), the purpose of the cookie is to determine if the users' browser supports cookies. |
U | Set by LinkedIn, Browser Identifier for users outside the Designated Countries (which LinkedIn determines as European Union (EU), European Economic Area (EEA), and Switzerland). |
UserMatchHistory | LinkedIn sets this cookie for LinkedIn Ads ID syncing. |
UserMatchHistory | This cookie is used by LinkedIn Ads to help dunnhumby measure advertising performance. More information can be found in their cookie policy. |
VISITOR_INFO1_LIVE | A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface. |
YSC | YSC cookie is set by YouTube and is used to track the views of embedded videos on YouTube pages. |
yt-remote-connected-devices | YouTube sets this cookie to store the video preferences of the user using embedded YouTube video. |
yt-remote-device-id | YouTube sets this cookie to store the video preferences of the user using embedded YouTube video. |
yt.innertube::nextId | This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen. |
yt.innertube::requests | This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen. |
_gcl_au | Set by Google Analytics, to take information in advert clicks and store it in a 1st party cookie so that conversions can be attributed outside of the landing page. |