(Nearly) Real-world data
Here at dunnhumby, we understand the importance of great data and the analysts who make sense of it. Uncovering patterns, predicting trends, validating theories — insight gained through analysing customer data is the foundation of our business and key to the success of every one of our clients.
But more than that, we just really love data. We love connecting the dots. We love the human stories data can help you tell. And we love the people who love data as much as we do. That’s why we created Source Files, a platform for sharing datasets inspired on the real-world, where fellow data geeks – from professors to students to data scientists – can easily access rich data sources. Whether you’re teaching a course, completing a class project, testing an algorithm, or running a hack-a-thon, Source Files is the place to go to put your theory into practice.
Breakfast at the Frat
What’s inside?
A representation of sales and promotion information on five products from three brands within four categories (mouthwash, pretzels, frozen pizza, and boxed cereal) over 156 weeks.
Unit sales, households, visits, and spend data by product, store, and week
Base Price and Shelf Price, to determine a product’s discount, if any
Promotional support details (e.g. sale tag, in-store display), if applicable
What’s it for?
This dataset is designed to facilitate time series analyses, including:
Price sensitivity analysis
Promotional effectiveness analysis
Comparing/contrasting results across products, categories or store geographies
Carbo-Loading
What’s inside?
A representation of household level transactions over a period of two years from four categories: Pasta, Pasta Sauce, Syrup, and Pancake Mix
What’s it for?
Classroom projects and case studies
Understanding the process required to mine data
Learning how to merge data tables and aggregate data
How should I use it?
Professors have had success asking students questions such as:
What is the household penetration of Product X? That is, out of all customers purchasing Pasta Sauce, what percent purchase Product X or Brand Z?
Did any customers first purchase an item or category using a coupon? If so, how many of these customers made additional purchases of the item or category?
In two complementary categories (e.g. Pasta and Pasta Sauce), what products, if any, are commonly purchased together?
Special considerations
Don’t forget, you’re dealing with Big Data! Large file sizes may take 5+ minutes to download, and importing the millions of rows of data contained within will require specialised software such as R, Microsoft Excel with PowerPivot, Microsoft Access, SAS, SPSS, SQL, etc.
The Complete Journey
What’s inside?
A representation of household level transactions over two years from a group of 2,500 households who are frequent shoppers at a retailer
All of a household’s purchases within the store, not just those from a limited number of categories
Customer attributes and direct marketing contact history for select households
What’s it for?
More advanced classroom settings
Academic research on the effects of direct marketing to customers
How should I use it?
Professors have had success asking students questions such as:
How many customers are spending more/less over time?
Which customer attributes appear to affect spend of the customer?
Is there evidence to suggest that direct marketing improves overall customer engagement?
Special considerations
Don’t forget, you’re dealing with Big Data! Large file sizes may take 5+ minutes to download, and importing the millions of rows of data contained within will require specialised software such as R, Microsoft Excel with PowerPivot, Microsoft Access, SAS, SPSS, SQL, etc.
Let’s Get Sort-of-Real
What’s inside?
By the numbers
117: Weeks of transactions at till dummy data
300M: Total number of transactions
47M: Total number of baskets
400,000: Average number of baskets per week
2.6M: Average number of transactions per week
~500,000: Distinct number of customers
~5,000: Distinct number of products
~760: Distinct number of stores
What’s it for?
We’ve replicated the typical patterns found in real in-store data to help data scientists test their techniques and algorithms in a (nearly) real-world environment.
A note on download times
Please remember, you’re dealing with Big Data! Large file sizes can result in download times of five minutes or more. Please be patient.
Samples available
Data preview
2,000 baskets, randomly selected, over a period of two weeks
All transactions for a randomly selected sample of 5,000 customers
All transactions for a randomly selected sample of 50,000 customers
Full dataset
Ready to get real? Grab the full 4.3GB dataset below (in nine ~500MB files, for your downloading convenience).

