The not so secret weapon in data science

The not so secret weapon in data science

27 September 2017

A recent LinkedIn article claimed that Python has overtaken R as the lead language used by data scientists to create Machine Learning platforms. While this may be no surprise to many in the data science community who have embraced Python for its scalability and ease of use, the uninitiated may be wondering what all the fuss is about.

Programming languages, believe it or not, have existed for over 200 years. We’ve come a long way from punch-card programmable looms and machine-specific assembly languages to the programming languages we are so familiar with today. But early low-level programming languages only evolved because they were far too laborious and error-prone to build entire systems out of. Also, object-oriented programming came about to provide a good way for non-specialists to create meaningful applications; hiding away complex implementations from the users. Today the motivation for more bug-free and versatile code promotes an open source ecosystem among developers.

More recently, API’s have snuck into code in every industry from government to gaming. Google Maps was the first example of mashing together data from different web applications to make a new one. The enormous popularity of Google Maps prompted them to release an API so that developers could utilise their local map services without the need to hack. Demand drives development forward.

The principles that drove the evolution of programming are why Python has become the programming language of choice today. Although out of the box it doesn’t do anything clever like statistical modelling or even matrix multiplications, it does have dedicated fans, who over time have established a thriving ecosystem that fosters all kinds of statistical and analytical tools built for common purposes. Libraries created by academic institutions, as well as corporates, to solve their own scientific challenges have been open-sourced and shared with the global community, extending Python’s capability to do things like analysing celestial objects or creating the next Picasso using deep learning. Easy to use, general purpose and transparent, Python also encourages self-service.

Python’s active ecosystem has enabled users to interface with tools written in other programming languages. Such is its versatility that new programming models have APIs that allow users to code in Python. One example is the PySpark API, enabling Python programmers to take advantage of the benefits of a cluster computing framework.  Cluster computing allows for tasks to run in parallel. This means Machine Learning algorithms can now be scaled to run faster. Companies like eBay, Yahoo and Netflix are using Apache Spark to enhance the customer experience with targeted offers, personalised content and online recommendations to customers.

At dunnhumby Python is contributing to greater productivity within our data science teams. We benefit greatly from the active ecosystem and use a whole host of open-source libraries. For example, packages like pandas, numpy, scipy, statsmodel and scikit-learn enable us to quickly iterate through different machine learning models. Building over other Python modules, we created dunnhumby’s own Python library for Data Science that has reduced routine analytical workload from weeks to days. Using libraries like xlsxwriter to output pre-formatted reports, eliminates the requirement to manually highlight numbers or fields. Graphic libraries like matplotlib and seaborn allow us to create visually enticing charts and graphs which help communicate key insight findings and results to our retail clients. For any repetitive tasks that one might have in their daily workflow, (for example, a data analyst publishing data into marts for analysis), it’s very likely that Python can automate those and save valuable time. We’re already taking advantage of these benefits with 200 of our data science professionals skilled in Python programming today and plans in place to have all analysts trained in Python by mid-2018, further boosting productivity and development capabilities.

So is Python the great enabler to truly revolutionise data science practices?  It’s certainly got a role to play in opening up the discipline as this recent story from the US suggests: Berkeley, the well-known university in California, is now requiring all their undergraduates (regardless of their university majors) to take the Foundations of Data Science course, which is taught in Python[1].  Not only is it testament to how fundamental programmatic and statistical thinking is for the next generation of skilled workers, it places Python firmly at the pinnacle when it comes to data science.


Senior Analyst

Back to all content