Standard compliant data projects

A tale about applying standards to data projects and if it’s necessary.

Published in

Data & Smart Services

4 min readNov 20, 2020

Right before the current hype about Data and Data Science the industry was already faced with data related projects. The need for standardization grew. About 23 Years ago a first attempt for a standard has been defined. Data projects can get complex and it’s necessary to establish a process for tackle them. Let’s take a look at some common standards and how they could help to structure your data science projects.

CRISP-DM

In 1996, a consortium of five companies introduced a process model for data mining projects. Back in these years no one talked about data science. The model is called CRISP-DM — CRoss-Industry Standard Process for Data Mining. They split a project into six phases. The idea is iterating these phases.

The inner process is sequential. Nevertheless, it is often necessary to switch back and forth between different phases. Let’s take a look at the six steps.

Business Understanding
At the beginning you define goals and requirements. What do you want to achieve with your project?
Data Understanding
Collect and understand existing data. At this phase you could identify problems with your data or with quality of data.
Data Preparation
Self-explaining phase: Prepare and clean your existing data for your desired models and goals of your projects.
Modelling
Create your models and optimizing parameters. Usually in this step, more than one model is being created.
Evaluation
In this step you evaluate which model may fit best for your current goal and requirements. It is necessary to check with your initial goal to be sure to match requirements.
Deployment
In this final step, you „deploy“ your results. Could mean you have a presentation or a deliverable system using your model. It depends on your goals.

All phases should start over when your first model is deployed. Data changes, goals are adjusted etc. This is what the circle around the representation of the model tries to visualize. You can also see some phases having feedback to others. E.g. Business understanding ↔ Data understanding. This means you should revalidate if you find out your data could not help to fulfil your goals. Keep in mind iterative development like Scrum was first mentioned by Ken Schwaber in 1995. And the first book came out around 2001. CRISP-DM is surprisingly close to Scrum. But it is still sequential/V-Model-like in its core. Only the whole process itself is iterating. That put IBM on the scene — 19(!) years later.

IBM revised the process

2015 IBM published their idea for a standard Data Mining and Predictive Analysis process: ASUM-DM — Analytics Solutions Unified Method for Data Mining/Predictive Analytics.

Analytics Solutions Unified Method (ASUM) Process Model.

IBM proposed a process adapted to modern requirements. Their process has five phases supported by a continuous project management stream. The phases are not strictly chronological as in CRIPS-DM. They can also be run through several times or go back to other phases, depending on your application. It is based on CRISP-DM extended with tasks and activities on infrastructure, operations, project, and deployment, and adds templates and guidelines to all the tasks.

Analyze
As in CRISP-DM, you define your goals and requirements first.
Design
Defining components, development environments and needed Resources to complete the task
Configure & Build
The needed components are gradually implemented and tested. At his step you develop models and test them.
Deploy
Integrate the developed components in your final environment.
Operate and Optimize
Continuous optimization is important which could lead into new requirements.

Like Scrum, ASUM is more like a framework instead of a fixed process or standard.

IBM has another idea how to define a process for data mining/-modelling

IBM DataFirst Method

Based on IBM Cloud Garage, their DataFirst Method targets at IT transformation to get infrastructure, processes, and employees ready for AI.

Again a five-step process with self explaining phases relying heavily on their Cloud Garage Method.

*IBM DataFirst Method Process Model. Source: IBM Corporation*

Which method should I use?

The question no one could answer for you. It really depends on your project. In my opinion, you should use what your project needs. As in Scrum there are parts of the framework you should not dismiss. E.g. defining your goals and analyze your data. The key to a successful data science project is having good and reliable data. The most time will be spent for analysis and exploration of data. You should definitely iterate and don’t hesitate to go back in your process when needed. Revalidate often. Maybe business goals or data will change and a better model could help. This is why a data-based project is never finished. You always have to revalidate models, requirements and data sources.

It’s always good to know what methods or standards are available. Applying them to every project without adaptations won’t work. These lessons have already been learned in the era of using V-Model. Always keep in mind: Each project is unique and has never been seen before.