Data Science is an interdisciplinary field of study whose ultimate goal is to extract useful information and knowledge from structured or unstructured data. Seen from a business perspective, data science makes it possible to translate a business problem into a research and analysis project and then transform it, again with the help of data, into a practical solution. Today Data Science, while maintaining some peculiarities of its original definition (i.e., the transversality of scientific disciplines: computer science, mathematics, and statistics), is understood as a multidisciplinary science that requires skills closer to the business world, linked to the ability to know how to read, interpreting, understanding and capitalizing on data to extract practical value.
And it is precisely from this definition that the phases of a typical Data Science process derive, phases that concern a process of data analysis and interpretation that must be seen more as iterative than linear, subject to continuous verification. In any case, the cyclical aspect of the process does not prevent the identification of the fundamental steps. To date, there is no single and universally shared framework to define Data Science processes.
Discovery: Identify And Understand The Problem (Identify Objectives)
Data Science aims to translate a problem into an analysis project and then into a practical solution. The first important step is to identify and understand the problem. Before solving a problem, it is essential to define precisely what it is, which means being able to “translate” the questions about the data into something usable. Simplifying and generalizing a bit, Data Science is useful for answering five basic types of questions:
- How much or how many? (regression)
- Which category? (classification)
- Which group? (grouping)
- Is it strange? (anomaly detection)
- Which option should be taken? (recommendation)
Looking at these questions from a business perspective, it takes to identify and understand the problem to ask the right questions to interlocutors (business people) who often provide ambiguous and subjective inputs. It would help if you had the ability – and the intuition – to transform those inputs into information to ask the right questions, those that will be used to have outputs that can be used for the other phases of the Data Science process. For example, business questions such as:
- Who are the best customers?
- Why are they buying “that” product?
- How to predict if a customer will buy another type of product?
- Why haven’t those customers been buying in a long time?
They could help to “find” the problem (some customers buy less than expected or behave differently), from which to start a series of analyzes to make certain decisions (continue to invest in a product or change the offer). It is essential that at the end of this phase, there are all the elements to define the specific business and company context and to have well-delineated the problem so that the analysis project – even before thinking about the data – can be planned to give a concrete answer to a clear business need (also of an organizational nature). This phase is critical because it allows you to identify a clear objective of what you want to do with the data.
Data Selection / Collect The Correct Data: Recover The “Raw Data” Necessary For The Identified Problem
Once you have defined the problem and clarified the goal, you need to collect the data and find the datasets you need to solve the problem. Data Selection, in elementary words, is collecting data from different sources. This phase of the process requires some attention because it involves both thinking a priori about which data will be needed and the practical “recovery” of data from a plurality of sources (internal to the company and external datasets).
The data can be structured data (for example, from databases and internal applications of the company, such as a CRM or an industrial application, for example for production or warehouse management) or unstructured (texts, images, videos coming from emails, documents, collaboration platforms, but also from external sources such as social networks, open repository of documents, web pages, etc.). Other situations that may occur are cases in which companies do not have a reasonable amount of data available when the project starts testing.
Data Cleaning / Data Preparation: Process Data For Analysis
The Data Cleaning phase consists of manipulating preprocessing, those raw data coming from various sources and in different formats, to clean them, arrange them, harmonize them and transform them into data that the analysis tools can then use. Data preprocessing is perhaps the most time-consuming phase and involves procedures for preparing data such as profiling, cleaning, validating, and data transformation (often with ETL – Extract, Transform, Load, and Data Quality technologies ). Working on “data cleansing” means:
- Eliminate the “noise” or other disturbances of the data;
- Organize and prepare the data to be analyzed;
- Develop strategies to manage missing data and data that vary over time.
An integral part of Data Preparation is also the so-called Data Enrichment (data enrichment), that is, the set of processes necessary to integrate and improve raw data and information in databases (the process is essential because it allows you to compare data from different sources and to unify and integrate them, to have complete, accurate and complete datasets).
Even if it is a challenging phase, it must be carried out with the utmost care because the transfer to the next phase is incongruent, inconsistent, missing, poor quality, etc. A crucial element of data preparation is regulatory compliance: it is essential to ensure from the early stages of the project that the data used (and how it is used and analyzed) comply with the laws on privacy and data protection.
Data Exploration / Data Transformation: Explore And Transform Data
Once the data has been obtained, the Data Exploration phase continues, carrying out an initial “exploratory analysis”; in essence, statistical tests are carried out, the first analyzes are made, and the first Data Visualization techniques are verified. In this phase, the Data Scientists identify and prepare what is then necessary to experiment with the analytical models, understand their performance for the problem to be solved and the data available, and, above all, identify any “biases in the data.” And it is perhaps from here that the concept of the iterative and non-linear process begins to be seen.
In the Data Exploration phase, errors in the data or, in any case, needs for intervention may emerge that “bring” the teams back to the previous phase of the cleaning, preparation, and data enrichment. Part of the Data Exploration phase – or instead, strictly connected – is experimentation and modeling, which identifies and builds the analysis model for solving the specific problem identified in the very first phase of the entire Data Science process. In this phase, the Data Visualization systems are also tested to verify that the information generated by the analysis models is accessible, usable, and understandable to the business people involved in the decision-making process.
Also Read: Here Is How Data Science To Predict Covid-19
Data Mining: Advanced Analytics And Machine Learning
At this point in the process, algorithms are used to analyze the data, discover hidden patterns or extract interesting knowledge from this data. These are the “typical” operations of this phase: identification of parameters, processing, modeling, model evaluation. It is here, in fact, that we define how to extract practical value from large volumes of information, choosing the algorithms and “training” methods to search for patterns in the data (for example, with machine learning), as well as the form of representation of the set of different representations with which the information is to be extracted.
After testing the first models, Data Scientists could identify others to do more in-depth analysis (for example, to discover trends in data that were not distinguishable in initial graphs and statistics) or to “build predictions” (for example, by analyzing past data and finding the characteristics that have influenced past trends to build models for so-called predictive analytics ). An essential part of this phase of the Data Science process is also to provide business people with all the necessary elements (both quantitative and qualitative) to be able to access information and knowledge that are genuinely relevant to the identified problem, to the possible applicable solution and, therefore, effective for the business decision.
Data Visualization: Communicate And Visualize The Results
As mentioned, Data Visualization comes into play several times during the various phases of the typical Data Science process. Although it finds ample space in phase six, it is good to place it also in the previous phases, especially in the Data Exploration phase. The “final” phase of the process concerns the communication of the results deriving from the analyzes, understood not so much as the return of information by the Data Scientists to business people, but rather as the visualization of these results through the analysis systems that must be returned. Available and usable with ease by business users.
Here the Data Visualization and Data Storytelling systems come into play, that is, advanced data analysis systems that allow you to “read” between hundreds and thousands of data (of different formats and structures, coming from diversified fake ones, such as Big Data), information, correlations, patterns, that is to find “a story” hidden in these data that can only come “to the surface” through advanced analysis and can become usable for business people, without specific technical skills, thanks to Data Storytelling and information visualization.