February 4, 2025 - Uncategorized
In the early 2000s, when I was considering pursuing higher education in India, the field of engineering was abuzz with the emergence of a new course called “IT”. At that time, we didn’t have much information on what the Information Technology (IT) course offered and how it differed from the existing Computer Science course. Fast forward to today, and technology and education have transformed significantly. The buzzwords have shifted to Data, Data Science, and Artificial Intelligence.
In this blog, I aim to provide an outline of what data science encompasses and offer an overview of the lifecycle of a data science project to assist those considering a career switch to data science or pursuing an educational course in this dynamic field.
Data Science or a Data Scientist primarily analyzes data and extracts valuable insights for business.
When we say data- What type of data are we referring to? Where is the data coming from? Who is generating this data? How can we make this data useful for business?
The below sections aim to answer these questions.
Data can be very broad and includes a wide variety of forms. Beyond just text, images, useful data can also be found as audio, video, databases, emails, blogs, server logs, graphs, network diagrams, and live feeds. This data can be generated from a variety of sources, and anybody could be generating this data.
For users/customers who use e-commerce websites and banking transactions, the data generated can be used for recording and analyzing sales and purchase data; IoT devices like smartwatches and fitness trackers can be used for analyzing energy consumption, government databases providing census, and economic data can be used for research and development. Data in the form of images, videos, and audio can be used in a variety of applications like image classification, recognition problems, video recommendations, etc.
These are just a few examples of applications and the data associated with them.
Though some steps overlap, we can broadly break a Data Science Project into the following steps:
Understanding the business and the challenges the business is facing is the first step in collecting data – what type of data can help to solve or analyze the current pattern of business.
For example, a business issue to check why the sales of a particular store are reducing could require data such as historical sales figures, product-wise sales, customer information, promotional, operational data, location, etc.
Most of the store-related and sales data can be efficiently collected from the software and hardware systems that the company already uses; customer Service Platforms can collect the customer service interactions, issues, resolution time; and Inventory Management Systems can provide the data on stock levels. Some open-source data can be made available using websites like Kaggle, Google data.
The data can be made available in a tabular form in different formats like Excel, CSV, JSON, SQL, and HTML files.
We can access diverse data formats and seamlessly import them into a Python environment for machine learning and analysis by robust libraries like pandas, adept at handling different file types. Python offers libraries like Pandas, SQLAlchemy, and BeautifulSoup, each tailored for specific data formats.
Data cleaning & preparation is one of the important steps of Machine Learning. It involves identifying and handling any missing, duplicate, and irrelevant data which can negatively affect the performance of the model.
Once the data is loaded, methods like df.info()
, df.shape
, and df.describe()
provide an overall overview of the dataset.
Next, we need to perform the following checks:
df.isnull().sum()
. The missing values in numerical columns can be addressed using various strategies. If the column values have a normal distribution, we can use the mean values to fill the missing data; however, for a skewed distribution, the median is a better option. If there are different categories, we can use mode to fill in the missing values.Exploratory Data Analysis (EDA) is the process of visualizing datasets to summarize their main characteristics, often using graphics and other data visualization methods. EDA helps to uncover patterns, spot anomalies, test hypotheses, and check assumptions, providing a solid foundation for further data analysis and modeling.
Some of the ways to perform EDA are:
After we have analyzed the data using the different EDA, the data needs to be preprocessed to be able to be fed into the algorithms. It involves standardizing features to aid model convergence and selecting pertinent features while ensuring robust model evaluation through partitioning data into training and testing subsets. Some steps in Data Preprocessing are:
Based on the problem type (Classification or Regression), we have to choose the appropriate machine learning models.
Model Tuning or Hyperparameter tuning involves adjusting machine learning model parameters to achieve the best performance of the machine learning. GridsearchCV, RandomizedSearchCV, or Bayesian optimization are some of the model tuning methods.
Before the model is deployed in the production environment, the model is tested where new or unseen data is fed to the model. This can help us to evaluate how well it performs in a production environment.
Model deployment aims to integrate the trained model into a production environment where it can receive new data inputs and make predictions in real-time. For example, if we have created a model which classifies a particular variety of beans, we can deploy this model as a Flask application that can help the business to access the model developed to gain business insights.
The model should be continuously monitored for performance and retrained as necessary to maintain accuracy over time. The developed APIs can facilitate the integration of the model with other applications or systems for seamless deployment and usage.
Each step, from initial data acquisition and preprocessing to model training and evaluation, is critical for building accurate and reliable models. Successful model deployment and ongoing monitoring ensure the delivery of actionable intelligence and sustained impact in real-world scenarios.