In this section of the machine learning playbook, we delve into the critical questions that arise when dealing with the complexities of data. Basically, we tackle the question: what information do you need about the data right from the beginning of your data science project? In case you missed the first article of the series which focused on the project brief, you can find here.

The Data Puzzle

Before immersing yourself in the intricacies of planning the methodology or implementation of your project, it is crucial to address a fundamental aspect: the data. Seasoned data scientists with hands-on experience in real world solutions and projects will tell you that it is not the algorithm but the data that makes all the difference. The pivotal role of data cannot be emphasized enough in the realm of developing solutions and executing client projects.

Hence, it is very important to ask the right questions in order to connect the dots and proactively anticipate potential data challenges inherent in the field of data science.

There are several forms of data out there and each has its own set of nuances. In this article, we are going to explore the questions associated with common data types, namely images, text and structured tables. While there are also other forms of data, such as audio and geo-spatial data, we are not going to cover those within the scope of this article.

Let’s begin with some general concerns that you might have:

  • Type of data (Text, image, audio)
  • Volume of the data (number of rows / columns)
  • Source of the data (how was it created, and how is it stored)
  • Known issues or challenges associated with the data
data puzzle
Decoding Machine Learning: 2. How to Master the Data Puzzle 2

Example

In marketing, let’s say the client is running campaigns and has tasked you with identifying the channel that drives customer acquisition in order to optimize their advertising budget. To help illustrate the process, let’s simulate a conversation to gain a better understanding of the scenario:

What is the frequency of campaigns run?

How many projects / products are the campaigns run for?

What all channels are used for the campaigns?

Data Science professional

We run 10-15 campaigns every week for 3 products and have been doing so for the past 2 years.

Sources used includes Google ads, Bing ads, Facebook marketing and Instagram influencer marketing.

Client

Let’s take a second to dissect the details shared by the client to decide our next line of questioning. It is clear that there are multiple sources of data due to the variety of channels being used. This also implies that the they might not have a system to standardize the data and store it in a consolidated manner. Considering the volume of data generated from 10 to 15 campaigns over a span of 2 years, it becomes crucial to ascertain what specific data points are being captured.

What specific data points are being captured?

Where do you store this data?

Up until now, how have you been determining the allocation of funds for different channels?

Data Science professional

We receive orders and invoices through an automated software, while another system is responsible for inventory management.

All this data is currently pushed into a SQL database.

Our current approach involves evaluating the performance of these campaigns, measured by the number of clicks at the buy button as reported by different channels, on a weekly basis.

Client

How do you establish a connection between the orders and the purchases reported by the channels?

Are there any known issues with the data?

How effective has your current strategy been in achieving the desired outcomes?

Data Science professional

We are only able to connect it at an aggregate level and cannot pin point the exact source for each customer.

To elaborate on the previous point, if a potential customer views the ad on google, clicks on it but then goes elsewhere. Later on, they encounter an ad on Facebook, click on it and go ahead to make a purchase. Both google and Facebook will report it as a conversion.

Since we update our spending on a weekly basis using consolidated data, there is an opportunity for better optimization based on specific products or campaign types. However, this aspect remains unexplored, and we currently observe variations in our key performance indicators (KPIs) on a weekly basis.

Client

Okay, let’s analyze the insights we have gathered:

  • There are known issues with the data currently stored in database.
  • Establishing connections between the data poses a challenge.
  • Discrepancies are observed in reporting across different channels.
  • The client has defined key performance indicators (KPIs) to measure campaign performance and recognizes the potential for improvement.

This serves as an excellent initial step in comprehending the intricacies of the data puzzle.

In the next article of this series, we will take a look the nuances of dealing with three common data types: images, text, and structured tables. Stay tuned for more actionable insights and guidance in our next instalment.

Key takeaways

In this post, you have discovered essential elements that form the bedrock of understanding data in any type of data science project. These insights serve as a solid foundation for exploring further factors that are specific to your unique project requirements. By grasping these fundamental aspects, you are better equipped to navigate the intricacies of data and make informed decisions to drive the success of your data science endeavours.

  • Data plays a pivotal role in developing solutions and executing client projects in the field of data science. Asking the right questions is crucial to anticipate potential data challenges and connect the dots effectively.
  • Different forms of data, such as images, text, and structured tables, have their own nuances and require specific considerations.
  • General concerns about data include its type, volume, source, and known issues or challenges.
  • Known issues with data quality and variations in key performance indicators highlight the need for improvement.

About Us

Data Science Discovery is a step on the path of your data science journey. Please follow us on LinkedIn to stay updated.

About the writers:

  • Ujjayant Sinha: Data scientist with professional experience in market research and machine learning in the pharma domain across computer vision and natural language processing.
  • Ankit Gadi: Driven by a knack and passion for data science coupled with a strong foundation in Operations Research and Statistics has helped me embark on my data science journey.

Leave a Reply