After finalising the planning and preparation aspects of a machine learning project, the next step is to dive into the data itself. In the previous post, we talked about the intricacies of assessing data requirements and explored strategies for assessing data quality. As an extension, in this article, we will delve into the considerations we should keep in mind when working with three common data types: images, text, and structured tables. Although there are additional data types like audio and geospatial data, they will not be covered in this article.
Table of Contents
To begin with, regardless of the data type we are dealing with- be it images, text, or a combination of numeric and categorical columns- we need to start with some general guidelines-
- Does the dataset have the information that is needed to tackle the business outcome of the project? For example, for a ticket classification system, where user complaints are routed to the right support group, the training data must have columns containing the textual description of the issue and the corresponding department it was sent to.
- Are there any annotation errors or inconsistencies present in the dataset? This is particularly relevant in supervised tasks like image and text classification. If annotators make mistakes, such as incorrectly drawing bounding boxes around objects in images or following different conventions for different images, it can significantly impede the model’s ability to learn the correct behaviour. A machine learning model is akin to a child that requires nurturing. To accomplish this, we need to provide it with clear and consistent examples of what is considered right and wrong.
Structured data/relational tables
Now, let us start off with structured data, usually stored in relational tables. We could divide the inspection into categories of sufficiency, missing values, exploration and outliers, and ask the following questions-
- Are the columns present in line with the target of prediction? In case of unsupervised tasks, are the given columns sufficient to differentiate data points properly into bins?
- Is there any class imbalance in the target variable, in case of a categorical type?
- In case of a numeric target, are the values clustered around a specific range? How do the tails of the distribution look like?
- Is the dataset skewed toward a particular demographic? If yes, then the model will not generalise well and will only be useful for the represented demographic.
- Missing values:
- Which columns have a high proportion of missing values? Often a very high percentage may render a column unsuitable for being included as part of training.
- Is there a reason why they are missing? Sometimes, there may be data entry errors or the field may simply not be applicable for that particular input case.
- Are there rows which have garbage values? This is often observed with surveys, where sometimes, participants enter illogical responses to questions.
- For numeric columns, what is the distribution like? Specifically, we want to focus on the measures of central tendency and dispersion.
- For categorical variables, what are the counts of each bin?
- What is the Pearson correlation among the independent numeric variables, as well with the target, in case of a regression task?
- What do cross tables tell us about the relationship between independent categorical variables, as well as with the target, be it for a regression or a classification task? Are those relationships statistically significant?
- Do the relations and trends we discover support our assumptions derived from understanding business subject matter?
- Should some numeric columns be converted to categorical columns through bins instead?
- Are the outliers present in the data due to entry errors or genuine?
For instance, in a medical history dataset, a blood sugar level of 100,000 mg/dL is evidently an error made by the clerk. However, a value of 200 mg/dL, although relatively high and potentially appearing as an outlier, is a plausible example.
Images can be obtained through various means such as cameras (both digital and analog), specialised instruments like x-ray and ultrasound machines, satellites, and more. These images serve as valuable inputs for computer vision applications, enabling tasks like object classification, segmentation, detection, and even the utilisation of generative adversarial models.
Despite the variety of sources, there some common considerations, anchored around the goals of the assignment-
- Are the given images suitable for the intended task?
- Is the image quality sufficient for humans to identify its contents accurately?
- Have the images been annotated correctly? Incorrect annotations can lead to poor model performance. For example, mislabeling images by drawing bounding boxes around objects or animals instead of humans will result in an inaccurate face detection system.
- Is the annotation consistent across the dataset? For instance, in a face detection task, one annotator may draw separate bounding boxes for a couple in a photo, while another annotator may use a single box encompassing both individuals. Although both cases may be valid to humans, they can confuse the model’s understanding of correct detection.
- Are the images consistent in terms of pose, lighting, and alignment? Unless the model needs to be trained on a wide variety of such factors, it is beneficial to strive for uniformity.
- Are there sufficient samples representing different demographics or slices of data? Without diverse representation, the model may struggle to generalise to different types of inputs.
- Will the model need to infer on images of varying degrees of quality, such as those sourced from the web, mobile phones and even DSLR cameras? If so, it is advisable to train the model on a dataset created from a mix of these sources.
- Have the images been sized to the same dimensions?
- Are the images in colour or grayscale? It is preferable to have consistency in either colour or grayscale throughout the dataset.
- Do the images have sufficient resolution and detail necessary for the downstream task?
- Is the region of interest within the image obstructed in any way? For instance, models trained to predict a person’s age based on facial input will yield inaccurate results if, for example, the forehead or cheek- areas that exhibit visible signs of ageing- are obstructed by hair or accessories.
- Is it possible to simplify the images in order to isolate the region of interest for the prediction task? For instance, it could be beneficial to mask out the background or crop a person’s face specifically for age prediction tasks.
Text is undeniably one of the most prevalent forms of information, alongside images. Numerous well-known applications involving text include classification, summarisation, language translation, and question answering. When working with textual data, there are several factors to consider, including:-
- Does the text extract even have the necessary information to solve the problem at hand? For instance, to train a movie sentiment model, the reviews must actually convey some positive, negative or neutral emotion towards a particular movie.
- Can a human expert perform the intended task manually? If not, then the text contains way too much noise for the model to learn anything useful.
- Has the annotation been done correctly?
- Has the annotation been done in a consistent fashion?
- What extent of processing and cleaning is needed for the task? This can include things like stemming, stop word removal, spelling correction etc.
- Is the task likely to benefit from tweaking the letter case?
- Can the text extract be shortened or isolated to focus on just where the relevant information is present? This can be helpful for classification and question answering tasks.
- Can some phrases be grouped under one single umbrella term for the sake of simplicity and uniformity? For instance, terms such as mother, father, brother, children, spouse and sister can be grouped together as first-degree relatives. Of course, the motivation behind this step is purely dependent on the intent.
- What kinds of entities are present in the data?
- Is the prediction task highly domain specific so that it requires extensively fine-tuning existing pre-trained models or even training from scratch?
- For framing questions for answering, are we using terms that are likely to be found or are similar in meaning to what exists in the text? This enables the model to assign higher scores to the correct set of answers.
These points are simple guidelines meant to get you started and there definitely will be data related considerations specific to the particular use case. What is important is that we need to explore and investigate the information available to us from every angle possible.
In the next article of this series, we will delve into the methodology of designing a comprehensive approach to meet the client’s specific requirements. Stay tuned for more actionable insights and guidance in our next instalment.
- When working with data for a machine learning project, it is important to consider the specific characteristics and requirements of the data type, whether it is images, structured tables, or text.
- Ensure data quality, accuracy, consistency, and representation for effective model training.
- The analysis and exploration of data should align with the business objectives and subject matter understanding to ensure meaningful and actionable insights.
- Continuous evaluation and validation of data quality and relevance are necessary throughout the machine learning project lifecycle.
- Tailoring data preprocessing techniques and approaches to specific data types and tasks can enhance the quality and usability of the dataset.
- Collaboration and expertise from domain experts, data annotators, and data scientists are vital in ensuring accurate and valuable data for machine learning projects.
Data Science Discovery is a step on the path of your data science journey. Please follow us on LinkedIn to stay updated.
About the writers:
- Ujjayant Sinha: Data scientist with professional experience in market research and machine learning in the pharma domain across computer vision and natural language processing.
- Ankit Gadi: Driven by a knack and passion for data science coupled with a strong foundation in Operations Research and Statistics has helped me embark on my data science journey.