How AWS SageMaker Ground Truth speeds up cancer identification.

10 min readJul 22, 2023

Introduction

Developing and training machine learning (ML) models face a significant challenge — obtaining a sufficient quantity of accurately labeled data at scale to ensure precise predictions. While labeling data may initially seem straightforward, the reality is far more complex.

Creating a custom labeling workflow and user interface tailored to our project precedes actual data annotation. This necessitates a combination of robust tools and skilled workers, consuming considerable effort. Once the labeling workflow and user interface are established, we organize and train a workforce to use these systems, even before a single data point is labeled.

Data labelling

Data labeling is a foundational step in building accurate machine learning models. High-quality labeled data facilitates effective learning, better generalization, and reliable predictions, leading to the success of machine learning applications across various domains. It is a crucial process in machine learning that involves assigning meaningful and relevant tags or categories to the data points in a dataset. These labels provide the ground truth information necessary for training supervised learning models. Data labeling helps algorithms understand the relationship between input data and corresponding output or target values.

The importance of data labeling lies on building accurate machine learning models. Here’s why it matters:

Training Supervised Models: In supervised learning, models learn from labelled examples to make predictions on new, unseen data. Accurate labels enable the model to recognize patterns and associations between features and outcomes, leading to better predictions.
Quality of Training Data: The accuracy of data labels directly affects the quality of training data. If the labels are incorrect or noisy, the model might learn incorrect patterns, leading to poor generalization and performance.
Domain-Specific Knowledge: This is required to correctly interpret the data and assign appropriate labels. Human expertise ensures that labels reflect the real-world context, improving the model’s understanding of the data.
Model Evaluation: Testing the model against correctly labelled data allows us to measure its accuracy, precision, recall, and other metrics.
Bias and Fairness: Properly labeled data can help identify and mitigate bias in the training process, ensuring fairness in predictions across different groups.
Iterative Improvement: Continuous improvement of data labelling ensures that the model becomes more accurate and reliable over time.

Challenges in Data labelling

Some of the main challenges in data labelling include

Subjectivity and Ambiguity: Different labellers may assign different labels to the same data, leading to inconsistency and potential bias in the dataset.
Scalability: Labelling large datasets can be time-consuming and expensive, especially for complex tasks or specialized domains. Scaling up data labelling efforts to handle big datasets can be a significant challenge.
Expertise and Training: Certain tasks require domain-specific knowledge or expertise, and finding qualified labellers who understand the nuances of the data can be difficult.
Cost: Hiring human labellers can be costly, especially for tasks that demand specialized skills. Additionally, iterative improvements and re-labelling for quality assurance can add to the overall cost.
Labeller Bias: Human biases into the labeling process, affects the quality and fairness of the labelled dataset.
Data Imbalance: Data imbalance can lead to biased model training and result in poor performance on underrepresented classes.
Time Sensitivity: In some real-time applications, data labelling needs to be done quickly, making it challenging to ensure accuracy and consistency.
Privacy and Security: Data labeling may involve sensitive information, and maintaining data privacy and security is crucial to avoid any potential breaches.
Continuous Learning: As machine learning models evolve and encounter new scenarios, continuous updating and re-labelling of data might be necessary to maintain model accuracy.
Complex Labelling Tasks: Certain tasks, such as object detection or semantic segmentation, require more intricate and detailed labelling, making the process more labor-intensive and prone to errors.

Furthermore, after constructing the labeling systems, designing workflows, and training the workforce, we continuously monitor and verify the data passing through the system to guarantee consistent, high-quality results. Only when enough data has been labeled we can finally proceed to train the ML model.

Each of these steps represents a significant investment of time, resources, and energy.

With AWS Ground Truth Plus, we can reclaim these valuable resources, redirecting them towards building ML models.

Amazon SageMaker Ground Truth Plus

Amazon SageMaker Ground Truth Plus is a simple solution that empowers us to create high-quality training datasets without having to develop labeling applications or manage labeling teams. We no longer need in-depth ML expertise or extensive workflow design and quality management knowledge. We simply provide our data and labeling requirements, and Ground Truth Plus handles the rest, setting up data labeling workflows and managing them according to our specifications.

For instance, if we require medical experts to label radiology images, we can specify this in the guidelines provided to Ground Truth Plus. The service will automatically select labelers trained in radiology to annotate our data, supported by an expert workforce trained in a variety of ML tasks. This infusion of ML-powered automation enhances the output dataset’s quality while simultaneously reducing data labeling costs.

Ground Truth Plus employs a multi-step labeling workflow that leverages ML techniques for active learning, pre-labeling, and machine validation. This reduces the time required for labeling datasets across various use cases, including computer vision and natural language processing. Additionally, the service offers transparency into data labeling operations and quality management through interactive dashboards and user interfaces. We can monitor the progress of training datasets across multiple projects, track project metrics such as daily throughput, inspect label quality, and provide feedback on the labeled data.

Core Components of Amazon SageMaker Ground Truth Plus

Ground Truth Plus uses ML techniques, including active-learning, pre-labeling, and machine validation. This increases the quality of the output dataset and decreases the data labeling costs.

Project: Each qualified engagement with an AWS expert results in a SageMaker Ground Truth Plus project. A project can be in the pilot or production stage.
Batch: A batch is a collection of similar recurring data objects such as images, video frames and text to be labeled. A project can have multiple batches.
Metrics: Metrics are data about your SageMaker Ground Truth Plus project for a specific date or over a date range.
Task type: SageMaker Ground Truth Plus supports five task types for data labeling. You can also have a custom task type. These include text, image, video, audio, and 3D point cloud.
Data objects: Individual items that are to be labeled.

How Sagemaker Ground Truth Plus works?

To get started, we head to the new Ground Truth Plus console and complete a form outlining our data labeling project requirements. After that, a team of AWS Experts will schedule a call to discuss our project in detail.

Once the call concludes, we simply upload our data to an Amazon Simple Storage Service (Amazon S3) bucket for labeling. AWS assigned experts will configure the data labeling workflow according to our needs and assemble a team of labelers with the expertise required to effectively annotate our data, ensuring the best possible talent for our projects.

These expert labelers use the tools provided by Ground Truth Plus to quickly and accurately label the datasets. As they annotate the data, our ML systems simultaneously kick in and start pre-labeling images on behalf of the expert workforce.

As more data is labeled, the ML model becomes more adept at pre-labeling images, reducing the need for manual labeling and thus lowering our costs and accelerating dataset creation without compromising quality.

Throughout the process, ML models will also highlight potential areas of interest that the labeling workforce may have missed or mislabeled. These suggestions are presented to human labelers for confirmation or correction, iteratively improving the pre-labeling and machine validation stages, and ensuring a high-quality output.

We can easily monitor the project’s progress and output through the Ground Truth Plus Project Portal. The portal allows us to track the amount of data labeled on a day-by-day basis and ensure the project is progressing at an acceptable rate.

Once each batch of images is labeled, we can decide whether to accept them or request relabeling if necessary.

Upon completion of the labeling process, we can retrieve the labeled data from a secure S3 bucket and proceed with training our models. Ground Truth Plus simplifies and streamlines the dataset labeling process, making it more efficient, cost-effective, and accessible to all.”

Example usage for cancer detection

Amazon SageMaker Ground Truth Plus can be used for identifying cancer positive cases by providing a streamlined and efficient data labeling solution for medical image datasets. Here’s how it can be used in the context of cancer detection

Data Collection: Medical imaging datasets, such as radiology images, need to be collected from various sources. These images will be the input data for the cancer detection model.
Setting Up the Labeling Project: The user, typically a medical expert or researcher, creates a data labeling project in the Amazon SageMaker Ground Truth Plus console. They outline the specific requirements for labeling the medical images, including guidelines and instructions for identifying cancerous regions in the images.
Expert Workforce Selection: Based on the guidelines provided, Ground Truth Plus automatically selects a team of expert labelers trained in radiology or medical imaging. These experts possess the necessary skills to accurately identify and label cancerous regions in the images.
Data Labeling Workflow: Ground Truth Plus sets up the data labeling workflow according to the project’s requirements. This workflow may include pre-labeling using ML techniques, active learning, and machine validation, which help accelerate the labeling process and improve accuracy over time.
Image Annotation: The expert labelers begin annotating the medical images, marking the areas that indicate the presence of cancer. As they proceed, the ML model starts pre-labeling the images to assist the experts and reduce their manual efforts.
ML Model Refinement: As more images get labeled, the ML model becomes better at pre-labeling, highlighting potential regions of interest for the expert labelers to verify. This iterative process helps improve the model’s performance and reduces the time required for manual labeling.
Quality Management: Ground Truth Plus provides transparency into the data labeling operations through interactive dashboards and interfaces. This allows the user to monitor the labeling progress, inspect the quality of labels, and provide feedback to improve the dataset’s accuracy.
Review and Validation: Once the labeling process is complete, the user can review the labeled data to ensure accuracy and make any necessary corrections or adjustments.
Model Training: After the dataset is labeled and verified, it can be used to train a cancer detection model. The labeled images serve as input data, and the corresponding cancer labels act as the ground truth for training the model.
Model Deployment: The trained cancer detection model can then be deployed and used to analyze new medical images, automatically identifying potential cancerous regions and assisting healthcare professionals in the diagnosis process.

By utilizing Amazon SageMaker Ground Truth Plus for cancer detection, the process of data labeling becomes more efficient, cost-effective, and accurate. The combination of expert labelers and machine learning techniques helps to create high-quality datasets for training robust cancer detection models, ultimately benefiting the medical community and improving patient outcomes.

Serious Note on Expert workforce selection

While Amazon SageMaker Ground Truth Plus can leverage machine learning techniques for pre-labeling and assist expert labelers in the annotation process, it should always be used in conjunction with a team of trained medical experts who can validate and refine the labels. The expertise of medical professionals is critical for the accuracy and reliability of the cancer detection model, ultimately leading to better patient care and outcomes.

Cancer diagnosis is a critical task that requires a high level of expertise and accuracy to ensure patient safety and well-being.

Complements Medical experts

Our approach uses Amazon SageMaker Ground Truth Plus to significantly assist and empower the expert workforce in the process of data labeling for machine learning tasks on cancer detection by supporting in the below ways

Automated Workflow Setup: Ground Truth Plus automates the process of setting up the data labeling workflow. This eliminates the need for manual workflow design and configuration, saving valuable time and effort for the expert labelers.
ML-Powered Pre-Labeling: The platform leverages machine learning techniques for pre-labeling images, assisting the expert labelers in their annotation task. The pre-labeling reduces the amount of manual labeling required by the experts, making the overall process more efficient.
Active Learning: Ground Truth Plus employs active learning, which selects the most informative samples for labeling based on the model’s uncertainty. This means that the platform intelligently prioritizes the most challenging or uncertain images for expert annotation, allowing them to focus on critical cases that require their expertise the most.
Machine Validation: The ML model also performs machine validation, identifying potential areas of interest that the expert labelers may have missed or mislabeled. The model’s suggestions are presented to the expert labelers for verification, improving the overall accuracy and quality of the labeled dataset.
Quality Monitoring: Ground Truth Plus provides interactive dashboards and user interfaces that allow the expert workforce to monitor the progress of the labeling project. They can inspect the labeled data, track daily throughput, and ensure that the labeling process meets the desired quality standards.
Feedback Loop: Expert labelers can provide feedback to continuously improve the performance of the ML model and the labeling workflow. ensuring enhancements of the dataset and the overall efficiency of the process.
Resource Allocation: By automating various aspects of the data labeling process, Ground Truth Plus allows the expert workforce to focus on the most critical and challenging cases. It optimizes resource allocation, ensuring that the expert labelers’ time and expertise are utilized where they are needed the most.
Simplified Management: Managing the labeling workforce and workflows, helps expert labelers to concentrate solely on their core task of accurate data annotation, reducing their burden.

Conclusion

Overall, Amazon SageMaker Ground Truth Plus acts as a powerful tool that complements the expertise of the expert workforce. By automating time-consuming tasks, offering intelligent assistance through ML techniques, and providing quality monitoring and feedback mechanisms, Ground Truth Plus enables the expert labelers to work more efficiently, accurately, and confidently in their data labeling efforts, ultimately leading to improved machine learning model performance and better outcomes in various domains, including cancer detection.

Reference : https://aws.amazon.com/sagemaker