How to generate labeled data for deep learning?


Your machine learning model is only as good as your training data

Modern machine learning techniques, especially in the fields of deep learning and AI are 'humanizing' machines at a rapid pace. This made machine learning a key enabler of automation in many areas including image-analysis based fields such as medical imaging and microscopy.  

While machine learning 'magically' answers many challenges, the magic only happens with the right training data. The quality and relevance of training data directly affects the performance of the machine learning model. In the past decade, many startups emerged that offer data annotation and labeling services. They often outsource the annotation work to individuals who may not fully understand the application and research goals.

For a research-oriented field such as microscopy, machine learning deserves data that are precisely annotated by a subject matter expert. This is not to say that the researcher spends her valuable time annotating every pixel of every image. The journey starts with annotating a few images manually using tools such as This nucleates the process of generating reliable training data for deep learning.

Deep learning requires large amount of training data, in the order of thousands or even hundreds of thousands of images. Hand annotation of large amounts of images is not practical and not good use of a researcher’s valuable time. Fortunately, transfer learning enabled the reuse of a pre-trained model from a different data set to a new problem.T his made it possible to train deep learning models with hundreds of images rather than thousands.

These hundreds of labeled images can be generated using traditional machine learning. Its superior performance with low amount of training data can be leveraged to help automate the annotation process.

A handful of partially annotated images is often enough to train a traditional machine learning algorithm (e.g. RandomForest). This trained model can be used to segment tens or hundreds of images.These raw and segmented images can be further divided into smaller patches and augmented using random image transformation operations. This process results in thousands of images and corresponding labels (masks) that are required for deep learning. The following video walks you through the process described above,all the way from raw images to thousands of labeled images for deep learning.  

NOTE: The terms ‘annotate’ and ‘label’ have been used interchangeably in this blog article.

Sreenivas Bhattiprolu

Related Posts

Stay in touch with our newsletter!

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form