Azure Language : Auto-labeling with GPT

Azure Language

Azure Language is a specialized cognitive service within the Azure Cognitive Services suite that focuses on Natural Language Processing (NLP). It works in conjunction with Azure OpenAI to provide comprehensive language processing capabilities.

Azure Language provides a variety of services including text analytics, sentiment analysis, entity recognition, translation, speech recognition, and language understanding. These services utilize machine learning and deep learning algorithms to comprehend and interpret language, enabling applications to effectively utilize textual data.

Language x OpenAI

Let’s explore together how Azure Language coupled with GPT can help you label your documents using the “custom text classification” feature.

Imagine you are responsible for the Electronic Document Management (EDM) for your company. It is essential to have an efficient system that allows easy navigation through all internal documents. Naturally, you will consider classifying these documents based on several criteria: business domain (HR / Marketing / Legal / IT / Technical Implementation, etc.), document format, document access type, confidentiality level, creation date, modification date, author, and more.

All of these criteria can be considered as metadata with your document. The challenge lies in obtaining and associating these metadata in a comprehensive and high-quality manner. While one option can be delegating this task to employees in each department, there are several limitations to consider:

Humans are prone to errors and may not always accurately understand metadata or can make careless mistakes while assigning them.
The capacity of an individual to process documents is limited and can be influenced by their workload, leading to potential delays in metadata assignment.
As the company evolves, its document management approach may change, requiring the creation of new metadata for a significant number of documents.

The auto-labeling feature offered by Azure Language provides a solution to these challenges. With a GPT model and text documents in the .txt format, you can perform document classification. You don’t even need pre-labeled documents, as you can add labels directly within the studio interface. These labeled documents then serve as examples for GPT to suggest relevant labels.

Label your data with GPT

To begin, once in Azure Language Studio, you need to:

Create a Custom Text Classification project.
Associate an Azure Storage account.
Specify the language(s) of the text to analyze.
Specify the container that contains your .txt files.
Specify whether the documents are already labeled or not.

For my example, I chose to use GPT to suggest tags for examination reports from the time when I was a biology student. I provided unlabeled text documents. Initially, I added the labels of interest using “Add class” and provided example documents for it to follow (see screenshot below). Here are the classes I considered:

Personal or group work: Individual / Group.
Report submission date: 2014 / 2015 / 2016.
Report topic: Genetics / Ecology / Statistics.

Once the classes were created and populated with examples, I tried the auto-labeling feature with Azure OpenAI. I simply needed to specify the OpenAI resource to associate and the deployment to use (text-davinci-002), indicate the classes to suggest, and provide the files to analyze. After the job was completed, I was able to observe the suggested classes for each document and directly indicate on the interface whether I accepted the tags or not (see image below).

In the last image, we can observe the tags associated with each of my documents. And to obtain these labels, I didn’t have to use a pre-trained model! I used GPT associated with meaningful class names and relevant examples.

It is worth noting that one must be careful with the provided classes; they should be relevant and enable GPT to classify effectively. Similarly, the provided examples should be appropriate and ideally sufficient in quantity for all the classes being suggested. In my case, I intentionally reduced this number to save time.

Another approach could have been the custom named entity recognition within documents. Still utilizing GPT for autolabeling, we provide examples of entities to recognize using the associated text within the document. Once a representative dataset is formed, we ask GPT to indicate the entities it has recognized in the unlabeled documents.

Conclusion

In summary, auto-labeling with GPT allows for increased productivity by automating a portion of the data labeling process. Traditionally reliant on manual intervention, this process is time-consuming and costly. This feature offers several benefits: saving time through quick and efficient tag suggestions, reducing costs by minimizing the need for qualified human resources in labeling, and improving accuracy by leveraging large amounts of training data compared to human error.

AI Data Vision