Building a Text Classifier using Azure Machine Learning
Recently a client came to us to see if we could help them automate their RFP distribution system. Currently the client has an employee manually check several websites for RFPs and alert the appropriate business vertical when a relevant RFP is found. The current system requires manual data scraping, meaning the process is slow and results in RFPs being missed. For the proof of concept phase with the client, we decided to build a machine learning model to classify the RFPs correctly and provide a way to automate the routing of the RFPs. The client wanted to break the project into stages so once the initial Proof of Concept was successful, other parts required to automate the whole process would receive the go-ahead.
Due to the abbreviated time-period, we decided to use Microsoft’s Azure Machine Learning Studio to build the model. Azure Machine Learning Studio provided great visualizations of the model for the client. When developing an end-to-end solution for the client, Azure Machine Learning Service will be implemented. If you are curious about the differences between Azure ML Studio and Azure ML Service, this article provides an excellent explanation.
Model Evaluation – Confusion Matrix:
If the model is built correctly, one should see distribution like what is shown above. The model assigns a probability per category to reflect its confidence in how to categorize each story. It is normal for a news story to be classified in one main class, but the model recognizes there is a probability that the story could belong to multiple classes.
The metrics from the model also showed good accuracy on the model.
Step 1: Receiving and cleaning the data.
The client uploaded several RFPs into different folders in Teams that were labeled with the client’s verticals. One of the challenges not solved in this POC is scraping the data from an RFP. Our focus was on starting small with the classifier to keep things moving forward. Every municipality creates their own version of an RFP, so most RFPs are not uniform. For this POC, the RFP summary data was scraped manually and added to a data file.
Step 2: Create the Model
To start, we followed the BBC News Classifier model outline. The R-Script module and the text_processing.zip found in the BBC News Classifier were switched with the pre-built Preprocess Text module. When the initial model was run, it classified all the data into the one bucket that had the largest number of examples. The model was run again including only data with labels with a high number of examples and a comparable amount in each bucket. Again, poor results. Time to re-think the model.
Microsoft has a great reference library around the modules available in Machine Learning Studio. While looking through the documentation around Text Analytics, two modules additional modules were found to test: “Extract Key Phrases from Text” and “Extract N-Gram Features from Text.” The Extracting Key Phrases from Text module extracts one or more phrases deemed meaningful. The Extract N-Gram Features from Text module creates a dictionary of n-grams from free text and identifies the n-grams that have the most information value. The new model was run with a Multi-class Decision Forest algorithm instead of the Multi-Class Neural Network. When the model was run with all the category labels, the results were closer to what was expected, but not yielding accurate results.
One drawback was the labels with minimal data were not classifying correctly. The model was re-run with only category labels with higher and comparable amounts of data.
Whoops! That was a step in the wrong direction. Maybe the n-gram feature wasn’t the best text analytics module to try. What happens if Feature Hashing is used instead? Feature Hashing transforms a stream of English text into a set of features represented as integers. The hashed features can then be passed to the machine learning algorithm to train the text analysis model.
Well, that accuracy is much better but maybe a bit too good. Even though the lowest number of decision trees, least amount of depth, and the least number of random splits were used the accuracy of the model was too good. We should expect to see some distribution or a small probability that the RFP could be classified in other categories.
This could be due to the size of the dataset that is being used. It was good to find out that Feature Hashing does a better job than Extracting Key Phrases from Text or Extracting N-Gram Features. What happens if a different machine learning algorithm (like the Multi-Class Neural Network) is used?
This is the best model yet. Distribution is across category labels as expected. There is a good chance of overfitting, but that can be worked out with additional data added to the model.
Since this was the best model yet, it was re-run will data from all category labels.
Results were encouraging, but clearly more data will be required to appropriately label all categories. As more data is added, there will be more improvements to the model. Two options worth considering would be applying an ensemble approach or trying NLP techniques like entity extraction, chunking, or isolating nouns and verbs.
Step 3: Automate the Model
Azure Machine Learning Studio’s option to Set Up a Web Service was used to create a Predictive Experiment and deploy as a web service. Then using the ML Studio add-in in Excel, a template was created where data can be added, the model can be run, and predictions bucketed into a scored probability column.
The next step was to create a table that reads the predicted data that can be picked up by a Flow. The Flow is set up to send a notification to a channel on Microsoft Teams.
This is not a final solution. Several additional steps in a further POC will be needed to be completed to set up a fully automated solution, but the initial results are promising. What’s important to understand is how flexible this process can be. If the client scoped a different set of requirements, or was in a different industry, we could easily tailor a solution to fit their pain.
If you’re thinking about a data platform, business analytics, or machine learning project, reach out here. We would love to work with you on a POC.
Jon is a Microsoft Data Analytics Consultant at Beyond Impact.
With experience in cloud analytics solutions, Power BI, and personalized analytics training,
he is an expert in tackling challenging business problems with data.