Introduction

This paper is written based on the experience our team gained by running the project of digitizing the customer feedback forms for a UK based travel firm. We have tried to cover the whole gamut of learnings that we obtained on due course of project completion and the roadblocks we faced and how we overcame those with specific solutions which could be reused by any other team while facing similar problems.

Problem Statement

The project came out as an outshoot of ideation workshop that we conducted with a UK based travel firm wherein we demonstrated our capabilities in the area of cognitive service especially with computer vision, OCR (Optical Character Recognition) and ICR (Intelligent Character Recognition) using our home-grown framework SLICE (Self Learning Intelligent Content Extractor).

The customer is a leading travel operator in UK and conducts branded tours for various destinations and lifestyles across multiple pricing options. As part of their customer feedback capture process, at the end of the tour each customer is provided with a feedback form where-in the rating for various services provided by the company is captured. The company wanted to act on the customer feedback much faster, to ensure great user experience. They were exploring options to create an automated solution which extracts contents from the scanned forms and uploads this information on their Salesforce system. Their current process required scanning these feedback forms and getting an outsourced offshore team to manually enter the data to excel sheets and then addback to Salesforce. The whole process would not just take 8 to 10 weeks but was also error prone. To add to the complexity, the firm operates about 40 different types of tours and each tour had a unique feedback form with varying design. Volumes exceeded 250,000 per year.

Our Approach

Since the project was experimental, where viability of the technology and accuracy of the output couldn’t be guaranteed upfront, we decided to go for a multi staged implementation. The project was split into 3 distinct phases of PoC, Pilot and Production where-in a decision to continue further or not would be taken at each checkpoint.

Proof of Concept

As part of the PoC phase, we collected few scanned images of the feedback forms for one of the brands and focused on identifying focus areas of interest within the form and content that needs to be extracted. For example, e forms had specific logos for identifying the brand of tour, printed labels to identify the tourist and a unique id. There were distinct sections for providing ratings and comments for areas of the tour like Itinerary, Travel Director, Sightseeing, Accommodation, Driver, etc. Ratings could be provided by ticking checkboxes or free flow handwritten comments in a multi-line text area. The feedback form also captured future travel plans of the tourist, destinations they wanted to visit, tentative schedules etc.

We ran the experiments on 20 forms of one brand and the accuracy of extraction was manually calculated. The proof of concept took 4 weeks to complete and the results were discussed with the firm before commencing next stage.

A few samples of the feedback forms are given below for reference:

Coforge

Pilot

After the content extraction exercise was validated, we proceeded to the next phase where one brand was identified to conduct a Pilot phase. During the Pilot we created an extraction logic to cull out all content from the feedback form and the extracted content was constructed to a specific input form so that the same can be uploaded to Salesforce seamlessly. We also conducted various sandboxing experiments to finalize the technology for extracting handwritten contents. A small study is provided in the Appendix of this document. During the Pilot phase we also established that a scan quality of 300 dpi was the optimum specification for a good content extraction result. Any further increase couldn’t improve the quality of extraction and would have resulted in using more storage space.

All along the Pilot stage, there was active participation of the product owner from the customer side and weekly calls were setup to monitor the progress and evaluate the extraction accuracy and efficiency. At the end of the Pilot phase, the decision was taken to move all brands to Production. The Pilot phase was conducted for 3 months.

Production

At the end of the Pilot Phase, we reviewed the performance of the extraction process which was done on one brand and the customer was happy with the accuracy of extraction and the attainable velocity of execution. With this a decision was made to move all the major brands to be digitized and a plan was devised to do the brand movement in a structured way. During the production phase we also drafted a plan for Continuous Integration and deployment of the code, code repository management, etc. We decided to run the extraction logic as AWS Lambda process to provision for scaling. Process was established to access scanned images added in AWS S3 bucket, move extracted content in a Salesforce up loadable format into another AWS S3 bucket,how to deal with the error scenario and how issues could be reported back so that SLICE based solution can continuously improve. A novel transaction based pricing was agreed with the firm for each form extracted.

A diagram depicting the flow is given below:

Coforge

Technology

As elaborated earlier, SLICE framework was used to create an extraction process of the content from the feedback forms. A diagram depicting the general process of extraction using SLICE is given below. Since SLICE is a reusable framework with multiple capabilities built-in, only those which needs to be used as part of this project was taken out and the rest of the solution was custom coded.

Coforge

Using the SLICE pre-processing capabilities, the feedback forms are prepared for a better extraction process by de-skewing the form and also by creating a better contrast of the Scanned images along with standard noise removal.

Extraction

The Feedback forms consisted of four distinct areas of the form that needs to be extracted. We classified them into 4 different classes based on the techniques that needs to be applied to provide a better extraction. These are Logo identification (to understand the brand), printed labels, checkboxes and the handwritten comments. The technique that we used to extract each of these areas are explained below.

Logo Identification

Since there are 40+ brands of tours operated by the firm, there are individual feedback form designs for each of the brands. This necessitated that the extraction logic be slightly altered for each of the brands and for this the first step was to identify the brand using the logo attached in the form. During the Pilot phase an Image classification approach was attempted to identify the brand by training a model with all different types of feedback forms. But this involved training a model using a huge set of samples of feedback forms and since the forms design were too close to each other, the identification accuracy suffered. Hence a simpler approach was identified, where the logos for all known brands were saved and stored and a technique named “Template Matching” which is available in the OpenCV library was applied to identify the logo against stored logos.

There were certain brands where a specific logo was not present in the feedback forms. In those cases, the nomenclature used by the file was used to identify the tour type.

Printed Label

Printed labels are plain printed stickers that are attached at the top of each scanned form. The label contains information like the tourist name and their address along with unique code. To cull out only the label part, “Template Matching” technique was used by having various types of label that were potentially applicable to the tour type were pre-created and kept. Once the printed label was programmatically identified, the same was now run for OCR extraction using the Tesseract OCR engine. Tesseract is an open source OCR engine under the Apache Licence Version 2.0. It supports various languages and can be deployed on premise. Tesseract gives good accurate extraction of the content as long as the content is in printed form and the amount of noise in the image is very less.

Checkbox

Checkboxes in the feedback form are used to get the user markings on when they plan to travel next (3 months, 6 months, 1 year, etc.), where they plan to travel to (Asia, Americas, Africa, Europe, etc.) and the ratings for areas of the tour like Itinerary, Travel Director, Sightseeing, Accommodation, Driver, etc. and each of these sections had multiple sub options like Customer Service, Responsiveness, Communication, Location. The tourist could tick respective checkbox to a choice which ranged from Excellent to Poor. In case multiple checkboxes are marked for the same option, the lowest rating was captured. To derive which checkboxes are checked, using the , “Template Matching” technique various types of checkboxes that are ticked were captured and compared to understand the ticked checkbox and then by looking at the relative position of the checkbox the association to the data was reached.

Handwritten

As part of the rating section for areas of the tour like Itinerary, Travel Director, Sightseeing, Accommodation, and Driver, a multiline free flowing text area was provided for the user to give their feedback. Extraction of this content was the toughest part since different users had different handwriting patterns ranging from block letter writing to cursive ones. The amount of noise in this area also was quite high as the user could scratch out a part they have written or overwrite on top of existing and write beyond the area provided.

First step was to cull out the handwritten areas for each section and try and extract content from each of these. Multiple techniques were applied to extract the handwritten comments part and each technique was evaluated for accuracy and efficiency and one approach was finally arrived at.

First approach was to get the extraction using the Tesseract OCR engine. But the amount of accuracy provided by Tesseract engine wasn’t that great. We tried to implement various post extraction processes by applying techniques like n-grams by creating a repository of words and their most relevant associated word blocks. But even this didn’t provide any good accuracy.

Second approach was to send the individual handwritten areas to a Cloud OCR engine like Azure Computer Vision or Google Computer Vision. On doing a trial it was found out that these provided far better accurate extraction. A comparison was done to see which cloud engine provided a better output and Azure Computer vision was seen to give better accuracy (A sample comparison is attached in the appendix). Even though the cloud OCR engine was paid and was charged based on transaction, it was found to be a better option which gave a better output.

Data output for Salesforce upload

Once all the data is extracted, it was required to be put into a particular pipe separated format so that it is easily integrated to the Salesforce engine. One text file was created for each of the feedback forms and it was added to an AWS S3 bucket. A sample of the data output is shown below:

Coforge

Continuous Deployment

The code repository used to maintain the code was the GitHub account of the customer. A bi-weekly schedule was created wherein any issue reported will be collated and the code will be corrected to solve the bugs. If any new feature was added or any new brand was inducted, those were also included in the build. Once the code is branched, labelled and checked-in, a script will start the build process and the executables are then moved to the Testing environment, Staging and then finally the production environment. On each of these stages automated test scripts were run against the build to look for regression errors.

Hosting

During the movement from Pilot phase to the production phase, one of the key architectural decisions made was to have the SLICE based extraction tool to run in a serverless architecture of AWS Lambda so that the solution is infinitely scalable. The flow was created to have the scanned images to be added to an AWS S3 bucket and the AWS Lambda will be setup to be invoked whenever any files get added to this S3 bucket.

One of the key problems we faced during this phase was the inherent limitations with AWS lambda. One of the learning we derived was that to have the executable to run in the AWS Lambda, the build needed to be created in a specific type of Linux instance named Amazon Linux AMI. The other issue was the limitation of the Lambda with the deployment package size. Lambda had a hard limit of 250 MB as the maximum deployment size and our executable run into more than 1 GB especially because of the presence of Tesseract engine being embedded within our build. To overcome this issue we had to take the open source content of Tesseract and strip off all the part that was deemed irrelevant for use and rebuild the package and was finally brought down to below 250 MB.

Technology Stack

Coforge

Pricing

While moving from the Pilot phase to the Production, one of the most important aspects that got discussed was on how to provide the pricing for the project. After multiple iterations a pricing structure was arrived that contained Onetime payment for the all the effort that expended on customizing the SLICE framework for this project and also whenever any new brand needed to be inducted, the effort will be estimated and a CR to be raised and invoice drawn for the effort. For the feedback form data extraction, a transactional pricing model was arrived at that will charge the customer with a specific dollar amount for each feedback form. This cost was arrived at after factoring the cloud provisioning charges.

Conclusion

A hitherto process which involved sending the scanned feedback forms to an offshore BPO company, where-in data from each of the forms were extracted manually into an excel sheet and then moved into the Salesforce took around 3-4 months. This process was time consuming and error prone and the time it took for the data to reach a SME to derive analytics out of this data was too much. With the introduction of SLICE based solution, this whole process now takes less than 30 seconds for the extracted content to reach the Salesforce and this Straight through processing (STP) technique had no manual intervention required at any stage of the process.

The project was a big success for the Travel firm and they happily gave us a most positive feedback for the project and was quoted in the Hfs Research POV paper on AI.

References:

https://en.wikipedia.org/wiki/Optical_character_recognition

https://en.wikipedia.org/wiki/Intelligent_character_recognition

https://docs.opencv.org/master/d4/dc6/tutorial_py_template_matching.html

https://github.com/tesseract-ocr

https://azure.microsoft.com/en-in/services/cognitive-services/computer-vision

https://aws.amazon.com/amazon-linux-ami

https://aws.amazon.com/lambda

https://www.hfsresearch.com/pointsofview/Change-the-game-with-verticaliszed-AI-NIIT-Technologies-unique-play-as-a-post-digital-firm