About the project
Brief results of the collaboration:
- The company cut time spent on analyzing each document from 12 minutes to 10 seconds.
- Achieving 99% of precision, the delivered solution enabled the customer to optimize its analyst team by focusing it on more important business tasks.
The company is involved in investment management, helping organizations to allocate their financial assets to gain value. Headquartered in Boston, the customer has affiliates in London, Singapore, Tokyo, and Sydney. Operating globally, the company serves customers across 25 countries in Europe, Asia, the Middle East, North America, and Australia.
To find an optimal investment opportunity, the company was manually analyzing publicly available financial reports. Turning to Altoros, the customer wanted to automate the process of recognizing and extracting explicit tables of contents (ToCs) from reports in a PDF format.
Under the project, the team at Altoros had to address the following issues:
- The entries in tables of contents greatly varied from company to company, so engineers at Altoros needed to achieve unification for better recognition of the contents.
- In many cases, it was impossible to extract text from a PDF file directly. So, developers at Altoros needed to rely on object recognition (OCR), treating a PDF as an image, and parse text from it.
At the preprocessing stage, our engineers parsed PDF files into symbols to recover text in a human-readable format, as well as extract such geometrical and formatting features of text lines as fonts, coordinates, etc. Using a classifier trained with scikit-learn, TensorFlow, and XGBoost, experts at Altoros were able to extract pages containing tables of contents.
Our team also built another classifier to extract ToCs from files’ metadata, which was present in 10% of the documents.
In order to detect a table of contents in a file, developers at Altoros trained a classifier with a subset of document bounding boxes, which label the areas containing tables of contents. While parsing, our team employed different features based on the styles of ToCs. For each text line, there were calculated and stored all the potentially relevant features.
Then, engineers at Altoros identified the exact page to which a ToC entry referred to. The extracted table of contents had a page number sequence, and the algorithms created by our experts detected the difference between a ToC page number and its actual page number in the PDF file.
Finally, developers at Altoros implemented a searchable database to easily access and search through the information contained in PDF reports.
Partnering with Altoros, the customer automated manual processing of financial reports, cutting time spent on each document analysis from 12 minutes to 10 seconds. Achieving 99% of precision, the delivered solution enabled the customer to optimize its analyst team by focusing it on more important business tasks.
TensorFlow, scikit-learn, XGBoost,Google BigQuery, Google Dataproc,tesseract, pdfminer
Google Cloud Storage
You May Also Like
Automation of In-field Job Planning and Performance Optimization
Call Recording, Analytics, and Workforce Optimization Solution
Highly Scalable System for DNA Analysis
A Highly Secure Smart Home System Wins a Kickstarter Funding
The Image Recognition System
Integrated logistics solutions to the offshore industry
LikeFolio: Best Practices of Cloud and Ruby Development for Application Optimization
Software for Selecting and Mixing Paint
Software Suite for Mobile Technicians and Field Service Management
The System for Emergency Control Centers
The Cloud-based Document Exchange System
The Marketing Information Messaging System
The NuoDB Migrator for Moving SQL Data to a NoSQL Database
Toyota Automates Its System for Holding Tenders
Warehouse Workload Monitoring Application
Web-Based Personal Styling
Web-Based System for Retailers
A Blockchain-Based Platform for Automating Bond Issuing Worth $10M
Contact us and get a quote within 24 hours