Brief results of the collaboration:
- A provider of investment management services turned to Altoros to automate manual aggregation of financial reports.
- The company cut time spent on analyzing each document from 12 minutes to 10 seconds.
- Achieving 99% of precision, the delivered solution enabled the customer to optimize its analyst team by focusing it on more important business tasks.
The company is involved in investment management, helping organizations to allocate their financial /assets to gain value. Headquartered in Boston, the customer has affiliates in London, Singapore, Tokyo, and Sydney. Operating globally, the company serves customers across 25 countries in Europe, Asia, the Middle East, North America, and Australia.
To find an optimal investment opportunity, the company was manually analyzing publicly available financial reports. Turning to Altoros, the customer wanted to automate the process of recognizing and extracting explicit tables of contents (ToCs) from reports in a PDF format.
Under the project, the team at Altoros had to address the following issues:
- The entries in tables of contents greatly varied from company to company, so engineers at Altoros needed to achieve unification for better recognition of the contents.
- In many cases, it was impossible to extract text from a PDF file directly. So, developers at Altoros needed to rely on object recognition (OCR), treating a PDF as an image, and parse text from it.
At the preprocessing stage, our engineers parsed PDF files into symbols to recover text in a human-readable format, as well as extract such geometrical and formatting features of text lines as fonts, coordinates, etc. Using a classifier trained with scikit-learn, TensorFlow, and XGBoost, experts at Altoros were able to extract pages containing tables of contents.
Our team also built another classifier to extract ToCs from files’ metadata, which was present in 10% of the documents.
In order to detect a table of contents in a file, developers at Altoros trained a classifier with a subset of document bounding boxes, which label the areas containing tables of contents. While parsing, our team employed different features based on the styles of ToCs. For each text line, there were calculated and stored all the potentially relevant features.
Then, engineers at Altoros identified the exact page to which a ToC entry referred to. The extracted table of contents had a page number sequence, and the algorithms created by our experts detected the difference between a ToC page number and its actual page number in the PDF file.
Finally, developers at Altoros implemented a searchable database to easily access and search through the information contained in PDF reports.
Partnering with Altoros, the customer automated manual processing of financial reports, cutting time spent on each document analysis from 12 minutes to 10 seconds. Achieving 99% of precision, the delivered solution enabled the customer to optimize its analyst team by focusing it on more important business tasks.
TensorFlow, scikit-learn, XGBoost,Google BigQuery, Google Dataproc,tesseract, pdfminer
Google Cloud Storage