Portfolio
, , , ,

Highly Scalable System for DNA Analysis

The customer turned to Altoros to improve its biotechnology system that analyzes DNA samples for mutations on early stages. As a result of cooperation with Altoros, the customer got:

  • A scalable system that can process 10,000+ DNA samples at a time (10x more than their legacy framework)
  • Reduced time spent on analysis: shortened from hours to minutes
  • A reference architecture for a reporting solution based on open-source technologies, saving thousands of dollars on costly BI licenses

The customer turned to Altoros to improve its biotechnology system that analyzes DNA samples for mutations on early stages. The legacy tool was able to de-duplicate only 1,000 samples maximum—due to memory and CPU limitations—and it still took hours (or even days) to process the pipeline. The goal was to fix performance bottlenecks as well as enable linear scalability for processing 10,000+ bio samples at a time.

The customer turned to Altoros to improve its biotechnology system that analyzes DNA samples for mutations on early stages. The legacy tool was able to de-duplicate only 1,000 samples maximum—due to memory and CPU limitations—and it still took hours (or even days) to process the pipeline. The goal was to fix performance bottlenecks as well as enable linear scalability for processing 10,000+ bio samples at a time.

Our Java engineers assisted in installing and configuring Cloudera CDH 5.2 for distributed data storage and computation. Cluster monitoring and profiling was enabled with Cloudera Manager. After that, our developers have created a mini-framework—based on MapReduce jobs—with custom partitioners that enabled efficient distribution of data between parallel tasks. The team has also built a converter that transforms binary variant files (samples) into the Hadoop sequence format—required by the HDFS file system. In addition, we designed a reference architecture for an improved reporting solution integrated with Apache Spark. Our experts suggested using Spark SQL to preserve the existing structure of the reporting module (SQL-based) and to easily change data sources, if needed.

Altoros has delivered a highly scalable analytical system for de-duplication of genome samples—as a part of the customer’s analytical platform. Thousands of hospitals and laboratories worldwide use the system to detect DNA mutations, saving thousands of lives. The analysis takes minutes now, not hours; it allows for processing 10x more genome samples—compared to performance of the legacy system.

Technology Stack

Server Platform

Linux

Technologies

Apache Hadoop (Cloudera CDH 5.2.1), MapReduce, Apache Spark (Spark SQL), Bash

Programming Language

Java, Perl

Database

HDFS

Want to develop something similar?


© 2001 – 2018 Altoros