Overview

One of the most important tasks in scientific research is identification of the informative variables in the phenomena under scrutiny.

Imagine a medical condition with likely, but yet not recognised genetic background. Finding the genetic defects linked with this disease is the first step to early diagnosis, development of new drugs, proposing the dietary changes that may remove symptoms etc.

The challenge is that the number of possible causes is enormous – there are millions of possible genetic variations that may have influence on the disease in question. Most of these variations are benign and nearly all of them are not connected with the disease in question.

The direct links between disease and the genetic variations are established using well-known statistical methods. In most cases the analysis is performed measuring the influence of a single variable at once, assuming that effects of different variables are mutually independent. This approach is well understood and works very well when disease in question indeed can be attributed to independent effects of different variables. However, this may not necessarily be the case.

One can easily imagine a situation, where metabolising some otherwise benign food leads to potentially harmful chemical compound. A single mutation decreasing the efficiency of processing this metabolite may have a negligible effect on health, since alternative processing pathways may exist. However, simultaneous mutation in the alternative pathway may result in the accumulation of the harmful metabolite and consequently lead to the disease. Presence of both mutations is required for a medical condition to appear, however, if both mutations are rare, they may not be discovered using a standard univariate testing.

The hypothetical situation described here is not limited to medicine. Similar phenomena may happen in any setting when multiple variables describe the phenomena under scrutiny, from composition of microbiome in human gut, through analysis of failures in the complex climatic models, analysis of satellite images in search for pests’ invasion etc.

The goal of the project was development of the tools for the multidimensional analysis of the datasets described with thousands or even millions of variables.

We have developed new methodological approach to analysis of such phenomena. The computer programs, which implement these algorithms very efficiently perform all required computations, executing even billions of tests per second. The algorithms are implemented using portable tools and may take advantage of GPU accelerators. The library can be compiled and run on Windows, Linux, major BSDs, Solaris and OS X. GPU acceleration is limited to platforms with CUDA support.

The tools are publicly available via this Internet service at and as a library in R - a popular environment for statistical computations.

Why register?

Unregistered users Registered users
Maximum number of files 20 50
Maximum number of tasks 20 150
Files storage time 2 days 30 days
Results storage time 3 days 90 days

Project funded by the Polish National Science Centre, grant DEC-2013/09/B/ST6/01550.