A few months ago, we received a phone call from a bioinformatics group at a European university. The problem they were having appeared very simple. They wanted to know how to usemmap()
to be able to load a large data set into RAM at once. OK I thought, no problem, I can handle that one. Turns out this has grown into a complex and interesting exercise in profiling and threading.
The background is that they are performing Markov-Chain Monte Carlo simulations by sampling at random from data sets containing SNP (pronounced "snips") genetic markers for a selection of people. It boils down to a large 2D matrix of floats where each column corresponds to an SNP and each row to a person. They provided some small and medium sized data sets for me to test with, but their full data set consists of 500,000 people with 38 million SNP genetic markers!
The analysis involves selecting a column (SNP) at random in the data set and then performing some computations on the data for all of the individuals and collecting some summary statistics. Do that for all of the columns in the data set, and then repeat for a large number of iterations. This allows you to approximate the underlying true distribution from the discreet data that has been collected.
That's the 10,000 ft view of the problem, so what was actually involved? Well we undertook a bit of an adventure and learned some interesting stuff along the way, hence this blog series.
The stages we went through were:
- Preprocessing
- Loading the Data
- Fine-grained Threading
- Preprocessing Reprise
- Coarse Threading
In this blog, I’ll detail stages 1 and 2. The rest of the process will be revealed as the blog series unfolds, and I’ll include a final summary at the end.
1. Preprocessing
The first thing we noticed when looking at the code they already had is that there is quite some work being done when reading in the data for each column. They do some summary statistics on the column, then scale and bias all the data points in that column such that the mean is zero. Bearing in mind that each column will be processed many times, (typically 10k - 1 million), this is wasteful to repeat every time the column is used.
So, reusing some general advice from 3D graphics, we moved this work further up the pipeline to a preprocessing step. The SNP data is actually stored in a compressed form which takes the form of quantizing 4 SNP values into a few bytes which we then decompress when loading. So the preprocessing step does the decompression of SNP data, calculates the summary statistics, adjusts the data and then writes the floats out to disk in the form of a ppbed
file (preprocessed bed where bed is a standard format used for this kind of data).
The upside is that we avoid all of this work on every iteration of the Monte Carlo simulation at runtime. The downside is that 1 float per SNP per person adds up to a hell of a lot of data for the larger data sets! In fact, for the full data set it's just shy of 69 TB of floating point data! But to get things going, we were just worrying about smaller subsets. We will return to this later.
2. Loading the data
Even on moderately sized data sets, loading the entirety of the data set into physical RAM at once is a no-go as it will soon exhaust even the beefiest of machines. They have some 40 core, many-many-GB-of-RAM machine which was still being exhausted. This is where the original enquiry was aimed - how to use mmap()
. Turns out it's pretty easy as you'd expect. It's just a case of setting the correct flags so that the kernel doesn't actually take a copy of the data in the file. Namely, PROT_READ
and MAP_SHARED
:
When dealing with such large amounts of data, be careful of overflows in temporaries! We had a bug where ppBedSize
was overflowing and later causing a segfault.
So, at this point we have a float *ppBed
pointing at the start of the huge 2D matrix of floats. That's all well and good but not very convenient for working with. The code base already made use of Eigen for vector and matrix operations so it would be nice if we could interface with the underlying data using that.
Turns out we can (otherwise I wouldn't have mentioned it). Eigen provides VectorXf
and MatrixXf
types for vectors and matrices but these own the underlying data. Luckily Eigen also provides a wrapper around these in the form of Map
. Given our pointer to the raw float data which is mmap()
'd, we can use the placement new operator to wrap it up for Eigen like so:
1 Comment
27 - May - 2019
Rolf Eike Beer
I would have done things slightly different at a few points.
First, your casting is inconsistent:
Since you already had problems with overflows you may agree that finding casts easily is a good thing, which makes searching for "_cast<" handy. So this boils down to:
Yes, the latter one can also be a static_cast, as casting from void* is permitted.
Next is, given that you opened the file with O_RDONLY and mapped it with PROT_READ I would expect ppBedMap to be "const float *" for all the good reasons that "const" has.
And then there is ppBedFd, which I suspect you don't need anymore. Once you have mmap()'d a file you can close the file descriptor unless you need it for other fancy things