Qatar Genome Programme is an initiative that aims to use the latest DNA sequencing technology to establish a genome map of the local population. It uses a collection of samples and data from Qatar Biobank participants to identify genotype-phenotype associations relevant to the Qatari population. This will provide unique insights that would enable the development of personalized healthcare in Qatar.
QCRI Bioinformatics group is part of the QGP consortium and is involved in solving how the Whole Genome Sequencing Reveals the Germline Landscape of Cancer-Susceptibility Genes Variation in Qataris population.
Availability: GIGI-Quick resources:
Bash 4.3 or newer to use the -q option (wait -n was added in 4.3). The rest will work with older Bash versions but I am not sure how much older.
If you need to compile the binaries, a C++ compiler. Cmake if you want to use the included easy compilation script.
Run the following command to clone the repository with git (git is a version management program started by Linus Torvalds https://git-scm.com/downloads)
git clone https://cse-git.qcri.org/Imputation/GIGI-Quick.git
Go to this url: https://cse-git.qcri.org/Imputation/GIGI-Quick/tree/master
Click on the icon with the download arrow above the column “Last Update” on the right hand side.
There are several download options with different compressions. If you get run_GIGI this way, then you will need to decompress it before proceeding.
Once you have the files, most users won’t need to do anything else to use GIGI. There are executables compiled on Ubuntu 64 bit Linux for 64 bit and 32 bit (via multilib) x86 systems.
GIGI-Quick will automatically choose which of these to run. We recommend using these unless your system has a different architecture (e.g. PowerPC, ARM). When GIGI-Quick runs, if there
are locally compiled versions of the binaries then GIGI-Quick will use those, it will check for them in the following locations: ./GIGI/GIGI, ./MERGE/gigimerge, ./SPLIT/gigisplit.
We use cmake to create make files for the architecture being compiled on, to use that method one will need a reasonably recent cmake installed. This approach should be compiler and
architecture agnostic. To do this, one need only run the included make.sh script:
This should create the make file then compile all three binaries. It will write a log file in ./make.log. If the cmake method is not working on your system, you can compile directly
with your compiler, we give an example with g++ from the gnu gcc:
g++ -O2 GIGISplit.cpp -o gigisplit
g++ -O2 GIGIMerge.cpp -o gigimerge
g++ -O2 GIGI.cpp -o ../../GIGI
The folder structure of GIGI-Quick should not be separated, GIGI-Quick depends on relative paths to locate the scripts and executables included other than run_GIGI.
If you like you can now add GIGI-Quick to your path, the examples assume that you have, you can do this by adding the following to your .bashrc (located in your home folder)
Then source your .bashrc to apply the changes right away
To add run_GIGI to the path system-wide for all users you can create a symlink in /usr/bin pointing to the run_GIGI script:
ln -s /path/to/run_GIGI/script /usr/bin/run_GIGI
Note: The parameter file is the same as you would use for GIGI normally, but if you are using the long format, then pass the “-l” option
The examples in shown below use the file “param-v1_06.txt” because it is included in the repository and can be run by simply cutting and pasting the example line.
run_GIGI parameter_file -o [OUTPUT FOLDER] -n [RUN NAME] -t [THREADS] -m [MEMORY IN MB] [-l] [-v] -q [THREADS] -r [START] [END] [-V] [-h]
-o [OUTPUT FOLDER] : This is the path to use for the outputs from the run_GIGI scripts, including temporary files.
-n [RUN NAME] : This is a path relative to the [OUTPUT FOLDER] to use to keep the outputs from more than one run of run_GIGI separated.
-t [THREADS] : The number of threads to use for run_GIGI, and also the number of chunks to split the input into.
-m [MEMORY IN MB] : The amount of RAM that run_GIGI will restrict its use to, not yet implemented
-l : Specifies that the input is in the long format.
-V : Verbose mode, output from run_GIGI is much quieter now, you can see much more of what it is doing and what variables are set to at various stages with -V.
-v : Display the version of GIGI-Quick and exit.
-h : Display this help text.
-r [START] [END] : Run on only a selected region, starting at start and ending at end, this region will be selected before any further splitting.
-q [THREADS] : Run in queued mode, this mode will run up to THREADS instances of GIGI at a time and will attempt to keep the total amount of memory being used less than
[MEMORY IN MB] using an estimate of the amount of memory GIGI may need. If -m [MEMORY IN MB] wasn’t given, then it will use the amount of memory available
as shown by ‘free.’ For older kernels this isn’t shown and we use an estimate that is no longer accurate for modern systems (amount free + amount of buff/cache).
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=34e431b0ae398fc54ea69ff85ec700722c9da773 Also, -t is ignored when -q is given.
-e [MEMORY IN MB] : Manual estimate of how much memory GIGI will need for queued mode in case the calculated estimate is too inaccurate
./run_GIGI ./INPUTS/Sample_Input/param-v1_06.txt #Output in the current folder with no run name identifying subfolder, threads and memory determined automatically ./run_GIGI ./INPUTS/Sample_Input/param-v1_06.txt -o ./OUTPUTS -n test_run #Output in ./OUTPUTS/test_run ./run_GIGI ./INPUTS/Sample_Input/param-v1_06.txt -o ./OUTPUTS -n test_run -V #Output in ./OUTPUTS/test_run, verbose mode (print more detailed information) ./run_GIGI ./INPUTS/Sample_Input/param-v1_06.txt -o ./OUTPUTS -n test_run -l #Output in ./OUTPUTS/test_run for a parameter file in the long format, do not cut and paste this one because the included param-v1_06.txt is not in the long format ./run_GIGI ./INPUTS/Sample_Input/param-v1_06.txt -t 2 #Limit to only 2 threads (and hence two chunks) ./run_GIGI ./INPUTS/Sample_Input/param-v1_06.txt -m 1000 #Limit memory use to 1 GB, please read the section on memory and cgroups ./run_GIGI ./INPUTS/Sample_Input/param-v1_06.txt -lmt 1000 2 #Limit memory use to 1 GB, please read the section on memory and cgroups, and threads to 2 with input in the long format, do not cut and paste this one because the included param-v1_06.txt is NOT in the long format ./run_GIGI INPUTS/Sample_Input/param-v1_06.txt -o RUN_FOLDER/ -n test_run -m 20 -q 3 -V -r 3 70 #Output in ./RUN_FOLDER/test_run, limit memory to 20 MB, use the queued mode with up to 3 threads at a time, and run on only the region from 3 to 70, note: the memory estimated as needed in queued mode does not account for the restricted region
If there is a problem that makes GIGI stop before completion, then the output files are left as they are in order to allow users to rerun only failed portions as needed.
If you are unsure where the failure occurred, then the safest approach will be to remove the intermediate files before rerunning (e.g. rm -R [OUTPUT FOLDER]/[RUN NAME]), always use rm with caution as always
e.g. if the 2nd example failed, I would “rm -R ./OUTPUTS/test_run” before rerunning.
The -n option is largely redundant, as it is equivalent to using the -o option with a longer path giving the subfolder, e.g.
./run_GIGI ./INPUTS/Sample_Input/param-v1_06.txt -o ./OUTPUTS -n test_run
is equivalent to:
./run_GIGI ./INPUTS/Sample_Input/param-v1_06.txt -o ./OUTPUTS/test_run
The inclusion of -n is mostly a semantic convenience.
With the addition of -v and cleanup of output, you may notice that even with -v you don’t see the output of split, gigi, and merge any longer. These are now written to their own individual log files in the output directory/run subdirectory.
e.g. ./run_GIGI ./INPUTS/Sample_Input/param-v1_06.txt -o ./OUTPUTS -n test_run will have logs in ./OUTPUTS/test_run/LOGS
We handle memory restrictions using cgroups. After looking at a number of different memory limiting mechanisms we saw this as the best solution, unfortunately it has some caveats. One is that root/sudo access is required to create the initial cgroup. If you are on a shared machine then we encourage you to discuss this with your system administrator if you intend to use the cgroups. For most shared clusters, we encourage you to use the built in memory limiting mechanims of your submission system (e.g. qsub, SLURM, Torque) instead of limiting it through run_GIGI, most of these also themselves make use of cgroups (e.g. https://slurm.schedmd.com/cgroups.html and HTCondor http://help.uis.cam.ac.uk/supporting-research/research-support/camgrid/camgrid/technical3/cgroups).
If you are using this on your own system where you have root/sudo access, then you will need to make sure that your cgroups are set up and that you have your equivalent of the libcgroup library installed for the cgcreate and cgexec commands for your distribution.
If you have a very old (e.g. maybe 7+ years old) kernel, then you may need to install a newer kernel that has cgroups (they are part of the Linux kernel technically).
Here is a list of common distributions and links to help/documentation on cgroups
Arch: https://wiki.archlinux.org/index.php/cgroups you may note that libcgroup is an AUR package, to install such packages: https://wiki.archlinux.org/index.php/Arch_User_Repository
Once you have a functional cgcreate command to create cgroups, you can make them permanent (unfortunately in different syntax) by editing /etc/cgconfig.conf on Linux distributions using systemd (most of them).
Already covered in many of the other links above
If your distro isn’t covered, it is still worth looking at the above guides, most things will be similar in your distro though they may not be exactly the same (e.g. package names could be different, package manager, etc…).
Here is some distribution agnostic information on cgroups: http://man7.org/linux/man-pages/man7/cgroups.7.html
cgroups will eventually be replaced with cgroups2, but most of their controllers are not yet functional: https://www.kernel.org/doc/Documentation/cgroup-v2.txt
Technically you can create the cgroup/s we need with mount and mkdir commands, but we ourselves depend on cgcreate and cgexec in code, of course you could create cgcreate and cgexec scripts and add them to your path instead of using the programs in cgroup-tools. We wouldn’t recommend that route though.
Essentially, the goal here is to get a user writable cgroup setup that run_GIGI (running as your user) can make use of to create its own subcgroup.
On Ubuntu in BASH you can do this as follows:
First we install cgroup-tools to get cgcreate and cgexec, etc…
sudo apt-get install cgroup-tools
Then we create a cgroup that your user has access to
sudo cgcreate -a $ -g memory,cpu:user_cgroup
We can see that it was create by checking the contents of /sys/fs/cgroup/memory and/or /sys/fs/cgroup/cpu
They should both now have a folder user_cgroup that your user has write permissions to the contents of
ls -la /sys/fs/cgroup/memory/user_cgroup ls -la /sys/fs/cgroup/cpu/user_cgroup
When run as your user normally with -m, run_GIGI will make its own subcgroup of this cgroup (do not run run_GIGI with sudo)
These are not persistent cgroups (that is, they will disappear on reboot).
To make persistent ones, please see the distribution documentation above, for most this involves editing a configuration file /etc/cgconfig.conf
We may soon also add the ability to control swap usage through the cgroups for run_GIGI, (https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-memory.html) some distributions need a kernel parameter set at boot to allow this (Debian, Ubuntu, Arch, ??).
See the issue and solution here: http://matthewkwilliams.com/index.php/2016/03/17/docker-cgroups-memory-constraints-and-java-cautionary-tale/
The same method in a shorter read here: https://unix.stackexchange.com/questions/147158/how-to-enable-swap-accounting-for-memory-cgroup-in-archlinux
If you have a different bootloader, adding that same option to your boot command should work but you’ll need to consult the documentation for your bootloader to see how to do this.
Be careful when editing this boot line, mistakes may cause your machine to fail to boot linux. This will not harm your data but you may need to manually fix or reinstall your bootloader.
Useful resource for that situation: https://help.ubuntu.com/community/Grub2/Troubleshooting#Editing_the_GRUB_2_Menu_During_Boot
One could also edit during boot like that to test the line without making it permenent. Thereby avoiding any more serious GRUB issues than a single failed boot.
If you have messed up your GRUB and can’t figure out how to get it back, the most reliable method I have used to reliably get back GRUB is to reinstall via chroot: https://help.ubuntu.com/community/Grub2/Installing#via_ChRoot
Going forward, part of this may become easier for Ubuntu users. From 14.04 and onwards there should be a user writable cgrouup by default. It is created by systemd automatically and I’m not sure how consistent the location is, https://help.ubuntu.com/lts/serverguide/cgroups-delegation.html
I think the best way to make use of this may be through cgmanager, we will explore this possibility.