# katcher **Repository Path**: quanrd_admin/katcher ## Basic Information - **Project Name**: katcher - **Description**: No description available - **Primary Language**: Unknown - **License**: GPL-3.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2023-11-20 - **Last Updated**: 2023-11-20 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Overview The katcher program aims to extract reads containing any of a set of *k*-mers from a BAM file. The program is written in the C language and makes use of multithreading to improve execution speed. It uses the [htslib](https://github.com/samtools/htslib) library for reading/writing BAM files and its implementation of [khash](https://github.com/samtools/htslib/blob/develop/htslib/khash.h) for efficient hash tables. # Usage ## katcher The main program provided in this repository is `katcher`. You can get an overview of command-line parameters by running `katcher --help`: Usage: katcher -i -k -i, --input: Input file in bam format (required) -k, --kmers: List of k-mers to look for in the reads (required) -o, --output: Name of the output file in bam format (default: stdout) -t, --threads: Number of analysis threads used in addition to decompression thread (default: 1) -b, --bufsize: Number of records stored in the buffers (default: 100,000) -m, --maxlength: Maximum read length allowed, used for memory allocation (default: 300) -h, --help: Print this help message The list of *k*-mers should be a text file with one *k*-mer per line. You do not need to include the reverse-complemented sequence of the *k*-mers that you are looking for. `katcher` will automatically add them to the list of queried *k*-mers. `katcher` will append a `KM` tag to every read it extracts indicating which of the query *k*-mers were found. If there is more than one query *k*-mer in the read, the value of the `KM` tag will be a list of comma-separated *k*-mers. Because of the way multithreading is implemented at the moment, the `--bufsize` parameter must be a multiple of the number of threads specified by `--threads`. `katcher` will complain accordingly if this requirement is not met. ## Other software The `katcher` program was developed to process the output of the [*k*-mer GWAS](https://github.com/voichek/kmersGWAS) program developed by [Voichek and Weigel (2020)](https://doi.org/10.1038/s41588-020-0612-7), but it can be used for any application requiring the efficient querying of *k*-mers within mapped reads. However, three other binaries were developed as part of the work on `katcher` with a more specific focus on the analysis of the output of the *k*-mer GWAS approach. ### extract_qname The `extract_qname` program included in this repository can be used to retrieve all the reads in a bam file that have the QNAME of any read in a subset of interest. This can be used, for example, to retrieve all read pairs that have at least one of the two reads in the pair containing a *k*-mer of interest. This was developed with the idea of using those reads for local assembly. The usage for this program is: Usage: extract_qname > With the output being written to stdout. ### add_pvalues The `add_pvalues` program can be used to add p-values corresponding to significant *k*-mers to a BAM file that has already been processed with `katcher`. The usage for this program is: Usage: add_pvalues > With the output being written to stdout. The p-value file must be a table of *k*-mers and their p-values as output by the *k*-mer GWAS approach, such as the `test/kmer_pvalues.txt` file included in this repository. the p-values will be added as a `PV` tag of comma-separated values corresponding to their matching `KM` tag. ### list_kmers Finally, the `list_kmers` program can be used to obtain a list of *k*-mers to use as input to `katcher` from a presence/absence table of *k*-mers and the name of a sample of interest. This can be useful to avoid looking for *k*-mers that are known beforehand not to be present in a given sample. The usage for this program is: Usage: list_kmers > An example of a PAV table file (test/significant_pav_table.txt) is included in the repository. Such a PAV table can be obtained using the `filter_kmers` program included in the *k*-mer GWAS suite of tools. # Installation ## Prerequisites You will need a Linux system with the [htslib](https://github.com/samtools/htslib) and pthread libraries installed in order to compile and run `katcher` and associated binaries. Users will find that `katcher` runs faster if htslib is compiled with libdeflate support, although this is not necessary. pthread should be installed by default on most Linux systems. ## Download and compilation The code can be downloaded by running `git clone https://github.com/malemay/katcher`. Once this is done, the following commands should be sufficient to compile the programs: cd katcher make The four compiled binaries will be located in the bin/ directory. You can add them to your `$PATH` or call them directly from that location. IMPORTANT: the *k*-mer length is a constant that is set at compile time. By default, it is 31. If you want to change it, you need to change the value of the `KMER_LENGTH` constant in `include/functions.h` and re-compile the binaries. ## Testing You can type `make test` to test that the compiled binaries work properly. This will use sample data files in the `test` directory. The commands that show up when running `make test` can be used as examples of how to run these binaries. # Implementation The development of `katcher` was motivated by the necessity to query large (>20 Gb) BAM files for tens of thousands of *k*-mers and hundreds of samples. In order to tackle this problem, our solution is to build a hash-table of *k*-mers to look for. We then go along each read and look for sub-sequences of length *k* in the hash table. For this reason, the *k*-mers that are queried must all be of a constant length that is determined at compile time. `katcher` implements multithreading by using a decompression thread that continuously reads the input BAM file in blocks while analysis threads (specified by the `--threads` option) look for *k*-mers of interest within the reads. Because the buffer filled by the decompression thread is split evenly between threads, the `--bufsize` option must be a multiple of the `--threads` option. This implementation is not optimal, but should be sufficient for most use cases. Please file a bug in the issues section of this repository if you believe that the multithreading implementation should be improved. # Citation The reference to the katcher software will be uploaded shortly. # References Voichek, Y. and Weigel, D., 2020. Identifying genetic variants underlying phenotypic variation in plants without complete genomes. *Nature Genetics*, 52(5), pp.534-540.