Distribution of undetermined indices in Illumina Hiseq experiments

Posted: March 23rd, 2015 | Author: | Filed under: Sequencing | Comments Off on Distribution of undetermined indices in Illumina Hiseq experiments

It is often that one wants to look at distribution of undetermined indices in Illumina Hiseq experiments to spot problems with experiment’s sample sheet or with the library itself. These indices are stored in fastq file(s) in the undetermined_indices folder. Below is the Bash script that processes these fastq files (fastq files can be in gzip’ed and/or uncompressed forms), and prints distribution of indices to stdout in a form of comma-separated values. If the script is named “index_stats”, one calls it from the undetermined_indices folder as:

   index_stats > lane2_undetermined.csv

Latest versions of GNU sort program support parallel sorting. The script as presented below, runs GNU sort with 8 threads. It also utilizes parallel unpigz with 8 cores/processors to uncompress fastq.gz files, if unpigz is present. Otherwise, it uses single core gunzip command.

#!/bin/bash

## This tool extract indexes from Illumina's fastq files.
## It is intended to process files in Undetermined_indices folder.
## Karol M. - Apr 2014

## This version does not take any argument - it processes
## all .fastq and .fastq.gz files in current directory.

## GNU parallel sort running with 8 threads
SORT="/usr/local/bin/sort --parallel=8"

## Do we find any .fastq or fastq.gz files?
##
ls -1 *fastq > /dev/null 2>&1
if [ "$?" != "0" ]; then
ls -1 *fastq.gz > /dev/null 2>&1
if [ "$?" != "0" ]; then
echo " No fastq files found in current directory or they are not readable."
echo " Exiting"
exit 1
fi
fi

## Do we have unpigz or gunzip? (needed only if there are fastq.gz files)
##
ls -1 *fastq.gz > /dev/null 2>&1
if [ "$?" = "0" ]; then
UNGZ=`which unpigz 2> /dev/null`
if [ ! "$?" = 0 ] ; then
UNGZ=`which gunzip 2> /dev/null`
if [ ! "$?" = 0 ] ; then
echo " Cannot find unpigz and gunzip commands that"
echo " are needed to process fastq.gz files"
echo " Exiting"
exit 2
fi
fi
fi
echo $UNGZ | grep unpigz > /dev/null
if [ "$?" = 0 ] ; then
UNGZ="$UNGZ -p 8"
fi

## Process fastq and fastq.gz files
##

ls -1 *fastq > /dev/null 2>&1
if [ "$?" = "0" ]; then
for f in *.fastq ;
do
sed -n '1~4p' $f >> .indices_$$
done
fi
ls -1 *fastq.gz > /dev/null 2>&1
if [ "$?" = "0" ]; then
for fz in *.fastq.gz ;
do
$UNGZ -p 8 $fz
fnz=${fz%.*}
sed -n '1~4p' $fnz >> .indices_$$
done
fi

## total number of indices
TOT=`wc -l .indices_$$ | awk '{print $1}'`

## sorting and counting
$SORT -i --field-separator=: -k 10 .indices_$$ | sed 's/.*://' | /usr/local/bin/uniq -c | $SORT -b -n -r -k 1 > .indices_counted_$$

## create the output - CSV format
awk -v tot=$TOT '{printf"%s,%d occurences,%6.4f per cent of all\n",$2,$1,($1/tot)*100}' .indices_counted_$$

## clean temporary files
rm .indices_$$ .indices_counted_$$


Comments are closed.