Distribution of undetermined indices in Illumina Hiseq experiments

Posted: March 23rd, 2015 | Author: karol | Filed under: Sequencing | Comments Off

It is often that one wants to look at distribution of undetermined indices in Illumina Hiseq experiments to spot problems with experiment’s sample sheet or with the library itself. These indices are stored in fastq file(s) in the undetermined_indices folder. Below is the Bash script that processes these fastq files (fastq files can be in gzip’ed and/or uncompressed forms), and prints distribution of indices to stdout in a form of comma-separated values. If the script is named “index_stats”, one calls it from the undetermined_indices folder as:

   index_stats > lane2_undetermined.csv

Latest versions of GNU sort program support parallel sorting. The script as presented below, runs GNU sort with 8 threads. It also utilizes parallel unpigz with 8 cores/processors to uncompress fastq.gz files, if unpigz is present. Otherwise, it uses single core gunzip command.

#!/bin/bash ## This tool extract indexes from Illumina's fastq files. ## It is intended to process files in Undetermined_indices folder. ## Karol M. - Apr 2014 ## This version does not take any argument - it processes ## all .fastq and .fastq.gz files in current directory. ## GNU parallel sort running with 8 threads SORT="/usr/local/bin/sort --parallel=8" ## Do we find any .fastq or fastq.gz files? ## ls -1 *fastq > /dev/null 2>&1 if [ "$?" != "0" ]; then ls -1 *fastq.gz > /dev/null 2>&1 if [ "$?" != "0" ]; then echo " No fastq files found in current directory or they are not readable." echo " Exiting" exit 1 fi fi ## Do we have unpigz or gunzip? (needed only if there are fastq.gz files) ## ls -1 *fastq.gz > /dev/null 2>&1 if [ "$?" = "0" ]; then UNGZ=`which unpigz 2> /dev/null` if [ ! "$?" = 0 ] ; then UNGZ=`which gunzip 2> /dev/null` if [ ! "$?" = 0 ] ; then echo " Cannot find unpigz and gunzip commands that" echo " are needed to process fastq.gz files" echo " Exiting" exit 2 fi fi fi echo $UNGZ | grep unpigz > /dev/null if [ "$?" = 0 ] ; then UNGZ="$UNGZ -p 8" fi ## Process fastq and fastq.gz files ## ls -1 *fastq > /dev/null 2>&1 if [ "$?" = "0" ]; then for f in *.fastq ; do sed -n '1~4p' $f >> .indices_$$ done fi ls -1 *fastq.gz > /dev/null 2>&1 if [ "$?" = "0" ]; then for fz in *.fastq.gz ; do $UNGZ -p 8 $fz fnz=${fz%.*} sed -n '1~4p' $fnz >> .indices_$$ done fi ## total number of indices TOT=`wc -l .indices_$$ | awk '{print $1}'` ## sorting and counting $SORT -i --field-separator=: -k 10 .indices_$$ | sed 's/.*://' | /usr/local/bin/uniq -c | $SORT -b -n -r -k 1 > .indices_counted_$$ ## create the output - CSV format awk -v tot=$TOT '{printf"%s,%d occurences,%6.4f per cent of all\n",$2,$1,($1/tot)*100}' .indices_counted_$$ ## clean temporary files rm .indices_$$ .indices_counted_$$

Download this page in PDF format

Comments are closed.

#%!@#(!

Notes by KM

Distribution of undetermined indices in Illumina Hiseq experiments

Archives