Skip to content

PGAP

Description

pgap website

The NCBI Prokaryotic Genome Annotation Pipeline is designed to annotate bacterial and archaeal genomes (chromosomes and plasmids). Genome annotation is a multi-level process that includes prediction of protein-coding genes, as well as other functional genome units such as structural RNAs, tRNAs, small RNAs and pseudogenes. NCBI has developed an automatic prokaryotic genome annotation pipeline that combines ab initio gene prediction algorithms with homology based methods. The first version of NCBI Prokaryotic Genome Pipeline was developed in 2001 and is regularly upgraded to improve structural and functional annotation quality (Li W, O'Neill KR et al 2021). Recent improvements include utilization of curated protein profile hidden Markov models (HMMs), and curated complex domain architectures for functional annotation of proteins and annotation of Enzyme Commission numbers and Gene Ontology terms.

Environment Modules

Run module spider pgap to find out what environment modules are available for this application.

Environment Variables

  • HPC_PGAP_DIR - installation directory
  • HPC_PGAP_BIN - executable directory

Additional Usage Information

The PGAP module provides a wrapper function pgap.py to fine tune usage in HPG. This means that you should just use pgap.py ... without a path when running the command. See the Job Scripts section below for an example.

Job Script Examples

Script used to test application installation

#!/bin/bash
#SBATCH --job-name=pgap_20220414_test
#SBATCH --mail-type=NONE
#SBATCH --cpus-per-task=28
#SBATCH --mem-per-cpu=4gb
#SBATCH --time=24:00:00
#SBATCH --output=pgap_20220414_test.log

echo "Setting up test environment..."
TEST_PWD=/data/apps/tests/pgap/20220414
TEST_DATADIR=${TEST_PWD}/example_data
TEST_WORKDIR=${TEST_PWD}/test_output

cd ${TEST_PWD}
module load pgap/20220414

# Remove any previous test results and re-create a working directory
if [ -d ${TEST_WORKDIR} ]; then rm -rf ${TEST_WORKDIR}*/; fi

echo "Starting test run at $(date) on $(hostname)..."

###############################################################################

pgap.py \
    --cpus ${SLURM_CPUS_ON_NODE:-4} \
    -o ${TEST_WORKDIR} \
    ${HPC_PGAP_DIR}/test_genomes/MG37/input.yaml

###############################################################################

# There should be some files in the work directory
echo "There should be some results listed below:"
find ${TEST_WORKDIR} -type f ! -empty -ls

echo "Test complete at $(date)."

Citation

If you publish research that uses pgap, please cite NCBI:

https://github.com/ncbi/pgap

Categories

biology, genomics, annotation, pipeline