Program Options for the Pattern Matching Program seedtop

 

Tao Tao, Ph.D.

NCBI User Service

July22, 2005

 

1. Introduction

 

seedtop is one of the programs found in the NCBI standalone blast package. It is a little-known program, whose main function is to search for patterns in an input sequence or database.

 

seedtop has four modes of usage, which are referred to as “subprograms”: two for pattern searches from a input query or database only and two for pattern initiated sequence alignment. The following table lists these subprograms, their functions, and required inputs.

 

Table 1.1 Functions of  individual subprograms from  seedtop

Program Call 1

Functions

Required Inputs

-p patmatch

Search for patterns in an input sequence

Pattern (-k) and sequence (-i)

-p pattern

Search for patterns in an input database

Same as above

-p patseed

Search for patterns in the query and align the query against a database

Pattern (-k), input  sequence (-i), and target database (-d)

-p seed

Search for specific pattern in the query and align the query against a database

Same as above 2

Note:

1. The program strings listed are for nucleotide searches. For protein searches, add lowercase p to the program name.

2. The pattern file needs to have an extra HI initialed line to specify the position in the input sequence at which the pattern occurrence of interest starts.

 

2. Setup

 

Installation of the standalone blast archive is fairly easy. Once the archive is placed in a desired directory and extracted, the whole package will be installed in a newly created subdirectory called blast-2.2.11 (assuming 2.2.11 release here). All the programs, including seedtop, will be in the blast-2.2.11/bin/ subdirectory (blast-2.2.11\bin\ for PC).

 

Appropriate setup requires the creation of .ncbirc configuration file, which blast programs (including seedtop) read upon startup to locate the appropriate files needed.

In this .ncbirc, we can specify the location of the DATA directory and the BLASTDB directory using the following lines:

 

[NCBI]

DATA=/path/data

 

[BLAST]

BLASTDB=/path/db

 

The [NCBI] section is used by most of the NCBI programs to locate the data directory and retrieve specific files needed (MATRIX file for example). The [BLAST] section specifies the path to the directory where databases are stored.

 

The db directory does not come with the NCBI setup, so one needs to create it after installation. If we place the directory anywhere, we need to change the path correspondingly. For simplicity, we suggest that it be created under the blast-2.2.11 at the same level as data directory.

 

3. Program Options

 

Once the standalone BLAST package is setup, we can “cd” to the directory and issue the “seedtp –” command to display the program options for this program.  Here we list each option in a table and describe it functions, argument value, and example usage.

 

Table 3.1

Option

-d

Function

Specifies the target database to search

Default

nr

Input format

Takes database formatted by formatdb, use name without extension

Example

To search against est_human, use: -d est_human

Note

This is not a mandatory option, search for patterns in a single input sequence does not require this option.

 

Table 3.2

Option

-i

Function

Specifies the input query file

Default

stdin

Input format

[File In], file name with extension

Example

To take my_pept.txt as input query, use: -i my_pept.txt

Note

To using stdin input, either pipe or redirect the input:

seedtop -k pat -p patmatchp <input_file

more input_file| seedtop -k pat -p patmatchp

 

Table 3.3

Option

-k

Function

Specifies the input pattern (Hit File)

Default

hit_file

Input format

Complete file name with extension

Example

If the pattern file is named my_pat.txt, use: -k my_pat.txt

Note

See section 4.1 below for details.

 

Table 3.4

Option

-o

Function

Specifies the output file name

Default

stdout

Input format

file name with or without extension

Example

To save result in my_output, use: -o my_output

Note

Redirection or piping can be used instead.

 

Table 3.5

Option

-G 

Function

Specifies the cost to open a gap

Default

11

Input format

[Integer]

Example

To change this to 12, use: -G 12

Note

The choice of -M option determines the available input value for this option as well as that for -E option. Only a selected set is supported. Detailed list is in the blastall document.

 

Table 3.6

Option

-E

Function

Specifies the cost to extend a gap

Default

1

Input format

[Integer]

Example

To change this to 2, use: -E 2

Note

See Table 3.5 for more information

 

Table 3.6

Option

-D

Function

Specifies the cost to decline alignment

Default

99999

Input format

[Integer]

Example

N/A

Note

Functions similar to the -L option in blastpgp. If enabled, it would implement Dr. Altschul’s 3-parameter gap model for scoring.

 

Table 3.7

Option

-X

Function

Specifies X dropoff value for gapped alignment (in bits)

Default

[Integer]

Input format

15

Example

To increase this dropoff value to 20, use: -X 20

Note

Increasing this value may enable one to see a longer alignment

 

Table 3.8

Option

-S

Function

Specifies cutoff cost

Default

[Integer]

Input format

30

Example

N/A

Note

Currently it is overridden in pseed3.c. It could allow the user to control the score threshold applied to the part of the alignment that does not include the pattern in deciding which alignment(s) to report.

 

Table 3.9

Option

-C

Function

Score only or not

Default

1

Input format

[Integer]

Example

N/A

Note

This is relevant only to searches with -p seed(p) or -p patseed(p).

NOT implemented yet.

 

Table 3.10

Option

-I

Function

Whether to Show GI's in deflines

Default

F

Input format

[T/F]

Example

To display GI in the deflines, use: -I T

Note

Relevant only to searches with -p seed(p) or -p patseed(p)

 

Table 3.11

Option

-e

Function

Specifies the expectation value (E) cutoff

Default

10.0

Input format

[Real]

Example

To set this to 0.001, use: -e 0.001 or -e 1e-3

Note

Relevant only to searches with -p seed(p) or -p patseed(p)

    

Table 3.12

Option

-J

Function

Whether to believe the query defline or not

Default

F

Input format

[T/F]

Example

To set this to true, use: -J T

Note

To save SeqAlign object requires -J T

 

Table 3.13

Option

-O

Function

Specifies the output file for SeqAlign object

Default

Optional

Input format

[File Out]

Example

N/A

Note

Relevant only to searches with -p seed(p) or -p patseed(p).

NOT implement yet.

 

Table 3.14

Option

-M

Function

Specifies which matrix file to use

Default

BLOSUM62

Input format

[String]

Example

To set matrix to PAM30, use: -M PAM30

Note

Relevant to seedp/patseedp searches, only a limited set is supported

 

 

 

Table 3.15

Option

-p

Function

Specifies which subprogram to run

Default

patmatchp

Input format

[String]

Example

To find protein patterns in a database, use: -p patternp

Note

Four choices for nucleotide searches are:

patmatch, pattern, seed, and patseed

Four for protein searches are:

patmatchp, patternp, seedp, and patseedp

 

Table 3.16

Option

-r

Function

Specifies the reward for a match

Default

10

Input format

[Integer]

Example

To increase the reward to 20, use: -r 20

Note

Relevant only to searches with -p seed(p) or -p patseed 

For nucleotide searches only.

 

Table 3.17

Option

-q

Function

Specifies the cost for a mismatch

Default

-10

Input format

[Integer]

Example

To increase the penalty to -15, use: -q -15

Note

For nucleotide search with seed/patseed only.

 

Table 3.18

Option

-F

Function

Whether to filter query sequence with SEG

Default

F

Input format

[T/F]

Example

To activate filter, use: -F T

Note

Relevant to seedp/patseedp searches only.

 

4. Execution and Practical Usage

 

The most useful functionalities of seedtop are patmatchp and patternp. Since patseedp does not generate the actual alignment and its function is already incorporated in blastpgp, we will not cover it here. The functionality for searching with nucleotide entries are similar to protein searches, we will only provide a couple simple examples.

 

4.1 Pattern specification

 

The pattern input file is unique for seedtop. Each pattern is specified by two lines, ID initialed for identification and PA initialed for pattern. The pattern specification uses the ProSite syntax. Multiple patterns can be used, each separated by a single line with /.

 

Single pattern specification:

 

ID Cyclic nucleotide-binding domain signature 2.
PA [LIVMF]-G-E-x-[GAS]-[LIVM]-x(5,11)-R-[STAQ]-A-x-[LIVMA]-x-[STACV].

 

Multiple pattern specification:

 

ID Cyclic nucleotide-binding domain signature 2.

PA [LIVMF]-G-E-x-[GAS]-[LIVM]-x(5,11)-R-[STAQ]-A-x-[LIVMA]-x-[STACV].

/

ID Cyclic nucleotide-binding domain signature 1

PA [LIVM]-[VIC]-x-{H}-G-[DENQTA]-x-[GAC]-{L}-x-[LIVMFY](4)-x(2)-G

 

Pattern lines should be less than 100 letters long. Longer patterns can be broken up into two or more PA lines. Multiple individual patterns should be separated by a line containing a single backslash (/). Other general pattern rules can be summarized as the following:

 

Symbol

Meaning

[]

mark a single position, match to anyone in the bracket is acceptable

(x,y)

mark a range for the residue(s) before it, matches within the range are acceptable

(x,)

represent range with no upper limit for the residue(s) before it

(x)

represent exact number of matches for the residue(s) before it

{}

mark a single residue, none in the braces should be present in this position

-

separate the individual positions in the pattern

.

at the end marks the end of a pattern

symbol at the end marks an incomplete pattern (optional)

 

4.2 patmatchp

 

This function matches patterns found in an input pattern file and identifies the pattern occurrences in an input protein sequence. The sample command line below (given along with its output) takes an input pattern named pattern.txt, searches against the input query.aa target sequence, and saves the output in a file named query.out.

 

seedtop -k pattern.txt -i query.aa -p patmatchp –o query.out

 

 

Name  Cyclic nucleotide-binding domain signature 2.

Pattern [LIVMF]-G-E-x-[GAS]-[LIVM]-x(5,11)-R-[STAQ]-A-x-[LIVMA]-x-[STACV].

 At position 521 of query sequence

Name  Cyclic nucleotide-binding domain signature 1

Pattern [LIVM]-[VIC]-x-{H}-G-[DENQTA]-x-[GAC]-{L}-x-[LIVMFY](4)-x(2)-G

 At position 483 of query sequence

 

 

The result only lists the pattern starting positions in the query sequence.

 

4.3 patternp

 

This function matches patterns in an input pattern file against an input database and reports back the database entries containing one or more of the input patterns as well as the pattern locations. The sample command line below (given with its partial output) takes an input pattern named pattern.txt, searches against the input refseq_protein database, the identified entries with pattern matches are saved in the output file db.out.

 

 

seedtop -k pattern.txt -d refseq_protein -p patternp –o db.out

 

seqno=892602    gi|33859524|ref|NP_034048.1|

 

ID  Cyclic nucleotide-binding domain signature 1

PA  [LIVM]-[VIC]-x-{H}-G-[DENQTA]-x-[GAC]-{L}-x-[LIVMFY](4)-x(2)-G

HI (449 450) (452 454) (456 457) (459 462) (465 465)

seqno=892873    gi|51470807|ref|XP_290552.4|

 

ID  Cyclic nucleotide-binding domain signature 1

PA  [LIVM]-[VIC]-x-{H}-G-[DENQTA]-x-[GAC]-{L}-x-[LIVMFY](4)-x(2)-G

HI (374 375) (377 379) (381 382) (384 387) (390 390)

 

 

Each hit is described in the following lines:

-         seqno: seqid for the database sequence with pattern matches

-         ID: Pattern ID, reiterated pattern input

-         PA: Pattern, reiterated pattern input

-         HI: Hit position on the db sequence, regions broken up by X

 

4.4 patseedp

 

This function takes three inputs, an input pattern, a query protein sequence with the pattern, and a protein sequence database. It identifies the pattern in the query and aligns the query against the database entries that contains the same pattern.  It reports the pattern position in the query, the total number of pattern occurrences in the database, and the actual database entries with pattern and alignment to the input query. Specifically, it reports the seqid of the database entry, its alignment (with the query) E-value, scores, and pattern position.

 

seedtop -k pattern.txt -d refseq_protein -p patseedp -o pat_db.out -i query_aa.txt

 

 1 occurrence(s) of pattern in query

Name  Cyclic nucleotide-binding domain signature 2.

Pattern [LIVMF]-G-E-x-[GAS]-[LIVM]-x(5,11)-R-[STAQ]-A-x-[LIVMA]-x-[STACV].

 At position 488 of query sequence

effective database length=3.1e+008

 pattern probability=3.4e-008

lengthXprobability=1.0e+001

 

Number of occurrences of pattern in the database is 265

892602  gi|33859524|ref|NP_034048.1|

0 Total Score 3279 Outside Pattern Score 3162 Match start in db seq 488

       Extent in query seq 1 631 Extent in db seq 1 631

 

1 occurrence(s) of pattern in query

Name  Cyclic nucleotide-binding domain signature 1

Pattern [LIVM]-[VIC]-x-{H}-G-[DENQTA]-x-[GAC]-{L}-x-[LIVMFY](4)-x(2)-G

 At position 450 of query sequence

effective database length=3.1e+008

 pattern probability=7.0e-008

lengthXprobability=2.2e+001

 

Number of occurrences of pattern in the database is 247

892602  gi|33859524|ref|NP_034048.1|

0 Total Score 3279 Outside Pattern Score 3188 Match start in db seq 450

       Extent in query seq 1 631 Extent in db seq 1 631

 

 

The input pattern.txt file contains two patterns, so the result contains two sections, one for each pattern. Only one database match is shown.

 

4.5. seedp

 

This function is similar to patseedp. The only difference is that the pattern file should specify the pattern position in the input query sequence. The output from the patternp can be used for this purpose. This specifies which pattern is to be used during the search.

 

An actual command line and pattern file is listed below. Output is omitted since it is essential the same as that from patseedp given above.

 

 

seedtop -p seedp -k pattern2a.txt -d refseq_protein -i q_aa.txt  -o seed.out

 

ID  Cyclic nucleotide-binding domain signature 1

PA  [LIVM]-[VIC]-x-{H}-G-[DENQTA]-x-[GAC]-{L}-x-[LIVMFY](4)-x(2)-G

HI (450 451) (453 455) (457 458) (460 463) (466 466)

 

 

The seedp functionality has been incorporated into standalone blastpgp, on the BLAST web server, it is called Pattern-Hit-Initiated BLAST or PHI-BLAST. The differences are that blastpgp does not report the total number of pattern occurrences in the database, but it does generate actual sequence alignments. The implementation in blastpgp provides more functionality in that the results of the (first) round of PHI-BLAST search can be used seamlessly as the start of a PSI-BLAST iterated search.

 

4.6. For nucleotide searches

 

Search for nucleotide patterns using seedtop is very similar to what described above for proteins queries. The function names for nucleotide have no terminating p.

 

5. Tech Support

 

For additional questions and comments, please write to:

            blast-help@ncbi.nlm.nih.gov