Entrez User's Guide Includes versions for Apple Macintosh Release 10.0 and Microsoft Windows systems April 15, 1994 ============================================================ Entrez: Sequences GenBank, April 15, 1994 (Release 82.0) EMBL, March 30, 1994 (Release 37.0 plus updates) DDBJ, March 30, 1994 SWISS-PROT, February, 1994 (Release 28.0) PIR, September 30, 1993 (Release 38.0) PDB, October, 1993 PRF, November, 1993 dbEST, March 30, 1994 (Release 2.2) U.S. and European Patents Entrez: References MEDLINE, Molecular Sequence Data Subset, March, 1994 GenBank Direct Submissions Please note that direct submissions of sequence data to GenBank should be sent to NCBI. The address for e-mail submissions is: gb-sub@ncbi.nlm.nih.gov NCBI encourages the submission of data through the Authorin software program for Macintosh or PC computers. Authorin is now provided on the Entrez discs. If you have any questions about the sequence submission process, you may contact us by phone at (301) 402-1301 or by e-mail at authorin@ncbi.nlm.nih.gov. National Center for Biotechnology Information --------------------------------------------- National Library of Medicine National Institutes of Health, Bldg. 38A 8600 Rockville Pike Bethesda, MD 20894 Public Domain Notice and Copyright Restrictions The code for the Entrez retrieval software is a "United States Government Work" under the terms of the United States Copyright Act. It was written as part of the author's official duties as a United States Government employee and thus cannot be copyrighted. This software is freely available to the public for use. The National Library of Medicine (NLM) and the U.S. Government have not placed any restriction on its use or reproduction. Although all reasonable efforts have been taken to ensure the accuracy and reliability of the software and data, the NLM and the U.S. Government do not and cannot warrant the performance or results that may be obtained by using this software or data. The NLM and the U.S. Government disclaim all warranties, express or implied, including warranties of performance, merchantability or fitness for any particular purpose. Some material in MEDLINE is from copyrighted publications of the respective copyright claimants. Users of MEDLINE are solely responsible for compliance with any copyright restrictions and are referred to the publication data appearing in the bibliographic citations, as well as to the copyright notices appearing in the original publications, all of which are hereby incorporated by reference. National Library of Medicine Cataloging in Publication Data Entrez. Sequences [computer file] / NCBI. -- Release 1.0 (Oct. 15, 1992) - . -- Computer data and programs. -- Bethesda, MD : National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health. ; [Pittsburgh, Pa. : For sale by Supt. of Docs., U.S. G.P.O.], 1992- computer laser optical disks ; 4 3/4 in. + guides. -- (GenInfo compact library series) System requirements for use with Windows: PC based on Intel 286, 386, or 486 micro-processor or compatible; 2 MB; MS-DOS 3.1 or later, Microsoft Windows 3.1 or later; hard disk drive, CD-ROM drive, graphics display, mouse. System requirements for use with Macintosh: Macintosh computer; 2 MB; foreign file access and ISO 9660 file access extensions; hard disk drive, CD-ROM drive. Bimonthly. Title from label. Packaged with: Entrez. References, Feb. 15, 1993 - . Publication preceded by 6 pre-releases in 1991-92. Each release cumulates previous disks. ISSN 1065-707X = Entrez. Sequences. 1. Molecular Sequence Data - periodicals - CD-ROMs I. National Center for Biotechnology Information (U.S.). II. Series W1 EN98K Entrez. References [computer file] / NCBI. -- Release 3.0 (Feb. 15, 1993) - . -- Computer data and programs. -- Bethesda, MD : National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health. ; [Pittsburgh, Pa. : For sale by Supt. of Docs., U.S. G.P.O.], 1993- computer laser optical disks ; 4 3/4 in. + guides. -- (GenInfo compact library series) System requirements for use with Windows: PC based on Intel 286, 386, or 486 micro-processor or compatible; 2 MB; MS-DOS 3.1 or later, Microsoft Windows 3.1 or later; hard disk drive, CD-ROM drive, graphics display, mouse. System requirements for use with Macintosh: Macintosh computer; 2 MB; foreign file access and ISO 9660 file access extensions; hard disk drive, CD-ROM drive. Bimonthly. Title from label. Packaged with: Entrez. Sequences, Feb. 15, 1993 - . Each release cumulates previous disks. Consists of citations in which molecular sequences appearing in Entrez. Sequences were published and other citations on this subject from: MEDLINE. ISSN 1065-707X = Entrez. References. 1. Molecular Sequence Data - periodicals - CD-ROMs I. National Center for Biotechnology Information (U.S.). II. Title: References. III. MEDLINE. II. Series W1 EN98K Subscriptions Entrez is distributed by subscription through the U.S. Government Printing Office (GPO). New orders may be placed by fax at (202) 512- 2233 or by phone at (202) 783-3238. Order status queries can be made bu fax at (202) 512-2168 or by phone at (202) 783-3238. To determine whether your order has been processed you should ask, "Am I on the mailing list yet?" Correspondence Specific comments on Entrez, e.g., problem reports, and suggestions for improvement of the retrieval system and documentation, should be sent to: entrez@ncbi.nlm.nih.gov. Alternatively, you can call us at (301) 496-2475. Indicate that you have a question about the Entrez: Sequences or Entrez: References CD-ROMs. (General questions should be directed to info@ncbi.nlm.nih.gov.) Trademarks The mention of trade names, commercial products, or organizations does not imply endorsement by the NCBI or the U.S. Government. Apple and Macintosh are registered trademarks of Apple Computer, Inc.; ATCC is a registered trademark of the American Type Culture Collection; Compact Pro is a trademark of Bill Goodman; DEC is a trademark of Digital Equipment Corporation; DOS Mounter is a trademark of Dayna Communications, Inc.; Entrez is a trademark and Index Medicus, MEDLARS, MEDLINE, and MeSH are registered trademarks of the National Library of Medicine; GenBank is a registered trademark of the U.S. Department of Health and Human Services; IBM is a registered trademark of International Business Machines Corporation; Intel is a registered trademark of Intel Corporation; Microsoft and MS-DOS are registered trademarks and Windows is a trademark of Microsoft Corporation; PIR is a registered trademark of the National Biomedical Research Foundation; PKZIP is a registered trademark of PKware, Inc. Entrez 10.0 Highlights Taxonomy-based Retrieval This release of Entrez introduces NCBI's comprehensive phylogenetic taxonomy for use in querying by organism name. The taxonomy began as a consolidation of several existing taxonomies. With this release, the viral, bacterial, and chordate taxonomies have undergone major revision by volunteer curators with expertise in those areas. Further developments and refinements will appear in subsequent releases of Entrez. This is the first time that a comprehensive taxonomy has been applied over all sequence databases, and its structure may be examined with Entrez. In addition, it maps scientific and common names, synonyms, misspellings, and acronyms to the appropriate sequence records, allowing queries to be made at any level in the taxonomic hierarchy. For example, "Eucaryota", "fungi", and "Primates" are all valid search terms, as are "mouse", "mice", and "Mus musculus". See the sections on Organism field and Taxonomy query mode in the Reference chapter for more information. Scientists with taxonomic expertise who wish to participate in this ongoing project are encouraged to contact NCBI. Sequence Submission Software The Entrez discs now include software for preparing sequence records to submit to GenBank. The current program, called "Authorin", has been in use for several years. Versions for both the Macintosh and for DOS computers are provided inthe AUTHORIN directory. Please read the README file for more information. Later this year, NCBI will be providing an improved sequence submission program called "Sequin", which will also be distributed on the Entrez CD-ROMs, as well as via diskette and FTP. Updated Entrez and EntrezCf Software May Need To Be Installed The Entrez configuration program, EntrezCf, allows you to specify whether you have one or two CD-ROM drives, and whether to copy index files to your hard disk for better performance of document retrieval. Versions of EntrezCf prior to Release 8.0 expected to find MEDLINE index files on the Entrez: Sequences disc. Because MEDLINE records have been removed from the Entrez: Sequences disc, older versions of EntrezCf will fail. If you did not install release 8.0 software, you will need to install the latest version of the software to avoid this problem. Users who copy index files to their hard disk for faster performance, or who specify disc swapping, will need to rerun the EntrezCf program with each new release in order to update the hard disk images of index or cdromdat.val files. HOW TO USE THIS GUIDE We recommend that you read the introductory section to gain a basic understanding of the purpose and capabilities of Entrez and the databases. Then follow the tutorial for an overview of the query and neighboring processes. If you need or want more details, consult the reference section. INTRODUCTION 2 What is Entrez What information is in the databases What's Different about Entrez The concept of neighboring Links between databases Contents and Construction of Entrez Databases Entrez: Sequences Entrez: References Computation of neighbors TUTORIAL 4 How to use Entrez -- a step-by-step example How to use the menus, buttons, and boxes REFERENCE 17 Query Modes More detail on the modes used in querying Searchable Fields Fields indexed for Boolean queries Menu Bar Options Choices and options present in the menu bar Document Window Controls Buttons used for neighboring and linking Advanced Boolean Queries Search Strategies INSTALLATION 28 Macintosh Software Extraction Microsoft Windows Software Extraction Configuration Troubleshooting Customization Entrez on Other Platforms CREDITS AND ACKNOWLEDGMENTS 35 INTRODUCTION Entrez is a molecular sequence retrieval system developed at the National Center for Biotechnology Information (NCBI), a division of the National Library of Medicine (NLM). Entrez provides an integrated approach for gaining access to nucleotide and protein sequence information, to the MEDLINE citations in which the sequences were published, and to a sequence-associated subset of MEDLINE. The sequence records are derived from a variety of database sources, including GenBank, EMBL, DDBJ, PIR, SWISS-PROT, PRF, and PDB. With Entrez and a personal computer, you can rapidly search several hundred megabytes of sequence and literature data using techniques that are fast and intuitive. The retrieval software and associated databases are distributed on CD-ROMs, but may be mounted on a file server and installed on multiple computers on a local network if desired. WHAT'S DIFFERENT ABOUT ENTREZ If you have searched other versions of MEDLINE or other bibliographic databases via an online system or CD-ROM, you probably have used some form of Boolean searching with AND, OR, and NOT. Entrez permits specification of Boolean queries to access MEDLINE documents or sequence records, but adds a valuable concept called "neighboring." After an initial query is completed, neighboring allows a user to locate references or sequences related to a given paper or sequence. The user can ask Entrez to "Find all papers that are like this one" or "Find all similar sequences." The neighbors are pre-computed using algorithms developed at the NCBI that relate records within the same database by statistical measurements of similarity. In addition to neighboring, there are also "hard links," which connect entries in different databases. For each MEDLINE record, there are hard links to any protein or nucleotide sequences that were published in that article. The cited protein or nucleotide sequences have reciprocal hard links back to the MEDLINE records. Nucleotide sequences and the proteins derived from them by conceptual translation also have hard links to one another. The pre-computed neighbors and hard links are stored on the CD-ROM along with the databases and retrieval indices. The ability to traverse the literature and molecular sequences via neighbors and links provides a very powerful yet intuitive way of accessing the information in those databases. Given a record or sequence of interest, the user can switch to the appropriate entry in a different database for viewing the associated record or for neighboring within that database. This multi-database structure with neighbors within databases and hard links between them is shown in Figure 1. CONTENTS AND CONSTRUCTION OF ENTREZ DATABASES The Entrez databases are distributed on two discs: Entrez: Sequences and Entrez: References. If you have only a single CD-ROM drive, you should configure Entrez to allow disc swapping. The Sequences disc includes protein and nucleotide sequence data from the GenBank, EMBL, DDBJ, PIR, SWISS-PROT, PRF, and PDB databases. An effort has been made to identify and eliminate duplicate entries based on cross-referencing of accession numbers and citation matching. In the future we intend to also use sequence similarity to detect duplicate records. The References disc includes those MEDLINE records that were cited by the sequence database records, plus additional records that were indexed under terms in the "Molecular Sequence Data" tree. The index terms are taken from the NLM's Medical Subject Headings (MeSH), a tree-structured vocabulary of index terms used to classify the medical literature. The sequence databases are compared against themselves to compute groups of similar sequences using the BLAST algorithm for finding ungapped local alignments (1). Protein sequences whose similarity scores would be expected to appear by chance less than once in a database of this size are included as neighbors. As a result, most biologically significant similarities will be classified as neighbors, but some may be missed. Furthermore, some chance similarities may be included as neighbors. Nucleotide sequence similarities are most successfully used to build contigs, rather than to discern biological function. Therefore, nucleotide neighbor sequences must either overlap at the ends, or one must be completely contained within the other, in addition to having appropriate similarity scores. For each of the MEDLINE records a list of its nearest neighbors has been computed by comparing the record against the database using the relevance pairs model of retrieval (2). In this model key terms coming from the titles, abstracts, and MeSH headings throughout the database are weighted according to their individual usefulness in relating documents. When retrieval is carried out the top 30 documents retrieved become the neighbor list for the query document, and this information is also stored on the CD-ROM. Thus, given a MEDLINE record, you immediately can access other records in the database that are most likely to be of interest because of their statistical similarity to the given record. While not all records found in this way are guaranteed to be relevant, studies of this type of neighboring process have shown it to be one of the most effective methods of finding useful documents (3). 1 Altschul SF, Gish W, Miller W, Myers EW, and Lipman DJ. Basic local alignment search tool. J Mol Biol 1990;215:403-10. 2 Wilbur WJ. A retrieval system based on automatic relevance weighting of search terms. In: Shaw, D., ed. Proceedings of the 55th American Society of Information Science Annual Meeting. Pittsburgh, PA: Learned Information, Inc., 1992:216-20. 3 Wilbur WJ, and Coffee L. The effectiveness of document neighboring in search enhancement. Inf Process Manage 1994;30:253-66. (Figure 1 not shown.) Figure 1. TUTORIAL The Entrez databases are distributed on two discs: Entrez: Sequences and Entrez: References. Entrez can be configured to use both discs in a single session with a single CD-ROM drive, and will prompt the user to swap discs as necessary. The example screens were produced with a previous release. Running the same example against a subsequent release is likely to result in a somewhat different set of documents retrieved at each step. In this tutorial, we will ask for MEDLINE records about binding or mutation of the ras oncogene. We will specify that "ras" must be in the title, but we will accept either "mutation" or "binding" in either the title or abstract of the record. After finding a set of MEDLINE records that satisfies this query, we will use the neighboring function to find related papers, then link to published sequences, and finally, neighbor again to find similar sequences. You will be exposed to the major screen characteristics -- menus, boxes, and buttons -- while following this step-by-step process. This tutorial demonstrates one method, or mode, of searching with Entrez. There are other modes available that provide additional ways to explore terms and deal with different word endings. Each mode has advantages for particular types of queries. The choice of mode depends in part on how specific or how general your query is. You will quickly become familiar with the best mode for your topic. Read about the modes in the Query Modes section. The Searchable Fields section provides complementary information on the parts of the MEDLINE, Protein, and Nucleotide records that can be searched. To fully exploit the capabilities for displaying, printing, and saving records, read the section on Menu Bar Options. This tutorial example is for the purpose of illustration. Since the databases are updated every two months, the actual documents shown, and the neighbor and link relationships explored, may not reflect exactly what you will see on a given release. SPECIFICATION OF QUERY TERMS An Entrez search typically begins with selecting initial MEDLINE or sequence records through a keyword query. This is done on the Query window that opens when you start up Entrez. There are two steps: 1. Entering or selecting the query terms. 2. Specifying the desired logical (Boolean) relationships between those terms. Entrez starts out in the MEDLINE database, using the Abstract or Title field and in Multiple mode. (Other indexed fields and query modes will be discussed later.) In Multiple mode, you enter one or more words that must be in the title or abstract in order for a paper to be found. Enter the words in the Term entry box after the I-beam pointer, and press the Return key. All of the terms will be chosen and moved to the Query Refinement box. Figure 2 shows how the Term Selection appears after the user types ras mutation binding and presses Return. The program has already chosen "ras" and "mutation", and is shown in the process of choosing "binding". (In actual use, the Term entry and the Term Selection boxes would be cleared once all terms had been processed.) Database: Field: Mode: +------------+ +--------------------+ +------------+ /=========\ | MEDLINE | | Abstract or Title | | Multiple | | Accept | +============+ +====================+ +============+ \=========/ +-----------------------------------------------------------------+ Term: | ras mutation binding | +-----------------------------------------------------------------+ +------------------------------------------------------------------------+ | binding | 1380 | 7406 | | binding-activity | 0 | 1 | | binding-affected | 0 | 1 | | binding-affinity | 0 | 1 | | binding-chain | 0 | 1 | | binding-cyclic | 1 | 1 | | binding-deficient | 0 | 1 | +------------------------------------------------------------------------+ /\ Term Selection || Query Refinememt Special Total || \/ Figure 2. RESULT OF INITIAL QUERY The Query Refinement box allows you to specify the query by merging or intersecting lists of documents in which terms appear. The Special column indicates the number of articles in which the word appears in the title, and the Total column indicates occurrences in the title or abstract. Upon entering the terms into the Query Refinement box, Entrez automatically selects the Total column for each term. The individual terms are independent and intersected (ANDed) by default, as indicated by the brackets to the left of each term. The results of the initial intersection are shown at the bottom of the screen in the Retrieve button. The button title Retrieve 4 Documents in Figure 3 indicates that 4 documents contain the configuration of terms that satisfies the unrefined query. /\ Term Selection || Query Refinement Special Total || \/ +------------------------------------------------------------------------+ | [_ras | 102 | 289 | | |_mutation | 627 | 3031 | | |_binding | 1380 | 7406 | | | | | | | | | | | | | | | | | +------------------------------------------------------------------------+ /---------------------------------\ /---------\ /--------------------\ | Retrieve 4 Documents | | Reset | | More Booleans... | \---------------------------------/ \---------/ \--------------------/ Figure 3. Since we wish to find documents that have "ras" in the title and either "mutation" or "binding" in the abstract or the title, we will want to refine the query. GROUPING SYNONYMOUS TERMS Similar or synonymous terms may be grouped (ORed) by clicking on an entry in the Query Refinement box (Figure 4) and, while keeping the mouse button depressed, /\ Term Selection || Query Refinement Special Total || \/ +------------------------------------------------------------------------+ | [_ras | 102 | 289 | | |_mutation | 627 | 3031 | | |_binding | 1380 | 7406 | | | | | | | | | | | | | | | | | +------------------------------------------------------------------------+ /---------------------------------\ /---------\ /--------------------\ | Retrieve 4 Documents | | Reset | | More Booleans... | \---------------------------------/ \---------/ \--------------------/ Figure 4. dragging the pointer on top of another term (Figure 5), /\ Term Selection || Query Refinement Special Total || \/ +------------------------------------------------------------------------+ | [_ras | 102 | 289 | | |_mutation | 627 | 3031 | | |_binding | 1380 | 7406 | | | | | | | | | | | | | | | | | +------------------------------------------------------------------------+ /---------------------------------\ /---------\ /--------------------\ | Retrieve 4 Documents | | Reset | | More Booleans... | \---------------------------------/ \---------/ \--------------------/ Figure 5. and releasing the mouse button (Figure 6). /\ Term Selection || Query Refinement Special Total || \/ +------------------------------------------------------------------------+ | [_ras | 102 | 289 | | | binding | 1380 | 7406 | | |_mutation | 627 | 3031 | | | | | | | | | | | | | | | | | +------------------------------------------------------------------------+ /---------------------------------\ /---------\ /--------------------\ | Retrieve 86 Documents | | Reset | | More Booleans... | \---------------------------------/ \---------/ \--------------------/ Figure 6. The first term is moved, and the bracket at the left of the table expands to encompass both terms. (The terms within an OR group are sorted in alphabetical order.) Terms can be ungrouped by dragging a term to another group or to the blank region at the bottom of the Query Refinement box. (Holding down the shift key and dragging with the mouse will merge multiple terms in one step.) Note that 86 documents satisfy this new formulation of the query. Entrez selects documents by first calculating the union (logical OR) of each group, then calculating the intersection (logical AND) between groups. This visual method of query refinement eliminates the need to retype terms and avoids the difficulty of constructing Boolean queries using English phrases and parentheses. We will specify that "ras" must be in the title in the final refinement of the query. FURTHER REFINEMENT OF QUERY Clicking on the number in the Special or Total column further refines your query. Terms in which neither the special nor total count is chosen do not take part in the query. The count in the Retrieve button reflects the number of documents that satisfies the query. In Figure 7, the user has clicked on the "102" in the special column of the "ras" row (indicating that "ras" must be in the title), resulting in 30 documents now satisfying the query. /\ Term Selection || Query Refinement Special Total || \/ +------------------------------------------------------------------------+ | [_ras | 102 | 289 | | | binding | 1380 | 7406 | | |_mutation | 627 | 3031 | | | | | | | | | | | | | | | | | +------------------------------------------------------------------------+ /---------------------------------\ /---------\ /--------------------\ | Retrieve 30 Documents | | Reset | | More Booleans... | \---------------------------------/ \---------/ \--------------------/ Figure 7. RETRIEVING CHOSEN DOCUMENTS When the number of articles that satisfy the query is reasonably small, click on the Retrieve button to actually retrieve the documents from the CD-ROM. The Document window appears, and the first author, year of publication, and title of each document is displayed, along with an icon indicating whether or not the article has an abstract in the database. A check box at the left of each document is used to select records for neighboring. +------------------------------------------------------------------------+ | [ ] Fukumoto, Molecular cloning and characterization of a novel | | 1990 type of regulatory protein (GDI) for the rho | | proteins, ras p21-like small GTP-binding | | proteins. | | | | [ ] Yamamoto, Purification and characterization from bovine | | 1990 brain cytosol of proteins that regulate the | | GDP/GTP exchange reaction of smg p21s, ras | | p21-like GTP-binding proteins. | | | | [X] Tanaka, IRA2, a second gene of Saccharomyces cerevisiae | | 1990 that encodes a protein with a domain homologous | | to mammalian ras GTPase-activating protein. | | | | [ ] Matsui, Molecular cloning and characterization of a novel | | 1990 type of regulatory protein (GDI) for smg p25A, | | a ras p21-like GTP-binding protein. | | | | [ ] Campo, The Harvey ras 1 gene is activated in | | 1990 papillomavirus-associated carcinomas of the | | upper alimentary canal in cattle. | | | | [ ] Gautam, A G protein gamma subunit shared homology with | | 1989 ras proteins. | +------------------------------------------------------------------------+ /----------------\ /------\ | Neighbor 18 | (*) MEDLINE () Protein () Nucleotide | Prev | \----------------/ \------/ /-----\ /------\ /---------\ /------\ Select: | ALL | | None | | Parents | | Next | \-----/ \------/ \---------/ \------/ Figure 8. In Figure 8, the user has retrieved the 30 documents, has scrolled through the list, and has already clicked on the check box next to Tanaka, 1990, deciding that it is a paper of interest. (Note that the results of a Boolean query are not ranked by any measure of relevance, and thus you should scan the entire list for relevant articles.) To mark records to be used in neighboring or saved to a disk, click on the check box. NEIGHBORING ON RELATED MEDLINE RECORDS As you activate check boxes for documents, the Neighbor button changes to indicate the number of documents that will be retrieved. The radio button group at the bottom of the window indicates the target database. With MEDLINE documents loaded and the MEDLINE button chosen, clicking the Neighbor button will retrieve the related articles. In Figure 8, with one document checked there are 18 MEDLINE neighbors that have been stored. The first round of neighboring from the Tanaka paper in Figure 8 retrieves a number of papers related to RAS and its regulators (Figure 9). If the Parents Persist item in the Preferences menu is set, as it is in this example, the Tanaka paper will appear at the top of the list identified with a bullet. The neighbors include Ballester, 1990, which is already checked for the next round of neighboring. It is worth noting that although the Ballester paper did not satisfy the initial query, it is relevant to the spirit of the current inquiry. +------------------------------------------------------------------------+ | [ ] * Tanaka, IRA2, a second gene of Saccharomyces cerevisiae | | 1990 that encodes a protein with a domain homologous | | to mammalian ras GTPase-activating protein. | | | | [ ] Kim, Overexpression of RPl1, a novel inhibitor of the | | 1991 yeast Ras-cyclic AMP pathway, down-regulates | | normal but not mutationally activated ras | | function. | | | | [ ] Tanaka, IRA1, an inhibitory regulator of the RAS-cyclic | | 1989 AMP pathway in Saccharomyces cerevisiae. | | | | [ ] Ruggieri, MSl1, a negative regulator of the RAS-cAMP | | 1989 pathway in Saccharomyces cerevisiae. | | | | [X] Ballester, The NF1 locus encodes a protein functionally | | 1990 related to mammalian GAP and yeast IRA proteins. | | | | [ ] Imai, Identification of a GTPase-activating protein | | 1991 homolog in Schizosaccharomyces pombe. | | | | [ ] Bussereau, The CCS1 gene from Saccharomyces cerevisiae | | 1992 which is involved in mitochondrial functions is | | identified as IRA2 an attenuated of RAS1 and | +------------------------------------------------------------------------+ /----------------\ /------\ | Neighbor 11 | (*) MEDLINE () Protein () Nucleotide | Prev | \----------------/ \------/ /-----\ /------\ /---------\ /------\ Select: | ALL | | None | | Parents | | Next | \-----/ \------/ \---------/ \------/ Figure 9. VIEWING A MEDLINE ABSTRACT The relationship of NF1 to GAP and IRA looks interesting, and it can be useful to read the abstract of the article. Double clicking on the Ballester entry displays the reference, title, initials and last names of all authors, the primary author's affiliation, the full abstract, MeSH terms for the paper, gene symbols, and chemical substance names (Figure 10). +----------------------------------------------------------------------------+ | Cell 63: 851-9 (1990) [91029516] | | | | The NF1 locus encodes a protein functionally related to mammalian | | GAP and yeast IRA proteins. | | | | R. Ballester, D. Marchuk, M. Boguski, A. Saulino, R. Letcher, | | M. Wigler & F. Collins | | | | Cold Spring Harbor Laboratory, New York 11724. | | | | The von Recklinghausen neurofibromatosis locus, NF1, encodes a protein | | with homology restricted to catalytic region of the RAS GTPase-activating | | protein, GAP, and with extensive homology to the IRA1 and IRA2 gene | | products of the yeast S. cerevisiae. A segment of the NF1 cDNA gene, | | expressed in yeast, can complement loss of IRA function and can inhibit | | both wild-type and mutant activated human H-ras genes that are coexpressed | | in yeast. Yeast expressing the NF1 segment have increased H-ras GTPase- | | stimulating activity. These studies indicate that the NF1 gene product | | can interact with RAS proteins and demonstrate structural and functional | | similarities and differences among the GAP, IRA1, IRA2, and NF1 proteins. | | | | MeSH Terms: | | Amino Acid Sequence | | Base Sequence | | Fungal Proteins/*genetics | +----------------------------------------------------------------------------+ Figure 10. The Options menu lets you choose between viewing MEDLINE abstracts as a MEDLINE Report, in MEDLARS Format suitable for use with bibliographic software packages, or in MEDLINE ASN.1 text form. The open document can be sent to the printer using the Print menu command, or can be saved to a disk file using the Save or Save As menu commands. SECOND ROUND OF NEIGHBORING Since the NF1 (neurofibromatosis) connection is considered worth pursuing, we can easily change the focus of our inquiry. A second round of neighboring on the Ballester entry reveals several relevant articles on the relationship of the neurofibromatosis gene and RAS activation, and on the relationship between GAP and the yeast IRA proteins (Figure 11). The original Tanaka, 1990, article will reappear as a neighbor of the Ballester, 1990, paper as long as it is among the articles computed to be the most relevant to Ballester. Neighboring can associate two papers even if neither one cites the other. +------------------------------------------------------------------------+ | [ ] * Ballester, The NF1 locus encodes a protein functionally | | 1990 related to mammalian GAP and yeast IRA proteins. | | | | [ ] Martin, The GAP-related domain of the | | 1990 neurofibromatosis type 1 gene product interacts | | with ras p21. | | | | [X] Xu, The neurofibromatosis type 1 gene encodes a | | 1990 protein related to GAP. | | | | [ ] Tanaka, IRA2, a second gene of Saccharomyces cerevisiae | | 1990 that encodes a protein with a domain homologous | | to mammalian ras GTPase-activating protein. | | | | [ ] Garrett, Purification and N-terminal sequence of the | | 1991 p21 rho GTPase-activating protein, rho GAP. | | | | [ ] Nishi, Differential of a GTPase-activating protein | | 1991 neurofibromatosis type 1 ( NF1 ) gene | | transcripts related to neuronal differentiation. | | | | [ ] Wang, sar1, a gene from Schizosaccharomyces pombe | | 1991 encoding a protein that regulates ras1. | | | +------------------------------------------------------------------------+ /----------------\ /------\ | Lookup 3 | () MEDLINE (*) Protein () Nucleotide | Prev | \----------------/ \------/ /-----\ /------\ /---------\ /------\ Select: | ALL | | None | | Parents | | Next | \-----/ \------/ \---------/ \------/ Figure 11. We can determine that a protein sequence was published in the Xu, 1990, paper by checking it and clicking on the Protein radio button. In preparation for looking up the neurofibromatosis protein sequence, we have done this. The Neighbor button now reads Lookup, indicating that we are about to switch to another database. LOOKING UP ASSOCIATED SEQUENCES With MEDLINE documents loaded and either the Protein or Nucleotide radio button chosen, the Neighbor button reads Lookup and will retrieve the sequences that were published in those MEDLINE articles. In a similar manner you can look up the articles in which sequences were published, or can look up the nucleotide sequences of proteins that are translated from GenBank, EMBL, or DDBJ, simply by choosing the appropriate radio button. (Note that Neighbor is used to find related records within a single database, and Lookup is used to find associated records (hard links) in a different database.) With the Xu, 1990, paper checked, and the Protein radio button chosen (Figure 11), pressing Lookup retrieves the records for any protein sequences published in that paper (Figure 12). +------------------------------------------------------------------------+ | [X] NF1_HUMAN NEUROFIBROMIN (NEUROFIBORMATOSIS- | | RELATED PROTEIN NF-1.) | | | | [X] HUMNF1MRNA neurofibromatosis type 1 protein | | cds 1 | | | | [X] HUMNF1MRNB neurofibromatosis protein type 1 | | cds 1 | | | | | | | | | | | | | | | | | | | | | | | | | | | +------------------------------------------------------------------------+ /-----------------\ /------\ | Neighgor 38 | () MEDLINE (*) Protein () Nucleotide | Prev | \-----------------/ \------/ /-----\ /------\ /---------\ /------\ Select: | ALL | | None | | Parents | | Next | \-----/ \------/ \---------/ \------/ Figure 12. NEIGHBORING ON RELATED SEQUENCES With protein or nucleotide sequences loaded and the same database selected in the radio button group, the program can then find the pre- computed nearest neighbors (based on BLAST sequence similarity). Neighboring on the human NF1 protein records retrieves additional neurofibromatosis, IRA and GAP proteins (Figure 13). Note that neighboring will not find proteins that interact with one another (unless they happen to be evolutionarily related), but this information may frequently be found in the comments of a sequence record or the abstract from the associated MEDLINE record. +------------------------------------------------------------------------+ | [ ] RATNF1ASAB neurofibromatosis protein type 1 | | cds1 | | | | [ ] A35910 *Neurofibromatosis-related protein NF1 - | | Human (fragment) | | | | [ ] YSCIRA2A cds1 IRA2 protein | | | | [ ] RGBYI2 Probably GTPase-activating protein IRA2 - | | Yeast (Saccharomyces cerevisiae) | | | | [ ] IRA2_YEAST INHIBITORY REGULATOR PROTEIN IRA2 (GLC4 | | PROTEIN). | | | | [ ] IRA1_YEAST INHIBITORY REGULATOR PROTEIN IRA1. | | cds 1 | | | | [ ] A40258 *GTPase activating protein homology - Yeast | | (Schizosaccharomyces pombe) | | | | [ ] Wang, sar1=RAS GTPase-activating protein | | 1992 [Schizosaccharomyces pombe, Peptide, 766 aa] | +------------------------------------------------------------------------+ /----------------\ /------\ | Neighbor 0 | () MEDLINE (*) Protein () Nucleotide | Prev | \----------------/ \------/ /-----\ /------\ /---------\ /------\ Select: | ALL | | None | | Parents | | Next | \-----/ \------/ \---------/ \------/ Figure 13. VIEWING A SEQUENCE REPORT Double clicking anywhere on the "RGBYI2" document summary (on the icon, caption, or the title) displays a sequence report (Figure 14). The report is generated from data that are stored in Abstract Syntax Notation 1 (ASN.1), using a specification for sequence information designed at the NCBI. This results in a tagged, hierarchical representation of the data that facilitates access to all of the features associated with the sequences. +-------------------------------------------------------------------------+ | gi|69012 --------------------------------------------------- | | | | Definition Probable GTPase-activating protein IRA2 - Yeast | | (Saccharomyces cerevisiae) | | | | Protein Names: Probable GTPase-activating gim|69012: [ Whole ] | | protein IRA2, GLC4 protein | | | | PIR Name: RGBYI2 | | | | NCBI GenInfo ID: 69012 | | | | Organism Saccharomyces cerevisiae | | | | Citation Tanaka K., Nakafuku M., Tamanoi F., Kaziro Y., | | Matsumoto K. & Toh-e A. (1990). IRA2, a second gene of | | Saccharomyces cerevisiae that encodes a protein with a | | domain homologous to mammalian ras GTPase-activating | | protein. Mol. Cell. Biol. 10, 4303-4313. MEDLINE | | identifier: 90318397 | | | | Sequence 3079 aa | | | | 1 msqptknkkk ehgtdskssr mtrtlvnhil ferilpilpv esnlstysev | | 51 eeyssfiscr svlinvtvsr danamvegtl eliesllqgh eiisdkgssd | | 101 viesiliilr llsdaleynw qnqeslhynd isthvehdqe qkyrpklnsi | | 151 lpdyssthsn gnkhffhqsk pqalipelas kllescaklk fntrtlqilq | | 201 nmishvhgni lttlsssilp rhksyltrhn hpshckmids tlghilrfva | +-------------------------------------------------------------------------+ Figure 14. Sequences from PDB are sometimes numbered relative to a family-specific reference sequence. The displayed sequence breaks to a new line at each point of discontinuity in the numbering, which can result in sequence reports that are not as nicely formatted as that shown in Figure 14. All bases or residues in numbered regions map to coordinates in the three-dimensional structure. Residues that are marked unnumbered were not present in the model structure reported by PDB, due to crystallographic disorder or other reasons. The Options menu lets you choose between viewing sequence records as a Sequence Report, in GenBank Format, in FASTA Format, or in Sequence ASN.1 text form. As with MEDLINE articles, the open document can be sent to the printer using the Print menu command, or can be saved to a disk file using the Save or Save As menu commands. This concludes the tutorial guide to Entrez. REFERENCE QUERY MODES There are five modes used in Boolean retrieval searching with Entrez. Each provides a different method to choose the terms that will be used in your query. The mode options are found on the Query window in a popup menu. Their use varies somewhat according to the field that is being searched -- not all modes are appropriate (or available) for all fields. The differences among the five are described below. (Note that "term" as used in these descriptions can refer to a word from an abstract, an author's name, a journal title, or a gene name.) +------------+ | Selection | | Multiple | | Truncation | +------------+ The two columns of numbers at the right of the Term Selection and Query Refinement boxes give the counts for the number of database records that correspond to that term. The right-hand number is always the broader approach, and the left, more specific, if that distinction is appropriate for the field being searched. Selection Mode Selection mode allows you to see a list of terms alphabetically adjacent to a single term entered in the Term entry box. When you click on Accept (or press Return) after a term is entered, the list of available terms appears in the Term Selection box. You can scroll through the list and select one or more terms (by double clicking) for transfer to the Query Refinement box. After finishing your selection(s) for the first term, enter the next term of interest in the Term entry box. Another list of terms adjacent to that term will appear for selection. The number of times you repeat this process will depend on the complexity of your search question, and the database and field being searched. Selection mode allows you to browse the variety of terms that exist in the field before committing to any terms. (You may decide after seeing the available terms that Truncation mode would be more appropriate for your search.) Multiple Mode Multiple mode is used to search using one or more terms that are entered in the Term entry box together, separated by spaces. When you click on Accept (or press Return) after the term(s) are entered, the terms are automatically retrieved, one at a time, and put in the Query Refinement box with their corresponding record counts. One screen of alphabetically adjacent terms appears briefly for each term, but you do not have the option to select from these lists. Multiple mode is the quickest and most straightforward way to enter a query, but with less reflection on the choice of each search term. It is not available for most fields (e.g., Author Name, Journal Title, and Organism), where the usage would be inappropriate because those terms can contain spaces. Truncation Mode Truncation mode is a bit more complex because it acts differently depending on the field being searched. If it is possible to search a field in Multiple mode (such as Abstract or Title), it searches on one or more term-beginnings that are entered in the Term entry box together, separated by spaces. If Multiple mode is not available for a field (such as Author Name) because spaces are possible within terms, then truncation acts on the entire term. For example, if you search on the author Smith J in Truncation mode, Entrez will retrieve articles written by all authors whose last names are Smith and whose first names begin with the letter J. In either case described above, all available endings will be retrieved for each term-beginning. (This is not stem or root searching; it is strict truncation and merging, and any available ending will be included in the results.) When you click on Accept (or press Return) after the term-beginnings are entered, the term or terms appear, one at a time, in the Query Refinement box with the corresponding record counts. No alphabetical lists of terms appear. Truncation mode allows you to search for common endings of terms that appear in the literature without selecting them, one at a time, from lists in Selection mode. For example, if you want records with the terms interact, interaction, or interacting, it is quicker to enter interact in Truncation mode. All three words will be selected, and the term will appear as interact... in the Query Refinement box to indicate that assorted endings are included. Remember, however, that you will often select some terms that you may not have anticipated. In the example above, several additional terms would be included in the truncation, including interaction-born, interactivation, and interactor. You can always examine the available terms in Selection mode to help you decide where to truncate a term. Taxonomy Mode Taxonomy mode allows you to explore the taxonomic hierarchy. The top of the hierarchy is the "root" node. If you first select a term in Selection mode in the Organism field, then switch to Taxonomy mode, the hierarchy will be displayed at that point. Otherwise, it will start at "root". The popup list that replaces the Term entry box represents the "parent" lineage, starting from "root" and ending at the current node. Choosing an element of the parent list moves up the hierarchy. The Term Selection box displays the "children" of the current node. Double clicking on an item moves down the hierarchy, with the chosen item becoming the new parent. A mark to the left of an item in the Term Selection box indicates that it has children. Clicking on Accept (or pressing Return) will accept a node into the Query Refinement box. If a child is highlighted (having been selected by single clicking), that node is used. Otherwise, the current node in the popup list is accepted. The number of documents that satisfy the query may be less than that indicated in the term list if there are records that contain multiple organism references. Lookup Mode Lookup mode is used to lookup an article by its unique identification number. If you specify the MEDLINE database and the MEDLINE ID field, by implication you want to display the record for an ID already known to you. The NCBI Seq ID field in the Protein or Nucleotide databases allow you to retrieve records based on the unique identier assigned by NCBI's ID tracking database. In Lookup mode, the title of the Accept button changes to Find, and an article is retrieved directly, bypassing the Boolean query refinement step. SEARCHABLE FIELDS There are a number of searchable fields in the MEDLINE, Protein, and Nucleotide databases. Some fields are found in all three; others are unique to one database. This section briefly describes the fields. The fields are always located on the popup menu, Field. The field choices for the MEDLINE, Protein, and Nucleotide databases, respectively, are shown below: +-------------------+ +---------------+ +---------------+ | Abstract or Title | | Text Terms | | TextTerms | | MeSH Term | | Keyword | | Keyword | | Author Name | | Author Name | | Author Name | | Journal Title | | Journal Title | | Journal Title | | Gene Name | | Organism | | Organism | | Substance | | Accession | | Accession | | E.C. Number | | Gene Name | | Gene Name | | MEDLINE ID | | Protein Name | | Protein Name | +-------------------+ | E.C. Number | | NCBI Seq ID | | NCBI Seq ID | +---------------+ +---------------+ Abstract or Title (MEDLINE) Text Terms (Protein and Nucleotide) The Abstract or Title field in the MEDLINE database and the Text Terms field in the Protein and Nucleotide databases are very similar. When you use these fields to search, you are looking for "free-text" words from specific fields in the records. Words searched in the Abstract or Title field in MEDLINE come from, as the field name indicates, the abstract or the title. The counts for the Special column refers to the number of articles in which the term appears in the title. The Total column count includes articles in which the term appears in either the abstract or the title. Words searched in the Text Terms field in the Protein database come from four fields, definition, comment, protein name and protein description. The definition field serves as a title for the record, and definition fieldterms are indexed in the Special column. There can be multiple comments in a record, each of which contains descriptive information about the protein. Words searched in the Text Terms field in the Nucleotide database come from four fields, definition, comment, gene name and gene description. The definition field serves as a title for the record, and definition fieldterms are indexed in the Special column. Accession The Accession field in the Protein and Nucleotide databases is used to find a record(s) based on sequence name or accession numbers. Accession numbers are assigned by the sequence database builders and are typically published in the literature. The Special column indicates a primary accession number. Patent accession numbers have a two-letter "country" code followed by a space and a number. The majority of country codes are US or EP, representing patents issued by the U.S. Patent and Trademark Office and the European Patent Office, respectively. However, there are a large number of possible country codes. Author Name The Author Name field contains up to ten authors of papers in the indexed literature. Authors in protein or nucleotide records are the authors of the MEDLINE articles linked to those sequence records. The format for names is: Last name followed by a space and initial(s). For example, the author James Michael Lang would appear in MEDLINE as: Lang JM. Therefore, when searching for an author, enter the author's last name followed by a space and initial(s). Author Name is available in Selection or Truncation mode. Selection mode will result in a list from which you can select one or more authors. Truncation mode will retrieve all authors whose names start with the name or partial name that you supply. For example, if you search on the name Lloyd (followed by space) in Truncation mode, papers written by all the Lloyds in the database (ranging from Lloyd AD to Lloyd WS) will be retrieved. If Lloyd P is entered, only papers by Lloyd P and Lloyd PE will result. E.C. Number The E.C. (Enzyme Commission) Number is used to find a record for a specific enzyme. An example is 4.1.1.39, which is the E.C. number for ribulose-1,3-bisphosphate carboxylase. Gene Name The Gene Name field is used to find a record(s) by standard gene symbol. It can also be used in combination with the Organism field in the sequence databases to find the literature about a particular gene in a particular organism. Examples include recA, lacZ, and apoA-I. Greek letters in gene names appear spelled out in angle brackets (e.g., ). Try using the Text Terms field if you cannot find a particular gene using Gene Name. Journal Title The Journal Title field is used to search for all articles published in a particular journal(s). It is particularly useful in combination with other fields, such as Author Name. Keyword The Keyword field is used to search controlled vocabularies, or index terms, used with the GenBank, EMBL, DDBJ, PIR, SWISS-PROT, PRF, and PDB databases. Unless you are familiar with these particular vocabularies, you would not be likely to use this field in a query. However, the information in this field will add to the overall description of proteins and nucleotides. This field is not present in all records. PDB records are indexed under the keyword "PDB structure". MEDLINE ID The MEDLINE ID field is primarily used to find a record for which you already know the number. Each MEDLINE record has a unique number. MeSH Term MeSH stands for Medical Subject Headings, which is a controlled vocabulary, or thesaurus, of terms used by the National Library of Medicine in MEDLINE and other databases. Each journal article is indexed by a group of terms from MeSH, which supplements the title and abstract information in describing the contents of the article. Use the MeSH Term field to query the MEDLINE database for articles indexed by MeSH terms. It can also be useful to group a term from the title or abstract with the most closely related MeSH term. MeSH terms can be one or more words and may have internal punctuation. For example, Heart; Cytochrome B; and Neoplasms, Hormone-Dependent are all valid MeSH terms. Entrez will display the list of MeSH terms on MEDLINE papers below the abstract information. Frequently, MeSH terms have an additional term, called a "subheading" following the term and separated with a slash (e.g., Genetic Code/ Physiology or Kearns Syndrome/ Genetics). These subheadings give you further information as to the particular aspect of the subject being indexed by the MeSH term. The counts for the number of database records in the Special and Total columns for each MeSH term provide further information about the references that will be retrieved. The Total column is, as the name suggests, the total number of references on that topic. The Special column indicates the number of references where that MeSH term indicates that the concept is a focal point of the article. Special occurrences are marked with an asterisk (*) on MEDLINE Reports in Entrez, and also indicate that a reference will appear in that section of Index Medicus. NCBI Seq ID The NCBI Seq ID field allows you to find a record based on the unique identifier assigned by NCBI's ID tracking database. It is used to ensure that separate analyses performed on the same accession from different releases actually used the same sequence. Unlike accession numbers, this number will remain the same unless the sequence changes, at which point a new Seq ID is assigned and the old one is retained in the history list for that record. Organism The Organism field is used to search for a particular organism(s). Both scientific and common names are included. Although there is no standardization across databases for this field, NCBI has developed a comprehensive phylogenetic taxonomy for GenBank, and we superimpose this taxonomy on all sequence records on the Entrez discs, whether nucleotide or protein, for the purpose of indexed retrieval. For example, sequence reports referring to the mouse may have mouse, mice or Mus musculus as the organism, depending upon the database of origin. Querying with any of these terms will now retrieve all relevant records. The new GenBank taxonomy is built upon a number of sources, including the nucleotide and protein sequence databases (PIR, GenBank, EMBL, DDBJ, and SWISS-PROT), the ICTV (International Committee on Taxonomy of Viruses) taxonomy for viruses, the USDA taxonomy for higher plants, the FlyBase taxonomy for drosophilids, and the NOAA/NODC (National Oceanic and Atmospheric Administration/National Oceanographic Data Center), ATCC (American Type Culture Collection), and RDP (Ribosome Data Project) taxonomies. Several branches of the taxonomy have already been reviewed and revised in detail by individual volunteer curators from the systematics and phylogenetics communities. The new taxonomy maps scientific and common names, synonyms, misspellings, acronyms, and teleomorphs and anamorphs (the sexual and asexual stages of fungi) to the appropriate sequence records. In addition, terms at any level of the taxonomic hierarchy can be queried, not just the genus and species. Thus, querying on fungi (a common name for Eumycota) will retrieve all records containing fungal sequences, not just a few that were not further classified. We have not, at this time, modified the actual database records, in part because the taxonomy is still being refined. When a common name refers to multiple groups in the taxonomy, individual entries are distinguished by a number. For example, monkeys ((#1)) and monkeys ((#2)) refer to New World and Old World monkeys, respectively. When duplicated scientific names exist between the plant and animal kingdoms, both terms are qualified by their domains, unless one of the levels is clearly higher. For example, Proboscidea is an order of mammals that includes elephants, and Proboscidea ((plant)) is a genus that includes the pale devil's-claw. Synonyms can indicate that organisms were reclassified, or promoted from subspecies to species level, or they can collect variant spellings of a term. This allows retrieval of all eukaryotes by Eukaryota, Eukaryotae, or Eucaryotae, for example. Humans are currently classified as Eucaryotae; Metazoa; Chordata; Vertebrata; Gnathostomata; Osteichthyes; Sarcopterygii; Choanata; Tetrapoda; Amniota; Mammalia; Theria; Eutheria; Archonta; Primates; Catarrhini; Hominidae; Homo; Homo sapiens. As mentioned above, any of these levels can be accepted for Boolean query refinement. Notice that the phylogenetic taxonomy, based on evolutionary relationships, can contain more nodes (branch points) than can be easily assigned to a fixed number of levels (i.e., kingdom, phylum, class, order, family, genus, species, and their sub- and super-level categories). For this reason, and because assignment of taxonomic names to categories is subject to controversy, we do not display what level each node represents. Furthermore, it is not possible to infer level from the suffix of the term, since taxonomic naming conventions differ in different domains of systematics. The source databases also differed in breadth and depth of coverage at various levels in their taxonomies. This is one reason why Deuterostomia is not shown as a taxonomic node between the kingdom Metazoa and the phylum Chordata in the example above. The other reason is that Deuterostomia is a morphological distinction that may not reflect a fundamental evolutionary event (branch point). Taxonomy mode can be used to browse through and examine the structure of the taxonomic hierarchy. The taxonomy itself is maintained with TaxMan, a tree-structured database management program developed at NCBI for this project. Protein Name The Protein Name field is used to search for a specific protein or type of protein. The common name of a protein may not be indexed in this field. Try using the Text Terms field if you cannot find a particular protein using Protein Name. Substance The Substance field is used to search chemical names from the Chemical Abstract Service (CAS) registry and the MEDLINE Name of Substance field. Text Terms (Protein and Nucleotide) See Abstract or Title section. MENU BAR OPTIONS The menu titles used in Entrez are File, Edit, Options, and Preferences. These appear on the menu bar on the Macintosh, and in the appropriate windows on the PC. Each of the menus is illustrated below with a brief description of commands that are unique to Entrez. FILE MENU The File menu contains the standard Save, Save As, Print, and Quit choices. Open can be used to open a checked document instead of double clicking, and Save will save all checked documents to a single file. If you Save an open document, Entrez will construct a legal file name from the document's window title. Save All will save all records without the need to check them first (which would also calculate their neighbors). Print Documents must be open in order to print. Printer fonts can now be specified independently from screen fonts. See the section on Customization below. Save Uid List / Load Uid List The lists of unique identification numbers (UIDs) for a Boolean query in the Query Refinement box or given set of document summaries in the Document window can be saved to a file for later use. When the saved list is later loaded back into Entrez it is placed into the Query Refinement box. This allows Boolean queries to be performed on the results of neighboring, and allows the UIDs for records at any taxonomic level to be obtained. OPTIONS MENU +-----------------+ | MEDLINE Report | | MEDLARS Format | | MEDLINE ASN.1 | |-----------------| | Sequence Report | | GenBank Format | | FASTA Format | | Sequence ASN.1 | |-----------------| | NucProt | | SegSet | | BioSeq | +-----------------+ These options allow you to display, save, and print the MEDLINE and sequence records in a variety of formats. Choose the desired format prior to opening the document. MEDLINE Report / MEDLARS Format / MEDLINE ASN.1 A MEDLINE Report is a cleanly formatted presentation of title, authors (up to 10), institutional affiliation of first author, journal publication information, abstract (if available), MeSH terms, and unique MEDLINE ID. MEDLARS Format contains the same information found in MEDLINE Report format, plus some additional fields, in the original National Library of Medicine format with mnemonics for each field (e.g., TI for Title, SO for Source). Use this format when you are going to transfer records to personal bibliographic manager software, which needs the mnemonics to receive and store each field correctly. MEDLINE ASN.1 format displays the data in the Abstract Syntax Notation 1 (ASN.1) form in which it is stored. ASN.1 is a machine-readable syntax that can represent complex, hierarchical information in a well-defined linear form. All other MEDLINE report forms are generated from this information. Sequence Report / GenBank Format / FASTA Format / Sequence ASN.1 A Sequence Report is an easily readable summary of a sequence and associated features and citations. Sequence records from nucleotide databases can be viewed in GenBank Format, which is used by some commercial sequence analysis software packages. (PIR, SWISS-PROT, and PRF entries in Entrez do not contain nucleotide sequences and cannot be converted to GenBank format.) Records viewed in FASTA Format contain a definition line and a sequence. This is the format used by the FASTA sequence similarity search program. Sequence records are stored in Sequence ASN.1 format. Data from GenBank, EMBL, DDBJ, PIR, SWISS-PROT, PRF, and PDB have been converted into a common ASN.1 format. All information has been validated against the formal ASN.1 specification for sequence records developed at the NCBI. All other sequence report forms are generated from this information. NucProt / SegSet / BioSeq These choices control the level of complexity desired in any type of sequence report. A given sequence record can contain a nucleotide sequence and all of the protein sequences it is known to encode. (A sequence record will also contain the citations to articles in which the sequences were published.) This structure is called a NucProt set. We create this object out of individual component database records because it is a biologically meaningful construct combining a genetic sequence and its potential or actual products. The nucleotide sequence within a NucProt set may be composed of multiple segments, typically corresponding to exons or to individual sequence contigs. This substructure is called a SegSet. We will combine records into a SegSet when they are listed as "segment m of n", when a coding region has a join across multiple accession numbers, or in journal scanning when the data are presented in discrete segments. Each individual contiguous sequence in these structures is called a BioSeq. Use of SegSet and BioSeq will give you a more traditional view of the data in a NucProt set. PREFERENCES MENU +-------------------+ | Parents Persist | | Show Sequence | | Use Timer +----+ | CharsPerLine > | 30 | | Query Defaults > | 40 | +------------------+ 50 | | 60 | +----+ There are five entries on the Preferences menu. The function of each is described below. Parents Persist Causes the documents used for neighboring to appear at the top of the next generation. This can be useful for building a complete collection of relevant sequences for viewing as a group. Show Sequence Controls the display of sequence data in the Sequence Report. Use Timer Controls when the term list appears in Selection Mode. If on, Entrez will retrieve the term list when you pause in your typing. If off, Entrez waits for you to use the Accept button or press Return. Chars Per Line A sub-menu that controls the length of line for sequence data in the Sequence Report. Query Defaults A sub-menu that allows the user to save the Current database/field/mode and options combination, or to restore the Original settings. The Entrez Query window and Options menu will start up with the latest default settings. DOCUMENT WINDOW CONTROLS /----------------\ /------\ | Neighbor 0 | (*) MEDLINE () Protein () Nucleotide | Prev | \----------------/ \------/ /-----\ /------\ /---------\ /------\ Select: | ALL | | None | | Parents | | Next | \-----/ \------/ \---------/ \------/ There are four groups of buttons at the bottom of the Document window that control neighboring and linking: Neighbor / Lookup This button reads "Neighbor" when the "target" database is the same as the current database (indicated by the type of documents currently in the Document window). It reads "Lookup" when the target database is different. It is active when more than 0 documents can be retrieved. MEDLINE / Protein / Nucleotide This radio button group selects the target database for neighboring or linking. (Select) All / None / Parents This set of buttons allows convenient checking of documents. (Select) All and (Select) None check all documents or no documents, respectively. If the Parents Persist preference is set, the parents of a neighboring are carried over to the next generation and are marked with a bullet. Pressing the (Select) Parents button will check only these documents. Prev / Next Each retrieval (whether Boolean, neighbor, or lookup) creates a new "generation" in the Document window. The Prev and Next buttons allow you to examine the "history" of these generations, without needing to recreate a query. ADVANCED BOOLEAN QUERIES /---------------------------------\ /---------\ /--------------------\ | Retrieve 0 Documents | | Reset | | More Booleans... | \---------------------------------/ \---------/ \--------------------/ Although the graphical method of query refinement is convenient, and adequate for most situations, it cannot allow complex nested Boolean algebraic expressions, and it does not allow the specification of documents to be excluded from a search. The More Booleans button provides the ability to specify an arbitrarily-complex Boolean algebraic query expression. An example of a simple query is: ("Roberts R*" [AUTH] | "Sharp P*" [AUTH]) & "Splicing*" [WORD] The asterisk indicates truncation (useful if you do not know an author's middle initial). The field names are WORD, MESH, AUTH, JOUR, GENE, KYWD, ECNO, ORGN, ACCN, and PROT. The field [*] indicates that all fields should be examined. Note that much of this complex expression composition can be performed using point-and-click, by using the Paste from Query Refinement button in the More Booleans window. This button pastes the logical equivalent of the Query window's Retrieve Documents button into the complex expression. SEARCH STRATEGIES In using MEDLINE neighboring for pursuing a particular piece of information, as well as for exploring the literature in general, it may be more productive to select the first few relevant retrieved documents and perform an additional round of neighboring than to read through the entire list of retrieved documents. (This is in contrast to the results of a Boolean query, which are not ordered by any measure of relevance, and thus should be completely scanned.) The Medical Subject Headings (MeSH) field contains terms applied by trained indexers at the National Library of Medicine. You can choose a text term from the Title or Abstract field, and an equivalent MeSH term from the MeSH field, and OR them together in the Query Refinement box, to get more inclusive coverage of a concept. One reason that queries may fail is that the particular term entered is only used in one or a few documents. You can use Selection mode to peruse the list of available terms. An example term list is shown in Figure 2 of the Tutorial (on page 5). The term "binding" (which is used in numerous articles) is followed by a number of hyphenated terms, most of which are only used in one document. Typing "binding-activity" would not be a good way of finding articles that talk about binding and activity. Those terms should be entered separately. You can select "binding" in Truncation mode to collect all of the papers indexed by term under the general concept of binding. Because there are several sequence database sources, which have different conventions and different levels of annotation, it is difficult to use a text term or keyword search to find all similar sequences. In some cases the best way to compensate for such word retrieval shortcomings is to locate one record and retrieve its neighbors. The MEDLINE abstract for the paper in which a sequence was published does contain many relevant terms. One way to retrieve sequences by terms is to first retrieve MEDLINE articles by term, then switch to a sequence database as the target, press the (Select) All button, and Lookup the associated sequence records. For nucleotide neighboring, note that there will be many cases where a record has no neighbors that meet the minimum significance and satisfy the criterion of overlapping sequences. In cases where a gene product is a protein, neighboring in protein space may pick up the desired neighbors, while nucleotide neighboring very likely will not, due in part to the degeneracy of the genetic code and to different codon usage in different organisms. The statistical text retrieval method of calculating MEDLINE neighbors can only provide an approximation of biological relevance. It naturally has limitations, and is not a substitute for human intelligence in interpreting abstracts and following a train of thought. The recently discerned connection between apolipoprotein E and Alzheimer's Disease illustrates this point. Because of their emphasis (as measured by use of terms), the few articles retrieved by truncation on "apolipoprotein" and "Alzheimer" neighbor only Alzheimer articles, not those on apolipoprotein and its role in cholesterol metabolism. If you come across and recognize such a situation, you could follow the secondary topic by performing a new query specific to the other term. The apolipoprotein and Alzheimer situation described above is also a case of interacting proteins. As mentioned in the tutorial, neighboring by BLAST sequence similarity will not find proteins that interact with one another, unless they happen to be evolutionarily related. Information on protein interaction, however, may frequently be found in the comments of a sequence record or the abstract from the associated MEDLINE record. INSTALLATION The Entrez databases are distributed on two discs: Entrez: Sequences and Entrez: References. Nucleotide and protein sequence records are on the Entrez: Sequences disc, and MEDLINE records are on the Entrez: References disc. You may wish to occasionally install the latest version of the Entrez application, in order to take advantage of improvements in the software. Entrez should be configured to use both discs with disc swapping enabled if you have a single CD-ROM drive. Although you can configure Entrez to prohibit swapping, Entrez will then not be able to link directly between sequence records and the associated MEDLINE records. Do this only if your drive is incapable of ejecting a disc in the middle of an application. (Contact NCBI for further assistance in this rare situation.) Update Installation Users who copy index files to their hard disk for faster performance, or who specify disc swapping, will need to rerun the EntrezCf program (see below) with each new release in order to update the hard disk images of index or cdromdat.val files. Extraction of Entrez Software Archives Installation of Entrez is a two-step process. In the first step, the Entrez application, EntrezCf configuration program, and accessory files and directories, are "extracted" from "archives" on the CD-ROM and placed onto your hard disk. The Entrez: Sequences disc contains self- extracting archives for the Macintosh and PC/Windows versions of Entrez. The Entrez: References disc contains an identical copy of each archive. You need only perform the extraction from one of the two discs. EntrezCf Configuration Program In the second step, the EntrezCf program (extracted in the first step) allows you to configure Entrez for your particular machine. EntrezCf asks you to specify which discs you want to use (Entrez: Sequences or Entrez: References or both), where the data sources are located (on the original CD-ROMs, copied to a hard disk, or mounted on a file server), whether you want to copy the index files to your hard disk for improved performance of document retrieval, how many CD-ROM drives you have, and whether you want to allow disc ejection and swapping, so that both the Sequences and References discs may be searched in a single CD-ROM drive system. The program will then ask you to enter the paths to each data source (volume name or drive letter). It will finish by copying the index files (if requested) and writing the ncbi configuration file (discussed below). ncbi and entrez Configuration Files Two configuration files are necessary for running Entrez. The ncbi configuration file specifies paths for access to the sequence and MEDLINE records and to ASN.1 parse tables and accessory data and index files. The entrez configuration file maintains persistent preferences, allows built-in fonts to be overridden, and allows the save path and default names to be changed. Although the configuration files have a common format, their names and locations are decided by platform- specific conventions. On the Macintosh, the files are of the form xxx.cnf, and reside in the System Folder or the System Folder:Preferences folder. Under Windows, they are of the form XXX.INI, and reside in the Windows directory. The EntrezCf configuration program ensures that the ncbi configuration file is created in the proper location and is configured properly for your particular machine. The Entrez application will set up the entrez configuration file with certain default values, if one does not already exist. MACINTOSH SOFTWARE EXTRACTION Requirements A Macintosh computer with at least 2 MB of memory, a hard disk drive, and a CD-ROM drive. If Entrez gives you memory limitation warnings when opening a document, increase the memory partition for the Entrez application with the Finder's Get Info choice in the File menu. In addition to the CD-ROM driver (e.g., Apple CD-ROM), you will also need the Foreign File Access and ISO 9660 File Access files. If you have an extension that is used to read PC diskettes (e.g., DOS Mounter), it must be disabled. Such programs preempt recognition of the ISO 9660 CD-ROM. Software Update Installation Update installation requires the same steps as a new installation. New Installation The Entrez retrieval program is supplied as a Compact Pro self- extracting archive. To extract the program, follow these steps: * Turn on the CD-ROM drive before starting the computer. Then start up your Macintosh. * Create a new folder named ENTREZ. * Insert the Entrez: Sequences CD-ROM disc, then select and open the SEQDATA icon. * Select and open the ENTREZ folder. Then select and open the MAC folder. * Select and open the ENTREZ.SEA application. A dialog box will appear. Press the Desktop button and select and open your hard disk. Then select and open (or double-click on) the ENTREZ folder as the destination. Use the Extract button to extract the file into this folder. (Click on the Replace ALL Duplicates button if asked.) The ENTREZ folder will now hold the Entrez and EntrezCf applications, and asnload, data, seq and ref folders. * Return to your ENTREZ folder, select and open the EntrezCf icon to start the configuration application, and follow the instructions in the Configuration section. MICROSOFT WINDOWS SOFTWARE EXTRACTION Requirements A personal computer that is based on an Intel 286, 386 or 486 microprocessor (or compatible), at least 2 MB of memory, a hard disk drive, a graphics display, a mouse or other pointing device, and a CD- ROM drive. MS-DOS operating system (version 3.1 or later), Microsoft Windows (version 3.1 or later, or Microsoft Windows for Workgroups, version 3.1 or later), and drivers that are supplied by the CD-ROM manufacturer. Software Update Installation Update installation requires the same steps as a new installation. New Installation The Entrez retrieval program is supplied as a PKZIP self-extracting archive. To extract the program, follow these steps: * Insert the Entrez: Sequences CD-ROM disc into the drive and make it the current drive (assumed to be drive D in the example below). Change the current directory to ENTREZ\WIN and type the command "install" followed by the name of the directory into which the software should be copied (C:\ENTREZ in the example). Then return to the hard disk. The ENTREZ directory will now hold the ENTREZ.EXE and ENTREZCF.EXE applications, and ASNLOAD, DATA, SEQ and REF directories. C:> d: D:> cd entrez\win D:> install c:\entrez D:> c: C:> * Start Windows. For convenience, you may want to create a program group and program items (icons) for Entrez and EntrezCf in your Program Manager window, if they have not already been created. Refer to the Microsoft Windows User's Guide for instructions. * Start the EntrezCf configuration application, and follow the instructions in the Configuration section. CONFIGURATION EntrezCf allows you to tailor your Entrez installation to suit your particular computer and network resources. Select Install in the opening window. In most cases, EntrezCf will be able to find the folder or directory that was just extracted (in which it should reside). If it cannot, it will put up a file dialog box and ask you to find the folder or directory. If this is the case, you will need to press the Save button when you can see directories named asnload and data. EntrezCf then displays a window asking which discs you want to use, where the data sources are located, whether you want to copy the index files to your hard disk for faster document retrieval, how many CD-ROM drives you have, and whether to allow disc swapping on a single drive. +-----------------------------------------------------------------------+ | EntrezCf | +-----------------------------------------------------------------------+ | | | Which data sources do you wish to use? | | | | +-Data Sources----------------------------------------------------+ | | | | | | | [x] Entrez: Sequences [x] Entrez: References | | | +-----------------------------------------------------------------+ | | | | On what media are these data sources? | | | | +-Entrez: Sequences--------+ +-Entrez: References--------+ | | | | | | | | | (*) CD-ROM Drive | | (*) CD-ROM Drive | | | | ( ) Local Hard Disk | | ( ) Local Hard Disk | | | | ( ) File Server | | ( ) File Server | | | +--------------------------+ +---------------------------+ | | | | Which index files do you want copied? | | | | +-Copy Index Files------------------------------------------------+ | | | | | | | [x] Entrez: Sequences [x] Entrez: References | | | +-----------------------------------------------------------------+ | | | | What are your hardware resources? | | | | +-Number of CD drives----------+ | | | | | | | ( ) 0 (*) 1 ( ) 2 | [x] Disc Swapping | | +------------------------------+ | | | | /==========\ /--------\ | | | Accept | | Quit | | | \==========/ \--------/ | | | +-----------------------------------------------------------------------+ In this example the user has chosen to use both the Entrez: Sequences and Entrez: References discs in a single CD-ROM drive, with both sets of index files copied for faster performance, and with disc swapping enabled. Therefore, at times in a given session of Entrez, the software will eject one disc and request the other when necessary. To use Entrez: Sequences and Entrez: References on a system with one CD-ROM drive, Disc Swapping should be checked (the default choice). Do not disable disc swapping unless your CD-ROm drive is incapable of ejecting a disc in the middle of an application. With allow disc swapping enabled, Entrez will prompt you to insert the References disc whenever the drive contains the Sequences disc and you choose the MEDLINE database. Conversely, you will be prompted to insert the Sequences disc whenever the drive contains the References disc and you choose to search the protein or nucleotide databases. If you choose this option, it is strongly recommended that you have two CD- ROM caddies available. Frequent switching of CD-ROMs between a single caddy and the CD jewel boxes is very tedious. Once you accept the settings, EntrezCf will ask you for the paths to the desired data sources. This is generally either the volume name (e.g., SEQDATA:) or the drive letter (e.g., D:\). In most cases you can simply press Accept, although you may need to change the drive letter first. If you have copied the contents of the disc to a hard disk or file server, you need to specify the path to the cdromdat.val file at the root level of the subdirectory structure. You can enter this either in the text box or by pressing the Find cdromdat.val button to ask for a file open dialog box. +----------------------------------------------------------------------+ | EntrezCf | +----------------------------------------------------------------------+ | | | Entrez: Sequences Volume | | | | +--------------------------------------------------+ | | Root Path: | SEQDATA: | | | +--------------------------------------------------+ | | | | /==========\ /---------------------\ /--------\ | | | Accept | | Find cdromdata.val | | Quit | | | \==========/ \---------------------/ \--------/ | | | +----------------------------------------------------------------------+ As you enter the paths for each data source, EntrezCf will copy the index files to your hard disk, if requested. If you are using a single CD-ROM drive with swapping, it will also copy the cdromdat.val file from each disc onto your hard disk. These copies are used by Entrez to determine which disc is inserted at any given point, and in the decision to eject the disc, when necessary. These files must be updated for each subsequent release of the Entrez databases by rerunning EntrezCf. Currently, the Entrez: Sequences index files are around 3 MB, and the Entrez: References index files are around 2 MB, for a total of 5 MB if you choose to copy both sets onto your hard disk. You can rerun the EntrezCf program if you decide to change the configuration parameters (such as if you get tired of having to swap discs with a single drive). Again, select Install in the opening window. EntrezCf will not copy the index files if identical index files had been copied in a previous configuration session. TROUBLESHOOTING Disc Recognition Failure On the PC, if Entrez cannot find the CD, and the Windows File Manager also cannot open the CD icon (or indicates that there are no files on the disc), the problem may be caused by Windows overwriting the memory onto which the CD-ROM driver is mapped. You should exclude this region from being considered as available memory. To do this, determine the memory addresses that the CD-ROM drive is using (check the manual for how to read the switch settings on the interface card or call the manufacturer). Then, configure your memory manager or DOS extender program to exclude these addresses (check the manual for what command options to use or call the manufacturer for instructions). On the Macintosh, if the CD is rejected as not being a "Macintosh disc", you may have an extension (e.g., DOS Mounter) that preempts recognition of the ISO 9660 disc. If so, you should drag the offending file out of the Extensions folder and restart the computer. Another possibility is that your Foreign File Access and ISO 9660 File Access drivers did not load properly. These are automatically installed along with the CD-ROM driver in the Extensions folder. However, under System 7.0, they must manually be dragged into the System Folder itself, and the computer must then be restarted, in order for them to work. System 7.1 can use them in the Extensions folder. Application Software Problems As a general rule, if you have a problem that occurs when you get a new release of Entrez, you should delete the old software and extract the version on the disc. The problem may have been noticed or reported, and addressed in the most recent update of the application. If Entrez fails, the cause may be that the ncbi configuration file has become corrupted. You should try reinstalling and reconfiguring after first deleting the existing ncbi configuration file. (On the Macintosh this file is named ncbi.cnf, and resides in the System Folder or the System Folder:Preferences folder. Under Windows it is named NCBI.INI, and resides in the Windows directory.) If there is a problem it is best to first configure without copying index files (after deleting the configuration file). If this solves the problem then you can run EntrezCf again to reconfigure with copying index files. If you get an error like "File open error on c:\entrez\terms\ntaccn.idx", it means that the ROOT entry of the ncbi configuration file is incorrect, and is pointing Entrez to the hard disk rather than the CD for the databases. In this case you should delete the existing ncbi configuration file and reconfigure. Remember that in many cases the default paths presented by EntrezCf will be correct (e.g., SEQDATA: or D:\), and you can simply press the Accept button without any typing. If you get an error saying that "Index files are out of date", you should rerun EntrezCf, which will copy the new index files. (If you copy index files, you need to do this when you receive each new release.) If each time you swap discs you get the same message, it probably means that the ncbi configuration file is specifying the paths to the copied indices with an IDX entry in the [NCBI] section (instead of separate SEQIDX and REFIDX entries). You should delete the existing ncbi configuration file, delete the old software, delete the old indices, extract the new software, and reconfigure. The latest EntrezCf will no longer produce the obsolete IDX entry. Printer fonts can now be specified independently from screen fonts. See the section on Customization below. The DISPLAY font is used for displaying (and printing) records in MEDLARS, GenBank flat file, and FASTA formats, and for the MEDLINE and Sequence ASN.1 formats. These reports do not word wrap, so the printer font should be adjusted if they print past the right margin. The MEDLINE Report and Sequence Report formats do perform word wrapping. CUSTOMIZATION The entrez configuration file has several possible sections. This file can be edited with a text editor, after it is first created by Entrez. (On the Macintosh it is named entrez.cnf, and resides in the System Folder of the System Folder:Preferences folder, while under Windows it is named ENTREZ.INI, and resides in the Windows directory.) The explanations below are for those configuration file elements that are not documented elsewhere or set under program control: [Section] Explanation Key=Value [PREFERENCES] MAXLOAD=200 Maximum number of documents that can satisfy a Boolean query and still be retrieved. [FONTS] JOURNAL=Geneva,10,i These entries allow you to override the default font specifications used by the Entrez application. A printer font can be specified after the screen font and a vertical bar (e.g., "Geneva,10,i | Times,12,i"). [SAVE] PATH=C:\SAVEHERE\ Path for saving files with the Save and Save All menu commands. Default is the current directory, i.e., the directory in which the Entrez application is located. DEFREFFILE=ENTREZ.REF File name for saving MEDLINE records with the Save and Save All menu commands. Default is entrez.ref. DEFSEQFILE=ENTREZ.SEQ File name for saving sequence records with the Save and Save All menu commands. Default is entrez.seq. The PC/Windows version of Entrez allows you to set the TMP environment variable to specify the path to the directory in which temporary files will be created. You can add a statement setting this variable in your AUTOEXEC.BAT file. This is useful for installations that allow Entrez to be run over a local area network. ENTREZ ON OTHER PLATFORMS NCBI makes executables available for other computer systems as a convenience to the community. These are available on an unsupported basis by anonymous ftp. To obtain these files, ftp to ncbi.nlm.nih.gov, enter anonymous as the user name and your email address as the password, cd to entrez, and get the README file, which gives further instructions. Will Gilbert of the University of New Hampshire makes available NCBI software on DEC computers through an arrangement with Digital Equipment Corporation. To obtain these files, ftp to ncbi.wi.mit.edu, enter anonymous as the user name and your email address as the password, cd to entrez, and get the README file, which gives further instructions. The source code for Entrez is available for persons who may wish to port it to another platform. CREDITS CD-ROM Data Coordination Greg Schuler Entrez Application Software Jonathan Kans, Greg Schuler, Jim Ostell, Jonathan Epstein, Karl Sirotkin Entrez Documentation Rose Marie Woodsmall, Jonathan Kans Data Specification (ASN.1) Jim Ostell Journal Scanning Coordination Kenn Rudd Data Acquisition and Preparation Karl Sirotkin, Tim Clark, Carolyn Tolstoshev, Mark Cavanaugh, Scott Federhen, Rand Huntzinger, Cynthia Chung, Steve Bryant, Mark Boguski, Francis Ouellette Neighbor Analysis John Wilbur, Warren Gish, David States, Herve Recipon Brandon Brylawski Graphic Design Greg Schuler Production, Distribution and Support Jim Fleshman, Dennis Benson, Greg Schuler, Steven Rosenthal Barbara Rapp, Rose Marie Woodsmall, Diana Airozo, Lisa Hackett Project Direction David Lipman ACKNOWLEDGMENTS Sequence data have been incorporated from GenBank journal scanning and direct submission (NLM's Division of Library Operations); EMBL Nucleotide Sequence Database (European Molecular Biology Laboratory Data Library); DDBJ (DNA Data Bank of Japan); PIR Protein Sequence Database (National Biomedical Research Foundation, Martinsried Institute for Protein Sequences, and Japan Institute for Protein Information Database); SWISS-PROT Protein Sequence Database (Amos Bairoch and EMBL); PRF (Protein Research Foundation, Osaka); PDB (Brookhaven National Laboratory Protein Data Bank); dbEST (NCBI); and from sequences in patents published by the European Patent Office and the U.S. Patent and Trademark Office. GenBank contractor support has been provided by Management Systems Designers, Inc., ComputerCraft Corporation, and KEVRIC Company, Inc. MEDLINE bibliographic data has been provided by NLM's Division of Library Operations and Office of Computer and Communications Systems.