================
This is the first public release version of the Pearson lab's
"seqdb_demo" relational database for sequence data obtained via the
NCBI "nr" flatfile (or other similarly formatted flatfiles available
from the NCBI: swissprot.fa, yeast.fa, etc).  This release consists of
SQL schemas usable in either an MySQL or PostgreSQL RDBM environment
(note that PostgreSQL support is limited (i.e. incomplete) and highly
experimental; please let us know if/when you have problems).

We have tested this package locally, but undoubtably there will be
problems when used "in the wild".  Please let us know when you have
problems.

Aaron J. Mackey (amackey@virginia.edu)
William R. Pearson (wrp@virginia.edu)

================
Requirements:

* an already functional MySQL or PostgreSQL installation.  You should
  already have the necessary user-level permissions to create new
  databases.
* Perl
* Perl modules:
	- DBI
	- DBD::mysql (or DBD::Pg for Postgres usage)
	- Inline::C
	- File::Temp
* a FASTA-formatted sequence library with description lines in the NCBI
  format.  An incredibly small test library is included (test.fa).
  You can download similar libraries from:

    ftp://ftp.ncbi.nih.gov/blast/db/

================
General installation notes:

* You may need to modify the connection information (db host, user,
  password, dbname, etc) at the top of updatenr.pl as appropriate to
  your environment (only if you're using it with a database other than
  the one automatically created for you by the included SQL scripts).

* You may provide an optional third numeric argument to "updatenr.pl"
  which is the limit to the number of sequences that will be read from
  the flatfile (e.g. updatnr.pl mysql /path/to/nr.fa 1000 will stop
  after reading the first 1000 sequences).

================
MySQL installation:

% mysql < mysql/seqdb_demo.sql
% updatenr.pl mysql /path/to/nr.fa >> updatenr.log

================
PostgreSQL installation:

% dropdb seqdb_demo
% createdb seqdb_demo
% psql seqdb_demo < pgsql/seqdb_demo.sql
[ lots of psql notification messages ]
% updatenr.pl Pg /path/to/nr.fa >> updatenr.log

================
Testing the MySQL installation:

% mysql -u seqdb_demo -pdemo_pass seqdb_demo
mysql> select count(*) from protein;
+----------+
| count(*) |
+----------+
|       13 |
+----------+
1 row in set (0.00 sec)

mysql> quit
%

================
Other update scripts (note: will currently only work with MySQL
database schemas):

% updatetax.pl /my/directory/to/save/taxonomy/data/

[ this script connects via FTP to the NCBI taxonomy database and
downloads the necessary info; the directory you specify on the command
line is where updatetax.pl will store the info, note that the
directory must already exist ]

% updatego.pl /my/cvs/checkout/go

[ this script depends on having a cvs checkout of the gene ontology
files; the directory you specify on the command line should contain
the subdirectories "ontology" and "gene_associations".  See
http://www.geneontology.org/#cvs for instructions.  Please note that
this script may take a very long time (> 2 hrs) to import all the GO
gene association information.]

================
Caveats:

This database environment is not to be considered "production
quality"; rather, it exists only to demonstrate how one might setup an
extremely simple database for sequences gathered from NCBI-generated
"nr-like" FASTA-formatted flatfiles.  These databases have entries
whose FASTA-formatted description line looks like:

>gi|15241446|ref|NP_196966.1| (NM_121466) putative protein [Arabidopsis thaliana]^Agi|11281152|pir||T48635 hypothetical protein T15N1.110 - Arabidopsis thaliana^Agi|7573311|emb|CAB87629.1| (AL163792) putative protein [Arabidopsis thaliana]

Because the "seqdb_demo" database relies on external database
accession numbers for primary keys, any SwissProt sequence that has
been "split" by the NCBI into separate records (and thus has no
SwissProt accession number) will still be imported into the database,
but will be given an accession number of "gi|123456", i.e. a GI-based
accession number.



