Initial Database Design

From Librivox wiki
Jump to: navigation, search

This initial database design is from forum user freqmod. The comments below are his. More formatting needed to lay this out properly. -- JoeMcMahon

Hello - I have created a set of database tables based on the work I have seen earlier in this thread; these tables uses many more relations which gives more possibilities to get interesting data and statistics.

I have tried to include all data specified by the drafts of the cataloging and work registring progress, including Dublin Core. I looked at the pages of Project Gutenberg and found their metadata as GPL-licensed Dublin Core RDF/XML files that may be used to get automatic titles, etc., even for books published earlier than ISBN as long as you know the e-text number.

I have renamed a book coordinatior a project coordinator and a book a work as it is more suited for poems, constitutions, supreme court decisions, etc. The tables are written in Postgresql [good choice! - JM] format like this:

CREATE TABLE <tablename> (
    <fieldid> <fieldtype> <constraint>
);

NOT NULL means that the fields are obligatory. CHARACTER VARYING(<length>) is text that is a maximum of <length> characters.

-- lines are comments

Fields ending with "id" are unique numbers/id's for each record that may be used to match information between the tables. PRIMARY KEY implies UNIQUE. REFERENCES <tablename>(<fieldid>) means that a record with that number has to exist in the other table named here; i.e. "mcid integer REFERENCES users(uid)," gives an error if somebody tries to insert a mcid that isn't a valid user(id).

I will probably look into creating a PHP/mysql system for cataloging using this database (but no promises)

The following table describes the users of the system and maintains the roles for the users.

CREATE TABLE users (
     uid integer PRIMARY KEY,
     username  character varying(30) NOT NULL,
     email character varying(255) NOT NULL,
 --  Sha1 hashed password + a random string as salt. Sha1 allways returns 40 characters
     passwd character varying(40)  NOT NULL,
     webpageURL character varying(255),
     catalogingName character varying(30) NOT NULL,
     dbcoordinator bool DEFAULT false,
     metacoordinator bool DEFAULT false,
 --  true: female, false:male bool is ok as there are only 2 choices,
 --- no sexism intended when choosing which sex that is true and false.
     sex bool NOT NULL
 --  Add more user releated data here: geographic locations in lattitude/logitude/city/state/contry etc.
 );

[ I'm wondering - is this the best way to go, or should we consider a user to have the basic "this is who I am" data, with possibly a link to a roles table ("this is what I do")? This way we have a roles table, and a user table, and then a user_roles table, with entries that refer to both a user and a role; it's then possible to query this table to see if a given user has a given role. This allows us to add roles without redoing the user table. - JM]

CREATE TABLE works (
    wid integer PRIMARY KEY,
    title character varying(50) NOT NULL,
    creator character varying(50) NOT NULL,
    subject character varying(50) DEFAULT '',
--  Keywords e.g. for search comma(,) separated:
    keywords character varying(100) DEFAULT '',
    [ another candidate for a mapped table ]
    contributors character varying(50) DEFAULT '',
    [ as is this ]
    origPubdate date NOT NULL,
    lbvSubmitdate date,
    lbvStartdate date DEFAULT NULL,
    lbvTargetdate date DEFAULT NULL,
    lbvPubdate date DEFAULT NULL,
    sourceURL character varying(255) NOT NULL,
    gutenbergID integer DEFAULT NULL,
    isbnnum character(13) DEFAULT NULL,
    lang character(2) NOT NULL DEFAULT 'en',
    coverage character varying(255) DEFAULT '',
    description character varying DEFAULT '',
    origPublisher  character varying(50) NOT NULL,
--  Wikipdia url:
    wpURL character varying(255) NOT NULL,
--  Project coordinator:
    pcid integer REFERENCES users(uid),
--  Meta coordinator:
    mcid integer REFERENCES users(uid),
--  0: proposal 1: accepted (by metacoordinator) 2: got project coordinator 3: some chapters claimed 4: all chapters claimed 5: some parts posted 6: all parts posted 7:book coordination complete (waiting for metacoordinator) 8: book completed, released and cataloged
    status integer DEFAULT 0,
--  Copyright checked:
    copyclear boolean NOT NULL DEFAULT false,
    category integer DEFAULT 0
--  Add more work related data here: tags etc.
);
-- Parts, chapters etc.
CREATE TABLE parts (
    pid integer PRIMARY KEY,
    wid integer REFERENCES works (wid) NOT NULL,
--  Who has claimed the part (may be null)
    uid integer REFERENCES users (uid),
--  Which status has is the part?
--  0: unclaimed 1: freed (is unclaimed has been claimed) 2: claimed 3: posted 4:validated (i.e. proofheard)
    satus smallint DEFAULT 0,
--  length for possible recorders to know how long the chapters are
--  0: considered tiny - 10: considered huge
    len smallint DEFAULT 0,
--  Name of the part e.g. chapter name
    partname character varying(50) NOT NULL
);
-- Actual recordings, a part may refer to multiple recordings, and multiple parts may refer to one recording (if the recording contains multiple chapters)
CREATE TABLE recordings (
    rid integer PRIMARY KEY,
--  The recording is of whitch part:
    pid integer REFERENCES parts (pid) NOT NULL,
--  Who did the recording:
    uid integer REFERENCES users (uid) NOT NULL,
    uploadDate date NOT NULL,
    version integer DEFAULT 0,
--  Size in bytes, 0 means no file exists:
    mp3hqfile integer DEFAULT 0,
    mp3lqfile integer DEFAULT 0,
    oggfile integer DEFAULT 0,
    spxfile integer DEFAULT 0,
--  Length in seconds (the program can determine long/medium/short etc. if wanted)
    runtime integer DEFAULT 0,
--  If multiple recordings of the same part exist why? is it multiple recordings of
--  the same work or is the part split up into smaller parts (or some other black magic)
    descr character varying(255) NOT NULL
);

It is possible get data based on relations between the tables:

By joining tables new and interesting tables may be selected: for example, to select all parts administrated by a meta-coordinator with user id mcid do:

SELECT * FROM works INNER JOIN parts USING (wid) WHERE mcid=<mcid>;

Then you can access all fields from works and parts at the same time for each part:

SELECT * FROM works INNER JOIN parts USING (wid) WHERE mcid=<mcid>;

To get all recordings andministated by a meta coordinator:

SELECT * FROM works INNER JOIN (parts INNER JOIN recordings USING (pid)) USING(wid) WHERE mcid=0;

To get all works by an user:

SELECT * FROM works INNER JOIN (parts INNER JOIN (recordings INNER JOIN users USING(uid)) USING (pid)) USING(wid) WHERE users.uid=0;

And many other possibillities.

Dublin core mapping

|| "data to be added as written" || <sql table> || <field> || ||Title || works || title || || Creator || works || creator || || Subject || works || subject || || Description || works || description || || Contributor || works || contributors + users catalogingName (for each user that recorded parts of the book) || || Date || works || oribPubdate || || Identifier ("Librivox") || recordings || rid || || Source || works || sourceURL (or etext url if sourceURL does not exists and the etext url may be generated from works gutenbergID) || || Language || works || language || || Coverage || works || coverage || Relation (internal book relation) may be works wid for recordings or whatever [need better explanation - JM]

[It looks to me like we should build a special table for the fixed Dublin Core data (the items not in table entries below) - JM] Publisher: "Librivox" Type: "Sound" Format: "mono "128kbps/MP3/44100kHz, 64kbps/MP3/44100kHz, -q5/OGG/44100kHz,-q5/speex/32000kHz" (or whichever formats available) Rights: "Public Domain"

ID3/vorbis tags (in the audio files): |||||| ID3-v1: || || Title || works || title" - " parts partname || || Artists || works || creator || || Album || works || title || || Year || works || lbvPubdate || || Genre: Speech [ "Spoken Word", as in iTunes? - JM] || [For artists: solo vs. collection? - JM ]

||||||ID3-v2.4.0 || || TCOP (copyright message) |||| "Librivox/public domain" || || TDOR (original release) || works || pubDate || || TDRC (release time) || recordings || uploadDate || || TPE1 (lead singer/soloist) || users || catalogingName || || TLAN (language) || works || lang || || Album-gain |||| <from replaygain http://en.wikipedia.org/wiki/Replay_Gain > || || Track-gain |||| <from replaygain http://en.wikipedia.org/wiki/Replay_Gain > ||

Look at http://www.id3.org/id3v2.4.0-frames.txt to see if more frames/tags might be used

||||||Vorbis tags (Speex/Vorbis/Flac) (supports "user" defined tags) || || ID3v2-<id3v2 4 character code> |||| id3v2-tag || || REPLAYGAIN-ALBUM |||| <from replaygain http://en.wikipedia.org/wiki/Replay_Gain > || || REPLAYGAIN-TRACK |||| <from replaygain http://en.wikipedia.org/wiki/Replay_Gain > || || PERFORMER || users || catalogingName || || ORGANIZATION |||| Librivox ||

DATE: works lbvPubdate LOCATION: (eventual location stored in the users table) DESCRIPTION: works description (is it any point in such a big type?) LBV-RECORDID: recordings rid other LBV- fields...

  • /