tadhg.com
tadhg.com
 

Fun with Books and Data Models

23:56 Sun 16 Nov 2008. Updated: 17:16 28 Jan 2009
[, , , ]

Fun might be the wrong word.

(Also, this is long. Condensed: I’ve been using Freebase to store my reading data, I wrote an Acre app to provide a custom view, and I discovered that my data model has some shortcomings.)

I’ve been playing with Acre some more, specifically on a long-term project of mine: to store data about the books I read in some system and then create views about my reading habits. Yes, compulsive list-making combined with programming/data geekery.

Anyway, I could have used a lot of other systems, such as Delicious Library or LibraryThing or Books, to store this information, but none of them seemed to have quite what I want (and most of them are proprietary). I could have written my own, and planned to, but kept tweaking with the data model and generally wasn’t sure how I wanted to deal with it.

I was going down the path of having objects for Books, instances of Books, Book reading events, and Authors. Then I encountered Freebase—I said that this has been a long-term project—and decided that I like the data model in its Publishing domain.

The relevant Freebase types for me are:
Book
Book Edition
Written Work
Author

Written Work is an included type for Book, so whenever you assert that something is a Book, the Freebase frontend asserts that it is also a Written Work. I considered Written Work to essentially be part of Book for my purposes.

I’ve been keeping track, in text files, the books that I read for quite some time. I’ve gradually increased the information I gather about each one, so that what was originally just date started, date finished, title, author, and whether or not I’d read the book before, has become this:

07. 29/03/2008; Matter; Iain M. Banks
593 pages; started 28/03/2008; 2008; Orbit, New York Feb 2008; ISBN-10: 0316005363; ISBN-13: 9780316005364; FBG: 9202a8c04000641f80000000091ceee6;

The last part, the most recent addition, stands for FreeBase Guid—that is, if you tack /guid/ onto the front of it, you have a Freebase identifier. One of many ways to use this is to just put the Freebase view address in front of it, like so: http://www.freebase.com/view/guid/9202a8c04000641f80000000091ceee6, which gives you the link to the Freebase page for the edition of Matter that I read in March this year.

The reason that edition is in there is because I added it. Freebase doesn’t have fantastic coverage for book data yet (thought that may change), so at the moment I manually enter the edition data, and occasionally the book data, for each book I read.

So, once all that is in the system, there should be some way for me to tag the books that I’ve read. The easiest way to do this would be to create a type, something like “Books Tadhg has read”, with the date properties. I don’t like that approach because Freebase types are a little like tags, and my addition of this type to a topic would show up for everyone, which strikes me as a little like pollution (because I assume that details on what I’ve been reading are not as fascinating to everyone as they are to me and, no doubt, the readers of this blog).

Instead, I created the types Book Edition Reader and Book Edition Reading Event. Before going into that, let me explain why I used Book Edition instaed of Book—it’s mainly because I want the easiest path to the data that’s relevant to me, which is the data that I capture myself as in those two lines above indicating when I read Matter, how long it was, where it was published, and so on. Because a Book Edition can refer to only one Book, it seemed an easier path to specify just the Book Edition, from which the Book can later be extracted, whereas if I specified Book, I would also have to specify which Book Edition of that Book I meant. Book Edition, then, was the approach.

My approach is more convoluted than adding a type to every Book Edition, but still fairly simple. The Book Edition Reader type is attached to a person, and takes Book Edition Reading Event in a property that expects a list of things. Book Edition Reading Event has the reader (me, since nobody else has used this so far, although it’s there for anyone who wants it), the start date, the end date, and the Book Edition read.

Simple. Once it’s all in there, I can get the list of Book Editions I’ve read with a straightforward query, like this one:


{
"book_editions_read" : [
  {
    "book_edition" : {
      "id" : null,
      "name" : null,
      "type" : "/book/book_edition"
    },
    "end_date" : null,
    "sort" : "end_date",
    "start_date" : null,
    "type" : "/user/tadhg/tbooks/book_edition_reading_event"
  }
],
"id" : "/guid/9202a8c04000641f8000000004904c64",
"type" : "/user/tadhg/tbooks/book_edition_reader"
}

Using the Freebase frontend's view capabilities, I can get a simple list pretty easily.

However, what I really want is something that exposes data about the book (as well as the edition), such as genre and subject, and the author, such as gender and nationality. A bunch of work with Acre got me to this (incomplete) view of books I've read this year.

So, it all looked good, and while it's a little slower than I would like, it's not the kind of thing I anticipate wanting really quick results from. But, as I was entering more books from this year, I ran into a snag—my data model is flawed as a way to represent the data I capture in my text files.

The reason is that what I consider a "book" isn't necessarily a physical book (which is what Book Edition maps to), but something more conceptual, a difference clearly exposed when I read a book that's contained in a collection. A collection such as Viriconium, which contains three M. John Harrison novels and a book of M. John Harrison short stories, and which my text file records represent as:
37. 04/08/2008; The Pastel City; M. John Harrison
108 pages; started 04/08/2008; 1971; [Viriconium; Bantam Spectra/Random House, New York November 2005; ISBN-10: 0553383159; ISBN-13: 9780553383157; FBG: 9202a8c04000641f8000000008dd71df]
38. 05/08/2008; Storm of Wings; M. John Harrison
146 pages; started 05/08/2008; 1980; [Viriconium; Bantam Spectra/Random House, New York November 2005; ISBN-10: 0553383159; ISBN-13: 9780553383157; FBG: 9202a8c04000641f8000000008dd71df]
39. 06/08/2008; In Viriconium; M. John Harrison
86 pages; started 05/08/2008; [Viriconium; Bantam Spectra/Random House, New York November 2005; ISBN-10: 0553383159; ISBN-13: 9780553383157; FBG: 9202a8c04000641f8000000008dd71df]
40. 06/08/2008; Viriconium Nights; M. John Harrison
121 pages; started 06/08/2008; [Viriconium; Bantam Spectra/Random House, New York November 2005; ISBN-10: 0553383159; ISBN-13: 9780553383157; FBG: 9202a8c04000641f8000000008dd71df]

The Book Edition is clearly the Viriconium collection. But if I enter that as a Book Edition Read, there's no way to denote which of the parts I read, or that there are four books (by my definition) in there. Furthermore, the numbers from my text files and the Freebase data would differ, and that's clearly not acceptable.

Right now, I only see one way around this, which muddies my data model somewhat: to include Written Work as one of the properties of Book Edition Reading Event, so that I'd be entering both the Book Edition and the Written Work—but could leave Written Work blank if it's just pointing to the Book that Book Edition points to. Then, I need some special-case code where if it's not the same as what the Book Edition is pointing to, I go fetch the data about the Written Work instead. It's just one property, but it makes it harder for other people to use the type (and I'd like to leave that option open), and it definitely increases the code complexity. However, I don't see another answer that satisfies these self-imposed requirements.

Therefore it's back to work on this, just at the point where I thought I'd fixed the basic data-entry, data-display, and data-model problems and could have some fun working on graphing out a variety of more-or-less pointless stuff—instead, I have to try to jam the anthology special case into the existing code framework.

Once that's done, though, there will be graphs, yes there will.

One Response to “Fun with Books and Data Models”

  1. Niall Says:

    Yeah, you make a good point about not polluting the system with frivolous “tag” types. I like your approach.

Leave a Reply