Discovery, the Catalog, Data, and APIs: Part 1

Lately we’ve focused an awful lot on Summon when talking about all things resource discovery. Summon is the new kid on the block, after all, and we’ve needed to try to get a handle on it. It’s changing the way people use our resources, and there are still a lot of open questions about how it’s used and how it should be used.

But what about the catalog? We haven’t really been talking much about the catalog. Not that there hasn’t been any interest from my colleagues: on the contrary, I’ve heard loud and clear the desire for a catalog that can do faceted browsing and the discussions about possibly loading our catalog data into Summon. Unfortunately, as a III shop, we’ve always been at a disadvantage in this area. We’ve never had an easy time of getting our catalog data out of the catalog because our system hasn’t really supported it.

What if I told you that had changed? What if we suddenly had a way to load data from our ILS into other systems of our choosing in data formats of our choosing? And what if we could automate it, to the point that a record saved in the ILS would get extracted minutes later with no muss, no fuss into the formats and systems that we specify? If we had this, what sorts of things could we build? What new discovery tools and methods could we support?

My big project over the past five months has been exactly this. It’s all still in development and probably will be for a few more months—but it’s creeping ever closer to being production-ready. The problem—my problem—is that this thing I’m building is infrastructure. It’s invisible. There’s not really an end-user interface that I can point you to so you can look and get a feel for what it does. But it will enable things that we couldn’t have done before. It will enable us to do the things we’ve planned in our RDS Phase 2 Action Plan—and much more.

So this is the first in a series of blog posts that I wanted to write to introduce this thing in a way that might make it interesting and accessible. I’ll start out by laying some of the groundwork, examining why having control of our data is so important for the things we want to do.

The Importance of Data

Astute readers might now be thinking, “Wasn’t moving to Sierra supposed to help open up our data?” Yes—yes it was. With Sierra, III promised much. III billed the system as “open”—built on open technologies using a “service-oriented architecture.” Listening to their marketing, Sierra sounded almost as though it would provide an extensible ILS framework that we would be free to build upon. We were promised open APIs that would allow us to build new software components that would consume data from Sierra and be able to interact directly with Sierra.

But wait, let’s take a step back. I know we throw around terminology like “open architecture” and “open data” and “API,” but these are all kind of buzzword-y, aren’t they? We talk about these things like they’re important—but we’ve never stopped to explain why.

So What Use is Data, Anyway?

At its most basic, a computer program comprises a set of instructions that a computer follows to perform some task. Useful programs contain generalized instructions that can be reused in different contexts. Data is what a program uses to do stuff.

Say I want to program something that displays a formatted list of titles and authors for a set of books that I own. One approach would be to write a separate command to display each individual piece of information—each title and each author, like this (where “\n” is a new line):

    print "Title: Hamlet\n"
    print "Author: William Shakespeare\n"
    print "\n"
    print "Title: The Sun Also Rises\n"
    print "Author: Earnest Hemingway\n"
    print "\n"
    print "Title: The Great Gatsby\n"
    print "Author: F. Scott Fitzgerald\n"
    print "\n"

I hope you see that this approach is pretty useless. It displays exactly what we told it to display and nothing more. What happens if we want to change the label “Title” to “The Title?” What happens if we want to change what displays between each book? We have to go through and change each instance in our code. This is clearly a very inefficient way to program, fraught with potential errors.

But we can write a more generalized program that does the same thing using fewer actual instructions. All we need to do is to store the title/author information for our book list in some sort of data construct. Then we can code a loop that will repeat one set of instructions for each item in the data.

    my_data = [{
        "t": "Hamlet",
        "a": "William Shakespeare"
    }, {
        "t": "The Sun Also Rises",
        "a": "Earnest Hemingway"
    }, {
        "t": "The Great Gatsby",
        "a": "F. Sott Fitzgerald"
    }]

    for book in my_data:
        print "Title: " + book['t'] + "\n"
        print "Author: " + book['a'] + "\n"
        print "\n"

What’s happening here is that we’re storing all of the authors and titles in a nested data structure that we’re putting in a variable called my_data. Our program loops over each element in that data structure, temporarily assigns it to the variable book, and then displays the title (book[‘t’]) and author (book[‘a’]). This is much better than the last version of our code. If we want to change how anything is displayed we only have one place to change it.

In addition, now that the data is actually defined in a data structure, we can reuse it later in our program and do other things with it besides display it. We could write code to let us search it, for example, which we couldn’t do at all with the first version.

But this still isn’t as good as we can make it. We’re still storing our data structure inside our program, and it’s still a little bit cumbersome to have to get the brackets and curly braces and the formatting all right when we’re editing the data. The next step might be to store the data in a separate file that’s a little more compact and little easier to edit—say, a comma-delimited (CSV) file, like this:

    title,author
    Hamlet,William Shakespeare
    The Sun Also Rises,Earnest Hemingway
    The Great Gatsby,F. Scott Fitzgerald

Now to use this, we’d have to code instructions in our program first to access the file, parse the contents, create a data structure like the one in the last version, and load the data into memory. But if the resulting data structure is identical, the rest of the code works without modification.

Show Me the Data!

Once I have my data stored somewhere, like in a CSV file that I’ve saved on my computer, I can write programs all day long that use it to do different things, and I don’t have to re-enter it anywhere. In a nutshell, that is one of the primary uses of opening up data: it allows anyone that has the necessary skills to write programs that can do new things with it.

In contrast, our data in Millennium was locked away—it was both physically inaccessible to us (except what we could access through the Millennium user interface) and stored in a data format that was Millennium-specific, that we couldn’t make sense of even if we could access it. To be fair, we could export data in a few ways: as MARC via Data Exchange, as MARC via Z39.50, or as delimited data via Create Lists. But these methods were too limited or too manual to be of much real use.

With Sierra, although the reality has unsurprisingly fallen somewhat short of the marketing promises, III has cracked open the door for us. Sierra is built on open-source database and indexing software, unlike Millennium. III has given us direct read access to the database. With this little bit of extra freedom that we didn’t have with Millennium, they have opened the door to allow us to take control of our data.

But is direct database access enough?

Storing, Using, and Extracting Data

Turning back to our earlier example: as we use this program we’ve created, there might be more and more that we want it to be able to do. Say we want to be able to track the date that we added a book, the number of times each person in our family has read a book, the amount we spent on a book, and the person in the family that bought a book. Say we notice that we have multiple books by the same author and we want to start recording information about authors somewhere so that we don’t have to replicate it for each book. As we want to do more with our program, we need the data that supports that functionality, and we need our program to be able to read that data in order to work with it. And, as our data grows—both in size and in complexity—the format in which the data is physically stored begins to impact how smoothly and efficiently the program runs.

How a program stores data and what it stores are highly individual to that program. It has to be able to read stored data into variables as appropriate, use those variables to carry out its tasks, and then update the data in storage as needed. It’s best to store data in a way that requires as little processing as possible to translate to internal data structures—if your data access methods are inefficient, it can slow down your program and make your code harder to understand and extend.

The upshot is that one system’s internal data store isn’t going to translate 1:1 to any other system or program—nor should it.

Let’s consider the ILS again and what I want to be able to accomplish. I don’t necessarily want to use the exact data as its stored in the ILS database. Each of the applications I want to write needs a different subset of data from the ILS; for each application I may need to store similar (or the same!) data quite differently.

At the moment Sierra allows us direct access to its internal database using SQL. SQL is the standard language used for querying relational databases: most programming languages have functionality that will let you query a database using SQL and pull data into internal programming structures with relative ease. With this, we can actually write programs that can read Sierra’s internal data.

But you know what? As nice at that is, it isn’t good enough. Sierra’s database is designed to support the Sierra ILS—it wasn’t written to support my applications. III made design decisions when it built the database that make it work okay for some purposes and not others. Let me give you an example that illustrates what I mean.

An Example: A Shelflist Application

One of the applications I started working on back in August is a simple shelflist builder to help with doing inventory. It lets you enter a location code and a call number range and pull back a list of items, sorted by call number, that you can then browse. You can search for call numbers and barcodes. You can mark items as on or not on the shelf.

The shelflist itself is a table of items, like this:

Row	Call Number	Volume	Copy	Status	Barcode	Suppressed?	Marked
1	M128 .A016 2002		1	AVAILABLE	1002285505	false	On Shelf
2	M128 .A05 2005		1	AVAILABLE	1002189053	false	Not On Shelf
3	M128 .A09 2007		1	AVAILABLE	1002420293	false

When you click on a row, it expands to show you the title, author, bib and item record number, and a link to the record in the WebPAC.

This data of course all comes from Sierra initially. When you first create a shelflist, my application submits an SQL query to Sierra, gets the information, and then stores it as a flat file so that my code can access the data quickly and efficiently. But the data as stored in Sierra is anything but flat. The database structure is quite complex, with item information and bib information being stored in separate tables and variable length fields stored in yet a separate table. Once I convert the data to the flat structure, access is instantaneous—but the initial query to build the shelflist takes 4 to 5 minutes, and sometimes even longer, to run. The SQL query is as efficient as I can make it—but, based on the database structure and what fields III has and has not indexed, the query still takes forever. And waiting 4 to 5 minutes for this query to run is unacceptable.

So even though we have direct access to the Sierra database, we still don’t have control of our data. To have control, we need to be able to pull the data out of Sierra. We need to be able to extract it and put it into other storage mediums that we do control, ones that we can configure to support fast, efficient access to the data that our applications need. The direct access we have to the database allows us to do this, but we’ve had to build the functionality ourselves.

What About APIs?

When one system shares data with other systems, there are at least two potential gotchas. One we saw in the last section: a system’s internal data won’t necessarily translate well to a system for which it wasn’t designed and optimized. Another is that a system has to maintain the integrity of its data. If we allow multiple programs—even ones that we’ve written—to access and write to the same data store, we risk that one of the programs might write data that renders the store (or parts of it) unreadable or invalid for other programs. This is why even open systems don’t generally allow write access to their internal data.

One way to take care of these problems is to enable access to your system via an API, or Application Programming Interface. If someone has created an API for their system, it means that they have defined a set of commands that I, for example, could use in my own programs that allows mine to interact with theirs in predefined ways. Maybe this allows my program to submit a query to their system to get back data that it can then use to do something. Or maybe it submits a call to their API with a data value and their API performs some calculation on the value and returns the result. Having access to APIs helps me extend the capabilities of my program so I don’t have to reinvent something that someone else has done. And APIs allow my programs to interact with other systems in a controlled, predetermined way—i.e., so that they don’t have to give me carte blanche access to everything in their systems and I don’t have to care about their internals.

APIs can be read-only or read-write. A read-only API only allows other programs to send commands for reading data, whereas a read-write API allows other programs to send commands that write data to the system as well.

In a way, you can think of an API as being a counterpart to a UI (User Interface). A UI allows a person to input commands, enter data, and work with a system to accomplish some task. An API is the same, but for a computer program. It’s just as important to design an API well so that they’re easy to understand and use for the same reasons that its important to design a UI well.

The work I’ve been doing the past five months initially began as a desire to create an API for our catalog data. In fact, that’s still the major component. Having a “programming interface” built on top of our catalog data that can serve up different views of that data on request will make application development easier. (And I will come back to this in a later post.)

At the same time, you may have seen recent announcements from III about their own “Sierra API,” which has been in the works for a long time. They’ve finally announced that they will be releasing the initial version of their API in April.

So you might be wondering: how does III’s API affect mine? Have I just wasted five months of work building something that’s now redundant?

Consider my earlier point about why having direct access to Sierra’s database isn’t good enough. Just like the Sierra database, III has designed their APIs for some purpose. I haven’t seen them yet, so I can’t yet say how well-designed they are or how applicable they will be to the types of things we want to do. But they’re still their APIs. They are in control of what data they serve up and how that data is modeled. I want our applications to be built on top of our own APIs, because I want the option to serve data to our applications in a way that’s custom-tailored to what we need. The applications I build this way will be more fully functional and will perform better.
Speaking of performance—if API access is slow, all the applications that use that API will be slow. I can ensure that our APIs are fast and responsive; I can’t ensure that III’s will be.
Let’s think of our catalog data as separate from Sierra. III is building an API for Sierra. I’m building an API for our catalog data. Now that we can extract data from Sierra, and as we build more and more of our own applications, that distinction will become more and more meaningful. For instance: data in Sierra isn’t FRBRized. The data extraction process I’ve built would allow us to write an extraction routine that would analyze and FRBRize our catalog data as it gets extracted, loading it into a system as separate Work, Manifestation, Expression, and Item entities. Our API would then serve up this data as such to applications we build to use it. Sierra’s API couldn’t do this, because that’s not how Sierra stores data.
I’m almost certain that there will be ways that my API will be able to use III’s. For instance, one thing mine won’t be able to do on its own is write data to Sierra, whereas some of III’s APIs will be read-write. The systems underlying my APIs could make calls to III’s to enable us to write to Sierra while still letting us retain all the advantages of controlling our own APIs.

Ultimately, building an API lets you give other programs access to data using standardized queries and commands. From the consuming program’s perspective it’s not too different than storing data on the filesystem or in a database and using commands to read it into an in-memory data structure. Just as with every method for storing and accessing data, there are practical implications and considerations that affect how you build and design your API depending on the needs of the application that uses the data. Building and maintaining our own API for catalog data means that these factors are under our control—it means we can build better applications and generally do more than we could otherwise.

Hopefully now why we’re building what we’re building is a little bit clearer. But I still haven’t given you a great idea of what exactly it is we’re building. In the next installment we’ll take a closer look, specifically at how the extraction process I’ve developed works and how it will run in a production environment. There are a lot of practical implications—like, how do we keep extracted data fresh and in-synch with Sierra? Will it require any extra work from library staff to maintain? Stay tuned!

RDS Blog

Discovery, the Catalog, Data, and APIs: Part 1

The Importance of Data

So What Use is Data, Anyway?

Show Me the Data!

Storing, Using, and Extracting Data

An Example: A Shelflist Application

What About APIs?

Recent Posts

RSS Feed

Archives

Categories