{"id":15,"date":"2014-03-26T09:22:53","date_gmt":"2014-03-26T09:22:53","guid":{"rendered":""},"modified":"2015-03-27T15:06:51","modified_gmt":"2015-03-27T15:06:51","slug":"15","status":"publish","type":"post","link":"https:\/\/blogs.library.unt.edu\/rds\/2014\/03\/26\/15\/","title":{"rendered":"Discovery, the Catalog, Data, and APIs: Part 1"},"content":{"rendered":"<p>Lately we\u2019ve focused an awful lot on Summon when talking about all things resource discovery. Summon <em>is<\/em> the new kid on the block, after all, and we\u2019ve needed to try to get a handle on it. It\u2019s changing the way people use our resources, and there are still a lot of open questions about how it\u2019s used and how it should be used.<\/p><p>But what about the catalog? We haven\u2019t really been talking much about the catalog.&nbsp;<span style=\"line-height: 1.538em;\">Not that there hasn\u2019t been any interest from my colleagues: on the contrary, I\u2019ve heard loud and clear the desire for a catalog&nbsp;<\/span><span style=\"line-height: 1.538em;\">that can do faceted browsing and the discussions about possibly loading our catalog data into Summon. Unfortunately, as a III shop, we\u2019ve always been at a disadvantage in this area.&nbsp;<\/span><span style=\"line-height: 1.538em;\">We\u2019ve never had an easy time of getting our catalog data out of the catalog because our system hasn\u2019t really supported it.<\/span><\/p><p><span style=\"line-height: 1.538em;\">What if I told you that had&nbsp;<\/span><em style=\"line-height: 1.538em;\">changed<\/em><span style=\"line-height: 1.538em;\">? What if we suddenly had a way to load data from our ILS into other systems of our choosing in data formats of our choosing? And what if we could automate it, to the point that a record saved in the ILS would get extracted minutes later with no muss, no fuss into the formats and systems that we specify? If we had this, what sorts of things could we build? What new discovery tools and methods could we support?<\/span><\/p><p><span style=\"line-height: 1.538em;\">My big project over the past five months has been exactly this. It\u2019s all still in development and probably will be for a few more months\u2014but it\u2019s creeping ever closer to being production-ready. The problem\u2014my problem\u2014is that this thing I\u2019m building is <em>infrastructure<\/em>. It\u2019s invisible. There\u2019s not really an end-user interface that I can point you to so you can look and get a feel for what it does. But it will enable things that we couldn\u2019t have done before. It will enable us to do the things we\u2019ve planned in our RDS Phase 2 Action Plan\u2014and much more.<\/span><\/p><p><span style=\"line-height: 1.538em;\">So this is the first in a series of blog posts that I wanted to write to introduce this thing in a way that might make it interesting and accessible. I\u2019ll start out by laying some of the groundwork, examining why having control of our data is so important for the things we want to do.<\/span><\/p><p>&nbsp;<\/p><h2>The Importance of Data<\/h2><p>Astute readers might now be thinking, \u201cWasn\u2019t moving to Sierra supposed to help open up our data?\u201d Yes\u2014yes it was. With Sierra, III promised much. III billed the system as \u201copen\u201d\u2014built on open technologies using a \u201cservice-oriented architecture.\u201d Listening to their marketing, Sierra sounded almost as though it would provide an extensible ILS framework that we would be free to build upon. We were promised open APIs that would allow us to build new software components that would consume data from Sierra and be able to interact directly with Sierra.<\/p><p><span style=\"line-height: 1.538em;\">But wait, let\u2019s take a step back. I know we throw around terminology like&nbsp;<\/span><span style=\"line-height: 1.538em;\">\u201c<\/span><span style=\"line-height: 1.538em;\">open architecture<\/span><span style=\"line-height: 1.538em;\">\u201d<\/span><span style=\"line-height: 1.538em;\">&nbsp;and&nbsp;<\/span><span style=\"line-height: 1.538em;\">\u201c<\/span><span style=\"line-height: 1.538em;\">open data<\/span><span style=\"line-height: 1.538em;\">\u201d<\/span><span style=\"line-height: 1.538em;\">&nbsp;and&nbsp;<\/span><span style=\"line-height: 1.538em;\">\u201c<\/span><span style=\"line-height: 1.538em;\">API,<\/span><span style=\"line-height: 1.538em;\">\u201d<\/span><span style=\"line-height: 1.538em;\">&nbsp;but these are all kind of buzzword-y, aren\u2019t they? We talk about these things like they\u2019re important\u2014but we<\/span><span style=\"line-height: 1.538em;\">\u2019<\/span><span style=\"line-height: 1.538em;\">ve never stopped to explain why.<\/span><\/p><h3><span style=\"line-height: 1.538em;\">So What Use is Data, Anyway?<\/span><\/h3><p><span style=\"line-height: 1.538em;\">At its most basic, a computer program comprises a set of instructions that a computer follows to perform some task. Useful programs contain&nbsp;<\/span><em style=\"line-height: 1.538em;\">generalized<\/em><span style=\"line-height: 1.538em;\"> instructions that can be reused in different contexts. Data is what a program uses to do stuff.<\/span><\/p><p><span style=\"line-height: 1.538em;\">Say I want to program something that displays a formatted list of titles and authors for a set of books that I own. One approach would be to write a separate command to display each individual piece of information\u2014each title and each author, like this (where&nbsp;\u201c\\n\u201d is a new line):<\/span><\/p><pre><span style=\"line-height: 1.538em;\">    print \"Title: Hamlet\\n\"<br \/>    print \"Author: William Shakespeare\\n\"<br \/>    print \"\\n\"<br \/>    print \"Title: The Sun Also Rises\\n\"<br \/>    print \"Author: Earnest Hemingway\\n\"<br \/>    print \"\\n\"<br \/>    print \"Title: The Great Gatsby\\n\"<br \/>    print \"Author: F. Scott Fitzgerald\\n\"<br \/>    print \"\\n\"<br \/><\/span><\/pre><p>I hope you see that this approach is pretty useless. It displays exactly what we told it to display and nothing more. What happens if we want to change the label \u201cTitle\u201d to \u201cThe Title?\u201d What happens if we want to change what displays between each book? We have to go through and change each instance in our code. This is clearly a very inefficient way to program, fraught with potential errors.<\/p><p>But we can write a more generalized program that does the same thing using fewer actual instructions. All we need to do is to store the title\/author information for our book list in some sort of data construct. Then we can code a loop that will repeat one set of instructions for each item in the data.<\/p><pre>    my_data = [{<br \/>        \"t\": \"Hamlet\",<br \/>        \"a\": \"William Shakespeare\"<br \/>    }, {<br \/>        \"t\": \"The Sun Also Rises\",<br \/>        \"a\": \"Earnest Hemingway\"<br \/>    }, {<br \/>        \"t\": \"The Great Gatsby\",<br \/>        \"a\": \"F. Sott Fitzgerald\"<br \/>    }]<br \/><br \/>    for book in my_data:<br \/>        print \"Title: \" + book['t'] + \"\\n\"<br \/>        print \"Author: \" + book['a'] + \"\\n\"<br \/>        print \"\\n\"<\/pre><p>What\u2019s happening here is that we\u2019re storing all of the authors and titles in a nested data structure that we\u2019re putting in a variable called&nbsp;<em>my_data<\/em>. Our program loops over each element in that data structure, temporarily assigns it to the variable <em>book<\/em>, and then&nbsp;displays the title (<em>book[&#8216;t&#8217;]<\/em>) and author (<em>book[&#8216;a&#8217;]<\/em>). This is much better than the last version of our code. If we want to change how anything is displayed we only have one place to change it.<\/p><p>In addition, now that the data is actually defined in a data structure, we can reuse it later in our program and do other things with it besides display it. We could write code to let us search it, for example, which we couldn\u2019t do at all with the first version.<\/p><p>But this still isn\u2019t as good as we can make it. We\u2019re still storing our data structure inside our program, and it\u2019s still a little bit cumbersome to have to get the brackets and curly braces and the formatting all right when we\u2019re editing the data. The next step might be to store the data in a separate file that\u2019s a little more compact and little easier to edit\u2014say, a comma-delimited (CSV) file, like this:<\/p><pre>    title,author<br \/>    Hamlet,William Shakespeare<br \/>    The Sun Also Rises,Earnest Hemingway<br \/>    The Great Gatsby,F. Scott Fitzgerald<\/pre><p>Now to use this, we\u2019d have to code instructions in our program first to access the file, parse the contents, create a data structure like the one in the last version, and load the data into memory. But if the resulting data structure is identical, the rest of the code works without modification.<\/p><h3>Show Me the Data!<\/h3><p>Once I have my data stored somewhere, like in a CSV file that I\u2019ve saved on my computer, I can write programs all day long that use it to do different things, and I don\u2019t have to re-enter it anywhere. In a nutshell, that is one of the primary uses of opening up data: it allows anyone that has the necessary skills to write programs that can do new things with it.<\/p><p>In contrast, our data in Millennium was locked away\u2014it was both physically inaccessible to us (except what we could access through the Millennium user interface) and stored in a data format that was Millennium-specific, that we couldn\u2019t make sense of even if we could access it. To be fair, we <em>could<\/em> export data in a few ways: as MARC via Data Exchange, as MARC via Z39.50, or as delimited data via Create Lists. But these methods were too limited or too manual to be of much real use.<\/p><p><span style=\"line-height: 1.538em;\">With Sierra, although the reality has unsurprisingly fallen somewhat short of the marketing promises,&nbsp;III has cracked open the door for us. Sierra&nbsp;<\/span><em style=\"line-height: 1.538em;\">is<\/em><span style=\"line-height: 1.538em;\">&nbsp;built on open-source database and indexing software, unlike Millennium. III&nbsp;<\/span><em style=\"line-height: 1.538em;\">has<\/em><span style=\"line-height: 1.538em;\">&nbsp;given us direct read access to the database. With this little bit of extra freedom that we didn\u2019t have with Millennium, they have opened the door to allow us to take control of our data.<\/span><\/p><p><span style=\"line-height: 1.538em;\">But is direct database access enough?<\/span><\/p><p>&nbsp;<\/p><h2>Storing, Using, and Extracting Data<\/h2><p>Turning back to our earlier example: as we use this program we\u2019ve created, there might be more and more that we want it to be able to do. Say we want to be able to track the date that we added a book, the number of times each person in our family has read a book, the amount we spent on a book, and the person in the family that bought a book. Say we notice that we have multiple books by the same author and we want to start recording information about authors somewhere so that we don\u2019t have to replicate it for each book. As we want to do more with our program, we need the data that supports that functionality, and we need our program to be able to read that data in order to work with it. And, as our data grows\u2014both in size and in complexity\u2014the format in which the data is physically stored begins to impact how smoothly and efficiently the program runs.<\/p><p>How a program stores data and what it stores are highly individual to that program. It has to be able to read stored data into variables as appropriate, use those variables to carry out its tasks, and then update the data in storage as needed. It<span style=\"line-height: 1.538em;\">\u2019<\/span><span style=\"line-height: 1.538em;\">s best to store data in a way that requires as little processing as possible to translate to internal data structures\u2014if your data access methods are inefficient, it can slow down your program and make your code harder to understand and extend.&nbsp;<\/span><\/p><p><span style=\"line-height: 1.538em;\">The upshot is that one&nbsp;<\/span><span style=\"line-height: 1.538em;\">system\u2019s internal data store isn\u2019t going to translate 1:1 to any other system or program\u2014nor should it.<\/span><\/p><p><span style=\"line-height: 1.538em;\">Let\u2019s consider the ILS again and what I want to be able to accomplish. I don\u2019t necessarily want to use the exact data as its stored in the ILS database. Each of the applications I want to write needs a different subset of data from the ILS; for each application I may need to store similar (or the same!) data quite differently.<\/span><\/p><p><span style=\"line-height: 1.538em;\">At the moment Sierra allows us direct access to its internal database using <a href=\"http:\/\/en.wikipedia.org\/wiki\/SQL\" target=\"_blank\">SQL<\/a>. SQL is the standard language used for querying relational databases: most programming languages have functionality that will let you query a database using SQL and pull data into internal programming structures with relative ease. With this, we <em>can<\/em> actually write programs that can read Sierra\u2019s internal data.<\/span><\/p><p><span style=\"line-height: 1.538em;\">But you know what? As nice at that is, it isn\u2019t good enough. Sierra\u2019s database is designed to support the Sierra ILS\u2014it wasn\u2019t written to support my applications. III made design decisions when it built the database that make it work okay for some purposes and not others. Let me give you an example that illustrates what I mean.<\/span><\/p><h3><span style=\"line-height: 1.538em;\">An Example: A Shelflist Application<\/span><\/h3><p><span style=\"line-height: 1.538em;\">One of the applications I started working on back in August is a simple shelflist builder to help with doing inventory. It lets you enter a location code and a call number range and pull back a list of items, sorted by call number, that you can then browse. You can search for call numbers and barcodes. You can mark items as on or not on the shelf.<\/span><\/p><p><span style=\"line-height: 1.538em;\">The shelflist itself is a table of items, like this:<\/span><\/p><table border=\"0\"><tbody><tr><td><strong>Row<\/strong><\/td><td><strong>Call Number<\/strong><\/td><td><strong>Volume<\/strong><\/td><td><strong>Copy<\/strong><\/td><td><strong>Status<\/strong><\/td><td><strong>Barcode<\/strong><\/td><td><strong>Suppressed?<\/strong><\/td><td><strong>Marked<\/strong><\/td><\/tr><tr><td>1<\/td><td>M128 .A016 2002<\/td><td>&nbsp;<\/td><td>1<\/td><td>AVAILABLE<\/td><td>1002285505<\/td><td>false<\/td><td>On Shelf<\/td><\/tr><tr><td>2<\/td><td>M128 .A05 2005<\/td><td>&nbsp;<\/td><td>1<\/td><td>AVAILABLE<\/td><td>1002189053<\/td><td>false<\/td><td>Not On Shelf<\/td><\/tr><tr><td>3<\/td><td>M128 .A09 2007<\/td><td>&nbsp;<\/td><td>1<\/td><td>AVAILABLE<\/td><td>1002420293<\/td><td>false<\/td><td>&nbsp;<\/td><\/tr><\/tbody><\/table><p><span style=\"line-height: 1.538em;\">When you click on a row, it expands to show you the title, author, bib and item record number, and a link to the record in the WebPAC.<\/span><\/p><p><span style=\"line-height: 1.538em;\">This data of course all comes from Sierra initially. When you first create a shelflist, my application submits an SQL query to Sierra, gets the information, and then stores it as a flat file so that my code can access the data quickly and efficiently. But the data as stored in Sierra is anything but flat. The database structure is quite complex, with item information and bib information being stored in separate tables and variable length fields stored in yet a separate table. Once I convert the data to the flat structure, access is instantaneous\u2014but the initial query to build the shelflist takes 4 to 5 minutes, and sometimes even longer, to run. The SQL query is as efficient as I can make it\u2014but, based on the database structure and what fields III has and has not indexed, the query still takes forever. And waiting 4 to 5 minutes for this query to run is unacceptable.<\/span><\/p><p><span style=\"line-height: 1.538em;\">So even though we&nbsp;<em>have<\/em> direct access to the Sierra database, we still don\u2019t have control of our data. To have control, we need to be able to pull the data out of Sierra. We need to be able to extract it and put it into other storage mediums that we <em>do<\/em> control, ones that we can configure to support fast, efficient access to the data that our applications need. The direct access we have to the database allows us to do this, but we\u2019ve had to build the functionality ourselves.<\/span><\/p><p><span style=\"line-height: 1.538em;\">&nbsp;<\/span><\/p><h2><span style=\"line-height: 1.538em;\">What About APIs?<\/span><\/h2><p>When one system shares data with other systems, there are at least two potential gotchas. One we saw in the last section: a system\u2019s internal data won\u2019t necessarily translate well to a system for which it wasn\u2019t designed and optimized. Another is that a system has to maintain the integrity of its data. If we allow multiple programs\u2014even ones that we\u2019ve written\u2014to access and write to the same data store, we risk that one of the programs might write data that renders the store (or parts of it) unreadable or invalid for other programs. This is why even open systems don\u2019t generally allow write access to their internal data.<\/p><p>One way to take care of these problems is to enable access to your system via an API, or<span style=\"line-height: 1.538em;\">&nbsp;<\/span><em style=\"line-height: 1.538em;\">Application Programming Interface<\/em><span style=\"line-height: 1.538em;\">. If someone has created an API for their system, it means that they have defined a set of commands that I, for example, could use in my own programs that allows mine to interact with theirs in predefined ways. Maybe this allows my program to submit a query to their system to get back data that it can then use to do something. Or maybe it submits a call to their API with a data value and their API performs some calculation on the value and returns the result. Having access to APIs helps me extend the capabilities of my program so I don\u2019t have to reinvent something that someone else has done. And APIs allow my programs to interact with other systems in a controlled, predetermined way\u2014i.e., so that they don\u2019t have to give me carte blanche access to everything in their systems and I don\u2019t have to care about their internals.<\/span><\/p><p><span style=\"line-height: 1.538em;\">APIs can be&nbsp;<em>read-only<\/em> or&nbsp;<em>read-<\/em><em>write<\/em>. A read-only API only allows other programs to send commands for reading data, whereas a read-write API allows other programs to send commands that write data to the system as well.<\/span><\/p><p><span style=\"line-height: 1.538em;\">In a way, you can think of an API as being a counterpart to a UI (User Interface). A UI allows a person to input commands, enter data, and work with a system to accomplish some task. An API is the same, but for a computer program. It\u2019s just as important to design an API well so that they\u2019re easy to understand and use for the same reasons that its important to design a UI well.<\/span><\/p><p><span style=\"line-height: 1.538em;\">The work I\u2019ve been doing the past five months initially began as a desire to create an API for our catalog data. In fact, that\u2019s still the major component. Having a \u201cprogramming interface\u201d built on top of our catalog data that can serve up different views of that data on request will make application development easier. (And I will come back to this in a later post.)&nbsp;<\/span><\/p><p><span style=\"line-height: 1.538em;\">At the same time, you may have seen recent announcements from III about their own \u201cSierra API,\u201d which has been in the works for a long time. They\u2019ve finally announced that they will be releasing the initial version of their API in April.<br \/><\/span><\/p><p><span style=\"line-height: 1.538em;\">So you might be wondering: how does III\u2019s API affect mine? Have I just wasted five months of work building something that\u2019s now redundant?<\/span><\/p><ol><li><span style=\"line-height: 1.538em;\">Consider my earlier point about why having direct access to Sierra\u2019s database isn\u2019t good enough. Just like the Sierra database, III has designed their APIs for some purpose. I haven\u2019t seen them yet, so I can\u2019t yet say how well-designed they are or how applicable they will be to the types of things we want to do. But they\u2019re still <\/span><em style=\"line-height: 1.538em;\">their<\/em><span style=\"line-height: 1.538em;\"> APIs. They are in control of what data they serve up and how that data is modeled. I want our applications to be built on top of <\/span><em style=\"line-height: 1.538em;\">our own<\/em><span style=\"line-height: 1.538em;\">&nbsp;APIs, because I want the option to serve data to our applications in a way that\u2019s custom-tailored to what we need. The applications I build this way will be more fully functional and will perform better.<\/span><\/li><li><span style=\"line-height: 1.538em;\">Speaking of performance\u2014if API access is slow, all the applications that use that API will be slow. I can ensure that our APIs are fast and responsive; I can\u2019t ensure that III\u2019s will be.<\/span><\/li><li><span style=\"line-height: 1.538em;\">Let\u2019s think of our catalog data as separate from Sierra. III is building an API for&nbsp;<em>Sierra<\/em><em>.<\/em> I\u2019m building an API for our catalog data. Now that we can extract data from Sierra, and as we build more and more of our own applications, that distinction will become more and more meaningful. For instance: data in Sierra isn\u2019t FRBRized. The data extraction process I\u2019ve built would allow us to write an extraction routine that would analyze and FRBRize our catalog data as it gets extracted, loading it into a system as separate Work, Manifestation, Expression, and Item entities. Our API would then serve up this data as such to applications we build to use it. Sierra\u2019s API couldn\u2019t do this, because that\u2019s not how Sierra stores data.<\/span><\/li><li><span style=\"line-height: 1.538em;\">I\u2019m almost certain that there will be ways that my API will be able to use III\u2019s. For instance, one thing mine won\u2019t be able to do on its own is write data to Sierra, whereas some of III\u2019s APIs will be read-write. The systems underlying my APIs could make calls to III\u2019s to enable us to write to Sierra while still letting us retain all the advantages of controlling our own APIs.<\/span><\/li><\/ol><p>Ultimately, building an API lets you give other programs access to data using standardized queries and commands. From the consuming program\u2019s perspective it\u2019s not too different than storing data on the filesystem or in a database and using commands to read it into an in-memory data structure. Just as with every method for storing and accessing data, there are practical implications and considerations that affect how you build and design your API depending on the needs of the application that uses the data. Building and maintaining our own API for catalog data means that these factors are under our control<span style=\"line-height: 1.538em;\">\u2014<\/span><span style=\"line-height: 1.538em;\">it means we can build better applications and generally do more than we could otherwise.<\/span><\/p><p><strong>Hopefully<\/strong> now why we\u2019re building what we\u2019re building is a little bit clearer. But I still haven\u2019t given you a great idea of what exactly it is we\u2019re building. In the next installment we\u2019ll take a closer look, specifically at how the extraction process I\u2019ve developed works and how it will run in a production environment. There are a lot of practical implications\u2014like, how do we keep extracted data fresh and in-synch with Sierra? Will it require any extra work from library staff to maintain? Stay tuned!<\/p>","protected":false},"excerpt":{"rendered":"Lately we\u2019ve focused an awful lot on Summon when talking about all things resource discovery. Summon is the new kid on the block, after all, and we\u2019ve needed to try to get a handle on it. It\u2019s changing the way people use our resources, and there are still a lot of open questions about how&#8230;  <a href=\"https:\/\/blogs.library.unt.edu\/rds\/2014\/03\/26\/15\/\" class=\"more-link\" title=\"Read Discovery, the Catalog, Data, and APIs: Part 1\">Read more &raquo;<\/a>","protected":false},"author":9,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[],"tags":[],"class_list":["post-15","post","type-post","status-publish","format-standard","hentry"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/s5tRT7-15","_links":{"self":[{"href":"https:\/\/blogs.library.unt.edu\/rds\/wp-json\/wp\/v2\/posts\/15","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blogs.library.unt.edu\/rds\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.library.unt.edu\/rds\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.library.unt.edu\/rds\/wp-json\/wp\/v2\/users\/9"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.library.unt.edu\/rds\/wp-json\/wp\/v2\/comments?post=15"}],"version-history":[{"count":1,"href":"https:\/\/blogs.library.unt.edu\/rds\/wp-json\/wp\/v2\/posts\/15\/revisions"}],"predecessor-version":[{"id":33,"href":"https:\/\/blogs.library.unt.edu\/rds\/wp-json\/wp\/v2\/posts\/15\/revisions\/33"}],"wp:attachment":[{"href":"https:\/\/blogs.library.unt.edu\/rds\/wp-json\/wp\/v2\/media?parent=15"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.library.unt.edu\/rds\/wp-json\/wp\/v2\/categories?post=15"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.library.unt.edu\/rds\/wp-json\/wp\/v2\/tags?post=15"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}