elliot's blog

Example of how to put RDFa into an HTML list

I'm not going to try to explain RDF and/or RDFa here, but thought any poor suckers looking for RDFa examples might benefit from me posting what I finally worked out, with help from my colleague Rob. Namely, how to annotate an HTML ordered list (<ol>) with RDFa attributes; and how to put RDFa attributes onto form elements.

Here's the HTML page with RDFa embedded in it. What I'm representing here is a sequence of collections, and the individual collections within it:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN" "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>Collections</title>
  </head>
  <body>
    <h1>Collections</h1>
    <form method="post" action="http://receptacular.org/collections">
      <ol xmlns="http://www.w3.org/1999/xhtml" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:xsd="http://www.w3.org/2001/XMLSchema#" xmlns:rec="http://receptacular.org/schema#" typeof="rdf:Seq" about="http://receptacular.org/collections">
        <li rel="rdf:_1" resource="http://receptacular.org/collections/1">
          <span style="display:none;" rel="rdf:type" resource="http://receptacular.org/schema#Collection"></span>
          <div class="collection-label" property="rdfs:label">Vague Collection</div>
          <input type="checkbox" id="collections-1-hidden" property="rec:hidden" datatype="xsd:boolean" content="false"/>
          <label for="collections-1-hidden">hidden</label>
          <input type="checkbox" id="collections-1-defaultSearch" property="rec:defaultSearch" datatype="xsd:boolean" content="false"/>
          <label for="collections-1-defaultSearch">use for searches</label>
        </li>
        <li rel="rdf:_2" resource="http://receptacular.org/collections/2">
          <span style="display:none;" rel="rdf:type" resource="http://receptacular.org/schema#Collection"></span>
          <div class="collection-label" property="rdfs:label">Archive Collection</div>
          <input type="checkbox" id="collections-2-hidden" property="rec:hidden" datatype="xsd:boolean" content="false"/>
          <label for="collections-2-hidden">hidden</label>
          <input type="checkbox" id="collections-2-defaultSearch" property="rec:defaultSearch" datatype="xsd:boolean" content="false"/>
          <label for="collections-2-defaultSearch">use for searches</label>
        </li>
        <li rel="rdf:_3" resource="http://receptacular.org/collections/3">
          <span style="display:none;" rel="rdf:type" resource="http://receptacular.org/schema#Collection"></span>
          <div class="collection-label" property="rdfs:label">Main Collection</div>
          <input type="checkbox" id="collections-3-hidden" property="rec:hidden" datatype="xsd:boolean" content="true" checked="checked"/>
          <label for="collections-3-hidden">hidden</label>
          <input type="checkbox" id="collections-3-defaultSearch" property="rec:defaultSearch" datatype="xsd:boolean" content="true" checked="checked"/>
          <label for="collections-3-defaultSearch">use for searches</label>
        </li>
      </ol>
      <p>
        <input type="button" value="Save" id="save-collections"/>
      </p>
    </form>
  </body>
</html>

Available online here: http://receptacular.org/collections

Things of note:

  • The doctype declaration. This is the W3C sanctioned doctype for XHTML+RDFa pages. By the way, the W3C validator will correctly validate this page, but standard XHTML validators don't (e.g. like "this one"http://nutrun.com/weblog/xhtmlvalidator-validate-xhtml-in-java/:). That's another story...
  • Namespace declarations on the <ol> element wrapping the list items. This is what causes standard XHTML validation approaches to fail.
  • The <ol> element is defined with typeof="rdf:Seq" and about="http://receptacular.org/collections". This sets it up as the RDF Seq resource.
  • The <li> elements inside the <ol> are Seq items, within the wrapping Seq resource. Each is each defined as a resource using the resource attribute. They are linked back to the enclosing <ol> element using the rel attribute on each. Note that the value for the rel attribute is an rdf Seq number, which orders the items within the enclosing Seq resource.
  • The RDF type of each Seq item is set using a hidden <span> element. Note that these elements have no text in them, but have opening and closing tags. If you just use a self-closing start tag for this element, the HTML doesn't display property.
  • The <span> elements use the rel attribute to mark their RDF type relationship to the outer list item; and the resource attribute to specify the location of the resource representing their type.
  • Each Seq item has an enclosed <div> which represents its rdfs:label property. Note that the property RDFa attribute is used to specify which property of the enclosing resource is being defined. Also note that the value of the property is inlined between the start and end tags of the <div>.
  • The two checkboxes define two more properties for each Seq resource: rec:hidden and rec:defaultSearch. (The semantics of the properties aren't discussed here, as I'm concentrating on syntax.) Each is defined on a standard XHTML <input> element: the relationship to the enclosing Seq item is defined with the property attribute; the value of the property is defined using the content attribute; and the data type of the literal value is defined via the datatype attribute. Any of the standard XML datatypes could be used here, or other types from other schemas.
  • When working with form elements which represent property values for RDF resources, you may need to change the content attribute in response to UI changes. (In the application from which this example was extracted, we use JQuery to respond to changes in the check box which set the content attribute.)

To see the RDF which can be extracted from this page, you can use the W3C's RDFa Distiller. Here's the resulting RDF:

<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF
  xmlns:dist="http://www.w3.org/2007/08/pyRdfa/distiller#"
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
  xmlns:rec="http://receptacular.org/schema#"
  xmlns:xhv="http://www.w3.org/1999/xhtml/vocab#"
  xmlns:xml="http://www.w3.org/XML/1998/namespace"
  xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
>
  <rdf:Seq rdf:about="http://receptacular.org/collections">
    <rdf:_1>
      <rec:Collection rdf:about="http://receptacular.org/collections/1">
        <rec:hidden rdf:datatype="http://www.w3.org/2001/XMLSchema#boolean">false</rec:hidden>
        <rec:defaultSearch rdf:datatype="http://www.w3.org/2001/XMLSchema#boolean">false</rec:defaultSearch>
        <rdfs:label>Vague Collection</rdfs:label>
      </rec:Collection>
    </rdf:_1>
    <rdf:_2>
      <rec:Collection rdf:about="http://receptacular.org/collections/2">
        <rec:hidden rdf:datatype="http://www.w3.org/2001/XMLSchema#boolean">false</rec:hidden>
        <rec:defaultSearch rdf:datatype="http://www.w3.org/2001/XMLSchema#boolean">false</rec:defaultSearch>
        <rdfs:label>Archive Collection</rdfs:label>
      </rec:Collection>
    </rdf:_2>
    <rdf:_3>
      <rec:Collection rdf:about="http://receptacular.org/collections/3">
        <rec:hidden rdf:datatype="http://www.w3.org/2001/XMLSchema#boolean">true</rec:hidden>
        <rec:defaultSearch rdf:datatype="http://www.w3.org/2001/XMLSchema#boolean">true</rec:defaultSearch>
        <rdfs:label>Main Collection</rdfs:label>
      </rec:Collection>
    </rdf:_3>
  </rdf:Seq>
</rdf:RDF>

Saving changes to an RDFa-enabled form like this is another challenge, for which we used rdfquery, and RDF library for JQuery. (I recommend you use the latest svn HEAD version of this library, as older versions have a bug where they ignore RDFa elements nested inside elements without RDFa attributes.) Maybe I'll get round to that another time.

Script to copy random mp3s from a directory to an mp3 player

This is a Ruby script which randomly copies mp3 files from one directory to an mp3 player. I wrote it so I could easily fill up my mp3 player from the 9000 odd mp3s I have on a different external drive.

To run it, you'll need the sys-filesystem gem (see http://rubyforge.org/projects/sysutils):

$ gem install sys-filesystem

Next, edit these variables in the script (near the top):
* source_dir to the directory containing the mp3s you want to select from
* dest_dir to the path for the directory on your mp3 player you want to copy to

Be a bit careful, as this will attempt to fill the dest_dir you specify with mp3 files from source_dir. You might end up filling the wrong disk up.

Then just run it with ruby from the command line:

$ ruby mp3s_random.rb

Note that it won't delete anything from the destination drive, and will attempt to fill all the space available. Also note that it doesn't keep trying mp3s until it finds one which will fit the last remaining space: once it tries to copy a file which won't fit, it stops. You can always run it again to see whether the next run finds a file small enough to fit.

I've only tested it on Linux, but, who knows, it might work on Windows too. (No operating-system specific commands are used in the script, as it uses Ruby for all file operations.)

The code is below, but I've attached it as well.

require 'rubygems'
require 'sys/filesystem'
require 'ftools'

source_dir = '/media/disk/music'
dest_dir = '/media/disk-1/music'

files = Dir[File.join(source_dir, '/**/*.mp3')]
num_files = files.size

stat = Sys::Filesystem.stat(dest_dir)

disk_free_space_kb = (stat.blocks_free * stat.fragment_size).to_kb

files_selected = []

while disk_free_space_kb > 0 and num_files > 0 do
  # choose an mp3
  file_num_to_copy = rand(num_files - 1) + 1
  file_path = files.delete_at(file_num_to_copy)

  num_files = num_files - 1

  # work out how big file is
  file_size_kb = File.stat(file_path).size.to_kb

  # subtract from free space
  if (disk_free_space_kb - file_size_kb) > 0
    files_selected << file_path
    disk_free_space_kb = disk_free_space_kb - file_size_kb
  else
    break
  end
end

files_selected.each do |f|
  copy_to_path = File.join(dest_dir, File.basename(f))
  puts "Copying #{f} to device"
  File.copy(f, copy_to_path)
end

How I worked out that curl is doing bad things with MARC

I work on a system at Talis which posts MARC records from customer library databases into a MARC to RDF transformer. The resulting RDF generated from the MARC is sent into the Talis Platform, where it's used to power Prism.

Over the last day or so I've been working on a bug which has prevented some records going correctly through this process. Along the way, I noticed another bug occurring somewhere between the post from the customer site into our MARC to RDF transformer. It looked as if line break characters in the original MARC record were being lost somewhere in the process. Consequently, when the MARC was pushed into the transformer, the record got spat out as invalid, as the length specified in the MARC leader didn't correspond to the length of the record (now it had lost its line break characters). (By the way, working directly with byte streams is the only way to work with MARC, for precisely this reason.)

I had a sudden insight on the way home, triggered by remembering issues I'd had with curl (the command line HTTP client) working on another personal project. On that project, I'd been trying to post RDF triples in ntriple format into my application using curl. However, the application only seemed to recognise the first RDF triple in the posted file. I couldn't understand why.

Then, when I echoed the body of the HTTP request, as received by my app from curl, I realised the issue: curl was sending the body of the request WITHOUT LINE BREAKS. As line break characters act as the delimiter between triples in RDF ntriple format, my app was only seeing a single RDF ntriple. When I tried an alternative tool to send the posts (the extremely useful Poster add-on for Firefox), the ntriples were received correctly.

Once I remembered this, I decided to do some debugging of the kind of requests curl would send if it were posting MARC records. My hypothesis was that curl was stripping line break characters from the MARC record (which is bad, as they are valid characters in MARC), and hence causing the record to be shorter than the leader said it should be.

First step was to put together something to echo and/or save HTTP request bodies. Rack is ideal for this sort of thing, so I used this little Rack web server program:

require 'rubygems'
require 'rack'

def save_body(body)
  File.open('last_raw_request', 'w') {|f| f.write(body)}
  body
end

Rack::Handler::WEBrick.run(lambda {|e| [200, {}, save_body(e['rack.input'].read)]}, :Port=>7777)

This saves the raw request body to a file called "last_raw_request".

I first posted a MARC file with line breaks in it (attached) using Poster (with Content-Type set to application/marc21) through Firefox. The MARC file came through intact and still valid.

I then posted a MARC file with line breaks in it using curl:

curl -d @marcfile.mrc -H "Content-Type:application/marc21" http://localhost:7777/

Which produced an invalid MARC file with line breaks missing.

The solution is to use the --data-binary switch when using curl to send binary data, which we're not doing when sending MARC from the customer site. Mostly this doesn't matter, but it does when the MARC record contains line break characters.

Namely:

curl --data-binary @marcfile.mrc -H "Content-Type:application/marc21" http://localhost:7777/

Open source at work

It's taken a while, but a feature request I logged 2 years ago has finally made it to Drupal trunk. (The basic idea was to put a timeout on Drupal HTTP requests to other systems, to prevent a whole Drupal site timing out if one of its requests to another site hung - prompted by working on AllConsuming and Last.fm modules for Drupal.) My original patch was promptly rejected, but it's been fascinating watching the discussion around the idea over the months, culminating in a well-rounded, properly-tested patch landing in CVS.

Some basic hardware testing on Linux

I've got a slightly unstable computer at the moment which I've been trying to diagnose. Still haven't worked out exactly what's wrong (it freezes randomly in both Windows and Linux), but I have found some useful testing tools on the way (for Ubuntu Intrepid Ibex unless otherwise stated).

Memory

Recent Ubuntu Linux distros include MemTest86+, a memory testing tool. You just select this option from the grub boot menu when your computer starts and it boots into a dedicated memory testing OS. The tests are fairly simple to get going, but taking hours, literally. You need to run them overnight.

Hard disks

The smartmontools package includes some testing tools for hard disks which have S.M.A.R.T. capability (most modern motherboards and hard disks support this). Once you've installed the package, you can use the smartctl command line tool to run diagnostics on your hard disks.

I used this tool like this:

$ sudo smartctl -t long /dev/sda

This starts the test, which will take a fair amount of time (mine took around 30 minutes for a 40Gb disk). Once it's finished, you can do:

$ sudo smartctl -H /dev/sda

to see the results. Mine looked like this:

smartctl version 5.37 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

CPU

A simple test to max out your CPU (and exercise your graphics card) is to run GLX gears:

$ glxgears -info

However, that doesn't really stress your system. For that, I used a tool called CPU Burn-in. This is ostensibly an overclockers tool, but what it does is attempt to push your CPU to maximum operating temperature so you can see whether it's stable. It's a binary download, so it's very easy to use, and has a Linux version. Unzip it, cd to the directory, and run:

./cpuburn-in 10

where 10 is the number of minutes you want to run the tests for. This one scared me a bit, as I watched the temperature of my CPU and system slowly climb. Read the caveats and warnings on the web site before running this tool.

Monitoring

While you're doing all this, you want to watch the system temperature etc. For this, you can use the xsensors tool. This is a simple apt-get on Ubuntu, but for some reason the default config. file is in the wrong place (it's called /etc/sensors3.conf but the app. is expecting /etc/sensors.conf). You can tell it where the config. file is using the -c switch, e.g.

./xsensors -c /etc/sensors3.conf

This displays a graphical readout of various system temperatures, fan speeds etc..

You can also watch various aspects of system usage by adding the hardware monitor applet to your panel (right-click on the panel and select Hardware Monitor). This lets you watch how much your CPU, memory, disks etc. are being utilised.

Decoding FLV files with ffmpeg

I'm using Ubuntu Intrepid Ibex, but the ffmpeg distribution it comes with doesn't support recent FLV file encodings (like some videos from YouTube). You get an error like this when you try to do anything with them:

[flv @ 0xb800e4c8]Unsupported video codec (7)

My solution was to checkout ffmpeg from its Subversion repository and compile it myself:

$ svn checkout svn://svn.ffmpeg.org/ffmpeg/trunk ffmpeg-svn
$ cd ffmpeg-svn
$ ./configure --prefix=~/apps/ffmpeg-svn --enable-libmp3lame
$ make install

To do the compile, you'll probably need build-essential, as well as libmp3lame-dev, and the *-dev versions of any other codecs you want to use.

Use the resulting binary to do the conversion:

$ ~/apps/ffmpeg-svn/bin/ffmpeg -i infile.flv outfile.mpg

It turns out this version does support those recent FLV files, happily.

Describing (finding) subjects which don't have a particular predicate in SPARQL

If you want to do something like a SQL NOT in SPARQL, here's what the query looks like:

PREFIX rs: <http://schemas.talis.com/2006/recordstore/schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

DESCRIBE ?tenancy {
  ?tenancy rdf:type rs:Tenancy .
  OPTIONAL { ?tenancy rs:platformStoreUri ?o } .
  FILTER ( !bound(?o) )
}

Here I'm looking for subjects with rdf type of http://schemas.talis.com/2006/recordstore/schema#Tenancy, which don't have a http://schemas.talis.com/2006/recordstore/schema#platformStoreUri predicate. The important bit is that you make the predicate which could potentially not be "set" OPTIONAL; and add a FILTER which only includes subjects where the predicate is bound to a value. This effectively screens out any subjects where the predicate has not been added to the subject. This pattern is basically Negation as Failure (according to the SPARQL recommendation), which derives from logic programming. Feels a bit like being back at university.

Installing Windows on the second hard disk of a Linux machine

I recently upgraded the hardware of my old desktop PC, with the aim of providing the house with a new-ish Linux machine for watching movies and using the internet, and a Windows machine for writing music and playing (old) games. My plan was to use two hard disks: one for Linux, another for Windows, and choose which to use at boot time.

Normal procedure is to install Windows first, then install Linux into a spare partition on the same hard drive (Windows tends to overwrite any disk you put it on). But it's easier to get a Linux machine up and running, see what hardware you've got, and get a decent system without needing to go and find loads of old drivers. So I decided to install Linux first. I plugged in a drive for it as the Primary IDE drive, and installed Ubuntu Linux onto it.

Then, I unplugged the Linux drive, plugged the other drive in, and installed Windows 2000 onto the second drive (just to make sure Windows couldn't overwrite Linux). Got that working too.

Then I plugged the Linux drive in, as the first drive on the IDE cable; and the Windows disk as the second.

The trick then is to get grub (the Linux bootloader I'm using) to present you with both disks as options as boot time. There's a sample configuration in /boot/grub/menu.lst, but that didn't work for me: it looked like it was working, then just hung. I tried a couple of things, but nothing which worked.

Finally, I found this blog entry and used the configuration there. The trick is to make Windows think it's installed on the first disk on the IDE cable. I added this to the bottom of menu.lst:

title Windows 2000
rootnoverify (hd1,0)
map (hd0) (hd1)
map (hd1) (hd0)
chainloader +1

which does the trick! Now I get a working Windows 2000 option in my grub boot menu.

Creating a self-signed SSL certificate for Apache on Linux

(This is extracted from my Apache course materials, but it's a useful howto in its own right.)

To generate a self-signed SSL certificate, you will need openssl installed first.

Then follow these steps:

  1. Generate the server's private key; we'll use a 1024-bit key using the RSA algorithm:
    openssl genrsa -out server.key 1024
  2. Generate a certificate-signing request:
    openssl req -new -key server.key -out server.csr
  3. Fill in the required information at the prompts:
       Country Name (2 letter code) [GB]:GB
       State or Province Name (full name) []:.
       Locality Name (eg, city) [Newbury]:Birmingham
       Organization Name (eg, company) [My Company Ltd]:Talis
       Organizational Unit Name (eg, section) []:Library Products
       Common Name (eg, your name or your server's hostname) []:prism.talis.com
       Email Address []:.
    
       Please enter the following 'extra' attributes to be sent with your certificate request
    
       A challenge password []:.
       An optional company name []:.
    
    The really important one is the Common Name: this must match the domain name which will serve the SSL site; otherwise connecting clients will get a prompt about a mismatch between the certificate's host name and the actual host name of the server.

    Note that we left the password blank. If we don't do this, Apache will prompt you for the certificate password each time you start the server, which is a pain in the arse.
  4. Create a self-signed certificate from the certificate-signing request (.csr file):
    openssl x509 -req -days 3650 -in server.csr -signkey server.key -out server.crt
  5. rm server.csr (you don't need it any more)
  6. Put the .crt and .key files into Apache's SSL directory and configure Apache to use them

If I get round to it I'll do another entry explaining how to make Apache use them.

FRBR explained pretty well

I've been struggling for a while to understand FRBR. It's basically (I quote) a conceptual model for the bibliographic universe. At its core are concepts describing bibliographic "things": books, works, scores, audio books, novels, all that litter. But there are two odd vague things sitting in between Items (physical things you can hold) and Works (the broad idea of "a work of art", separate from how it occurs in the world): Manifestations and Expressions. I kind of understood the difference, but they seem to have smudged boundaries.

This comment on the futurelib wiki by jrochkind cleared up some of the confusion for me:

An item, is an actual individual concrete book in your hand.

A manifestation is the set of all items that are identical (or close enough) in physical form as well as content.

An expression is the set of all manifestations that are identical in textual or information content. (or close enough for our purpopes; an archeologist would consider the coffee stain on the back to be distinguishing information content; we do not).

And a work is the set of all expressions that well, consist of the same intellectual work. This is definitely a cultural concept, but it's one we have and find useful. We consider the audio book version of a book to be the same book, just a different version. That's work.

Thanks Jonathan.

Also ran across Ian Davis' translation of FRBR concepts to RDF. He's my boss.

And the Resource Description and Access cataloguing standard, which I hadn't encountered before. And by coincidence, a recent UKOLN guest lecture on RDA just appeared in one of my RSS feeds.

Most of this was triggered by a colleague tipping me off to eXtensibleCatalog, a new open source discovery layer for bibliographic data, built on Drupal (amongst other things). It has its own metadata format, plus tools for translating out of common library metadata formats (like MARC) into their own format.

It's quite fascinating, this whole library metadata lark, once you get your teeth into it.

Syndicate content