The Prevalence of Book Properties in the Wild

Published: 2013-01-24 23:00

Updated: 2013-01-25 12:10

The Web Data Commons is extracting the structured data discovered in the Common Crawl corpus, and they’re making the extracted data and some high-level analyzed data available for free to all. I took a look at which properties of were actually used in the wild in the August 2012 corpus. My hope is to inform, in a small way, the discussion around extending to better accommodate bibliographic data happening through the W3C Schema Bib Extend Community Group. By seeing what is actually being used, we might make better decisions about how could be extended.

I looked at the Class-Property-Co-occurrence Matrixes spreadsheets. Looking at just the “Properties” worksheet shows the number of Pay Level Domains (PLDs) that use each property. It appears that only shows up in the Microdata spreadsheet and not in the RDFa one, though it seems as if they might have some differences I don’t understand.

Of the 23 properties listed only 18 are valid properties (5 are not and given an asterisk below). About 32 properties of are not used at all (there are some properties with deprecated plural duplicates I may have counted), so not every property is being used so far. This was just doing some quick counts, so my numbers might be off a bit.

Update: Jodi Schneider makes a good point on the public-schemabibex list, that terms not defined by “indicate either problems in understanding (e.g. surely there’s some other price property that could be pulled) or actual needs.” So I thought I would go into that a bit more for those properties.

The “price” and “priceCurrency” properties are a misunderstanding and should probably be turned into a which does have price and priceCurrency properties. The “numPages” property should be “numberOfPages” though I don’t know if there was an older version of that included this form of the property. “publishDate” should probably just be “datePublished.” Finally, “ratingValue” should be within an AggregateRating item. This is the same problem as with “price” and “priceCurrency.” Some publishers do not seem to understand that items ought to be nested or that they need to be nested in a particular way.

So while we might expect to find some expressed desires for new properties, what we are finding instead are problems in understanding or typos. I wonder if there also needs to be some education about the extension mechanism. Extending or adding new properties is allowed. While the partners may not completely understand a new property at first, if a property gains use, it may be accepted into the schema. At least using non-standard properties is one way to advertise the desire for these properties.

As new data comes out of Web Data Commons, I will try to report on it.

So here are the properties that are actually used followed by the number of PLDs that use that property:

Property Pay Level Domain Count 375 298 244 212 157 141 113 87 83 81 75 52 48* 46* 45 31 25 23 20* 17 15* 11* 10