Linked Data and a new Browser API event
bfrancis at mozilla.com
Thu Jun 4 17:19:55 UTC 2015
On 3 June 2015 at 19:42, Benjamin Francis <bfrancis at mozilla.com> wrote:
> This is what I'd really like to get more of, particularly usage data.
I've reached out to a few people at Yahoo, Google and a couple of
universities and have managed to turn up a few studies with useful data
My conclusions so far are:
- Microformats are used on a large number of web sites but are limited
by their case by case syntax and more fixed vocabulary and are less
- Microdata and RDFa are vocabulary agnostic which makes them inherently
more extensible, they're increasing in popularity due to schema.org and
consumption by major search engines, whilst the use of Microformats has
remained relatively constant over time.
- Microdata is a bit more concise than RDFa but doesn't allow for the
mixing of vocabularies.
- Open Graph is a simplistic form of RDFa with a limited vocabularly and
limited usefulness in comparison to other formats, but is very widely used
due to Facebook and Twitter being major consumers.
- Microformats is used by more websites (domains) but Microdata is used
by more web pages (more URLs, more typed entities and more triples) and is
growing the fastest. Microformats has the breadth, but Microdata has the
depth. In our case I think what we care about is the latter - the amount of
- JSON-LD is the newest format, the main difference being that it isn't
intended to be embedded in with HTML markup, but is included separately in
a script tag. It's also useful as a canonical JSON-based format to
represent all of the other formats.
That leads me to recommend that we do the following:
- Parse Microdata and RDFa (including Open Graph) from web pages in Gecko
- Expose all of this data to Gaia via a single getLinkedData() or
getStructuredData() method on the Browser API which returns a Promise that
resolves with the data in a canonical JSON-LD format
- Also consider supporting JSON-LD directly as no parsing is required,
we just need to detect a script tag
If anyone finds any more usage data, or has a different interpretation of
the data below, then please do share.
1. Web Data Commons website based on Common Crawl corpus (2009-2014)
2. Web Data Commons Paper based on Common Crawl Corpus (2009-2012)
3. Yahoo post based on Yahoo corpus (2011)
4. Yahoo paper based on Bing corpus (2012)
More information about the dev-platform