python functions for applied use of schema.org
This is early development of schemaorg Python, and in this document I will detail the purpose of this Python module.
The package is provided on pip but you can also install from source.
pip install schemaorg
git clone https://www.github.com/openschemas/schemaorg
cd schemaorg
python setup.py install
The high level goal is to make it easy to tag datasets, containers, and other software to be accessible via Google Search as a dataset (or similar as Google develops these search types) or programatically via an API. This means that:
If I’m a researcher
If I’m a developer
The goals for this early development are simple - to define a “Container” in schema.org so that we can then discover and query the (currently) expansive and disorganized universe of containers. This comes down to:
1. Container Definition in Schema.org
Defining “ContainerRecipe” and “ContainerImage” in schema.org. While imperfect, after discussion with the OCI community I am doing an early proposal:
Thing > ContainerImage
Thing > CreativeWork > SoftwareSourceCode > ContainerRecipe
I summarize the discussion and rationale here.
2. Tagging of Dockerfiles as ContainerRecipe
Next I would want to be able to tag these Dockerfiles as ContainerRecipe. Since we don’t have this entity added to schema.org yet and I’m impatient to wait for meetings, I will give this a first shot and just call them “Datasets” and use this exercise to develop the codebase here.
This is brief usage. For complete examples, see the next section. For this
example, let’s prepare metadata for a SoftwareSourceCode
The recipe defines what the properties are needed for a specific use case. This might be the minimum set for a registry, for example. The Schema object will represent the schema itself.
from schemaorg.main.parse import RecipeParser
from schemaorg.main import Schema
First let’s read in our recipe.yml file. This file looks like this:
version: 1
schemas:
SoftwareSourceCode:
recommended:
- softwareVersion: version
- citation
- identifier
- keywords
- license
- url
- sameAs
- spatialCoverage
- temporalCoverage
- variableMeasured
required:
- description
- name
Person|Organization:
required:
- description
- name
you can see it tells us the required and recommended fields that we need, along with the actual specification types.
We can first read in our specification. This is production and provided with the python library.
spec = Schema("SoftwareSourceCode")
Specification base set to http://www.schema.org
Using Version 5.0
Found http://www.schema.org/SoftwareSourceCode
SoftwareSourceCode: found 104 properties
It’s pretty straight forward - we read in the specification from the library, it tells us the version, and the number of properties. Now let’s read in our recipe.
recipe = RecipeParser("recipe.yml")
print(recipe.loaded)
Once the recipe is loaded, you can see the properties that are required at “recipe.loaded” For the entire list of properties that are defined for our SoftwareSourceCode, you can look at “spec._properties”. For those that you’ve extracted and added, look at “spec.properties.”
At this point, you want to extract your information from the Dockerfile until the recipe validates against the schema. To do this, I used the Singularity Python Dockerfile parser.
pip install spython
from spython.main.parse.parsers import DockerParser
parser = DockerParser("Dockerfile").parse()
Now here is how I add a property. Let’s add the obvious ones from the Dockerfile.
spec.add_property('environment', parser.environ) # currently a list
spec.add_property('entrypoint', parser.entrypoint)
spec.add_property('description', 'A Dockerfile build recipe')
This would be extracted at build –> push time, so we know the uri.
spec.add_property('name', "vanessa/sregistry")
spec.add_property('ContainerImage', parser.fromHeader)
Depending on where you are doing this (a CI server, your computer, or elsewhere) this is where you can do interesting things like use Google’s container-diff to get dependencies, or any other kind of parsing of the container guts. The metadata that you add here will help with search, so add meaningful things.
When you are done, validate your specification.
recipe.validate(spec)
The module includes a simple html template to embed the html and make a pretty web page to view it. You can change the template argument, or just use one of the templates provided here.
from schemaorg.templates.google import make_dataset
dataset = make_dataset(spec, "index.html")
print(dataset)
You’ll see an html page returned along with written to index.html.
By setting the template
variable in the make_dataset
function, you can control
the output. You have choice of the following templates:
For example:
dataset = make_dataset(spec, "vue-table.html", template="google/dataset-vue-table.html")
For the pretty templates, see the examples folder below.
If you’ve generated an embedded json-ld, how do you load it again? You can actually use the BaseParser of the recipe to do this.
result = RecipeParser('SoftwareSourceCode.html')
[schemaorg-recipe][SoftwareSourceCode.html]
result.loaded
{'@context': 'http://www.schema.org',
'@type': 'SoftwareSourceCode',
'about': 'This is a Dockerfile provided by the Dinosaur Dataset collection.',
'codeRepository': 'https://www.github.com/openschemas/dockerfiles',
'creator': {'@type': 'Person',
'contactPoint': {'@type': 'ContactPoint'},
'name': '@vsoch'},
'description': 'A Dockerfile build recipe',
'name': 'deforce/alpine-wxpython:latest',
'runtime': 'Docker',
'sameAs': 'ImageDefinition',
'schemas': {},
'thumbnailUrl': 'https://vsoch.github.io/datasets/assets/img/avocado.png',
'version': '3.4'}
If there is interest, we could easily add to the library to look at the type, and the version, and load the initial schema with it. Please open an issue if this would be useful to you!
If you want to use a version of schemaorg that is not provided, you can download
the schemaorg/schemaorg repository
and move the content of a version subfolder in data/releases/<version>
into
schemaorg/data/releases/<version>
here. The same can be done for the entire
content of the current extensions (data/ext
) folder into schemaorg/data/ext
.