This is early development of schemaorg Python, and in this document I will detail the purpose of this Python module.
The package is provided on pip but you can also install from source.
pip install schemaorg
git clone https://www.github.com/openschemas/schemaorg cd schemaorg python setup.py install
The high level goal is to make it easy to tag datasets, containers, and other software to be accessible via Google Search as a dataset (or similar as Google develops these search types) or programatically via an API. This means that:
If I’m a researcher
If I’m a developer
The goals for this early development are simple - to define a “Container” in schema.org so that we can then discover and query the (currently) expansive and disorganized universe of containers. This comes down to:
1. Container Definition in Schema.org
Defining “ContainerRecipe” and “ContainerImage” in schema.org. While imperfect, after discussion with the OCI community I am doing an early proposal:
Thing > ContainerImage Thing > CreativeWork > SoftwareSourceCode > ContainerRecipe
I summarize the discussion and rationale here.
2. Tagging of Dockerfiles as ContainerRecipe
Next I would want to be able to tag these Dockerfiles as ContainerRecipe. Since we don’t have this entity added to schema.org yet and I’m impatient to wait for meetings, I will give this a first shot and just call them “Datasets” and use this exercise to develop the codebase here.
This is brief usage. For complete examples, see the next section. For this
example, let’s prepare metadata for a
The recipe defines what the properties are needed for a specific use case. This might be the minimum set for a registry, for example. The Schema object will represent the schema itself.
from schemaorg.main.parse import RecipeParser from schemaorg.main import Schema
First let’s read in our recipe.yml file. This file looks like this:
version: 1 schemas: SoftwareSourceCode: recommended: - softwareVersion: version - citation - identifier - keywords - license - url - sameAs - spatialCoverage - temporalCoverage - variableMeasured required: - description - name Person|Organization: required: - description - name
you can see it tells us the required and recommended fields that we need, along with the actual specification types.
We can first read in our specification. This is production and provided with the python library.
spec = Schema("SoftwareSourceCode") Specification base set to http://www.schema.org Using Version 3.4 Found http://www.schema.org/SoftwareSourceCode SoftwareSourceCode: found 101 properties
It’s pretty straight forward - we read in the specification from the library, it tells us the version, and the number of properties. Now let’s read in our recipe.
recipe = RecipeParser("recipe.yml") print(recipe.loaded)
Once the recipe is loaded, you can see the properties that are required at “recipe.loaded” For the entire list of properties that are defined for our SoftwareSourceCode, you can look at “spec._properties”. For those that you’ve extracted and added, look at “spec.properties.”
At this point, you want to extract your information from the Dockerfile until the recipe validates against the schema. To do this, I used the Singularity Python Dockerfile parser.
pip install spython
from spython.main.parse import DockerRecipe parser = DockerRecipe("Dockerfile")
Now here is how I add a property. Let’s add the obvious ones from the Dockerfile.
spec.add_property('version', containerRecipe.version) spec.add_property('environment', parser.environ) # currently a list spec.add_property('entrypoint', parser.entrypoint) spec.add_property('description', 'A Dockerfile build recipe')
This would be extracted at build –> push time, so we know the uri.
spec.add_property('name', "vanessa/sregistry") spec.add_property('ContainerImage', parser.fromHeader)
Depending on where you are doing this (a CI server, your computer, or elsewhere) this is where you can do interesting things like use Google’s container-diff to get dependencies, or any other kind of parsing of the container guts. The metadata that you add here will help with search, so add meaningful things.
When you are done, validate your specification.
The module includes a simple html template to embed the html and make a pretty web page to view it. You can change the template argument, or just use one of the templates provided here.
from schemaorg.templates.google import make_dataset dataset = make_dataset(spec, "index.html") print(dataset)
For the pretty templates, see the examples folder below.