We present use cases that encompass two distinct models for validation. In the first use case, validation is performed on clinical data in an institutional context. In the second group of use cases, validation is performed via the Wikidata Query Service, a public SPARQL endpoint maintained as part of the Wikidata infrastructure.
3.1 Domain-Specific ShEx Validation in Medical Informatics
The Yosemite Project  started in 2013 as response to a 2010 report by the President’s Council of Advisors on Science and Technology  calling for a universal exchange language for healthcare. As part of its initial efforts, this project released the “Yosemite Manifesto”Footnote 18, a position statement signed by over 100 thought leaders in healthcare informatics which recommended RDF as the “best available candidate for a universal healthcare exchange language” and stating that “electronic healthcare information should be exchanged in a format that either: (a) is in RDF format directly; or (b) has a standard mapping to RDF”.
Around the same time as the Yosemite Project meeting, a new collection of standards for the exchange of clinical data was beginning to gather momentum. “Fast Healthcare Interoperability Resources (FHIR)”  defined a modeling environment, framework, community and architecture for the REST oriented access to clinical resources. The FHIR specification defines some 130+ healthcare and modeling related “resources” and describes how they are represented in XMLFootnote 19 and JSONFootnote 20. One of the outcomes of the Yosemite project was the formation of the FHIR RDF/Health Care Life Sciences (FHIR/HCLS) working groupFootnote 21 tasked with defining an RDF representation format for FHIR resources.
ShEx played a critical role in the development of the FHIR RDF specification. Prior to its introduction to ShEx, the community tried to use a set of representative examples as the basis for discussion. This was a slow process, as the actual rules for the underlying transformation were implicit. There was no easy way to verify that the examples covered all possible use cases and that they were internally self-consistent. Newcomers to the project faced a steep learning curve. The introduction of ShEx helped to streamline and formalize the process . Instead of talking in terms of examples, the group could address how instances of entire FHIR resource models would be represented as RDF. Edge cases that seldom appeared received the same scrutiny as did everyday usage examples. The proposed transformation rules could be implemented in software, with the entire FHIR specification being automatically transformed to its ShEx equivalent.
ShEx allowed the participants to finalize discussions and settle on a formal model and first specification draft in less than three months. A formal transformation was created to map the (then) 109 FHIR resource definitions into schemas for the RDF binding. This transformation uncovered several issues with the specification itself as well as providing a template for the bidirectional transformation between RDF and the abstract FHIR model instances. The documentation production pipeline was additionally extended to transform the 511 JSON and XML examples into RDF, which were then tested against the generated ShEx schemas.  These tests both caught multiple errors in the transformation software and uncovered a number of additional issues in the specification itself, ensuring that the user-facing documentation was accurate and comprehensive. In early 2017, the FHIR documentation production framework, written in Java, switched from using the shex.js implementation to natively calling the Shaclex implementation. As a testament to the quality of the standard, both implementations agreed on the validity of all 511 examples. The first official version of the FHIR RDF specification was released in the FHIR Standard for Trial Use (STU3) release  in April of 2017.
3.2 Domain-Generic ShEx Validation in Wikidata
What Wikipedia is to text, Wikidata is to data: an open collaboratively curated resource that anyone can contribute to. In contrast to the language-specific Wikipedias, Wikidata is Semantic Web-compatible, and most of the edits are made using automated or semi-automated tools. This ‘data commons’ provides structured public data for Wikipedia articles  and other applications. For each Wikipedia article–in any language–there is an item in Wikidata, and if the same concept is described in more than one Wikipedia, then Wikidata maintains the links between them.
In contrast to language-specific Wikipedias, and to most other sites on the web, Wikidata does not assume that users who collaborate have a common natural language . In fact, consecutive editors of a given Wikidata entry often do not share a language other than some basic knowledge about the Wikidata data model. Using ShEx to make those data models more explicit can improve such cross-linguistic collaboration.
Wikidata is hosted on Wikibase, a non-relational database maintained by the Wikimedia Foundation. The underlying infrastructure also contains a SPARQL engine https://query.wikidata.org that feeds on a triplestore which is continuously synchronized with Wikibase. This synchronization–which occurs in seconds–enables data in Wikidata to be available as Linked Data almost immediately and thus becoming part of the Semantic Web. Basically, Wikidata acts as an “edit button” to the Semantic Web and as an entry point for users who otherwise do not have the technical background to use Semantic Web infrastructure. While Wikidata and its RDF dump are technically separate, they can be perceived as one from a user perspective. Content negotiation presents either the Wikibase form or the RDF form, creating a sense of unity between the two. For instance, https://wikidata.org/entity/Q54872 (which identifies RDF) points to the Wikibase entry at https://www.wikidata.org/wiki/Q54872, while http://www.wikidata.org/entity/Q54872.ttl will provides the Turtle representation and http://www.wikidata.org/entity/Q54872.json a JSON export.
The Wikidata data model  currently consists of two entity types: items and properties (a third one, for lexemes, is about to be introduced). All entities have persistent identifiers composed of single-letter prefixes (Q for items, P for properties, L for lexemes) plus a string of numbers and are allotted a page in Wikidata. For instance, the entity Q1676669 is the item for JPEG File Interchange Format, version 1.02. Properties like instance of (P31) and part of (P361) are used to assert claims about an item. A claim, its references and qualifiers form a statement. Currently, Wikidata’s RDF graph comprises about 5 billion triples (with millions added per day), which reflects about 500 million statements involving about 50 million items and roughly 5000 properties.
Besides serving Wikipedia and its sister projects, Wikidata also acts as a data backend for a complex ecosystem of tools and services. Some of these are general-purpose semantic tools like search engines or personal assistants , while others are tailored for specific scientific communities, e.g. Wikigenomes  for curating microbial genomes, WikiDP for digital preservation of software , or Scholia  for exploring scholarly publications. Through such tools, communities that are not active on Wikidata can engage with the Wikidata RDF graph. ShEx can facilitate that.
Non-ShEx Validation Workflows for Wikidata. Wikidata uses constraints for validation in multiple ways. For instance, some edits are rejected by the user interface or the API, e.g. certain formats or values for dates cannot be saved. Some of the quality control also involves patrolling individual edits .
Most of the quality control, however, takes place on the data itself. Initially, the primary mechanism for this was a system of Mediawiki templatesFootnote 22, similar to the infobox templates on Wikipedia. These templates express a range of constraints like “items about movies should link to the items about the actors starring in it” or “this property should only be used on items that represent human settlements” or a regular expression specifying the format of allowed values for a given property. For more complex constraints, some SPARQL functionality is available through such templates. In addition, an automated tool goes through the data dumps on a daily basis, identifies cases where such template-based constraints have been validated, and posts notifications on dedicated wiki pages where Wikidata editors can review and act on themFootnote 23. This template-based validation infrastructure, while still largely functional, has been superseded by a parallel one that has been built later by having dedicated propertiesFootnote 24 for expressing constraints on individual properties or their values or on relationships involving several properties or specific classes of items. For instance, P1793 is for “format as a regular expression”, P2302 more generally for “property constraint”, and P2303 for “exception to constraint” (used as a qualifier to P2302). This way, the constraints themselves become part of the Wikidata RDF graph. This arrangement is further supported by dedicated Mediawiki extensionsFootnote 25, one of which also contains a gadget that logged-in users can enable in their preferences in order to be notified through the user interface if a constraint violation has been detected on the item or statement they are viewing. Reading through the reports generated by constraint violation systems supports inspection on a per-property basis. This system of validation requires community members to create and apply constraint properties on each of the Wikidata properties, of which there are more than five thousand. Constraints have not yet been added to all properties.
ShEx is a context-sensitive grammar with algebraic operations while Property Validation is a context-free. Unlike in ShEx Validation, where properties have context-sensitive constraints, Property Validation constraints must be permissive enough to permit all current or expected uses of the property. For example, a ShEx constraint that every human gene use the common property P31 “instance of” to declare itself an instance of a human gene MUST NOT be expressed as a property constraint as P31 is used for 56,000 other classes. Of course Property Validation is additionally problematic because the author of a constraint may not be aware of its use in other classes. ShEx allows us to write schemas that describe multiple properties, their constraints, their permissible values in combinations for which there are not yet property constraints created in Wikidata’s infrastructure. This allows us to test conformance to schemas that describe features that may not yet be relevant for the Wikidata community, but may be necessary for an external application.
Generic ShEx Validation Workflow for Wikidata. One issue with the existing template-based constraint and validation mechanisms for Wikidata is that they are usually very specific to the Wikidata platform or to the tools used for interacting with it. ShEx provides a way to link Wikidata-based validation with validation mechanisms developed or used elsewhere. Getting there from the RDF representation of the Wikidata constraints is a relatively small step.
Efforts around the usage of ShEx on Wikidata are coordinated by WikiProject ShExFootnote 26. The ShEx-based validation workflow for Wikidata consists of:
writing a schema for the data type in question, or choosing an existing one;
transferring that schema into the Wikidata model of items, statements, qualifiers and references;
writing a ShEx manifest for the Wikidata-based schema;
testing entity data from Wikidata for conformance to the ShEx manifest.
Initially, Wikidata may be missing some properties for adequately representing such a schema. Such missing properties can be proposed and, after a process involving community input, created. Once they appear in the Wikidata RDF graph, ShEx can be used to validate the corresponding RDF shapes.
At present, the ShEx manifests for Wikidata are hosted on GitHub, but they could be included into the Wikidata infrastructure, e.g. through a dedicated property similar to format as a regular expression (P1793).