Parsing OpenAPI v3 specifications

5 min readJun 5, 2020

OpenAPI is a global standard for describing REST APIs. The descriptions are made either in yaml or json format. These description files are immensely useful and have a lot of tools written around them. As they follow a defined standard, it’s easy to work with them. They provide the consumer of the API, a power to explore the API even before writing code to consume it.

But in real-life production scenarios, sometimes there is a requirement to customise the specification files. For example, a use case can be :

Some operations / fields should be visible only to some consumers of the API, and not all.

One way of doing this is to parse the specification file, based on the consumer downloading it (logged in) and then filter/hide the operations/parameters not required. This is an interesting solution design, as the resulting file should still be OpenAPI schema valid. I will be describing the design of doing so for a OpenAPI v3 json specification file.

First things first, we should understand the input we need to work with. There can be many scenarios under this :

we have complete json paths, to all the operations/parameters/schemas required to be removed
we have partial json paths, or may be only names of the operations/parameters/schemas required to be removed
we have some other indicator, for example a user-defined property ‘x-hidden’ = ‘true’ , using which we have to find the operations/parameters/schemas required to be removed
we have the paths in some other format, which we might need to convert into valid json paths

Next, we need to find a parsing library which is efficient and suits our purpose. I worked with Jayway-JsonPath. It helps provide filter searches, which will be useful in this case. The only drawbacks :

filter searches and recursive searches are not that efficient ( they slow down the performance )
the library does not provide parent-child navigation on json nodes

Design :

We will try to handle all the above listed input scenarios in our design. And make is generic enough to handle more.

For scenarios 3 and 4, some preparation work is required to gather the required json paths ( absolute or partial ). JsonPath library’s filter search can be helpful in this :

Configuration conf = Configuration.builder().options(Option.AS_PATH_LIST).build();
DocumentContext document =  using(conf).parse(json);
List<String> jsonPaths = document.read("$..x-hidden");

One we have a list of json paths ( absolute or partial ), they can be fed into the parser. The paths should look like :

$..address
$..requestBody..billing..city
$.paths.dummy
$.paths.dummy..leaf
$..requestBody..billing..city.SYDNEY
$..responses..address..AUSTRALIA

The parser would do something as described below :

Step 1 : Read the json path in the provided json file. Take “$..requestBody..billing..city” for example. If some nodes are found , great. Store the information as a ParserNode object in the queue. If not, then extract the last “city” as subpath, and try reading the new path “$..requestBody..billing”. This will handle the cases where “billing” schema is defined as a reference, and the node “city” is actually present inside that schema. Keep doing this until either read-result is not empty, or only “$” is left as a path. In which case we have exhausted the path and there is nothing found.

Lets flow through the diagram using “$..requestBody..billing..city”, and lets say at the end of step1, the queue contains 1 ParserNode object looking like :

rootPath = “$..requestBody..billing”

subpath = [“city”]

nodes = JSONArray nodes for all billing nodes under all requestBody nodes

Step 2 : Poll the queue. For the ParserNode object, loop into the children nodes and evaluate each node against some strategies.

Strategy a — this may seem like a redundant check, but in some cases it is required. For our example flow, this is false. Otherwise, “rootPath.latest-subpath” should be read and new ParserNode object created from the read-results. The subpaths list will now contain one less element. This object should be added to the queue.

Strategy b — this will be true for our example. Nodes contain a reference. Something like ‘$ref’ = ‘#/components/schema/billing_address’. Here we will read this jsonpath “$.components.schema.billing_address” in the json file; gather the results in ParserNode object; add the object to the queue.

Strategy c — this will be true for enum cases, like “$..requestBody..billing..city.SYDNEY”. As enum will always be the last subpath left, we simply check if node contains the ‘enum’ child or not.

Strategy d — By this time, we have exhausted the possible cases, and either we have the leaf node in our hands ( subpaths will be empty) or we have to try a recursive search with the subpath as a child ie with json path — “rootPath..latest-subpath”.

The highlighted blue boxes indicate that — here we can take our required action ( of removing the node) on the node. For huge json files, this sequential process may prove time consuming, and a parallel stream might be required. In such cases, its essential the editing/removing of nodes happen AFTER the reading (as DocumentContext is not threadsafe). Hence we can hold these nodes in an array for processing later.
Another thing to note here is that while removing the node, we have to also remove all direct references to it. Otherwise the resulting json file may not valid as it may contain references pointing to non-existing nodes.

Configuration conf = Configuration.builder().options(Option.AS_PATH_LIST).build();
DocumentContext document =  using(conf).parse(json);
List<String> jsonPaths = document.read("$..*[?(@.$ref=='#/components/schema/billing_address/city')]");

Continuing the same example, let’s say we are ready to remove the “city” leaf node (which is inside the “billing_address” schema). Then we should look for, and remove, all references to “#/components/schema/billing_address/city” as well.

The highlighted green box indicates that while processing a leaf node, we need to process the references listed inside its definition as well. This will ensure we dont have schema definitions without any uses. But this should be done in a NEXT ROUND of parsing. Before removing these references, we need to ensure they are not being used by any other node in the json.

Configuration conf = Configuration.builder().options(Option.ALWAYS_RETURN_LIST).build();
DocumentContext document =  using(conf).parse(json);
List<String> refList = document.read("$.components.schema.billing_address.city..$ref");

Continuing the same example, let’s say we are ready to remove the “city” leaf node. But its schema definition contains a $ref node for “#/components/schema/area”. Then, while removing “city” we will take note of this reference, and later check if it’s safe to be removed or not.

Each time we remove a node, we have to look for its own references, and the references used inside its. Hence, depending upon the complexity of the json file, there might be multiple rounds of parsing required to clean it up. Note that number of rounds does not depend on the number of json paths fed into the parser, instead it depends on the json file structure itself, ie how many steps deep do the references go.

Step 1 and 2 can be done in parallel for each json path we need to filter. This will result in 2 lists. One — the final set of nodes to action upon. Two — the final set of references to look for in the next round. The “next rounds” will always contain absolute reference paths only.

My implementation of this design leads to a valid OpenAPIv3 json file. For an original file of 1.5MB, the algorithm takes around 5 seconds, to filter 130 (all partial) json paths, in 3 rounds of parsing.

Parsing OpenAPI v3 specifications

Written by Shivani Agrawal