Grammar
The formal grammar of the codescan annotation surface. This document specifies the language a Go comment must conform to so that the scanner classifies it, dispatches it to the right builder, and populates the OpenAPI spec deterministically.
Audience. Implementers — anyone porting, extending, or debugging the parser. Annotation authors typically need annotations.md and keywords.md instead.
The grammar is layered:
- Preprocess — comment-marker stripping (see §preprocess).
- Lex — terminal token emission, including multi-line body accumulation (see §lexer).
- Parse — block construction, family dispatch, keyword classification (see §parser).
- Walk — typed dispatch through
grammar.Walkercallbacks to the builders (see §walker).
The productions below operate on the lexer’s terminal alphabet, not raw text. Per-terminal lexical detail (how the lexer recognises a number, a string, an annotation, …) is described in §lexer; the EBNF that follows consumes pre-classified terminals.
The grammar is rigorous ISO-14977 EBNF. Required vs. optional arguments, value typing, and family membership are grammar-visible — every legality constraint expressible by token sequencing is expressed that way.
Table of contents
Preprocess
Input is a *ast.CommentGroup from go/parser. Each *ast.Comment
in the group is one source-level comment node (either // … or
/* … */). The preprocessor produces a flat sequence of Line
structs, each one source line with:
Line.Text— content after comment-marker stripping and leading content-prefix trim.Line.Raw— content after comment-marker stripping only (preserves leading whitespace).Line.Pos—token.Positionof the first content byte.
Stripping rules:
- For
//comments: drop the//marker.Line.TextrunstrimContentPrefix(strips leading\t*/|);Line.Rawkeeps the post-marker spacing. - For
/* */block comments: split body on newlines.- First line: drop the
/*marker. - Continuation lines: run
stripBlockContinuation(strips leading whitespace + optional*continuation marker + one following space), thentrimContentPrefix. - Last line: drop the trailing
*/.
- First line: drop the
trimContentPrefix strips \t*/ and a single trailing | from
the line head. It does NOT strip - (so YAML list markers and
markdown dash items survive intact).
For synthetic per-line comments produced by upstream tooling
(notably parsers.ParseRoutePathAnnotation), a // prefix is
prepended before stripping so the // branch fires and the leading
whitespace gets shed correctly.
Lexer
The lexer turns a []Line into a []Token ending in TokenEOF.
Pipeline:
- Line classifier — emit one preliminary token per line (annotation / keyword / fence / blank / text).
- Body accumulator — fold multi-line bodies (OPAQUE_YAML, RAW_BLOCK_, RAW_VALUE_) into single body tokens.
- Prose classifier — re-type surviving text tokens as
TokenTitle/TokenDesc.
Terminal vocabulary
Annotation name terminals (TokenAnnotation)
Each recognises an annotation name only — positional arguments are emitted as separate terminals.
| Terminal | Annotation |
|---|---|
ANN_MODEL | swagger:model |
ANN_RESPONSE | swagger:response |
ANN_PARAMETERS | swagger:parameters |
ANN_ROUTE | swagger:route |
ANN_OPERATION | swagger:operation |
ANN_META | swagger:meta |
ANN_STRFMT | swagger:strfmt |
ANN_ALIAS | swagger:alias |
ANN_NAME | swagger:name |
ANN_ALLOF | swagger:allOf |
ANN_ENUM | swagger:enum |
ANN_IGNORE | swagger:ignore |
ANN_DEFAULT | swagger:default |
ANN_TYPE | swagger:type |
ANN_FILE | swagger:file |
Argument terminals
| Terminal | Recognises |
|---|---|
IDENT_NAME | Identifier-shaped token. Used for every named arg and reference. |
JSON_VALUE | RFC-8259 JSON literal (string / number / boolean / null / array / object). |
RAW_VALUE | Verbatim non-LF text — fallback when JSON_VALUE recognition fails. |
TYPE_REF | Closed vocab: string / integer / number / boolean / array / object / file / null. |
HTTP_METHOD | GET / POST / PUT / PATCH / HEAD / DELETE / OPTIONS / TRACE (case-insensitive). |
URL_PATH | RFC-3986 URL path token (used as the second positional arg of OperationArgs). |
Keyword head terminals (TokenKeyword)
Each recognises the keyword name only. See keywords.md for the complete keyword surface.
Inline value terminals
The lexer types values per their lexical shape; semantic coercion against the Go target happens in the analyzer.
| Terminal | Recognises |
|---|---|
NUMBER_VALUE | Signed decimal literal (integer or fractional). |
INT_VALUE | Unsigned decimal integer. |
BOOL_VALUE | true / false (case-insensitive). |
STRING_VALUE | Verbatim non-LF text. |
COMMA_LIST_VALUE | Comma-separated list of strings, trim-stripped. |
ENUM_OPTION_VALUE | One of a closed token set declared per keyword (query/path/… for in:, csv/ssv/… for collectionFormat). |
When the lexer fails to type a value against its keyword’s expected
shape, the property reaches the analyzer with Property.Typed.Type == ShapeNone and a CodeInvalidNumber / CodeInvalidInteger /
CodeInvalidBoolean diagnostic is emitted.
Multi-line body terminals
Single tokens spanning multiple source lines. The lexer absorbs the head and the body lines.
| Terminal | Parent keyword | Body shape |
|---|---|---|
RAW_BLOCK_CONSUMES | consumes | Flat token list (see sub-languages §flex-list) |
RAW_BLOCK_PRODUCES | produces | Flat token list |
RAW_BLOCK_SCHEMES | schemes | Flat token list |
RAW_BLOCK_SECURITY | security | Security requirements (see sub-languages §security-requirements) |
RAW_BLOCK_SECURITY_DEFINITIONS | securityDefinitions | YAML map |
RAW_BLOCK_RESPONSES | responses | Response sub-language (see sub-languages §responses) |
RAW_BLOCK_PARAMETERS | parameters | Parameter chunk sub-language (see sub-languages §parameters) |
RAW_BLOCK_EXTENSIONS | extensions | YAML map of x-* entries |
RAW_BLOCK_INFO_EXTENSIONS | infoExtensions | YAML map of x-* entries |
RAW_BLOCK_TOS | tos | Free-form prose paragraph |
RAW_BLOCK_EXTERNAL_DOCS | externalDocs | YAML map |
RAW_VALUE_DEFAULT | default | Raw value text |
RAW_VALUE_EXAMPLE | example | Raw value text |
RAW_VALUE_ENUM | enum | Comma list, JSON array, or YAML dash list |
Body accumulation
A raw-block / raw-value keyword opens a body. The body terminates at
the next sibling structural token in the same family — either
another TokenAnnotation, another body-keyword head whose context
makes it a sibling, or TokenEOF.
Blank lines do NOT terminate the body. They are absorbed as visual separators inside list-shaped bodies.
For raw-block heads, the inline post-colon value (when non-empty)
is prepended to the body as its first line. This means
Consumes: application/json (inline single value) and
Consumes:\n - application/json (multi-line body) both yield the
same body content; consumers don’t need to special-case the inline
form.
YAML fence handling
A line whose trimmed content is exactly --- opens (or closes) a
YAML fence. While the cursor sits between matching fences:
- Annotation and keyword recognition is suspended; every line emits
as
tokenRawLinecarrying the verbatim source text. - The body accumulator captures the fenced region as a single
OPAQUE_YAMLtoken attached to the surrounding annotation (typicallyswagger:operationor a fenced extensions body). - A missing closing fence emits a
CodeUnterminatedFencediagnostic; theOPAQUE_YAMLtoken is marked truncated and the builder degrades gracefully.
Prose classification
Surviving tokenText tokens (not consumed by a body, not an
annotation or keyword head) re-type as either TokenTitle or
TokenDesc per three heuristics evaluated in order:
- Blank-line split — a blank line inside the prose run ends the title and starts the description.
- Closing punctuation — if the first prose line ends with Unicode punctuation, the title is just that one line.
- Markdown ATX heading — if the first prose line matches
markdown’s
# Headingshape, the#markers are stripped and the line becomes the title.
When no heuristic fires, the entire prose run is title.
See sub-languages.md §prose-classification for the author-facing description.
Parser
The parser consumes the lexer’s terminal stream and produces typed
Block values, one per *ast.CommentGroup. A single comment group
may produce MORE than one Block when multiple annotations appear
(each annotation closes the preceding Block and opens a fresh one).
Top-level dispatch
The dispatcher reads the first ANN_* terminal; its identity
selects the family. If no annotation appears, the input is an
UnboundBlock — typically a Go struct field with description-only
documentation.
Block.AnnotationKind() returns the family discriminator.
Block.AnnotationArg() returns the leading IDENT argument (if any)
without requiring the caller to type-assert on the typed Block
kind.
Schema family
Bodies of swagger:model, swagger:parameters, swagger:response,
swagger:name.
Operation family
swagger:route and swagger:operation are distinct block productions
because their bodies differ structurally — swagger:route accepts
the structured keyword surface; swagger:operation accepts an
OPAQUE_YAML body.
The <GoIdent> swagger:route ... godoc-prefix exception (which
allows a leading Go identifier on the route annotation line) is
absorbed by the lexer; the EBNF sees a plain ANN_ROUTE.
Meta family
swagger:meta defines top-of-spec metadata.
Classifier family
Single-purpose annotations that classify the surrounding declaration without carrying their own body.
Classifiers are stateless markers — they carry no validation body of their own. The surrounding declaration’s other annotations (or the absence thereof) determine where the classification lands.
Cross-cutting productions
These appear in multiple families and share a single production.
Vendor extensions (ExtensionsBlock, InfoExtensionsBlock) accept
YAML map bodies; non-x-* keys emit CodeInvalidAnnotation and
drop. The lexer additionally surfaces them via
Block.Extensions() with an Extension.Source discriminator
(KwExtensions vs KwInfoExtensions) so consumers can route to
the correct spec field (spec.extensions vs info.extensions).
Walker
Block.Walk(grammar.Walker{...}) dispatches Properties through
typed callbacks. The Walker maps a Property to a callback by
Keyword.Shape:
| Shape | Callback | Payload |
|---|---|---|
ShapeNumber | Number | (p, float64, exclusive bool) |
ShapeInt | Integer | (p, int64) |
ShapeBool | Bool | (p, bool) |
ShapeString | String | (p, string) — value on p.Value |
ShapeEnumOption | String | (p, string) — closed-vocab token on p.Typed.String |
ShapeRawBlock | Raw | (p) — caller reads p.Body / p.Raw |
ShapeRawValue | Raw | (p) |
ShapeCommaList | Raw | (p) — caller splits via Property.AsList |
ShapeNone (failed typing) | Raw | (p) — diagnostic fired separately |
Additional callbacks fire outside the per-Property dispatch:
Title(s string)— once, before any property, if non-empty.Description(s string)— once, before any property, if non-empty.Extension(ext grammar.Extension)— once per typed extension.Diagnostic(d grammar.Diagnostic)— block-level diagnostics fire before Title; per-property diagnostics fire immediately before the property’s main callback.
Walker.FilterDepth gates property callbacks by Property.ItemsDepth.
Pass 0 for level-0 properties (default); pass N for items-level
N; pass AllDepths (-1) for every depth.
For full Walker contract see the
grammar package README.
Diagnostics
The grammar emits typed diagnostics for malformed input, recovered where possible:
| Code | Severity | Trigger |
|---|---|---|
CodeInvalidAnnotation | Warning | Unknown tag, malformed annotation arg, dropped malformed property |
CodeInvalidNumber | Warning | Number-typed value failed lexical parse |
CodeInvalidInteger | Warning | Integer-typed value failed lexical parse |
CodeInvalidBoolean | Warning | Boolean-typed value failed lexical parse |
CodeShapeMismatch | Warning | Keyword applied to a schema type that doesn’t accept it (e.g. minLength on a number) |
CodeContextInvalid | Warning | Keyword used outside its legal annotation context |
CodeUnsupportedInSimpleSchema | Warning | Full-schema-only keyword used in SimpleSchema (non-body param, header) |
CodeInvalidYAMLExtensions | Warning | YAML parse failed inside an extensions body |
CodeUnterminatedFence | Warning | YAML fence opened but not closed before EOF |
All diagnostics drop the offending property / annotation /
extension and continue the build. The accumulator on
common.Builder collects them in source order; the consumer’s
OnDiagnostic callback (if wired) fires inline.
What this grammar does not describe
The grammar’s job ends at producing typed Property and Block
values. The analyzer (builders / spec orchestrator) owns:
- Type coercion —
default: 1.5against anintegerschema is a lexical success and an analyzer rejection.validations.CoerceValueandvalidations.ParseDefaultapply the schema-type-aware coercion at write time. - Cross-reference resolution —
$reftargets, alias-chain resolution, post-decl discovery. The grammar emits the names; the analyzer resolves them. - Schema-shape gating —
validations.IsLegalForTypedecides whetherminLengthapplies to the resolved schema type. The grammar always emits the property; the handler dispatch decides whether to write it. - Ordering & merging across multiple comment groups — when
several
swagger:parameters Foo Bar Bazdeclarations contribute to the same operation, the spec builder merges them.
The grammar is also deliberately single-pass — it never
revisits a *ast.CommentGroup after Parse(cg) returns. The
common.Builder blockCache memoises results across the analyzer’s
recursive type descent (see
common README §blockcache).