DOC AWK 

A quick-and-dirty literate-programming-style documentation generator inspired by docco.

by Case Duckworth acdw@acdw.net

Source available under the Good Choices License.

There's a lot of quick-and-dirty "literate programming tools" out there, many of which were inspired by, and also borrowed from, docco. I was particularly interested in shocco, written in POSIX shell (of which I am a fan).

Notably missing, however, was a converter of some kind written in AWK. Thus, DOC AWK was born.

This page is the result of DOC AWK working on itself. Not bad for < 250 lines including commentary! You can pick up the raw source code of doc.awk in its git repository to use it yourself.

Code 

BEGIN {

All the best awk scripts start with a BEGIN block. In this one, we set a few variables from the environment, with defaults. I use the convenience function getenv, further down this script, to make it easier.

First, the comment regex. This regex detects a comment line, not an inline comment. By default, it's set up for awk, shell, and other languages that use # as a comment delimiter, but you can make it whatever you want.

	COMMENT = getenv("DOCAWK_COMMENT", COMMENT, "^[ \t]*#+[ \t]*")

You can set DOCAWK_TEXTPROC to any text processor you want, but the default is the vendored mdown.awk script in this repo. It's from d.awk.

	TEXTPROC = getenv("DOCAWK_TEXTPROC", TEXTPROC, "./mdown.awk")

You can also set the processor for code sections of the source file; the included htmlsafe.awk simply escapes <, &, and >.

	CODEPROC = getenv("DOCAWK_CODEPROC", CODEPROC, "./htmlsafe.awk")

Usually, a file header and footer are enough for most documents. The defaults here are the included header.html and footer.html, since the default output type is html.

Each of these documents are actually templates, with keys that can expand to variables inside of @@VARIABLE@@. This is mostly for title expansion.

	HEADER = getenv("DOCAWK_HEADER", HEADER, "./header.html")
	FOOTER = getenv("DOCAWK_FOOTER", FOOTER, "./footer.html")
}

Because FILENAME is unset during BEGIN, template expansion that attempts to view the filename doesn't work. Thus, I need a state variable to track whether we've started or not (so that I don't print a header with every new file).

! begun {

The template array is initialized with the document's title.

	TV["TITLE"] = get_title()

Print the header here, since if multiple files are passed to DOC AWK they'll all be concatenated anyway.

	file_print(HEADER)
}

doc.awk is multi-file aware. It also removes the shebang line from the script if it exists, because you probably don't want that in the output.

It wouldn't be a bad idea to make a heuristic for determining the type of source file we're converting here.

FNR == 1 {
	begun = 1
	if ($0 ~ COMMENT) {
		lt = "text"
	} else {
		lt = "code"
	}
	if ($0 !~ /^#!/) {
		bufadd(lt)
	}
	next
}

The main logic is quite simple: if a given line is a comment as defined by DOCAWK_COMMENT, it's in a text block and should be treated as such; otherwise, it's in a code block. Accumulate each part in a dedicated buffer, and on a switch-over between code and text, print the buffer and reset.

$0 !~ COMMENT {
	lt = "code"
	bufprint("text")
}

$0 ~ COMMENT {
	lt = "text"
	bufprint("code")
	sub(COMMENT, "", $0)
}

{
	bufadd(lt)
}

Of course, at the end there might be something in either buffer, so print that out too. I've decided to put text last for the possibility of ending commentary.

END {
	bufprint("code")
	bufprint("text")
	file_print(FOOTER)
}

Functions 

bufadd: Add a STR to buffer TYPE. STR defaults to $0, the input record.

function bufadd(type, str)
{
	buf[type] = buf[type] (str ? str : $0) "\n"
}

bufprint: Print a buffer of TYPE. Automatically wrap the code blocks in a little HTML code block. I could maybe have a DOCAWK_CODE_PRE/POST and maybe even one for text too, to make it more extensible (to other markup languages, for example).

function bufprint(type)
{
	buf[type] = trim(buf[type])
	if (buf[type]) {
		if (type == "code") {
			printf "<pre><code>"
			printf(buf[type]) | CODEPROC
			close(CODEPROC)
			print "</code></pre>"
		} else if (type == "text") {
			print(buf[type]) | TEXTPROC
			close(TEXTPROC)
		}
		buf[type] = ""
	}
}

file_print: Print FILE line-by-line. The > 0 check here ensures that it bails on error (-1).

function file_print(file)
{
	if (file) {
		while ((getline l < file) > 0) {
			print template_expand(l)
		}
		close(file)
	}
}

get_title: get the title of the current script, for the expanded document. If variables are set, use those; otherwise try to figure out the title from the document's basename.

function get_title()
{
	title = getenv("DOCAWK_TITLE", TITLE)
	if (! title) {
		title = FILENAME
		sub(/.*\//, "", title)
	}
	return title
}

getenv: a convenience function for pulling values out of the environment. If an environment variable ENV isn't found, test if VAR is set (i.e., doc.awk -v var=foo.) and return it if it's set. Otherwise, return the default value DEF.

function getenv(env, var, def)
{
	if (ENVIRON[env]) {
		return ENVIRON[env]
	} else if (var) {
		return var
	} else {
		return def
	}
}

template_expand: expand templates of the form @@template@@ in the text. Currently it only does variables, and works by line.

Due to the way awk works, template variables need to live in their own special array, TV. I'd love it if awk had some kind of eval functionality, but at least POSIX awk doesn't.

function template_expand(text)
{
	if (match(text, /@@[^@]*@@/)) {
		var = substr(text, RSTART + 2, RLENGTH - 4)
		new = substr(text, 1, RSTART - 1)
		new = new TV[var]
		new = new substr(text, RSTART + RLENGTH)
	} else {
		new = text
	}
	return new
}

trim: remove whitespace from either end of a string.

function trim(str)
{
	sub(/^[ \n]*/, "", str)
	sub(/[ \n]*$/, "", str)
	return str
}