A pelican plugin to avoid publishing drafts

So I messed up a few days ago and accidentally published a draft that got picked up by various RSS aggregators and I felt pretty dumb about it.

Turns out those aggregators routinely check for dead links and suppress them, so the entry disapeared on its own after a few hours.

But still, knowing myself, I'm sure it'll happen again and I'd rather avoid spamming unfinished crap into people's feeds if I can, so I started thinking about setting up some guard rails.

First thing that came to mind was to use git hooks, or maybe Forgejo's equivalent to Github actions, and basically only deploy the site when articles get pushed to the remote repository.

But thinking about it more, I realized that:

It's way too hot to read documentation.
Generating the site on a push means I would have to do it on the server.

The second point is not such a big deal, but right now I'm building the site on my local machine and uploading the results via a simple rsync, and I love the simplicity of this setup.

That's the whole beauty of static sites.

So if I ever need something fancier for deployment, I'll get to it, but until then I'd rather keep my current process.

Well. Let's turn the problem around. Rather than building the site when content gets pushed, let's instead check the local repository as we generate the articles.

Easiest way to do that was to write a pelican plugin, so that's what I did. Here's what I ended up with if anyone cares:

import logging
import subprocess

from pelican import signals


def warn_and_skip(msg, skip, generator, article):
    if skip:
        msg += "\nSkipping article generation"
        generator.articles.remove(article)
    logging.warning(msg)


def skip_outta_git(generator):

    # NOTE: this gets called *after* filtering out articles that pelican already
    # skips (like those with a draft status), so no need to explicitely check
    # for the "published" flag.

    # Bail out if the plugin is disabled
    if not generator.settings.get("GIT_CHECK_ENABLED", True):
        return

    skip_ignored = generator.settings.get("GIT_SKIP_IGNORED", True)
    skip_modified = generator.settings.get("GIT_SKIP_MODIFIED", False)
    skip_untracked = generator.settings.get("GIT_SKIP_UNTRACKED", True)

    for article in generator.articles.copy():
        src_path = article.source_path
        # Check for ignored files. This may be overkill, but better to err on
        # the side of caution.
        if skip_ignored:
            proc_check_ignore_return_code = subprocess.call(
                ["git", "check-ignore", "-q", src_path])
            if not proc_check_ignore_return_code:
                warn_and_skip(
                    f"{src_path} is set to published, but is ignored by git.",
                    skip_ignored, generator, article)
                continue
        # Check for untracked files or uncommited changes.
        proc_check_status = subprocess.run(
            ["git", "status", "--porcelain=1", src_path],
            capture_output=True,
            text=True)
        if (output := proc_check_status.stdout):
            if output.startswith(" M"):
                warn_and_skip(
                    f"{src_path} contains unstaged modification.",
                    skip_modified, generator, article)
            elif output.startswith("??"):
                warn_and_skip(
                    f"{src_path} is set to published but is not tracked in git.",
                    skip_untracked, generator, article)
            # Log and skip any other output, just to be safe.
            else:
                warn_and_skip(
                    f"git status --porcelain=1 {src_path}: {output}",
                    True, generator, article)


def register():
    """ Pelican plugin registration. """
    signals.article_generator_pretaxonomy.connect(skip_outta_git)

So now the blog will simply skip generating an article if its markdown source file is not tracked by git yet, hopefully ensuring that I won't publish a draft again.

I can disable it in my local pelicanconf.py to check stuff on my machine, and as long as I override it in publishconf.py, I should be fine if I decide to deploy minor edits in the meantime.

But truly, the real highlight of this afternoon project was that I got to use the morse (:=) operator, which always makes me happy.

So it works, but as I started writing this note, I realized it had one pretty big flaw: What if I decide to commit a draft ?

Now that the file is checked in, the current code will let it pass. I'll get a warning if I make a modification afterwards, but the more likely scenario looks more like:

Commit and push some changes, planning to finish the article later
Go do something else
Get back to the site, get distracted and start tweaking something unrelated, and decide to publish that tweak, confident that my system has my back.
Get burned. Again.

Sure, I'll be fine as long I make sure to check the article's status. But then I'm back to square one. Catching that very mistake is the whole point of writing the damned code.

Up until now, I've always edited my posts on the same machine and haven't felt the need to track changes until an article is basically done. The only reason I check them in to begin with is so the repo can serve as a "backup" of what's been published.

So this isn't a problem for me right now, but what if I change my mind ?

Guardrails that only work on specific conditions are effectively useless, and now that I spotted the problem, this false sense of security is making me nervous.

Guess I'll have to think of something else.

An exercise in simplification

This post is getting long already, so I'll spare you the details of what happened next.

I got the idea of checking the live site's RSS feed, and use that to tell if an article is already public as pelican generates them. It worked, but as I started cleaning up the code I realized, why bother ?

I could just keep a list of all published urls in a local file. As the site gets generated, I can just check for any article not already listed and ask the user for a confirmation, adding it to the list if it gets approved.

It's dumb and simple, reliable, and doesn't depend on specific habits that may or may not fit the situation. No magic, no trying to be clever. Just a quick explicit prompt to avoid getting caught with my pants down. The published cache can be checked in version control with everything else, and correcting any mistake is a simple edit.

So that's what I went with. Generating the initial list is a bit annoying, as you have to manually approve every single post that is already live,¹ but once that's done pelican will systematically ask me to confirm whether or not I want to include any new article whenever I built the site. Here's what it looks like in practice:

(rgaz) raphi@raphi-MS-7821:~/dev/web/rgaz$ make html
"pelican" "/home/raphi/dev/web/rgaz/content" -o "/home/raphi/dev/web/rgaz/output" -s "/home/raphi/dev/web/rgaz/pelicanconf.py"
Article "test_ignored" has not yet been published.
Ship it [y/n] (n) ? nope
Article "test_untracked" has not yet been published.
Ship it [y/n] (n) ? go away
Article "A pelican plugin to avoid publishing drafts" has not yet been published.
Ship it [y/n] (n) ? yup
[09:45:16] WARNING  Skipping: blog/test_ignored/                                                                   plugin.py:98
           WARNING  Skipping: blog/test_untracked/                                                                 plugin.py:98
Done: Processed 37 articles, 22 drafts, 0 hidden articles, 2 pages, 0 hidden pages and 0 draft pages in 0.85 seconds.

It's a bit annoying to use locally, as the devserver loop messes up the input and basically breaks everything, so I've disabled it in my local pelicanconf.py, and make sure to activate it in publishconf.py so that I get the prompts when it actually matters.

Also, the code is a lot simpler than what I had before.² Sadly the morse operator is gone, though =(

Maybe I'll polish it a bit and upload it to pypi. I tend to avoid doing so, as most of the stuff I've been hacking together lately is pretty quick and dirty and is usually tailored to my specific needs, and I'd rather avoit polluting the official index with code that may very well be useless to anyone but me.³

But this thing seems generic enough that it just might be useful to someone else, so I dunno. I'll test drive it for a little and see how it goes.

In the meantime, the repository is public, so you can pip install git+https://git.rgaz.fr/raphi/pelican_check_published it if you really want to.

See ya, Internets. Can't wait to see what my next stupid goof will look like.

Adding a script to automatically generate it should be easy enough, though. ↩
I do have to do some shenanigans to ensure the code only runs when I want it to, but nothing too bad. ↩
The proliferation of useless libraries is a real problem. Many of them done by eager beginners that don't realize the problems they're trying to solve just aren't that interesting, of much of a real problem to begin with.

Of course I've been that beginner. Hopefully I've learned some restraint since then. ↩