Sunday, February 5, 2017

Migrating Old Website - Part 3

In this third part of the migration of my old website, I add even more functionality, as well as clean up some of the code.

One of the largest issues I had with the code overall was that it was pretty noisy. I decided that I would split out the Golang standard logging with the go-logging library. This is a pretty straight forward logger, at work we use a custom logger that integrates into our systems and provided line numbers and package/file names, but I didn’t feel that the extent of that system was neccessary in this, and go-logging is pretty powerful out of the box. The absolute key being, I could change the log verbosity overall of the output, since while doing the import, every action was pushed to the stdout.

You could say, why not just remove the logging from the code? I always prefer to log as much as possible about everything that is going on, and have the ability to tweak the verbosity it actually outputs, versus remove logging because it is “noisy”, when troubleshooting something, you would be drastically surprised how that “noisy” stuff, will help you find that really tricky bug in production.

logger := logging.MustGetLogger("importer")
var loggerFormat = logging.MustStringFormatter(
    `%{color}%{time:15:04:05.000} ▶ %{level:.4s} %{color:reset} %{message}`,
)
logging.SetFormatter(loggerFormat)

stdOutBackend := logging.NewLogBackend(os.Stdout, "", 0)

// Only errors and more severe messages should be sent to backend1
levelLogger := logging.AddModuleLevel(stdOutBackend)
levelLogger.SetLevel(logging.WARNING, "")

// Set the backends to be used.
logger.SetBackend(levelLogger)

Overall, configuring the logger is pretty simple, the above code is stting up some formatting, setting an output (io.Writer), and then setting the default level to warning, when something is bad, for this small project, I set that to Debug and re-run the application, but that is normally a good candidate for command line arguments, or some mechanism to change at run time.

I don’t like globals either, here I setup a logging instance, so that it can be passed into the the methods and/or structs that need it. I remember in the VB6 days and the “global module” and the pains that this always caused, experience teaches us that the global pattern not only causes static bounderies, but also removes isolation of settings, what if I wanted a different logger for this one method, if it is using a global, there is no way to change that, that’s actually called Inversion of Control btw.

The next significant change to the codebase was extracting all of the static methods out to be attached to a new BlogImporter struct. This separates all of the concerns for importing the blog out of the main, however there is only still one file, which is some of my next steps, the code is very clearly more modularized for putting it in a package.

The BlogImporter exposes two public methods (Parse() and PostCount()) and a constructor method (NewBlogImporter(logger, config)). By calling the Parse method, it will block the calling stack until completed. Idealistically I would like to move this to a method that returned int, error and get rid of the PostCount() all together, however, a lot of other refinements where made instead in this spike.

When concurrency is bad, was a question I got to answer during this spike as well, the first approach was to run go routines for each blog, and then run a go routine for each post. However, since I am actually using a third party API to convert my HTML to Markdown, I quickly realized that I was getting throttled with too many requests per second. Instead, I opted to only run a go routine per post, but only do one feed at a time, this turned out to get me just under the throttling, as well as drastically speed up my import time. Roughly, 1.5 minutes to about 6 seconds.

func (this *BlogImporter) parseFeed(feed string) {
    now := time.Now()

    parsed, err := this.parser.ParseURL(feed)

    if err != nil {
        this.logger.Fatalf("Failed to parse feed: %v", err)
    }

    this.logger.Infof(parsed.Title)

    feedUrl, _ := url.Parse(feed)
    feedBasePath, _ := path.Split(feedUrl.Path)
    completion := make(chan *gofeed.Item)
    defer close(completion)

    for _, item := range parsed.Items {
        go func(thisItem *gofeed.Item) {
            now := time.Now()
            if err := this.parseFeedItem(feedBasePath, thisItem); err != nil {
                this.logger.Error(err.Error())
            } else {
                atomic.AddUint32(&this.postCount, 1)
            }
            this.logger.Debugf("Parse Feed Item %s took: %v", thisItem.Title, time.Since(now))
            completion <- thisItem
        }(item)
    }

    for i := 0; i < len(parsed.Items); i++ {
        <-completion
    }

    this.logger.Debugf("Parse Feed %s took: %v", feed, time.Since(now))
}

I use a pretty rudementary way of running multiple go routines and then waiting for completion here, I know how many items I need to gather, so I iterate and block waiting on the completion channel. This has an effect of having a long wait on the first one, and usually the rest happen almost immediately.

You will also see in here how I increment my post counts using the sync/atomic package to increment the counter. This is a pretty common thing in concurrent programming. And there are some minor timer loggings, I would prefer to switch this to something more contextual where I can just do a defer at the beginning of the statement, but this was the quick way to implement it.

Another huge update to this spike, was the addition of downloading images, this was “mostly” straight forward and involved the following steps:

  • Scan the markdown for image tags ![][#]
  • Lookup the image URL from the markdown [#] <url>
  • Attempt to download the image and save it to disk, however, use a mime parser to find the actual file extension (e.g. .aspx might serve image/png, need to make sure to save it as .png instead of .aspx.
  • Update the URL in the markdown with the new relative address to the image
  • Delete any invalid images from the post.

It ended up looking like this:

// detect and download images
postImages := getImageList(bodyString)
if len(postImages) > 0 {
    for _, image := range postImages {
        if strings.Contains(image.url, "aggbug.aspx?PostID=") {
            bodyString = removeImage(image, bodyString)
            continue // skipping these things
        }
        os.MkdirAll(filePrefix, 0644)
        this.logger.Debugf("Found Image [%s]: %s @ url: %s", image.id, image.title, image.url)
        outputPath, err := downloadImage(image, filePrefix)
        if err != nil {
            this.logger.Errorf("Post: '%s': '%s' failed to download image: %v", item.Title, item.Link, err)
            continue
        }
        if len(outputPath) == 0 {
            // image doesn't exist, prune it from existance
            bodyString = removeImage(image, bodyString)
        } else {
            bodyString = updateImage(image, dir, outputPath, bodyString)
            // image exists, update the markdown to point to the relative dir
            this.logger.Infof("Image: %s", outputPath)
        }
        }
    }

For the next spike, what I hope to do is spend time pulling the importer logic outside of the main package, rework the code to use existing html files, instead of having to constantly download them, rework the code to separate out the input markdown from the output markdown (again, so it doesn’t have to constantly hit that poor guys html-> markdown converter), render a report per blog (missing images, low value posts, etc..).

The fun parts of this entire project is that while I am building the software to convert my site to something else, I am also building the hardware and infrastructure to host it.

To see the results of this tool, or to check out the latest version of the above code, I will be playing with this quite a bit in the github.com/renevo/renevo.com repository, and I will make sure to branch by post… is that a new thing… as I update, so the post stays up to date :)

Stay tuned sometime soon™ for an update!

No comments: