In this second part of the migration of my old website. I take some time to add a lot of needed features to my importer application.
One of the first issues I noticed, is that the path is going to be troublesome for me to map if I don’t maintain the same file names, while the slug library that I found did a great job, I need to keep them consistent with the previous posts. I now modify the code to use the path from the post and use the net/url package to parse the link from the RSS, and then parse out the directory/filename.
postUrl, _ := url.Parse(item.Link)
postPath, postFileName := path.Split(postUrl.Path)
Then from there, I dump it for now, just into the yaml front matter.
yamlFrontMatter += "original: " + item.Link + "\n"
yamlFrontMatter += "file: " + postFileName + "\n"
yamlFrontMatter += "path: " + postPath + "\n"
The output now looks like this:
---
title: IOC vs. DI vs. Composition
published: true
date: 2010-12-05 07:57:51 +0000 UTC
tags: imported
original: http://renevo.com/blogs/developer/archive/2010/12/05/ioc-vs-di-vs-composition.aspx
file: ioc-vs-di-vs-composition.aspx
path: /blogs/developer/archive/2010/12/05/
author: tom anderson
---
However, I don’t really want to keep that “archive” portion of the path, since it really is kinda silly, and I don’t want the .aspx file extension, I can use rewrite rules in my web server to add those.
Now the fun part, because I have multiple blogs, I want to setup a list of feeds to snag the content from, this really just involves warpping the entire thing in a for loop. The beginning of the application now looks like this:
output := path.Clean("./imported")
parser := gofeed.NewParser()
feeds := []string{
"http://renevo.com/blogs/developer/atom.aspx",
}
for _, feed := range feeds {
parsed, err := parser.ParseURL(feed)
As you can see, I have shifted a few things around, because I don’t think that I need to create some of those objects for each iteration. The underscore in the for look is the index of the element, I really don’t need that, so using an underscore _ is a way to just ignore it. The next step is to add another feed, and see how it works, to keep this simple, I want to make it another blog feed.
I really don’t want to keep the posts of people who aren’t part of my site anymore, not that I don’t like those guys, but I don’t feel right keeping there content, from what I can tell, most of those posts actually don’t get any traffic, and I will at some point also prune a few of my own posts from this list, however, I can programatically get rid of those posts if the author isn’t me.
for _, item := range parsed.Items {
if !strings.EqualFold(item.Author.Name, "tom anderson") {
log.Printf("Skipping %v's post %s", item.Author.Name, item.Title)
continue
}
strings.EqualFold is a really awesome method that folds the UTF8 and compares it, basically… case insensitive. After filtering everything out, and adding all three blogs, this gave me 146 items. This seems excessive, some of these posts if I remember correctly, were based on a plugin that allowed me to “share” a page. I think I want to automatically detect short posts and prune them.
Community server has this way to track blog views, basically they insert a 1x1 blank image to the bottom of every post, and with the markdown converters, I basically get an “end of post” anchor.
![][2]
[1]: http://www.renevo.com/blogs/community_blogs/WindowsLiveWriter/Friendsband_1248D/image_thumb.png "image"
[2]: http://renevo.com/aggbug.aspx?PostID=2324
At some point in a future post, when I do some content downloading from inside the posts, I want to remove those tracker images, however, right now what I want to do is find that ![][2]
thing and count the number of words before it. There are many ways to count words, I am going to do a simple split and len for this one.
bodyString := string(body)
markerLocation := strings.LastIndex(string(bodyString), "![][")
if markerLocation > 0 {
wordCount := len(strings.Split(bodyString[:markerLocation], " "))
log.Printf("Post has %v words", wordCount)
}
As I said, not very sophisticated. I use the LastIndex to capture it from the end, rather than the first image without a title. From a quick look, I was able to identify that about 250 words is a good limit on “crap” posts vs. posts with some value. Adding that to my rule checks, I now end up with 55 posts, not horrible. However, at this point, I feel like I have my first candidate for a command line argument, because I might want to tweak that number, and instead of changing the code, I just want to rerun it with a different value (looking at the results, that is definitely the case).
var minWordCount = flag.Int("min-words", 100, "The minimum amount of words to keep")
func main() {
flag.Parse()
Golang makes this really easy with native support for basic command line arguments via the flag package. The gotcha is that this is a pointer to an int. So when using it:
if wordCount < *minWordCount {
log.Printf("Post seemed unworthy of keeping: %v/%v words", wordCount, *minWordCount)
continue
}
With some tweaking, I have found that this takes way too long, downloading one post at time just takes too much time, in the next post we will setup some concurrent downloading with channels to speed this up.
To see the results of this tool, or to check out the latest version of the above code, I will be playing with this quite a bit in the github.com/renevo/renevo.com repository, and I will make sure to branch by post… is that a new thing… as I update, so the post stays up to date :)
Stay tuned sometime soon™ for an update!
No comments:
Post a Comment