I have an ancient website that I pay a hosting company monthly for, and it isn’t cheap, due to never downgrading my plan. It currently runs on a Windows VM with a MS SQL server backend. Unfortunately, I haven’t modified this website in ages, due to the site just being… outdated. I want to migrate this off of the expensive site, and host it in either Github pages or a tiny host on Amazon EC2 to get my price down, and make it so I can start upgrading the site in place a bit more manageable.
RenEvo.com has had a lot of evolutions over the years in both site design and technologies, fortunately, there is content on the current site that is still regularly viewed and downloaded, and rather than blow those up, I want to still host them, except with a better theme, and statically.
To start off with, I want to download the entire rendered site, it uses Community Server circa 2006 and ASP.Net 2.0. Like I said, this thing is ancient. I did a lot of customization, as well as published a few tutorials on how to hack the code.
The first thing I want to try, is to download the site using Website Downloader. This site advertises that it can download an entire site with assets into a zip file. My first attempt at this, told me that “Big sites like this can take a while, but… it is still downloading”.
While waiting for the download, I decided that I would go ahead and build up the git repository, you can find it at github.com/renevo/renevo.com. This repository will simply be a way for me to edit the static site to refactor it to something more useful, remove a lot of dead and invalid content, as well as be able to push it to github pages if need be.
After waiting about 30 minutes for my site to download, I opted into the “email me when it is done” option.
While it was downloading, I started looking through the actual site for things that I want to salvage and keep links to, thankfully, in addition to self-pruning content, I have had google analytics hooked up to the site since launch. Immediately I was able to figure out there are a few key blog posts I want to keep, as well as a few file downloads.
Community Server had a feature to create downloads as redirects, all of those can be tossed in the garbage, as most of the links over the last 10 years have either died, or are completely irrelevant now (MS SQL 2005 Express links….). However, there are a lot of hosted game modding tools that are actually downloaded regularly, some game site forums link directly to these downloads, so those are things I definitely want to keep available.
The most hit blog items are those that still show up in top searches from my active stack overflow days; such as removing the My namespace in VB.Net, loading external configuration files, and creating a WPF application with the official Microsoft Ribbon control. Oddly, those still get regular hits from google searches, with a decent view time on them, meaning people are actually reading the posts.
We are now one hour into the website download, I am cautiously pessimistic that this is going to work, my backup plan being to go into the hosting company and ftp all the stuff down and backup the database locally, however, this would require that I setup an IIS and MySQL server to host it so that I can pull the content, which honestly, will suck.
Well, it’s the next day, I gave up on the downloader and figured I would just check it in the morning. The good part is that it finished, the bad parts were that when I looked at the summary, it had done a lot of duplication based on URL parameter, specifically to login.aspx? and createUser.aspx? To the extreme of telling me that I had roughly 31k files to download. OH, and they also wanted 4.99 to download, not that I don’t see the value in that, but with that many errors, that isn’t worth the 5 bucks to sift through all of that.
So, Plan… C? I did a bit of research, remember that my website was basically a content management system, and everything is available via RSS. Lets play with some golang:
package main
import (
"bytes"
"fmt"
"io/ioutil"
"log"
"net/http"
"net/url"
"os"
"path"
"strings"
"github.com/metal3d/go-slugify"
"github.com/mmcdole/gofeed"
)
func main() {
feed := "http://renevo.com/blogs/developer/atom.aspx"
output := path.Clean("./output")
parser := gofeed.NewParser()
parsed, err := parser.ParseURL(feed)
if err != nil {
log.Fatalf("Failed to parse feed: %v", err)
}
log.Printf(parsed.Title)
for _, item := range parsed.Items {
dir := path.Join(output, strings.ToLower(parsed.Title), fmt.Sprintf("%d", item.PublishedParsed.Year()), fmt.Sprintf("%d", item.PublishedParsed.Month()))
os.MkdirAll(dir, 0644)
slug := slugify.Marshal(item.Title, true)
filePrefix := path.Join(dir, slug)
// save html version
log.Printf("Post: '%v' by '%v' posted on '%v'", item.Title, item.Author.Name, item.PublishedParsed)
if err := ioutil.WriteFile(filePrefix+".html", []byte(item.Content), 0644); err != nil {
log.Printf("ERR: Failed to write file: %v", err)
continue
}
// create markdown version
yamlFrontMatter := "---\n"
yamlFrontMatter += "title: " + item.Title + "\n"
yamlFrontMatter += "published: true\n"
yamlFrontMatter += "date: " + fmt.Sprintf(item.PublishedParsed.String()) + "\n"
yamlFrontMatter += "tags: imported " + strings.ToLower(strings.Join(item.Categories, " ")) + "\n"
yamlFrontMatter += "---\n"
req, err := http.NewRequest("POST", "http://heckyesmarkdown.com/go/", bytes.NewBuffer([]byte("html="+url.QueryEscape(item.Content))))
if err != nil {
log.Printf("Failed to create request: %v", err)
continue
}
req.Header.Set("Content-Type", "application/x-www-form-urlencoded")
client := &http.Client{}
resp, err := client.Do(req)
if err != nil {
log.Printf("Failed to make request: %v", err)
continue
}
body, _ := ioutil.ReadAll(resp.Body)
resp.Body.Close()
if err := ioutil.WriteFile(filePrefix+".md", append([]byte(yamlFrontMatter), body...), 0644); err != nil {
log.Printf("ERR: Failed to write file: %v", err)
continue
}
}
}
This is some really quick and dirty code, but.. what it does is use a couple of golang packages I found through some research to load the atom feed from the site, iterate through each item, save the html to a structure directory with the blog title/year/month/, slugify the post title, the save the content to the html. With that html, I found the heckyesmarkdown.com site, it allows you to give it a URL, and it converts it to markdown, but looking closer, I found an API that is “kinda” documented, and I do a form url encoded post with the html content and take those results, as well as some yaml front matter (same thing used to generate gh-pages and what stack edit uses for post data). I tack on a “imported” tag, because I want to see that.
This is just a start, now I have a way to generate markdown from my previous website, by passing it in a feed, however, I still have downloads and pictures to take care of, as well as blog post content (related images, etc..).
To see the results of this tool, or to check out the latest version of the above code, I will be playing with this quite a bit in the github.com/renevo/renevo.com repository, and I will make sure to branch by post… is that a new thing… as I update, so the post stays up to date :)
Stay tuned sometime soon™ for an update!
No comments:
Post a Comment