#41 - Overview on My Blogging Systems
As mentioned in previous posts, it was so much fun to develop shell scripts which generate static webpages for the website.
I don't programme in C but use many C blobs (programmes) in my scripts.
After almost one year developing my own tools, I don't regret not learning Jekyll or WordPress. I am sure these scripts will work for a long time, but I will get into details.
I started writing specific scripts that would accomplish one task. Some tasks depend on each other so I wrote some scripts which run them in the rigt order in each section or directory of the website.
As such, that is enough for me to run a single script at the website root and all sections of the website will be updated!
The good thing about these scripts is I tried to keep each performing a single function. As such, I can reuse these scripts for creating independent sections in future websites.
However, these scripts depend on specific directory hierarchies and templates, even though I made my best to avoid minor or accidental changes to the templates breaking the processing and setting the major variables correctly should make the processing work with different file structures.
I don't expect anyone to actually use these scripts for his own blog, but i am sure there are some good ideas and chuncks anyone can find useful for his own set of poor man webmaster tools.
Just a note, the scripts could be much, much more simple had I chosen not to strictly follow standards and validation tests as a means to learning HTML, CSS and management.
Surprisingly, to running all the scripts take less than 20 seconds, or less than 40 seconds when regenerating all buffers and pages.. What may take some time is pushing large data to git (for example, audio) and now the new script which archives some of repos for distribution, but that should not be run everytime..
System overview
The Blog System
This was the first system to be created. It checks all blog
entries with globs for /blog/[0-9]+/
but with
zsh
extended globbing, which really makes
life easier.
There is a template file copy called i.html
inside
the entry directory which holds the author original text of the
blog entry in HTML. It would be OK to use markdown however markdown
is very limited in comparison of what structure and formatting HTML
allows.
The script will read many HTML tags, such as [DESCRIPTION],
[KEYWORDS], [DATE] and [TITLE] and generate title lists for the
blog homepage and of the latest 10 posts for the website homepage,
and generate a webpage with all post concatenated. For this last
task, the script leaves some buffer files behind for speed
improvement in the following runs. As long as the original
i.html
and buffer modification dates are the same, the
old buffer is reused.
That was a very important decision to start using
tidy
, the granddaddy of HTML tools for procssing HTML.
We can expect tidy
having a definite output once you
configure it to your needs. That allows us to work realiably on
obtaining and modifying data from specific fields with
sed
and other tools.
Recently, an option was added which generates a new post from a
template directory and automates most required tasks, such as
setting the number and date of the post. Before, I had to manually
copy the template directory, which really only holds the template
i.html
file, and set all properties manually.
Sitemap System
Soon after the website was holding a little of interesting and original content, I set up search engines craws to my website in order to try and figure among their search results.
The biggest ones to sign up for (even though they should reach your website in due course of time automatically) were Bing and Google. Both of these websites offer a console for checking data from search egines and that is all and the only data I know is currently collected from their users and shared with me about my website.
With this need I developed scripts to generate theree types of sitemaps, two for crawling engines and one for humans.
Blog RSS Feed System
As page development progressed, my attention was caught by RSS feeds. Luke Smith was one of the first I remember reccomending setting up your own RSS feeds.
RSS feed networks work in a very peculiar way in which one peer can relay audio content and data to other peers. Essentially, you are covered as copies of your podcast are broadcast and reach the end point, the listener. Having a copy of your content delivered to your listeners without depending too much on thrid-party services is really satisfying.
Plus, working with XML files, which are HTML files with a very
strict syntax, turned out to be very useful. Mostly, we can work
with XML with tidy
and xmlstarlet
.
xmlstarlet
is problably the only decent XML editor
available AFAIK. Otherwise,
editing XML with sed
is not hard in a shell script,
either.
In the end, I use xmlstarlet
to edit a basic and
valid RSS feed (remember, RSS stands for Really
Simple Syndication) with entries description taken from
[DESCRIPTION] tags of blog posts and some more metadata such as
date.
There is also an alternative RSS feed with full content of blog
entries, which some people may prefer. However, for adding
full-HTML content in XML [DESCRIPTION] tags, sed
seems
a better choice as it does not care about <![CDATA[]]> constructs nor does it try
to parse HTML.
Podcast Tumblelog and RSS Feed Systems
The real professional way of streaming podcast is through RSS networks. As XMl was not a monster to me anymore, I decided to write my own audiocast RSS system.
The main thing with podcast RSS is you need have at least the initial broadcast. Spotify, Blubrry and other services may eventually limit podcast storage and charge for premium services.
I could write the podcast topics in a blog entrya nd link back to it. As podcast distribution should require including only a (short) episode description beyond lots of metadata, I decided to have my podcast entries manually written as XML files.
There are two types of template files. Each audio file will need have associate an XML file with its title, description and metadata. That is created copying from an entry XML template. In the next step, the script will find all XML files of episodes, fill in all other required metadata, such as audio duration, size, timestamps in diferent formats and will inject episode entries into the final XML feed (from another XML template) in the correct order, which already contains all static information about my channel.
The automation of XML podcast entries was hard and time-consuming, but after struggling with it for some days, it started to work as expected and stabilise.
Even though I tried to automate fillinig in most of the metatags, it remains a pain in the arse to fill in the required fields of an XML template.. Still, it is so powerful what can be done with the XML file after filling in that metadata that that the manual work is worth it.
Once my RSS broadcast system was set, I set accounts in many (about 14) podcast directories for making it available to their users.
Creating a decent homepage for the podcast
episodes was possible using John Bokma's
tumblelog engine almost as an extra because extracting data
from the podcast XML files to use with tumblelog.py
was a breeze.
Homepage System
Eventually, I decided my homepage (landing page) was crowded and cloned content of some sections into separate pages, such as links and quotes.
The homepage system grabs the latest links and quotes and inject them into the homepage. If the visitor wishes the complete list of links and quotes, she can visit dedicated pages with full content and nice CSS styles.
Here, tidy
is essential to work with well-formatted
HTML for further processing and injections.
Repository System
Just yesterday, I had this idea of hosting my repos inside my website. Perhaps that is not the best idea as there is GitHub infrastructure exactly for that, but I decided to have a go.
It turns out there one feature of GitHub I cannot implement, that is to clone an entire repo (directory/folder). One way to offer similar functionality is to offer archives of some of my repos. That is however a shame that GitHub does not host files over 100GB and thus I can only serve archives of repos under 100GB.
This is the script which takes the longest time and is dependent on the compression method. But this should not be run every time, anyways, so not a problem speed-wise.
Vlog System
This is the latest latest system I developed to take care of publishing some videos from my old YouTube channel -- X GNU Bio.
Before removing my YouTube channel, I decided to backup all my
videos with youtube-dl
. The option to add meta data to
the videos is really generous and that enabled me to automatically
make pages for them for republication. All I have got to do is to
get meta data for fields such as title, description, comment (the
full YouTube video description), date and etc with shell scripting
tools and make HTML pages for them.
I gave a try to make a vlog section under my own website. GitHub does not allow files larger than 100MB, so only about 20% of my YouTube videos can be published over here (the smaller and shortest videos).
I am happy with the vlog portal, and you can check some of my YouTube videos.
This system will only work with my YouTube videos because I downloaded them all with the same settings and metadata available. But eventually I may have a new vlog with more videos, but will need a new vlog system then.
One thing is for sure, we shall not return to YouTube anytime soon.