RSS Logo

by Mountaineerbr

#22 - Website Improvements, Shell RSS Feed Generator


Trabalhando com o código do site, melhorando por exemplo, o menu de navegação da página principal do site, e também pensando em como fazer para manter um gerador de distribuição de notícias em RSS..


In the past few days I have been working on my website layout redesign of the main navigation bar. Finally, I am satisfied with the results. The blog proper uses a navigation bar from W3.CSS framework so I will just leave that as it is for now.

I have been sending a lot of curriculums for many private companies and i got no answer back, except one saying I was removed from their selection but would still be in the data bank for another opportunity.. However, the e-mail was for a 3rd-party contractor which omitted the company's name I applied for.. 😐


Cool/inspiring blog generators

I am using my own shell solution to manage my blog and that is working fine, although the code looks big at this point, but I am only doing small modification for improvements as time goes by.

I got inspired by John Bokma's tumblelog scripts that generate an RSS and JSON feeds. A JSON feed is over the top for me at this point, plus RSS feeds are well adopted so I thought I would try and make my own RSS generator script. There is plenty of code I was able to use from my blog.sh shell script, plus I found this nice Stack Overflow thread and decided to go with xmlstarlet, even though it is not too difficult to generate my sitemap.xml file with shell and sed.

I am now trying to understand some more about XML in general.

RSS, Atom and Namespaces

Specs/Standards

RSS 1.0
http://web.resource.org/rss/1.0/spec
RSS 2.0
https://cyber.harvard.edu/rss/rss.html
https://www.rssboard.org/rss-2-0-1-rv-6
Atom
https://www.ietf.org/rfc/rfc4287.txt
https://tools.ietf.org/rfc/rfc4287.txt

Some RSS Feeds

I am studying some RSS feeds from some high-quality people.

I shall clone Richard Stallman RSS feed template, including namespaces define in the XML document, but I will try and avoid using <![[CDATA]]> arrays for pasting HTML code under the item's description tag as in Luke Smith RSS feed.

Although here is the catch, I will not paste unescaped HTML code and will follow Arch RSS feed scheme. Luckily, xmlstarlet can do that escaping for us! Meaning that HTML code will be translated into HTML entities, but that will generate parsing errors if I wrote HTML entities in the raw blog post myself.. Need to think more about this.. Just use <![[CDATA]]> arrays as in Stallman's feed?

I reckon our RSS feed should stay compatible with standards this way but of course it will comply better over time as I adjust it..

I am just not sure if only a brief summary of the blog article or the whole blog article should de added in the RSS feed. Luke Smith does that sloppily because he cannot use all HTML elements, just simple ones such as <p> and <em> according to this reference, so lets see what sort of HTML he injects in his new feed..

Note: Luke has just changed his RSS feed and is using <![[CDATA]]> now.. Indeed, he has been redesigning his whole website recently.

Fig 1. Just like my own RSS feed, Luke's feed is also having some heavy design updates..

The <summary> tag is reserved for the blog posting system, however the way that is planned to be used is to optionally write a summary for each article in a different language other than the post itself (so content can reach a more different visitors that way, even if partially).

That will not do for the RSS feed system, though.. If I cannot include the full blog post content (with the exception of graphics), then I will just head the post content and add that to the RSS feed entry, OR, grep the description metatag of each blog article, which I need always fill manually (for Search Engines) anyway..

Hopefully, there will soon be an RSS feed button in the blog homepage..


It seems that the right way to do this is to use xmlstarlet and comply with its bidding. The following quote made me realise this:

It comes down to this: XML is not a string. Don't treat it as one. Don't use or create tools that treat XML as a string. XML requires a parser - and all conforming parsers will do the right thing in this situation.
--Tomalak

In reality we should not be adding much more than a simple description in the RSS description tag, you can escape some HTML code but xmlstarlet will throw some errors because of strings such as =&gt in URLs, which really should be encoded in the first place but we are not so perfect..

Stallman bypasses that using <![[CDATA]]> array and adding unescaped HTML code, such as URLs, etc to it. Arch New feed translates HTML code tags to HTML entities, which should be treated as proper HTML code by the RSS parser when reading the <![[CDATA]]> field (hopefully)..

Either we can go simple and use xmlstarlet help or we can go nuclear the Stallman's way and use sed to generate our RSS document.

There does not seem to be a single way of encoding data in description in RSS 2.0 as explained in the last references above.


For simple descriptions with few and simple HTML tagging, TrinitronX's answer at Stack Overflow suggests using tidy and transform named entities to numeric entities plus converting HTML to XHTML, which seems to work to inject almost any HTML data in the <description> element.. However, that will fail if you have escaped HTML entities in you HTML codes. Also, xmlstarlet does not care about <![[CDATA]]> tags and will escape/convert all HTML code to entities..

I was considering creating two channels in a single RSS feed, one with just a description of each post and another one with the full content of posts, which is generally more useful.

Instead of using two channels, there is an alternative using Atom namespaces but I don't see any major benefit plus I need studying Atom more..

In case of Atom, you could even provide both in the same feed: a content element for the full content, and a summary element for the excerpt. (I guess other feed formats allows this, too.)
--unor

I installed newsboat, available from the official repos. After configuring it, that seems a really cool cli news aggregator, I recommend it! I am also testing QuiteRSS (GUI, on reconsideration, it is a very impressive reader), Thunderbird add-on evolution-rss (very good!), Firefox add-on Livemarks (rather simple), and RSS Reader (for Android). They all behave in different ways..

I added <link rel="alternate" type="application/rss+xml" title="Biology Blog RSS" href="rss.xml"> to the <header> of index.html to reference the RSS Feed.

After a lot of trying, I decided to go the Stallman way and release two feeds, the Default RSS Feed which shall contain only short descriptions of posts and the Alternative RSS Feed which shall contain full content of posts injected within <![[CDATA]]> arrays (and is bigger in size).


Testing my feeds with newsboat and there seems that some bad or complex HTML code in my posts may make newsboat parser go nuts at times.

As RSS really stands for Really Simple Syndication, and that is so easy to open the full content in a proper (cli) web browser with newsboat and other RSS feed readers, I have decided that the main feed will contain short descriptions of blog post entries and there will be an alternative feed for full content, for those who want it. But I cannot guarantee your news reader will process the alternative feed correctly..

Even though delivering full content by defaults would seem the most useful approach for the visitor, my posts do contain a lot of HTML code and thus really are not supposed to be parsed by the RSS feed reader.. For example, if I wrote posts in markdown, that could easily be converted to simple HTML, encoded appropriately and injected in the RSS feed. When received, it would be parsed correctly by the RSS feed reader.

Another factor involved is size. The short description feed is more than 10 times smaller in size than the full-content feed (16.28KB vs 171.15KB).

If you are subscribed to the main RSS feed already, hopefully there may be some problems on your side when I do the switch. If so, please remove and re-add my feeds.

Specially now that I am also setting proper guids (global identifiers)) to my feeds. Eventually, I will need guids unrelated to my website address to make sure items are unique for podcast items and permanent whenever I move from website host (currently github)!

The feeds should be stable in a matter of few days.. Sorry about any inconvenience while I get all that sorted out..

More references