#22 - Website Improvements, Shell RSS Feed Generator
Trabalhando com o código do site, melhorando por exemplo, o menu de navegação da página principal do site, e também pensando em como fazer para manter um gerador de distribuição de notícias em RSS..
In the past few days I have been working on my website layout redesign of the main navigation bar. Finally, I am satisfied with the results. The blog proper uses a navigation bar from W3.CSS framework so I will just leave that as it is for now.
I have been sending a lot of curriculums for many private companies and i got no answer back, except one saying I was removed from their selection but would still be in the data bank for another opportunity.. However, the e-mail was for a 3rd-party contractor which omitted the company's name I applied for.. 😐
Cool/inspiring blog generators
I am using my own shell solution to manage my blog and that is working fine, although the code looks big at this point, but I am only doing small modification for improvements as time goes by.
I got inspired by John Bokma's
tumblelog
scripts that generate an RSS
and JSON feeds. A JSON feed is over
the top for me at this point, plus RSS feeds are well
adopted so I thought I would try and make my own RSS generator
script. There is plenty of code I was able to use from my
blog.sh
shell script, plus I found this nice
Stack Overflow thread and decided to go with
xmlstarlet
, even though it is not too difficult to
generate my sitemap.xml file with
shell and sed
.
I am now trying to understand some more about XML in general.
RSS, Atom and Namespaces
- XML 1.0
- RSS 2.0
- Atom, (b)
- Dublin Core Metadata Element (DC), (b)
- W3 wiki on RssContent
- A few examples of encoding & item-level descriptions in RSS 2.0
- RSS.ppt
- RSS and Atom Compared
- Phil Ringnalda's post
Specs/Standards
- RSS 1.0
- http://web.resource.org/rss/1.0/spec
- RSS 2.0
- https://cyber.harvard.edu/rss/rss.html
https://www.rssboard.org/rss-2-0-1-rv-6 - Atom
- https://www.ietf.org/rfc/rfc4287.txt
https://tools.ietf.org/rfc/rfc4287.txt
Some RSS Feeds
I am studying some RSS feeds from some high-quality people.
- Stallman's
- Luke Smith's
- Arch News Forum
- John Bokma's RDF
feed
(which is another beast completely)
I shall clone Richard Stallman RSS feed template,
including namespaces define in the XML document, but I will
try and avoid using <![[CDATA]]>
arrays for
pasting HTML code under the item's description
tag
as in Luke Smith RSS feed.
Although here is the catch, I will not paste unescaped
HTML code and will follow Arch RSS feed scheme.
Luckily, xmlstarlet
can do that escaping for us!
Meaning that HTML code will be translated into HTML
entities, but that will generate parsing errors if I wrote
HTML entities in the raw blog post myself.. Need to think
more about this.. Just use <![[CDATA]]>
arrays
as in Stallman's feed?
I reckon our RSS feed should stay compatible with standards this way but of course it will comply better over time as I adjust it..
I am just not sure if only a brief summary of the blog
article or the whole blog article should de added in the RSS
feed. Luke Smith does that sloppily because he cannot use all
HTML elements, just simple ones such as
<p>
and <em>
according to
this reference, so
lets see what sort of HTML he injects in his new feed..
Note: Luke has just changed his RSS feed and is using
<![[CDATA]]>
now.. Indeed, he has been
redesigning his whole website
recently.
The <summary>
tag is
reserved for the blog posting system, however the way that is
planned to be used is to optionally write a summary for
each article in a different language other than the post
itself (so content can reach a more different visitors that way,
even if partially).
That will not do for the RSS feed system, though.. If I
cannot include the full blog post content (with the exception of
graphics), then I will just head
the post content and
add that to the RSS feed entry, OR,
grep
the description metatag of each blog
article, which I need always fill manually (for Search Engines)
anyway..
Hopefully, there will soon be an RSS feed button in the blog homepage..
It seems that the right way to do this is to use
xmlstarlet
and comply with its bidding. The following
quote made me realise this:
It comes down to this: XML is not a string. Don't treat it as one. Don't use or create tools that treat XML as a string. XML requires a parser - and all conforming parsers will do the right thing in this situation.
--Tomalak
In reality we should not be adding much more than a simple
description in the RSS description
tag, you can
escape some HTML code but xmlstarlet
will throw
some errors because of strings such as =>
in
URLs, which really should be encoded in the first place but
we are not so perfect..
Stallman bypasses that using <![[CDATA]]>
array and adding unescaped HTML code, such as URLs,
etc to it. Arch New feed translates HTML code tags to
HTML entities, which should be treated as proper HTML
code by the RSS parser when reading the
<![[CDATA]]>
field (hopefully)..
Either we can go simple and use xmlstarlet
help or
we can go nuclear the Stallman's way and use sed
to
generate our RSS document.
There does not seem to be a single way of encoding data in description in RSS 2.0 as explained in the last references above.
For simple descriptions with few and simple HTML tagging,
TrinitronX's answer at Stack Overflow suggests using
tidy
and transform named entities to numeric entities
plus converting HTML to XHTML, which seems to work to
inject almost any HTML data in the
<description>
element.. However, that will fail
if you have escaped HTML entities in you HTML codes.
Also, xmlstarlet
does not care about
<![[CDATA]]>
tags and will escape/convert all
HTML code to entities..
I was considering creating two channels in a single RSS feed, one with just a description of each post and another one with the full content of posts, which is generally more useful.
Instead of using two channels, there is an alternative using Atom namespaces but I don't see any major benefit plus I need studying Atom more..
In case of Atom, you could even provide both in the same feed: a content element for the full content, and a summary element for the excerpt. (I guess other feed formats allows this, too.)
--unor
I installed newsboat
, available from
the official repos. After configuring it,
that seems a really cool cli news aggregator, I recommend it! I am
also testing QuiteRSS
(GUI, on
reconsideration, it is a very impressive reader), Thunderbird
add-on evolution-rss
(very good!), Firefox add-on
Livemarks (rather simple), and
RSS Reader (for Android). They all behave in different
ways..
I added <link rel="alternate"
type="application/rss+xml" title="Biology Blog RSS"
href="rss.xml">
to the <header>
of
index.html to reference the RSS Feed.
After a lot of trying, I decided to go the Stallman way and
release two feeds, the
Default RSS Feed which shall contain only short
descriptions of posts and the
Alternative RSS Feed which shall contain full content of
posts injected within <![[CDATA]]>
arrays (and
is bigger in size).
Testing my feeds with newsboat
and there seems that
some bad or complex HTML code in my posts may make
newsboat
parser go nuts at times.
As RSS really stands for Really Simple
Syndication, and that is so easy to open the full content in a
proper (cli) web browser with newsboat
and other
RSS feed readers, I have decided that the main feed will
contain short descriptions of blog post entries and there will be
an alternative feed for full content, for those who want it. But I
cannot guarantee your news reader will process the alternative feed
correctly..
Even though delivering full content by defaults would seem the
most useful approach for the visitor, my posts do contain a lot of
HTML code and thus really are not supposed to be
parsed by the RSS feed reader.. For example, if I wrote
posts in markdown
, that could easily be converted to
simple HTML, encoded appropriately and injected in the RSS
feed. When received, it would be parsed correctly by the RSS
feed reader.
Another factor involved is size. The short description feed is more than 10 times smaller in size than the full-content feed (16.28KB vs 171.15KB).
If you are subscribed to the main RSS feed already, hopefully there may be some problems on your side when I do the switch. If so, please remove and re-add my feeds.
Specially now that I am also setting proper guids (global identifiers)) to my feeds. Eventually, I will need guids unrelated to my website address to make sure items are unique for podcast items and permanent whenever I move from website host (currently github)!
The feeds should be stable in a matter of few days.. Sorry about any inconvenience while I get all that sorted out..
More references
- Podget -- A simple podcast aggregator. Check his favourite podcast list.
- BashPodder (podcast
aggregator) Check his
Where can I find podcasts?
list. - the Linux Action Show -- Best of Bash Scripts
- Making RSS Pretty
- Custom Style RSS Feed
- Adding a CSS StyleSheet to your RSS Feed
- Beginning to Style Your RSS Feed
- Improving an XML feed display through CSS and XSLT
- CSS in RSS feed
- Making Your RSS Feed Look Pretty in a Browser
- css and rss
- Atom Info Proposal
- Note: if you decide to go exploring the following links, make sure to retrieve older pages referenced from them with the WayBack Machine, as many of them have got their only surviving copy available there
- Phil Ringnalda's blog (again, very very good)..
- Atom <title>
- Title Conformance Tests
- Matt Mower's Looking for a helping hand... blog entry on his own blog disappering from Google results when searching for with his name
- Phil Ringnalda questions the value of letting Google and MSN index his pages...
- Search Engines as Leeches on the Web
- Feeds As Attack Delivery Systems
- Feed Injection in Web 2.0 (Web Feeds as Attack Vectors)
- Luke Hutteman on Java, .NET, J2EE, RSS and whatever else comes to mind...
- Hixie's
Natural Log, however beware he has got some nasty ideas about
eugenics (which he calls
Humanitarian eugenics
💀💀💀 ~laughings~) I have seen other people with poor understanding of biology and natural selection law as well, including Luke Smith. Be warned. - Deep Linking
- Berners-Lee's draft-ietf-iiir-html-00.txt