In this article we will describe the current state of the RSS clients, we will provide our solution and we will discuss the deeper technical details.
RSS seems an old technology that is heavily used by a lot of people. There are many RSS clients that contain pre-registered sources or allow the user to add his own links. Even with the large variety of solutions in the market, there are some aspects that have not covered yet.
When the user clicks an article, the RSS clients redirect the user to the article source. Then, the user lands on the origin page and usually waits a couple of seconds in order to load the contents, close the popup adds, accept the page’s policy, and deny the notifications prompt. Less popular languages like greek have not dedicated RSS clients. Latin-oriented clients are trying to cover the gap.
What we were really trying to do is to create a very fast RSS application that provides a simple, minimalistic, and modern experience. An application that provides all the news in almost real-time. Divided into categories and sources, so it is easy to choose exactly the type of news that he wants to read. Αssist the user with an integrated article reader so the user will not have to wait for the original article to load. All of this implemented in a cut edge, mobile-first application. So, let’s mention the required ingredients.
- Backend app
- RSS crawler
- Article extractor
- SSR (server-side rendering) for the social media SEO
- Image proxy
- Frontend app
- Services orchestrator
The rule separation of the services is clear from the beginning. There are some initial observations regarding the implementation. Our needs are different from a classic backend-frontend app. We had to create our own infrastructure and avoid using a server-as-service solution. Some of the main reasons are:
- The RSS crawler consists of an infinite-loop functionality that never stops fetching and processing external resources. Hence, if we deploy it in a SAS infrastructure the cost will be enormous.
- The RSS crawler is completely different from the backend app. If we put it in the same monolithic application, it will cause performance problems in the entire infrastructure.
Considering the above-mentioned, we decided to create a stack with 4 microservices, a frontend app, and an orchestrator.
This is the application that does the core job of the RSS fetching. It iterates the list of the RSS registrations. For each registration, it fetches the RSS link, extracts the necessary fields, and applies some transformations. The process is an infinite loop and there are many seconds delay between each fetch, so we will not be banned from the sources. We also track and log the acceptance and parse rate for every registration.
This is a functionality that we don’t see usually in the industry. Our target is to extract only the useful article content from the page, without the adv/sidebars/popups. The useful content that we define in an article page is the combination of text, heading, images, and videos. After a lot of research, we concluded that this functionality is already happening in modern browsers. The name of the functionality is the article “readability”. It already exists as an extension in Chrome and Firefox inside the browser execution context.
What we had to do is to fetch the article in the server, add it to a server-dom (jsdom) implementation and use the readability library in order to extract the useful segments of the page. After some experiments the result was amazing. The success rate was quite acceptable, even the fact that the language of the page was non-Latin. Sometimes there are some false positives but the result is not annoying. There are 2 observations in this process:
- This is a heavy process, so we had to add a caching mechanism.
- The success of the process is not guaranteed. Many steps can lead to wrong results, from the download step of the article page to the article content extraction. So, a nice error page should be implemented in the frontend.
- The feeds are too many, we cannot do this procedure for every feed, so we do it on demand.
Image Proxy ArchitectureImages are a basic requirement for this kind of application. It is a fundamental feature for both of the article preview and content. All the images of the sources exist in their own domain. We cannot download all the assets of every article in the crawler service because they are too many. So, once again, we have to do it on the fly. The steps here are:
1. The frontend app requires the source image from our image proxy service with optionally width and height dimensions.
2. Our service downloads the image in a temporary folder and applies the required transformations in the size and dimensions.
3. Caches the image using an in-memory LRU policy.
4. Sends the image to the frontend application
A must-to-have functionality for every modern application. Unfortunately, we cannot live without it. It is the only way to make a SPA reachable by the bots of Google and the preview mechanisms of Twitter, Facebook, etc. We did a lot of research here on what we can do in order to make this functionality as lightweight as possible. All the ready for production solutions were not working as expected.
- The renderToString method only renders the initial state. There are some alternatives to yield the suspended loaders, but it needs too much customization.
- The third-party libraries that use async renderToString methods are not working as expected. They either don’t update property the state or does not support some native browser functions.
We ended up that the best solution is to use a headless browser, evaluate the page in the headless context, and send back the rendered HTML. [Puppeteer(https://github.com/puppeteer/puppeteer) is by far the best solution for this scenario because it uses the real chrome engine under the hood. The steps for the procedure are:
- Use the react helmet in the frontend in order to inject all the SEO tags on every page. Title, image, description, etc.
- Start-up a puppeteer page instance in the SSR service.
- Open the URL in the page and wait until the network is idle. Then, we assume that the page is read for the snapshot.
- Send back the inner HTML of the page and cache it.
A straightforward implementation that provides all the necessary information to the front-end. Nothing tricky here.
We are using MongoDB as the primary database for our needs. The entire application is processing sequences and streams of data. The RSS feeds are stored sequentially, and the timeline of the frontend app depicts the same structure as it is stored in the database. A document-oriented database like MongoDB fits well in our needs.
There was a lot of work here in order to make it very fast, progressively, and easy to use. This section is an iteration that never stops, we can always find something to improve. We will mention the technologies:
- Create react app for scaffolding
- Grommet js, UI library
- React context, state management
- React-Virtualized, (infinite scroll, window scroll)
- React-helmet, SEO tags injection
The services orchestrator
Application ArchitectureThere are many segments of the services that we did not explain yet. How all of these services communicate with each other and how they are deployed in production. We used [Nginx(https://www.nginx.com/) for our services orchestrator. All the services are NodeJS applications. The applications that are working as public servers are using the ExpressJS framework. Nginx does the following processes:
- The public services are protected by reverse proxy.
- The image proxy service lives under the img.wiregoose.com domain.
- The backend services live under the api.wiregoose.com domain.
- The frontend app is a static bundle of files. It redirects all the requests to the index.html of the static files if there is not any matching file in the
- Nginx checks the request agent and decides if it is a crawler/bot or a normal request. The normal request is served from the static files. The bots are redirected to the SSR service. This technique is quite similar to the prerender.io Nginx configuration.
- The domains are certified by the [let’s decrypt plugin(https://www.nginx.com/blog/using-free-ssltls-certificates-from-lets-encrypt-with-nginx/) that is dedicated to the Nginx.
We have set up the configuration so the NodeJS is handling the core job and the Nginx is handling all the trivial parts of the services communication. We also wrap the NodeJS instances under the PM2 process manager. That way we keep the load of the NodeJS instances as thin as possible and prevent downtimes. All the services are launched in production using a docker-compose configuration.
Facebook page https://www.facebook.com/wiregoose/
- RSS technology, https://en.wikipedia.org/wiki/RSS
- SAS, https://en.wikipedia.org/wiki/Software_as_a_service
- Web Crawler, https://en.wikipedia.org/wiki/Web_crawler
- Firefox Reader, https://addons.mozilla.org/el/firefox/addon/reader-view/
- Chrome Just Read, https://chrome.google.com/webstore/detail/just-read/dgmanlpmmkibanfdgjocnabmcaclkmod
- Chrome Reader View, https://chrome.google.com/webstore/detail/reader-view/ecabifbgmdmgdllomnfinbmaellmclnh
- Virtual dom implementation in NodeJS, https://www.npmjs.com/package/jsdom
- Mozilla Readability library, https://github.com/mozilla/readability
- LRU Cache Policy, https://en.wikipedia.org/wiki/Cache_replacement_policies#Least_recently_used_(LRU)
- SPA, https://en.wikipedia.org/wiki/Single-page_application
- Facebook Sharing Debugging Tool, https://developers.facebook.com/tools/debug/
- React renderToString method, https://reactjs.org/docs/react-dom-server.html
- Headless browser, https://en.wikipedia.org/wiki/Headless_browser
- PuppeteerBrowser, https://github.com/puppeteer/puppeteer
- React Hamlet for head tags injection, https://github.com/nfl/react-helmet
- Grommet UI Library, https://v2.grommet.io/
- Virtualized list, https://github.com/bvaughn/react-virtualized
- Nginx lets encrypt, https://www.nginx.com/blog/using-free-ssltls-certificates-from-lets-encrypt-with-nginx/
- PM2 process manager, https://pm2.io/