Announcing website snapshot for Trandoshan

Go Oct 14, 2019

When I have written my article for Trandoshan I didn't expect so much positive feedback. More than 15k views on the article, hundred stars on the Github repository... I am very happy about it.

My first plan was to stop the development right after 1.0.0 was working and the article was published in order to move on new project(s). But despite of that I have had several cool features in mind.

Since the project seems to interest some people I have decided to implement one of them: website snapshot

What the hell is website snapshot ?

Currently Trandoshan is only indexing the text found in the crawled resources. The text is stored in database and a text search can be performed using the API.

The problem with this implementation is that you cannot actually render the website from Trandoshan, sure you can view the raw html from it but you cannot access any image, you'll loose the external CSS, etc...

The idea behind the feature is to create a snapshot of the website at time T and store it somewhere to allow complete reading from Trandoshan. (including images, relative URLs, etc...)

But wait, there's more

Creating a website snapshot at time T add an extra dimension to Trandoshan: the time. And this is where the fun begins...

Benefits of using the time as an extra dimension

Adding a time dimension allows a lot of features to be built:

  • We can imagine viewing the evolution in time of a given website (something similar as the Internet Archive)
  • Searching the number of occurrences  of a given term on the whole hidden services at given period of time (something like a graphic: evolution of mention of word bitcoin in time)
  • Create statistics on the whole hidden services: how much new websites are created, how much time they are alive, etc...

How to implement the feature

Such a big feature will need an important refactoring because Trandoshan wasn't designed for that in the beginning. I have started to implement little things such as allowing relative URLs crawling or customization of forbidden content type that will help building the feature.

But there is some other things to do...

The scheduler

Currently the scheduler prevent duplicate website crawling. It means that it will not crawl an already crawled resource. This has to change if we want to crawl a resource at different times.

There will be a threshold between two crawling requests of the same resource to prevent infinite loop if two websites refer each others too.

The persister

Now it's time to stop storing everything in a database and use something more convenient. The persister process will now store raw resources on the disk with the following pattern:

resource-url/64bit-timestamp

For example the following resource:

http://login.google.com/secure/createAccount.html

Will be store like this on the disk:

login.google.com/secure/createAccount.html/1570788418

But there is a drawback: by storing everything on a disk there will an important latency when searching for text occurrences... That's why a search engine will be added to the project: Elasticsearch.

The flow of the Persister will change a bit:

New interaction between the Persister and the API

The API

The API will be refactored to use Elasticsearch when a text search is performed and to access the raw resources on the disk when needed.

The GET /resources endpoint will be refactored to take 2 more query parameters:

  • url: the full url to the resources (without the protocol)
  • date: the date wanted for the crawling resources

and will read the filesystem to get the desired content.

If the endpoint has the search query parameters a text search will be performed on elastic search to get matching content.

The API will search for the closest available resource in time for the specified URL. If no resource is found (or existing resources are too far in time from wanted date) the API will return nothing.

The dashboard

To be honest the dashboard has to be completely rewritten from scratch. The current implementation was more a proof-of-concept than anything else so that's not a big deal.

I want the new dashboard to be dynamic: I want the user to build his own interface using pre-defined customizable components, to create alert based on word mentioning, to graph evolution of words, website, etc...

Each user will have his own custom dashboard accorded to his own need.

Cron Job

Since there is a new time dimension, it would be great to create a cron like system that will trigger automatic refresh of existing content, in order to have a day to day evolution of all resources.

The implementation of the cron is not clear yet: it could be a background task running on the scheduler or even a new process designed to get all content from the filesystem that will put a crawling request for each existing entry in the scheduler doneQueue. This process can be executed using Kubernetes cron job. This need to be think.

Conclusion

I think that this feature is really interesting to build and to use, however it will take time to do so, and I'll need help from you guys, especially for the UI/UX since I'm more a backend developer (I suck at designing interface).

That's all for the moment, for those who are interested to contribute to this open source project feel free to contact me at: alois@micard.lu it would be very cool to work with others devs.

You can view the backlog of the website-snapshot here

Happy hacking !

Aloïs Micard

You can contact me on: alois@micard.lu. PGP fingerprint: F733 E871 0859 FCD2