-
Notifications
You must be signed in to change notification settings - Fork 9
Automatically run sorter every night #91
Comments
Looks like this will entail a bit of templatizing, since the db values are currently read from |
Also, it appears that the db connector logic in |
Note if the sorter runs concurrently with the crawler, it will terribly pollute the traces. So either the crawler needs to be stopped, or we need to run the sorter on a dedicated server. |
Since we're not just running a single service, and we want to carefully control which script runs when, maybe we should simply use cron to manage the script runs and stagger the times accordingly. It doesn't seem to produce more valuable results if we run the crawler 24/7 versus once or twice per day. |
I think we should just use another VM for this and a systemd.timer (see https://wiki.archlinux.org/index.php/Systemd/Timers, https://www.freedesktop.org/software/systemd/man/systemd.timer.html, & https://coreos.com/os/docs/latest/scheduling-tasks-with-systemd-timers.html). We could save on VPS money by re-using our database VM as a sorter. Note the sorter only searches for strings in the HTML body text of pages and does no rendering of anything. |
@conorsch What do you think of re-using the database server for this purpose? The minimal processing of content means the sorter should be quite safe to run on the database server (not as if the crawlers with full access to the database aren't executing untrusted JavaScript). |
The "sorter" tells us the state of onionspace based on public directories of onion services. We should ensure that this is being executed every night (via cron) such that the crawlers have a lengthy list of onion services that are currently up to collect data from. Otherwise as time marches on we will be crawling an increasingly limited and biased set of onion services...
The text was updated successfully, but these errors were encountered: