Skip to content
This repository has been archived by the owner on Sep 10, 2020. It is now read-only.

Automatically run sorter every night #91

Open
redshiftzero opened this issue Dec 1, 2016 · 6 comments
Open

Automatically run sorter every night #91

redshiftzero opened this issue Dec 1, 2016 · 6 comments
Labels

Comments

@redshiftzero
Copy link
Contributor

redshiftzero commented Dec 1, 2016

The "sorter" tells us the state of onionspace based on public directories of onion services. We should ensure that this is being executed every night (via cron) such that the crawlers have a lengthy list of onion services that are currently up to collect data from. Otherwise as time marches on we will be crawling an increasingly limited and biased set of onion services...

@conorsch
Copy link
Contributor

conorsch commented Dec 1, 2016

Looks like this will entail a bit of templatizing, since the db values are currently read from config.ini, but the values in that file are not written by Ansible.

@conorsch
Copy link
Contributor

conorsch commented Dec 1, 2016

Also, it appears that the db connector logic in database.py is defaulting to using the test database, which isn't correct for actual use of the crawlers running in prod.

@psivesely
Copy link
Contributor

Note if the sorter runs concurrently with the crawler, it will terribly pollute the traces. So either the crawler needs to be stopped, or we need to run the sorter on a dedicated server.

@conorsch
Copy link
Contributor

conorsch commented Dec 2, 2016

Since we're not just running a single service, and we want to carefully control which script runs when, maybe we should simply use cron to manage the script runs and stagger the times accordingly. It doesn't seem to produce more valuable results if we run the crawler 24/7 versus once or twice per day.

@psivesely
Copy link
Contributor

I think we should just use another VM for this and a systemd.timer (see https://wiki.archlinux.org/index.php/Systemd/Timers, https://www.freedesktop.org/software/systemd/man/systemd.timer.html, & https://coreos.com/os/docs/latest/scheduling-tasks-with-systemd-timers.html). We could save on VPS money by re-using our database VM as a sorter. Note the sorter only searches for strings in the HTML body text of pages and does no rendering of anything.

@psivesely
Copy link
Contributor

@conorsch What do you think of re-using the database server for this purpose? The minimal processing of content means the sorter should be quite safe to run on the database server (not as if the crawlers with full access to the database aren't executing untrusted JavaScript).

@psivesely psivesely added the ops label Jan 9, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants