This is a rewrite of the first ruby scraper that saves text files useful for searching instead of whole sites in export format. It incrementally refetches pages that have changed based on dates in the sitemaps. github
The scrape runs every six hours on a schedule that shifts with daylight savings time. The scrape is built from scripts that manipulates files in directories. Some files are rolled up from similarly named files in subdirectories.
We have a modest search application. site
These files are updated when a page is discovered to be changed. A post scrape process rolls up these page level files to per site files and then these up to three files describing the whole visible federation.
Search Index Downloads as flat files.
Sitemap Failures as sites disappear or what?
Our scrape can miss content that is edited out of view and then forked or rsync'd into view well after the scrape's moving time window has passed. A fix would be to remember page names and last seen edit in the scraper.
Scrape dates from actions and roll them up to sites. Possibly render this as sparklines when reporting participating sites in a query.
Scrape titles (slugs) from pages and roll them up to sites. This is similar to sitemaps but simpler and could be searched to find all twins in the federation.
Rollup page names to federation level and compare with links.txt to see what is missing. Create pages.txt files in the rollups for sites we've successfully scraped. Add `search the federation` as an alternative for missing pages.
Create a site like sites.fed.wiki.org from information we find in the scrape. Manufacture a synopsis from the who and what links on the welcome page. Would this site be a good neighbor? Would neighborhood search be useful for finding sites?
We seem to have polluted the index with sfw.c2.com words that should have been ignored because they are part of tags. We've fixed this in the scraper but how to clean the index?
Add css to make flags gray until their true colors arrive. Results can be very confusing otherwise.
Add title and synopsis words to words.txt.
See Sitemap Scrape Improvements we have completed.
# See also