February 27, 2009
The service I want to provide is taking daily or weekly snapshots of high-profile websites like www.whitehouse.gov and keeping an archive that will allow us to easily yell gotcha when they disappear a statement that is inconvenient to their current policies.
Now, I just need help with understanding how web-crawlers work and how I'd be able to pull that data down. The rest of the presentation won't be a problem.
Anybody want to help? Alice H., I'm looking in your direction.
Update: Okay, something is there. I'll be putting it together in my very limited spare time this weekend. Basically, I figure I'll use a webcrawler to scrape the pages at various websites* of interest and then archive versions of the pages by date. That'll be all it does initially but eventually we could even have a tool that locates differences in websites over time. That would save having to hand-explore them.
* - I will need to look into the legality of doing this for private/corporate websites. I imagine anything political or governmental is pretty much fine under fair use laws but I'm not a lawyer.
*ahem*. I'm not a lawyer... ...
62 queries taking 0.3633 seconds, 145 records returned.
Powered by Minx 1.1.6c-pink.