February 27, 2009

Down the Memory Hole

So, I was reading Instapundit this morning and this post motivated me to register the domain www.thememoryhole.us

The service I want to provide is taking daily or weekly snapshots of high-profile websites like www.whitehouse.gov and keeping an archive that will allow us to easily yell gotcha when they disappear a statement that is inconvenient to their current policies.

Now, I just need help with understanding how web-crawlers work and how I'd be able to pull that data down.  The rest of the presentation won't be a problem.

Anybody want to help?  Alice H., I'm looking in your direction.

Update: Okay, something is there.  I'll be putting it together in my very limited spare time this weekend.  Basically, I figure I'll use a webcrawler to scrape the pages at various websites* of interest and then archive versions of the pages by date.  That'll be all it does initially but eventually we could even have a tool that locates differences in websites over time.  That would save having to hand-explore them. 

 

* - I will need to look into the legality of doing this for private/corporate websites.  I imagine anything political or governmental is pretty much fine under fair use laws but I'm not a lawyer. 

*ahem*.  I'm not a lawyer...   ...

Posted by: Moron Pundit at 09:51 AM | Comments (32) | Add Comment
Post contains 214 words, total size 2 kb.

Comments are disabled. Post is locked.
13kb generated in CPU 0.01, elapsed 0.0131 seconds.
62 queries taking 0.008 seconds, 145 records returned.
Powered by Minx 1.1.6c-pink.