February 27, 2009

Down the Memory Hole

So, I was reading Instapundit this morning and this post motivated me to register the domain www.thememoryhole.us

The service I want to provide is taking daily or weekly snapshots of high-profile websites like www.whitehouse.gov and keeping an archive that will allow us to easily yell gotcha when they disappear a statement that is inconvenient to their current policies.

Now, I just need help with understanding how web-crawlers work and how I'd be able to pull that data down.  The rest of the presentation won't be a problem.

Anybody want to help?  Alice H., I'm looking in your direction.

Update: Okay, something is there.  I'll be putting it together in my very limited spare time this weekend.  Basically, I figure I'll use a webcrawler to scrape the pages at various websites* of interest and then archive versions of the pages by date.  That'll be all it does initially but eventually we could even have a tool that locates differences in websites over time.  That would save having to hand-explore them. 

 

* - I will need to look into the legality of doing this for private/corporate websites.  I imagine anything political or governmental is pretty much fine under fair use laws but I'm not a lawyer. 

*ahem*.  I'm not a lawyer...   ...

Posted by: Moron Pundit at 09:51 AM | Comments (38) | Add Comment
Post contains 214 words, total size 2 kb.

1 Great idea. I have no skills in this direction but I do have a password to Moron Central. I put up a headline about this so maybe some of the great unwashed over there will be motivated to help.

Good luck.

Posted by: DrewM. at February 27, 2009 10:02 AM (hlYel)

2 What?! You want me to do something technical?!

That sounds like something hubby would know how to do.  Lemme ask him tonight.

Posted by: Alice H at February 27, 2009 10:06 AM (jRtPb)

3 I went to  www.thememoryhole.us. in Firefox, Opera and IE and it was always listed as a server that could not be found.  Maybe the memory hole is much bigger than previously imagined.

Posted by: snaggletoothie at February 27, 2009 10:06 AM (9m+gW)

4 I am a technical ludite, but i'll be happy to send link to items you may want to memorize. I'll pass on the word to McGoo et al.

Posted by: cbullitt at February 27, 2009 10:07 AM (M/WbE)

5 Do you want to grab a screenshot of each page, or do you want to download all of the HTML and images for the site(s)?

If I can help, let me know. Unfortunately, I have a _lot_ of time on my hands...

Posted by: Dysfunctional at February 27, 2009 10:11 AM (FTiH0)

6

The Memory Hole is not currently active.  I just registered it today.

So...  umm...  Wait a few days.  I'm awesome at this but I'm not a miracle worker.

Posted by: Moron Pundit at February 27, 2009 10:20 AM (83gRI)

7

I wonder if I can get Slublog to do a logo for the page.  I was thinking something with goatse.  Kidding.

But there are two "O's" in the memory hole that could be useful for decorations.

Posted by: Moron Pundit at February 27, 2009 10:22 AM (83gRI)

8

Dysfunctional : I intend to, as exactly as is possible, reproduce the website in question.  This would be scraping the markup and links down, creating files and building a version of the page. 

Screenshots would be easier to store/maintain but I'm not sure exactly how I'd do something like that.  Then all we'd have to do is crawl through the site one page at a time and create the images.  Either one meets the product requirements.

Posted by: Moron Pundit at February 27, 2009 10:51 AM (83gRI)

9 I have some answers for you, I'll email you tonight.

Posted by: Alice H at February 27, 2009 10:52 AM (jRtPb)

10 *ahem*.  I'm not a lawyer...   ...

HEY!  We're not all horrible evil black souled scum on the bottom of the shoe of humanity.  Oh.  Wait.  Hmmm.  Maybe I need to reconsider my position. 

Posted by: alexthechick at February 27, 2009 11:01 AM (SHHaV)

11 Moron Pundit, please email me when you get a chance.

Posted by: Gabriel Malor at February 27, 2009 11:02 AM (XQywO)

12 Well, you would need to pull down the HTML pages, any CSS style sheets, any Javascript, the images, and crawl the links to the next pages on the site.

And stay cognizant of links off-site, so you don't start scraping the known universe.

I could play around with visual basic a bit and see what I could do for you. It would be a local app that would dump to a specified directory. You would have to upload all of the data to a server of your choice.

If you want something automatic that would pull from their server to yours - well, I just started teaching myself php, so I wouldn't be able to help there. Moron for a student, fool for a teacher...

The whitehouse DOT gov website, correct?

Posted by: Dysfunctional at February 27, 2009 11:54 AM (FTiH0)

13

That would be the first place I'd want to watch.  Probably recovery.gov and so on.  I found the way to pull the exact text of the URI down.  I suppose it wouldn't be too hard to request the JS listed in those URI's and reproduce them locally. 

If their page contents are dynamic it gets A LOT hairier but most government sites are informational and therefore not OVERLY dynamic.  At least, the URL's are "friendly" and can be scraped.

Posted by: Moron Pundit at February 27, 2009 12:09 PM (83gRI)

14 I've done a bit of this professionally.  The real risk is that sites may block http requests from your crawler if they figure out it's a robot and it's violating the terms set out in their robots.txt files (which you should look for and read).  There are ways to circumvent detection, but most of it comes down to making the http requests from the crawler "look like" a person browsing.

I'd be happy to help if you need it.

Posted by: leoncaruthers at February 27, 2009 12:12 PM (PH0UW)

15 Did someone say logo?  To photoshop!  Away!

Posted by: TheUnrepentantGeek at February 27, 2009 12:28 PM (0U0+T)

16 I don't know too much specifically about web crawlers myself, but I am a computer scientist.  So I'd be willing to help you in whatever way I can. 

Also, I know some people who work at the Internet Archive (archive.org), and they obviously know a thing or two about web crawlers.  They use Alexa technology for that.  I could perhaps ask some of them for advice if you can't find anybody who knows what they are talking about.

I wouldn't go about trying to re-invent the wheel as far as creating a web crawler.  I'm sure there are plenty of open source tools out there that would be a good start. 

I put my email in the box here, feel free to ask me any questions and I'll let you know if I'm able to help find answers.  (I'm assuming you can see the email I entered in the box above).

Posted by: dan-O at February 27, 2009 12:41 PM (teb/C)

17 I have to say, this is a pretty damn smart bunch of Morons.  

Posted by: alexthechick at February 27, 2009 01:35 PM (SHHaV)

18 Impressive bunch, indeed. So why do y'all let me play?

Posted by: conservativebelle at February 27, 2009 01:47 PM (JXpnx)

19 dan-O is right - there are a lot of website scrapers / archivers out there already. I seen a few in my travels on the web, but I don't have any experience with them.

Anyone have any recommendations?

Sorry to run off on a 'let's build it from scratch' bent - when you are a hammer, everything looks like a nail

Good luck!

Posted by: Dysfunctional at February 27, 2009 02:11 PM (FTiH0)

20 There's software to do it.  No need to build it from scratch.

Posted by: Alice H at February 27, 2009 02:27 PM (jRtPb)

21 Excellent idea. This campaign (and it continues to be a campaign) has abused the Internet from the word "Go." It obviously doesn't matter to most people, but it needs to be documented anyway.

Posted by: Jim Treacher at February 27, 2009 03:33 PM (cvmgB)

22

Well, I need a piece of software that can run from the web on a chron and is preferably open source AND in a language I know AND like.  I'll be damned if I'm spending my free-time working in Java.  Just sayin.

So, preferably a PHP based solution.  For now we will be limiting the archiving to government websites for legal reasons. 

Posted by: Moron Pundit at February 27, 2009 03:38 PM (83gRI)

23 MP, Java is my primary language.  If you find a Java-only tool that you want to use, ping me and I'll bring it to heel.

Posted by: leoncaruthers at February 27, 2009 04:22 PM (PH0UW)

24

Well, I was tipped to Heiritrix by archive.org and it looks like it does the bulk of what we want it to do but I don't know how I'd run it on my server considering my hosting provider doesn't allow root access on the shared server. 

I don't have enough pure server admin experience to know the answer to that conundrum.  I'm a pure programmer and mostly do php/.net/os/400 work. 

If the solutions is running java from chron and then accessing the documents created by that java app from php/.net/html then that's a winner as far as I'm concerned.  I just don't know if the architecture allows it.

Posted by: Moron Pundit at February 27, 2009 04:53 PM (83gRI)

25 The web crawler does not necessarily need to be on the same computer as the website.  So for example you could have a home computer crawling the .gov sites and uploading stuff to your hosting-service-sever.

Posted by: dan-O at February 27, 2009 05:54 PM (teb/C)

26 Yes, having it run from another machine is definitely an alternative but not one I would consider elegant.  If there isn't another way to do it, no big deal but I'd like to see if it could be automated all from the same origin point.  Particularly when I can't be sure which machine will be running this crawl on any given day.  I need to retask machines on a regular basis around here.

Posted by: Moron Pundit at February 27, 2009 06:07 PM (2uped)

27 Moron Pundit,

I'd suggest using some advanced features of wget. (Also, make sure the client header says something other than "wget.")

Most servers will have this utility, even shared ones, so you won't need root access to install it. You might need a limited shell to run it if your account is sandboxed for security reasons.

If your webhost doesn't allow the access to or use of wget, I can suggest an economical one that will... even on a shared server.

The open source wget is free to use and quite versatile. With a limited shell, you could readily log in to your remote account and run wget via CLI and pull down, i.e. GET, a page with all formatting components intact. Since it's a government sites you're tracking, there are no copyright infringement issues that you'd need to be concerned with. Private sites, however, are a completely different issue and I'd discourage it's use with prejudice if you decide to expand your scope.

I suppose you could integrate wget with PHP, but you really wouldn't need to unless you wanted to make an interface that anyone could use to append your site's archived content. Bad idea if visitors can load up archives (via your script) to your server without tight controls. The easiest way would be to run the single-lined CLI command after you or someone with a tip finds a page worth tracking. I imagine you could set up a cron job, too, to archive a specific page at regular intervals. From there just create links to the archived pages from your blog.

Posted by: AnonymousDrivel at February 27, 2009 06:20 PM (swuwV)

28

You're probably getting plenty of advice via email already, but I'll throw in my 2 cents:

Screenshots would be easier to store/maintain but I'm not sure exactly how I'd do something like that.

NO. Do not use screenshots. You want this to be easily searchable, and that means the code. And technically that would be easier anyway.

But most importantly, for a project like this, you don't want to just download the data every day. What you're looking for is the difference day-to-day of the website. There are a lot of tools out there that can do that sort of thing*, that's what you want. Trying to look through screenshots with just the human eye would be totally impractical.

I remember reading about how McCain's campaign already implemented something like this, and caught Obama changing something on his website, maybe something about a timetable for Iraq a couple of months before Election day? Maybe you should track someone from the campaign down, see who designed his website or was on his online team.

*Computer programmers know this sort of system as "source control". Git is an example of that type of program. You can easily compare the changes in code, keep track of when they were made, and so on.

Posted by: dorkfork at February 27, 2009 07:05 PM (LIM+2)

29 I'd also be a little surprised if nobody did that sort of thing for the WH website while Bush was in office. If they did and were open about how they did it...

Posted by: dorkfork at February 27, 2009 07:08 PM (LIM+2)

30 I agree with dorkfork about the screenshots.  Text will be smaller for storage and actually searchable.  If you ever wanted a screenshot, you could make one from an html page, but you can't go from a screenshot to an html file.

And I think you are right about only caring about the changes made to the website.  This would mainly be for storage purposes.  Also it would be nice to know when changes are made to the site.  But I don't think that a source control system would quite work.  Assuming that Git works like CVS, I don't think that is exactly the correct tool for the job.  Doing complex searches through a repository like that is really tough.  A lot of the web pages will be dynamically generated with different content in different frames so that could mess up the tagging as well.  The only way I can think of to keep track of changes would be to do a simple "diff". 

So @ Moron Pundit: how serious are you about setting this up?  How much time do you want to spend on it?  What resources do you have?  What are you technically capable/not capable of doing?  Seems like you know quite a bit about this stuff.  Keep us updated on whatever help you need.

Posted by: dan-O at February 27, 2009 07:39 PM (teb/C)

31 I will be contacting the people that commented to this thread shortly and setting up a plan.  I've decided the front page will be custom and developed in PHP/Zend Framework with dojo for the AJAX and MYSQL for the database.  Standard LAMP system. 

I figure I'll do what we can call full reproductions of certain important government websites and then smaller deltas of other websites that are searchable by date and keyword. 

Technically, I'm a little fuzzy on how we get the scrapes pulled down but I'm a dynamo at database manipulation and UI/Web design.  I'd need help with the search algorithms and the web scraping. 

I'll be in touch soon.  There should be something rudimentary up on the site by tonight.  I'm pretty quick with php and dojo these days.

Posted by: Moron Pundit at February 27, 2009 08:26 PM (2uped)

32 Crawlers? CSS? HTML? JavaScript? AJAX? Does all that mean pr0n? If so, I there.

Posted by: FishFearMe at February 27, 2009 08:38 PM (DL7CT)

33 *I'm

Posted by: FishFearMe at February 27, 2009 08:38 PM (DL7CT)

34 Crawlers? CSS? HTML? JavaScript? AJAX? Does all that mean pr0n?

I'll help find the pr0n!  Hey, we all have our contributions to make. 

Posted by: alexthechick at February 27, 2009 09:01 PM (CW7CI)

Posted by: me at February 27, 2009 10:50 PM (MKY+0)

36

I heard the weekly prez address this morning. One statement claimed that skyrocketing health costs were bankrupting "one American every thirty seconds".

Maybe there's a transcript?

Posted by: rob at February 28, 2009 09:14 AM (oUZea)

37 I've already been doing this with a crawler I created. My crawler reads  the page text and saves it to a database, then it makes a large screenshot image of the entire page and saves that as well.

Contact me at the email address I provided with this post. I would be happy to send you the crawler and database, and help you get everything rolling. I'm a software developer and a web developer so creating a website to show the crawler's data will be easy.

Posted by: CR at March 01, 2009 05:44 PM (Ow4Af)

38 I really liked your article keep up the good work.. Send gifts to pakistan | online gift pakistan

Posted by: jenni at April 09, 2012 02:04 AM (rCsWO)

Hide Comments | Add Comment






40kb generated in 0.0602 seconds; 67 queries returned 185 records.
Powered by Minx 1.1.4-pink.