Internet Archive Opens Crawler Code Under LGPL

Catch up on stories from the past week (and beyond) at the Slashdot story archive

Internet Archive Opens Crawler Code Under LGPL 186

Posted by Cliff on Wednesday January 07, 2004 @11:40AM from the preserving-our-digital-culture dept.

ramakant writes: "It looks like the Internet Archive, which hosts the infamous Wayback Machine has opened its newest in-development crawler code under the LGPL. From the announcement: 'Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. Heritrix (sometimes spelled heretrix , or misspelled or missaid as heratrix / heritix / heretix / heratix) is an archaic word for inheritess. Since our crawler seeks to collect the digital artifacts of our culture for the benefit of future researchers and generations, this name seemed apt.'"

This discussion has been archived. No new comments can be posted.

Internet Archive Opens Crawler Code Under LGPL

Load All Comments

Search 186 Comments Log In/Create an Account

Comments Filter:

Mr peabody! (Score:5, Funny)

by Anonymous Coward writes: on Wednesday January 07, 2004 @11:40AM (#7902902)

They've open sourced your wayback machine! Now you've lost the monopoly!

Share
twitter facebook
- Re:Mr peabody! (Score:1)
  
  by dukeluke ( 712001 ) * writes:
  
  The Way Back machine - one way that we as a species of tech savvy gurus can travel back in time...now, if only they could figure out how to reverse the technology and travel forward ;-)
  
  no sig needed to make this message unique
  - Re:Mr peabody! (Score:2, Funny)
    
    by Anonymous Coward writes:
    
    I don't know about you but I have no problem traveling forward in time. It is getting back that is the real trick.
- Re:Mr peabody! (Score:1)
  
  by ackthpt ( 218170 ) * writes:
  
  They've open sourced your wayback machine! Now you've lost the monopoly!
  Mr. Peabody never makes a mistake. Didn't you learn anything, Sherman? It was the right thing to do.
  Trivia: I bought the season 1 DVD of Rocky and Bullwinkle and saw the original spelling was 'WAYBAC'
- - Re:[OT] Gnome 2 question (Score:1)
    
    by fatwreckfan ( 322865 ) writes:
    
    http://gnomesupport.org/ [gnomesupport.org]
gpl vs. lgpl? (Score:3, Interesting)

by Anonymous Coward writes: on Wednesday January 07, 2004 @11:41AM (#7902906)

could someone summarize the differences?

fp?

Share
twitter facebook
- Re:gpl vs. lgpl? (Score:2, Insightful)
  
  by Anonymous Coward writes:
  
  this ain't OT. The guy asked what the difference was between the GPL and LGPL. LGPL being the license the wayback code is being placed under, the opening of the code being the topic of discussion. Therefore, the post couldn't be any more on-topic.
  
  For chrissakes moderators! It says that the code is LGPL in the freakin' article HEADLINE!! We already have enough trouble with people not RTFA, an occasional someone who didnt read the submitter's post, and now we have moderators not RTFH to deal with too!!
- Re:gpl vs. lgpl? (Score:2, Funny)
  
  by Anonymous Coward writes:
  
  One is communist, the other is socialist.
- Re:gpl vs. lgpl? (answered) (Score:3, Informative)
  
  by DonGar ( 204570 ) writes:
  
  I'm quite certain that people will correct me (at length) if I'm wrong, but here goes.
  
  The GPL says that you can use source and code anyway that you want, but if you release modified versions, you must release the modified source under GPL.
  
  The LGPL is intended for libraries that are released until the GPL. It says that commercial and other non-GPL projects can use this library without becoming GPL, but that changes to the library itself must be released under the LGPL.
  
  LGPL is generally considered a lighte
- Re:gpl vs. lgpl? (Score:2)
  
  by TheSpoom ( 715771 ) * writes:
  
  From the GNU LGPL Preamble [gnu.org]:
  Most GNU software, including some libraries, is covered by the ordinary GNU General Public License. This license, the GNU Lesser General Public License, applies to certain designated libraries, and is quite different from the ordinary General Public License. We use this license for certain libraries in order to permit linking those libraries into non-free programs.
  
  When a program is linked with a library, whether statically or using a shared library, the combination of the two
Cultural artifacts? (Score:2, Funny)

by SexyKellyOsbourne ( 606860 ) writes:

You mean works of art like this?

B1FF#S K3WL H0M3 PAG3!!! [panix.com]
- Re:Cultural artifacts? (Score:4, Funny)
  
  by Lev13than ( 581686 ) writes: on Wednesday January 07, 2004 @11:50AM (#7903004) Homepage
  
  What I want to know is, how do they keep it from crashing when it reaches here [shibumi.org]?
  
  Parent Share
  twitter facebook
- Re:Cultural artifacts? (Score:1, Funny)
  
  by JPelorat ( 5320 ) writes:
  
  Holy buckets. More like a cultural fartifact.
- Biff (Score:2)
  
  by rs79 ( 71822 ) writes:
  
  I know BIFF. BIFF is my friend. SexyKellyOsbourne you are no BIFF.
  
  (BIFF never used numbers)
- Re:Cultural artifacts? (Score:2)
  
  by bgarcia ( 33222 ) writes:
  
  Take a look at the source HTML for that page. It's actually very organized & easy to read.
  It's a shame that the resulting page hurts my eyes so much!
In case of /.ing... (Score:4, Informative)

by Dave2 Wickham ( 600202 ) * writes: on Wednesday January 07, 2004 @11:41AM (#7902915) Journal

The source download is available on sourceforge [sourceforge.net].

I doubt it'll get slashdotted, but you never know...

Share
twitter facebook
- Re:In case of /.ing... (Score:4, Funny)
  
  by Anonymous Coward writes: on Wednesday January 07, 2004 @12:15PM (#7903210)
  
  Don't you mean: I doubt it'll get slashdotted, but I needed the Karma.
  
  Parent Share
  twitter facebook
  - Re:In case of /.ing... (Score:2)
    
    by Dave2 Wickham ( 600202 ) * writes:
    
    Not really, I already have excellent karma, and even if I didn't, who cares about it?
- Re:In case of /.ing... (Score:1)
  
  by Dave2 Wickham ( 600202 ) * writes:
  
  Ah, typical, it's sourceforge which has decided to slow down, not crawler.archive.org.
  
  *sigh*
Then maybe (Score:4, Insightful)

by caston ( 711568 ) writes: on Wednesday January 07, 2004 @11:42AM (#7902921)

OSDN can decide to open source source forge...

Share
twitter facebook
- SourceForge *IS* open source (Score:2)
  
  by TheSpoom ( 715771 ) * writes:
  
  As said above, OSDN *HAS* open sourced SourceForge. You can obtain it at the Alexandria Development Project on SourceForge [sourceforge.net]. Please try to do some research prior to saying things like this. That said, it is true that like many open source projects, SourceForge can only be used for open source software development. For commercial, closed source development using the SourceForge system, try SourceForge Enterprise Edition [vasoftware.com] from VA Software [vasoftware.com], the original developers of SourceForge.
Oldest /. emtry (Score:5, Interesting)

by Anonymous Coward writes: on Wednesday January 07, 2004 @11:44AM (#7902945)

Look, ma - no trolls!! But anti-MS comments in da hizzouse!! [archive.org]

I much prefer the current /.

Share
twitter facebook
- Re:Oldest /. emtry (Score:1)
  
  by CmdrTostado ( 653672 ) writes:
  
  The oldest /. entry has a link to older [archive.org] articles? That's too weird, even for me.
- Re:Oldest /. emtry (Score:1)
  
  by eraserewind ( 446891 ) writes:
  
  Wow, slashdot used to look much nicer than it's current ugly bloated mess.
- Re:Oldest /. emtry (Score:1, Funny)
  
  by Anonymous Coward writes:
  
  But anti-MS comments in da hizzouse!!
  Yea, Slashdot was great before the Microsoft fanboys showed up. Those were the days.
- Even better! (Score:4, Funny)
  
  by Inoshiro ( 71693 ) writes: on Wednesday January 07, 2004 @01:19PM (#7903836) Homepage
  
  " Ooopsies...
  Tim
  Sat Dec 20 at 6:37PM EST
  
  Guess I should read the article before I post. I was under the impression that the next release of IE4 *would* support HTML 4.0...Oh well."
  
  Guess I should read the article before I post? What a crazy, upside-down world it was back then!
  
  Parent Share
  twitter facebook
score (Score:5, Funny)

by TedCheshireAcad ( 311748 ) writes: <ted@fUMLAUTc.rit.edu minus punct> on Wednesday January 07, 2004 @11:45AM (#7902954) Homepage

Score! Now I can run my own wayback machine!

I only have a 30G hard drive though, what do you guys think, bzip should take care of it?

Share
twitter facebook
- Re:score (Score:5, Funny)
  
  by bamf ( 212 ) writes: on Wednesday January 07, 2004 @11:49AM (#7902986)
  
  If you limit yourself to only archiving the useful parts of the interweb, you should be able to fit it all on floppy disk or two.
  
  Parent Share
  twitter facebook
- Re:score (Score:2)
  
  by mahdi13 ( 660205 ) writes:
  
  Thats a great idea!
  I'm sure you can find 4-5 Terrabytes of drive space laying around somewhere!
  I have about 60GB I can donate! =P
  - Re:score (Score:2)
    
    by netsharc ( 195805 ) writes:
    
    I'll take your offer of donation. :)
- Re:score (Score:1)
  
  by Elendil ( 11919 ) writes:
  
  You can't. SCO now claims ownership of every line of GPL code. Barely stretching it, the Internet Archive (and thus Internet itself) can be seen as SCO's IP as "derivative work". You'll send a $699 check to the order of D. McBride, Salt Lake City UT 84101 every time you connect to your ISP. Ka-ching!
- Re:score (Score:5, Interesting)
  
  by corebreech ( 469871 ) writes: on Wednesday January 07, 2004 @01:14PM (#7903786) Journal
  
  I'll use it if you promise not to delete shit that doesn't hew to your ideology.
  
  That's what really sucks about the Wayback Machine.
  
  Ever try reading articles from the aftermath of 9/11? It's a great big hole, so much stuff has been deleted.
  
  Parent Share
  twitter facebook
The code is pretty clean, too... (Score:5, Informative)

by tcopeland ( 32225 ) * writes: <tom AT thomasleecopeland DOT com> on Wednesday January 07, 2004 @11:47AM (#7902968) Homepage

...some unused variables [infoether.com] and such-like in there, though, as reported by PMD [sf.net].

Share
twitter facebook
That sounds like a good working app. (Score:5, Funny)

by DeKoNiNG ( 597077 ) writes: <p.d.de.koning@fr[ ]er.nl ['eel' in gap]> on Wednesday January 07, 2004 @11:49AM (#7902990) Homepage

From their FAQ: if you are comfortable grabbing code directly from CVS, wrestling with incomplete documentation, and running into undocumented limitations, would you want to use the current software.
Undocumented limitations? That sounds like a lot of fun!

Share
twitter facebook
old torrents (Score:3, Funny)

by kyoko21 ( 198413 ) writes: on Wednesday January 07, 2004 @11:50AM (#7902997)

Nothing like crawling for old, recycled, and dead torrents.

Share
twitter facebook
This is great news (Score:2, Informative)

by CompWerks ( 684874 ) writes:

Open source that handles over 300tb of data!
- Stop giving open source movement undeserved credit (Score:3, Insightful)
  
  by jbn-o ( 555068 ) writes:
  
  Open source that handles over 300tb of data!
  
  Please don't be like Mark Webbink, Red Hat's general counsel [slashdot.org], and give the open source movement undeserved credit. Adding a license to a list of approved licenses is trivial compared to writing the license and creating a community. The Lesser General Public License (formerly the Library General Public License) was written by the Free Software Foundation well before the open source movement was formed. The LGPL was written as a compromise in order to spre [gnu.org]
Gordon Mohr (Score:4, Informative)

by Orasis ( 23315 ) writes: on Wednesday January 07, 2004 @11:50AM (#7903007)

Congrats Gojomo!

This project was written by the brains behind bitzi [bitzi.com] and some really cool P2P [open-content.net] stuff [yahoo.com].

He's one of those guys thats going to be working on important stuff for years to come.

Share
twitter facebook
What about... (Score:4, Insightful)

by herrvinny ( 698679 ) writes: on Wednesday January 07, 2004 @11:51AM (#7903012)

Heritrix (sometimes spelled heretrix , or misspelled or missaid as heratrix / heritix / heretix / heratix) is an archaic word for inheritess.

I know some grammar nazi is going to see this, so I might as well get it first. What about heretic: [m-w.com] one who dissents from an accepted belief or doctrine.

Share
twitter facebook
- Re:What about... (Score:1)
  
  by FrankoBoy ( 677614 ) writes:
  
  I've heard it will be available on Unique-based systems soon, stay tuned.
Fortune cookie (Score:2)

by __aahlyu4518 ( 74832 ) writes:

Beneath this article I noticed this fortune cookie:

"Insanity is hereditary. You get it from your kids. "
Maaaaamories... (Score:5, Funny)

by Dorf on Perl ( 738169 ) writes: on Wednesday January 07, 2004 @11:54AM (#7903036)

This is a great step forward, I welcome our archiving overlords, etc. Right now when I want to share some of my history (the good stuff, natch) with my kids, I have to dig out an old, musty shoebox full of junk. When they want to share theirs with their kids, they'll just beam a URL into my grandkids' in-skull HUDs. While in their flying cars. "Oh look, here's another stupid post to Slashdot by Grandpa..."

Share
twitter facebook
- - Re:Maaaaamories... (Score:2)
    
    by TheSpoom ( 715771 ) * writes:
    
    With the amount of pr0n out there, I think he's hit it head on ;^)
Infamous? (Score:4, Interesting)

by BitchAss ( 146906 ) writes: on Wednesday January 07, 2004 @11:59AM (#7903081) Homepage

the infamous Wayback Machine

Why is it infamous? I haven't heard anything bad about it.

Share
twitter facebook
- Re:Infamous? (Score:4, Funny)
  
  by hey ( 83763 ) writes: on Wednesday January 07, 2004 @12:06PM (#7903141) Journal
  
  Just wait 20 years when you are trying to get a CEO job and somebody produces your embarrassing old weblog.
  
  Parent Share
  twitter facebook
  - Re:Infamous? (Score:2)
    
    by BitchAss ( 146906 ) writes:
    
    So, don't use your real name :)
  - Re:Infamous? (Score:1)
    
    by powlow ( 197142 ) writes:
    
    ha haha :)
    
    just checked it out and it is kind of scary that its all there...my old site versions...
    
    new slogan :
    
    way back machine : your permanent record, online, all day, everyday!
  - Re:Infamous? (Score:2)
    
    by acceleriter ( 231439 ) writes:
    
    Then you just DMCA them, like the (few) savvy companies that had embarrasing information in archive.org. They'll take it down.
    - Re:Infamous? (Score:2)
      
      by MushMouth ( 5650 ) writes:
      
      It easier than that, you just ask them nicely and they take it down.
  - Re:Infamous? (Score:3, Funny)
    
    by marnanel ( 98063 ) writes:
    
    Beware the Ghost of Usenet^H^H^H^H^HBlog Postings Past! [ibiblio.org]</gratuitous>
- Re:Infamous? (Score:3, Funny)
  
  by Lester67 ( 218549 ) writes:
  
  The batting cage that I frequent with the kids hates the fact their web-coupon (with no expiration date) is still stored in the Wayback.
  
  I think they might agree with "infamous". :-)
  - Re:Infamous? (Score:2)
    
    by kevcol ( 3467 ) writes:
    
    Shouldn't they be happy that it is still driving business to them or does the coupon offer totally free service?
    - - Re:Infamous? (Score:2)
        
        by kevcol ( 3467 ) writes:
        
        That's too funny- "Hi- I'm an archive.org customer- I'd like my usual, please! And easy on the scowl, if you don't mind." I'm going to have to scour for other old coupons just to be pain in the ass. :-)
        
        Is the '5' five minutes in the cage? Not that it matters- I just haven't gone to a batting cage in more years than I care to admit and I was just curious what they get.
- Re:Infamous? (Score:1)
  
  by glaHHg ( 468427 ) writes:
  
  Infamous is when you're more than famous. This wayback machine is not just famous, it's INfamous.
  - Re:Infamous? (Score:1)
    
    by BitchAss ( 146906 ) writes:
    
    Not so much...here's some dictionary.com action:
    
    - Having an exceedingly bad reputation; notorious.
    
    - Causing or deserving infamy; heinous: an infamous deed.
    
    Don't mean to be all geeky, but, this *IS* slashdot :)
- Cause it doesn't work half the time? (Score:2)
  
  by rs79 ( 71822 ) writes:
  
  It's a great (cough) offsite backup, but very frustrating when you can't get all the pieces.
  - Re:Cause it doesn't work half the time? (Score:2)
    
    by smitty45 ( 657682 ) writes:
    
    the web frontend is not so great, but rest assured once you get ssh access, everything works excellently, actually.
Uh Oh (Score:1)

by ResQuad ( 243184 ) writes:

I think we /.'d sf.net...either that or its conviently not accessable right after I see it linked from slashdot.
Heritrix? (Score:3, Funny)

by elgrinner ( 472922 ) writes: on Wednesday January 07, 2004 @12:02PM (#7903109)

Sounds a bit like Asterix' grandfather.

Share
twitter facebook
Uh? (Score:5, Funny)

by Zog The Undeniable ( 632031 ) writes: on Wednesday January 07, 2004 @12:02PM (#7903110)

Heritrix (sometimes spelled heretrix , or misspelled or missaid as heratrix / heritix / heretix / heratix) is an archaic word for inheritess.
WTF is inheritess? I think we have recursive typos here...my head is going to explode!

Share
twitter facebook
- Re:Uh? (Score:3, Informative)
  
  by gojomo ( 53369 ) writes:
  
  'Inheritess' is femal form of 'inheritor' -- 'someone who inherits' (female). AKA 'heiress'.
- Re:Uh? (Score:2, Informative)
  
  by phiala ( 680649 ) writes:
  
  The OED online is my friend!
  As a confirmed sesquipedalian, and obsessive research-addict, how could I overlook the oportunity to learn new words? And of course, share my newfound knowledge with you all...
  The OED would like us all to know:
  heritrix, heretrix: A female heir or heritor; an heiress.
  heritress: An heiress, an inheritress.
  inheritress: A female inheritor; an heiress. (Less technical than inheritrix.)
  inheritrix: Latinized fem. of INHERITOR
  inheritess: not a word
  And there you have it, co
- Re:Uh? (Score:1)
  
  by jdavidb ( 449077 ) writes:
  
  Inheritess is not a typo for inheritance. It means a female who inherits.
Old slashdot news (Score:5, Interesting)

by AyeFly ( 242460 ) writes: on Wednesday January 07, 2004 @12:04PM (#7903129)

here is a slashdot story from wayback i just found.

"IBM announces a 25 gigger

Posted by Hemos on Wednesday November 11, @10:11AM
from the why-i-could-put-3/4-my-cd-collection dept.
Booker writes "So IBM announces a 25 gig hard drive... does the world need this yet? Unless this is in a RAID, would you really want to trust 25 gigs on a single drive? What would you use this for? 400+ hours of MP3s comes to mind... "
Read More...
64 comments"

Just thought it was interesting to see, since we now have 200gig HDs

Share
twitter facebook
- Re:Old slashdot news (Score:2)
  
  by WuphonsReach ( 684551 ) writes:
  
  Just thought it was interesting to see, since we now have 200gig HDs
  
  Check your rear-view mirrors more closely... that's a 300Gb drive passing you by (Maxtor 300GB Ultra ATA/133 [pricescan.com] for only ~$275-$290). Price is falling pretty nicely for them too (when they came out in September they were $350).
  
  Of course, we saw the same arguments that you quoted there when the 300Gb drives came out... does the world need this yet? Unless this is in a RAID, would you really want to trust 300 gigs on a single drive? What w
Slashdot wayback then... (Score:5, Funny)

by OpCode42 ( 253084 ) writes: on Wednesday January 07, 2004 @12:04PM (#7903130) Homepage

Just been looking at some slashdot pages from 1997... quote from the "Post your comments here!" form : "If you don't have anything worthwhile to say, don't say it. If people continue to abuse this feature, I will have to remove it."

Oh how different things could have been... ;-)

If the trolls had time machines... [archive.org]

Share
twitter facebook
- Kinda scary.... (Score:2)
  
  by imsabbel ( 611519 ) writes:
  
  Slashdot without comments would have around the same information density as a book without letters...
  - - Re:Not at all. (Score:2)
      
      by cyt0plas ( 629631 ) writes:
      
      When I'm in need of directions, I find the trolls (slightly) more useful than the bridge. Not that I come to slashdot for directions. Talk about the blind leading the blind.
- Re:Slashdot wayback then... (Score:2)
  
  by stevesliva ( 648202 ) writes:
  
  Oh how different things could have been
  
  Notice the unattributed slashdot quote of the day today, "I'm not proud."
  - Re:Slashdot wayback then... (Score:2)
    
    by adpowers ( 153922 ) writes:
    
    And right now it is "Spelling is a lossed art."
    
    Maybe later today it'll become "Duplicates are unavoidable."
I probably would have done this differently... (Score:5, Insightful)

by Rahga ( 13479 ) writes: on Wednesday January 07, 2004 @12:06PM (#7903140) Journal

Ever since the wayback machine started making waves, I'd guess about 2 years ago, I've noticed 2 things: There are far less updates of the archives, and it seems that the archive is regularly unable to keep up with the client load we impose on it.

To be honest, I don't have a great answer for the second problem. The only thing that could help there is the passage of time and advancement of technology, really. For the first problem, though, perhaps a SETI-ish distributed "Heritrix" could help make regularly archiving all of these sites a managable affair. IA sends marching orders out to the distributed volunteer network, each clients downloads, compares MD5 of the pages with other clients, compresses them, and sends them back to a master archive. Sounds great in theory, at least at first, to me...

Then again, would I do this, or even continue the project if I was in charge? No, I wouldn't. While, ideally, every page on the internet would be in XHTML, striking a major blow against signal:noise (hey, my own page is XHTML validated, how about yours?), the vast majority of time spidering is undoubtable wasted on re-downloading several dozen kilobytes of dynamically generated junk surrounding the content on sites such as CNN.com... While it's a noble cause, it's also a futile one.

Share
twitter facebook
- Re:I probably would have done this differently... (Score:2, Interesting)
  
  by benja ( 623818 ) writes:
  
  Ever since the wayback machine started making waves, I'd guess about 2 years ago, I've noticed 2 things: There are far less updates of the archives, and it seems that the archive is regularly unable to keep up with the client load we impose on it.
  
  I think that they possibly intentionally limit their bandwidth, so that it's faster to browse the real Web than them (because they don't want to become Google cache when a site is slashdotted, for example).
  (Although they only would if the page in question is
  - Re:I probably would have done this differently... (Score:2)
    
    by adpowers ( 153922 ) writes:
    
    I thought the reason they don't get the pages for 6 months is because Alexa (in exchange for sponsorship) gets the exclusive rights to the archive for the first 6 months. I'm too lazy to look it up now, but I think I read that.
    - - Re:I probably would have done this differently... (Score:2)
        
        by adpowers ( 153922 ) writes:
        
        I found this in the FAQ:
        
        Why are there no recent archives in Wayback?
        
        Wayback does not add pages less than 6 months after they are collected. Updates can take up to 12 months in some cases.
        
        There is no access to files before they appear in Wayback.
        -------------------
        
        I couldn't find exactly what I was looking for, but I am pretty sure that is how it works. However, this quote is interesting:
        
        "The Internet Archive contains over 100 Terabytes of compressed data. This data is collected in collaboration with Al
Wayback = Genealogy of AI Minds (Score:3, Interesting)

by Mentifex ( 187202 ) writes: on Wednesday January 07, 2004 @12:06PM (#7903147) Homepage Journal

The Internet Archive [archive.org] serves the hidden purpose of preserving the AI source-code DNA of artificial Minds.
Each AI Mind [virtualentity.com] leaves a source code trace of itself as it evolves and proliferates across the 'Net and the parsecs of nearby meatspace.
Robot Minds [scn.org] will be able to look up their ancestors in the Internet archive, just as we humans do. However, when the Joint Stewardship of Earth by man and cyborg has arrived in the form of the Technological Singularity, [caltech.edu] robots will be able to resurrect their AI Mind ancestors and bring them back to alife from the Internet Archive.

Share
twitter facebook
- Re:Wayback = Eternal life for geeks (Score:2)
  
  by Dusabre ( 176445 ) writes:
  
  And many a geek without a RL will achieve eternal life when their personality (as expressed through pointed comments), experiences (as expressed through pointless anecdotes) and knowledge (as expressed through worthless advice) and thus their consciousness and LIVING MIND ITSELF, is painstakingly put back together by the same future race which will unfreeze the richer geeks from their cryogenic deathsleeps, from the myriad holographic shreds on the archived internet.
  
  Think about it...
  
  Everything you've ever
Clone (Score:1)

by RoC MasterMind ( 576689 ) writes:

I wonder how long it will be till we see a new site open using the code...
Redundancy? (Score:3, Interesting)

by Anonymous Coward writes: on Wednesday January 07, 2004 @12:11PM (#7903180)

The Internet is huge. But get rid of all the redundancy and the size goes down by a huge factor. How many copies of the Linux kernel and distros are there? How many copies of Matrix Reloaded? Do an MD5 sum and store pointers in order to recreate the structure of the net, keeping only one copy of what is unique. Terrabyte servers are cheap these days. Wouldn't need more than a few at the most to archive everything.

Share
twitter facebook
- Ah, but the thing is... (Score:2)
  
  by Kjella ( 173770 ) writes:
  
  ...while there may be unique content, there's certainly not unique versions. I'm sure there's many different rips of Matrix Reloaded. First off, there's all the various screener / preview dvd / telesync / DVD releases.
  
  Then there's all the corrupted versions (a single unnoticable bit error = different MD5). Different rips (Macrovision removed/not removed, inverse telecine, PAL/NTSC versions, different resizing (bicubic/bilinear/Lanczos3).
  
  Some made using XviD, some DivX, some WMV, different versions of the
  - Gr. 350 Tb and 15 Tb, respectively. And 1 petabyte (Score:2)
    
    by Kjella ( 173770 ) writes:
    
    Did the math using mb, when I thought I was operating in gb. So I was off by a factor of 1000. So the correct guessitmate would be 1 petabyte (1000 Tb).
    
    Kjella
Unless the Archive caves in... (Score:5, Informative)

by turambar386 ( 254373 ) writes: <turambar386 AT routergod DOT com> on Wednesday January 07, 2004 @12:25PM (#7903309) Homepage

"Since our crawler seeks to collect the digital artifacts of our culture for the benefit of future researchers and generations..."

That is, unless the digital artifacts in question are, like Operation Clambake [xenu.net] opposed to rich and powerful sects. In which case, they are blocked [archive.org] by the Wayback machine after the Archive caves in to DMCA notices [yale.edu].

Share
twitter facebook
- Re:Unless the Archive caves in... (Score:2)
  
  by burtonator ( 70115 ) writes:
  
  Not true... they are just dark archives.
  
  The content is still there it's just not available to the CURRENT generation.
  
  Future researchers and generations will still have this data.
  
  If you want the latest just go to xenu.net..
  
  For the record I support Brewster's and the Archives position on this. It's hard to know who is more evil... the CoS of the anti-CoS folks ;)
  
  (quick answer... the CoS is pure evil! ;)
  
  I've had a few fights with the CoS myself:
  
  http://www.peerfear.org/rss/permalink/2002/12/1 4 /1 03990
- Re:Unless the Archive caves in... (Score:2)
  
  by jesterzog ( 189797 ) writes:
  
  In which case, they are blocked by the Wayback machine after the Archive caves in to DMCA notices.
  
  As upsetting as this is, I don't think it's fair to blame the Wayback Machine for this. They have to protect their own interests first to keep the service going at all. Becoming a martyr in a costly legal battle for political ideals may not fit into that. Companies don't have the freedom or flexibility of individuals, and this is the same reaction that nearly every other business and organisation wou
What if there's another archive.org (Score:4, Funny)

by British ( 51765 ) writes: <british1500@gmail.com> on Wednesday January 07, 2004 @12:27PM (#7903323) Homepage Journal

...and archive.org tries to archive it? Will it go into an infinite loop,or just have 2 copies of the interweb?

Share
twitter facebook
"Heritrix" explained (Score:2, Informative)

by skidoo2 ( 650483 ) writes:

Sheesh. Let me put this one to bed before it snowballs into a big cloud of impenetrable Times New Roman.

I'm tempted to shout, but I won't. Don't make me shout!

"Heretrix" is a term most often seen in a geneaology context. It denotes a chick who is designated to inherit (or has already inherited) the estate of someone. Example sentence: "Captain Dork married Jack Dipstick's heretrix Gassy Lucy."

In most cases the word "heretrix" connotes that there was something significant about the inherited estate, e.g.
finally! (Score:3, Funny)

by badansible ( 630677 ) writes: on Wednesday January 07, 2004 @12:30PM (#7903353)

I will be able to look at that exciting gopher site everybody was talking about! Yes?

Share
twitter facebook
Do it yourself archiving? (Score:2)

by TheRedHorse ( 559375 ) writes:

Guess this solves this guys problem [slashdot.org].
How long? (Score:1, Redundant)

by Raven42rac ( 448205 ) writes:

How long until SCO claims that the code is theirs?
Why use this crawler? (Score:1)

by glinden ( 56181 ) writes:

There's a huge number of open source web crawlers available already on SourceForge [sourceforge.net] and elsewhere. Anyone know the advantages and disadvantages of this one over the others?
- Because it's top notch (Score:2)
  
  by JohnQPublic ( 158027 ) writes:
  
  Brewster Kahle and Alexa Internet are the real deal. This isn't some undergrad's CS-101 project, it's a tool designed from the very start to archive the entire web. And it does it on a regular basis. Even if there's a really good SourceForge project (you didn't cite any of them), Alexa's should be a first stop for anyone interested in the task.
LGPL from Wikipedia (GFDL typo?) (Score:2)

by Famatra ( 669740 ) writes:

I went to the GNU main site [ttp] to try and figure out what the LGPL was about, and no luck at all getting a coherent explanation.

Wikipeda has a good explanation [wikipedia.org] (below), although I am confused as to why the way back machine choose this particular licence since it seems to really be specifically for software libraries. Perhaps they meant the GFDL [wikipedia.org] (GNU Free Documentation License).

P.S. Your allowed to copy all the stuff you want from Wikipeda its copylefted [wikipedia.org] with the GFDL itself! :)

--- Wikipedia Article on LG
What will happen if... (Score:2, Funny)

by balbord ( 447248 ) writes:

...wayback inadvertently archives itself?!?!

That reminds me... once I though of googling for "google"... but I didn't since it, no doubtly, wold create a black hole or something!
Important clarifications (!!!) (Score:5, Informative)

by gojomo ( 53369 ) writes: on Wednesday January 07, 2004 @01:26PM (#7903897) Homepage

Heritrix is just a crawler for collecting web resources recursively, within some defined parameters -- it doesn't offer Internet Archive Wayback Machine (IA WM) functionality.

FYI, there is a GPL'd web access tool that's very much like the IA WM, and even surpasses it in some ways: the NWA (Nordic Web Archive) Toolset 1.0 [nwa.nb.no]. It doesn't do crawling, but if you can coerce what you've crawled into its input format, it offers URL-based, date-based, and full-text search plus "back-in-time" viewing of an archive. (Check out their demo [nwa.nb.no], but remember it's only got a small number of pages from www.nb.no, so confine your searches to things like "Norway".)

Heritrix release 0.2.0 was mainly a test of our new release procedure; we would not recommend the code for outside use yet. We use it for crawls of up to hundreds of sites, taking a week or more to complete, but it still requires expert attention to crawl well.

We intend to improve its stability and scalability until it is capable of web-scale crawls -- billions of pages -- but that requires many incremental improvements, including extension to run on networks of cooperating crawling machines -- not planned until later in the year. (Heritrix currently crawls from a single machine.)

We are eager for contributors who would like to extend Heritrix in various ways, especially ways that would make it more valuable to researchers, librarians, and archivists. Optional modules for new fetch protocols, new media format link-extractors, or on-the-fly content-analysis to help direct further crawling would all be very interesting to us.

IA currently receives almost all of its full-web collection via an agreement with Alexa Internet, who have been crawling the web for the Internet Archive since 1996.

(P.S.: Yes, 'inheritess' should be 'inheritRess'/'heiress'. Oops.)

Share
twitter facebook
This is not the Wayback Machine code. (Score:2, Interesting)

by InvisiBill ( 706958 ) writes:

A friend from another messageboard is working on this project, and just posted to let us know that he's been /.ed (which is sort of a cool thing in the geek world).
And of course they got it all wrong. Heritrix != WayBackMachine.
Heritrix gathers web pages (harvests)
The WayBackMachine gives access to harvested material.
Also Heritrix is a new web crawler meant to replace the one that IA has been using (which is owned by Alexa Internet).

That's what he had to say about it. The post and the article both
spam (Score:2, Insightful)

by krokodil ( 110356 ) writes:

I am afraid spammers may use this code
to harvest web pages for email addresses.
- Re:spam (Score:3, Informative)
  
  by elemental23 ( 322479 ) writes:
  
  Don't lose any sleep over it, spammers have had tools to harvest the web for e-mail addresses for years.
  
  Insightful?
i thought i saw... (Score:2)

by burns210 ( 572621 ) writes:

I first read the headline and i thought it said the Internet Archive would be archiving L/GPL code.

That would be cool actually, like a 1stop shop for all the opensource cvs servers... get to see the linux kernel from .01 to 2.6.0 and a couple thousand other applications too. Oh well, the real story is neat too.
- Re:Google's IPO (Score:1)
  
  by agentforsythe ( 696066 ) writes:
  
  in english?
- Re:Heritrix (Score:3, Funny)
  
  by hplasm ( 576983 ) writes:
  
  And what, pray tell, is "inheritess" ?
  A Heritrix.
- Re:no articles for 4 hours on a weekday morning? (Score:2, Funny)
  
  by skidoo2 ( 650483 ) writes:
  
  I was wondering the same thing. Last night I posted a cool article about weird slime on Mars [washingtonpost.com], and it hasn't even been rejected yet.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Mr peabody! (Score:5, Funny)

Re:Mr peabody! (Score:1)

Re:Mr peabody! (Score:2, Funny)

Re:Mr peabody! (Score:1)

Re:[OT] Gnome 2 question (Score:1)

gpl vs. lgpl? (Score:3, Interesting)

Re:gpl vs. lgpl? (Score:2, Insightful)

Re:gpl vs. lgpl? (Score:2, Funny)

Re:gpl vs. lgpl? (answered) (Score:3, Informative)

Re:gpl vs. lgpl? (Score:2)

Cultural artifacts? (Score:2, Funny)

Re:Cultural artifacts? (Score:4, Funny)

Re:Cultural artifacts? (Score:1, Funny)

Biff (Score:2)

Re:Cultural artifacts? (Score:2)

In case of /.ing... (Score:4, Informative)

Re:In case of /.ing... (Score:4, Funny)

Re:In case of /.ing... (Score:2)

Re:In case of /.ing... (Score:1)

Then maybe (Score:4, Insightful)

SourceForge *IS* open source (Score:2)

Oldest /. emtry (Score:5, Interesting)

Re:Oldest /. emtry (Score:1)

Re:Oldest /. emtry (Score:1)

Re:Oldest /. emtry (Score:1, Funny)

Even better! (Score:4, Funny)

score (Score:5, Funny)

Re:score (Score:5, Funny)

Re:score (Score:2)

Re:score (Score:2)

Re:score (Score:1)

Re:score (Score:5, Interesting)

The code is pretty clean, too... (Score:5, Informative)

That sounds like a good working app. (Score:5, Funny)

old torrents (Score:3, Funny)

This is great news (Score:2, Informative)

Stop giving open source movement undeserved credit (Score:3, Insightful)

Gordon Mohr (Score:4, Informative)

What about... (Score:4, Insightful)

Re:What about... (Score:1)

Fortune cookie (Score:2)

Maaaaamories... (Score:5, Funny)

Re:Maaaaamories... (Score:2)

Infamous? (Score:4, Interesting)

Re:Infamous? (Score:4, Funny)

Re:Infamous? (Score:2)

Re:Infamous? (Score:1)

Re:Infamous? (Score:2)

Re:Infamous? (Score:2)

Re:Infamous? (Score:3, Funny)

Re:Infamous? (Score:3, Funny)

Re:Infamous? (Score:2)

Re:Infamous? (Score:2)

Re:Infamous? (Score:1)

Re:Infamous? (Score:1)

Cause it doesn't work half the time? (Score:2)

Re:Cause it doesn't work half the time? (Score:2)

Uh Oh (Score:1)

Heritrix? (Score:3, Funny)

Uh? (Score:5, Funny)

Re:Uh? (Score:3, Informative)

Re:Uh? (Score:2, Informative)

Re:Uh? (Score:1)

Old slashdot news (Score:5, Interesting)

Re:Old slashdot news (Score:2)

Slashdot wayback then... (Score:5, Funny)

Kinda scary.... (Score:2)

Re:Not at all. (Score:2)

Re:Slashdot wayback then... (Score:2)

Re:Slashdot wayback then... (Score:2)

I probably would have done this differently... (Score:5, Insightful)

Re:I probably would have done this differently... (Score:2, Interesting)

Re:I probably would have done this differently... (Score:2)

Re:I probably would have done this differently... (Score:2)

Wayback = Genealogy of AI Minds (Score:3, Interesting)

Re:Wayback = Eternal life for geeks (Score:2)

Clone (Score:1)

Redundancy? (Score:3, Interesting)

Ah, but the thing is... (Score:2)

Gr. 350 Tb and 15 Tb, respectively. And 1 petabyte (Score:2)

SourceForge IS open source (Score:2)