Page 1 of 2
ArchiveTeam wants a copy of all our sites
Posted: Tue Apr 24, 2012 5:20 pm
by koitsu
I have just been contacted by a member of a group called "
ArchiveTeam" who asked for, verbatim quote:
"we want a full archive of everything on the parodius network"
My response was as follows:
Plain and simple:
Not going to happen, and I'm not going to give approval for this.
Each and every hosted person here has a right to define whether or not they want their content archived. Some may have robots.txt in place, others may not but might have other conditionals (some technical, some via footers/agreements on their page). Their data is their data; I am not the owner of their data. I cannot decide for them if they are comfortable with that.
In fact, given our highly moral and ethical values, I'm a little surprised you'd even ask for this. I hope I'm misunderstanding your request, otherwise I'm actually a bit offended by it.
I state this here, publicly, given that many of our hosted users are members here on the forum.
If you feel comfortable with someone archiving your data like this, who is a third-party, I would recommend you contact them and offer/work out something. Otherwise, hosted users' data, as I said, is their own data. I will never, ever agree to such requests on behalf of people we host. Like I said: your data is your data, and the decision is not mine to make.
Posted: Tue Apr 24, 2012 5:46 pm
by 3gengames
I think all the tech docs here on the site also should be saved, but yeah my PM's in stuff, They won't be useful to people honestly, but I don't like the idea of having someone else have them either as some stuff is supposed to be kept a little closer to the vest.
Posted: Tue Apr 24, 2012 7:54 pm
by Kit Sniper
I was expecting them to say something.
They basically take site archives and distribute them via torrent / Archive.org / other sites. They do good stuff but I've always been iffy about the legality of the thing. And the privacy part.
At least in my regard the answer is a resounding no. My site is my site and it's not going to go down until I die.
Edit: Oh goodie.
http://www.archiveteam.org/index.php?ti ... Networking
This does not look good.
Posted: Tue Apr 24, 2012 8:04 pm
by koitsu
Kit Sniper wrote:I was expecting them to say something.
They basically take site archives and distribute them via torrent / Archive.org / other sites. They do good stuff but I've always been iffy about the legality of the thing. And the privacy part.
At least in my regard the answer is a resounding no. My site is my site and it's not going to go down until I die.
Edit: Oh goodie.
http://www.archiveteam.org/index.php?ti ... Networking
This does not look good.
Well, I'm fine with them archiving the home page, our FAQ, etc. -- sure, that's all public anyway. But our home page/etc. != users' content. They will need to get every individual site owners' permission to archive stuff. And I will be very,
very pissed if they run some kind of scraping bot against everything without talking to me first. Bandwidth doesn't grow on trees.
I'm not sure they should bother at this point anyway -- if they were fair, they'd simply wait until shortly before October to talk to me. Most of the site owners will be moving their stuff to other URLs, which means all the content/etc. will be available on the Internet just at a new URL. Thus I cease to see the point in archiving it. For things that don't get moved, that's something that can be discussed later.
Posted: Tue Apr 24, 2012 8:08 pm
by Kit Sniper
koitsu wrote:Well, I'm fine with them archiving the home page, our FAQ, etc. -- sure, that's all public anyway. But our home page/etc. != users' content. They will need to get every individual site owners' permission to archive stuff.
I'm not sure they should bother at this point anyway -- if they were fair, they'd simply wait until shortly before October to talk to me. Most of the site owners will be moving their stuff to other URLs, which means all the content/etc. will be available on the Internet just at a new URL. Thus I cease to see the point in archiving it. For things that don't get moved, that's something that can be discussed later.
I've followed them for a little while now and they'll be making copies of everything hosted at Parodius, not just the index, with or without permission. Even sites that won't go away, like mine.
They won't be able to get many of the files kept under directories with an index, but I still don't want them to archive my site. I'm not going away anytime soon.
Edit:
3gengames - when they mean archive they basically want mirrors of sites. For example, they'd back up the forum posts, but not the private data like PMs.
Posted: Tue Apr 24, 2012 8:18 pm
by koitsu
Well, if what you say is true, then I hope my previous statement to them in PM is a sufficient deterrent. I said no. If they do it anyway and it violates an individual's sites terms, then its up to the individual to contact them + deal with it in some way. But overall I've given them my statement: do not do this. Bandwidth = not free. The last thing I need is to find my 95th percentile at 20mbit because someone thinks bandwidth grows on trees.
Posted: Tue Apr 24, 2012 8:48 pm
by LocalH
Perhaps it may be worth investigating a proactive block, if you can identify any IP ranges that they use to scrape such content? Figure out which ranges of addresses need blocking and then block them server wide, so that they can't eat up valuable bandwidth. Just an idea, and if it's possible to do then each client can still voluntarily provide their site content to ArchiveTeam if they so choose.
Posted: Tue Apr 24, 2012 8:55 pm
by Kit Sniper
LocalH wrote:Perhaps it may be worth investigating a proactive block, if you can identify any IP ranges that they use to scrape such content? Figure out which ranges of addresses need blocking and then block them server wide, so that they can't eat up valuable bandwidth. Just an idea, and if it's possible to do then each client can still voluntarily provide their site content to ArchiveTeam if they so choose.
That won't work.
They basically run wget scripts on sites from various locations across the world by coordinating volunteers via their IRC channel. So even if you block one address range from Topeka, someone from France might go at it.
The good thing is, they don't really put ten people to download the same site at the same time. They get people to do segments and once they're done, they're done. There are no redundant scrapes. So while they may be downloading everything... they won't do it repeatedly. :\
Posted: Wed Apr 25, 2012 1:29 pm
by Tormenter
Whats the big deal about having an archive of all of this information, instead of letting it go offline to never be seen or used again? IMO, thats pretty much a kick in the ass to everyone in the community.
Posted: Wed Apr 25, 2012 1:58 pm
by koitsu
Tormenter wrote:Whats the big deal about having an archive of all of this information, instead of letting it go offline to never be seen or used again? IMO, thats pretty much a kick in the ass to everyone in the community.
1. What makes you think the information is being "let go offline never to be seen or used again?" You, nor these ArchiveTeam folks, have any insight to that. You don't know what our hosted users are doing, and neither do they.
2. The problem -- from my perspective, and that's why I made the post sticky -- is that the ArchiveTeam folks asked
me for permission to download all of our hosted sites.
**I** am not the person to ask when it comes to other people's data. If they want to archive (for example) Kitsune's sites then they need to talk to him, not me. If they want to archive NESWorld, then they need to talk to Martin. If they can't figure out who to ask (i.e. owner doesn't disclose contact information), then asking me won't solve that either.
The point is: I don't own our hosted users' data. They own their data. Decisions like this need to be made by the hosted users on a per-user basis and not by me. Nothing gives me the right to make decisions for them.
3. From a technical level, the "big deal" has to do with network traffic. I think this is the 2nd or 3rd time I've brought up this point in recent threads where you've commented. I will repeat, and make bold:
bandwidth/network traffic is expensive. It is not free. You may want to read up on what 95th-percentile billing is about -- because it's what datacenters/co-location providers use. It may not be something you've seen before because most low-end "hosting" environments look at things from a volumetric point of view, but no datacenter does (or carrier/transport provider, for that matter). 95th-percentile can screw a person out of tens of thousands of dollars in bandwidth overage fees.
Is there anything constructive you can add to any of the threads you've posted in? Sorry for getting combative, but all I've seen is peanut-gallery comments passing judgement and asking "why" in a smarmy way. What do you have that's positive that you can bring to the table? Because I welcome such.
Posted: Wed Apr 25, 2012 2:02 pm
by tepples
Perhaps I should get with them and tell them what's already being done to archive the nesdev subdomain and wiki.nesdev.com domain.
EDIT: I summarized koitsu's posts in this topic into AT's page about Parodius.
Posted: Wed Apr 25, 2012 9:07 pm
by koitsu
tepples wrote:Perhaps I should get with them and tell them what's already being done to archive the nesdev subdomain and wiki.nesdev.com domain.
EDIT: I summarized koitsu's posts in this topic into AT's page about Parodius.
Thanks much man. I appreciate the effort; right now I have too much going on (with all of this stuff -- you should see my inbox -- and with doctor's visits, work chaos (probably the most chaos I've ever seen), etc... I get to have an endoscopy tomorrow, for example. Hooray...)
Posted: Wed Apr 25, 2012 9:59 pm
by Kit Sniper
tepples wrote:Perhaps I should get with them and tell them what's already being done to archive the nesdev subdomain and wiki.nesdev.com domain.
EDIT: I summarized koitsu's posts in this topic into AT's page about Parodius.
Um... it's foxhack.net, not .com

Could you please fix that?
I own the .net / com / org domains but only use the .net one. Thanks for letting them know about it, and I'm writing a post about that at my site too.
Posted: Thu Apr 26, 2012 9:20 am
by Tormenter
koitsu wrote:Tormenter wrote:Whats the big deal about having an archive of all of this information, instead of letting it go offline to never be seen or used again? IMO, thats pretty much a kick in the ass to everyone in the community.
1. What makes you think the information is being "let go offline never to be seen or used again?" You, nor these ArchiveTeam folks, have any insight to that. You don't know what our hosted users are doing, and neither do they.
2. The problem -- from my perspective, and that's why I made the post sticky -- is that the ArchiveTeam folks asked
me for permission to download all of our hosted sites.
**I** am not the person to ask when it comes to other people's data. If they want to archive (for example) Kitsune's sites then they need to talk to him, not me. If they want to archive NESWorld, then they need to talk to Martin. If they can't figure out who to ask (i.e. owner doesn't disclose contact information), then asking me won't solve that either.
The point is: I don't own our hosted users' data. They own their data. Decisions like this need to be made by the hosted users on a per-user basis and not by me. Nothing gives me the right to make decisions for them.
3. From a technical level, the "big deal" has to do with network traffic. I think this is the 2nd or 3rd time I've brought up this point in recent threads where you've commented. I will repeat, and make bold:
bandwidth/network traffic is expensive. It is not free. You may want to read up on what 95th-percentile billing is about -- because it's what datacenters/co-location providers use. It may not be something you've seen before because most low-end "hosting" environments look at things from a volumetric point of view, but no datacenter does (or carrier/transport provider, for that matter). 95th-percentile can screw a person out of tens of thousands of dollars in bandwidth overage fees.
Is there anything constructive you can add to any of the threads you've posted in? Sorry for getting combative, but all I've seen is peanut-gallery comments passing judgement and asking "why" in a smarmy way. What do you have that's positive that you can bring to the table? Because I welcome such.
I host many sites, I know how much traffic costs. This is just a message board, and could easily get buy on a $100 a year plan with no problems.
Posted: Thu Apr 26, 2012 10:22 am
by tepples
Yeah, perhaps part of the difference is that one of you is talking about "*.parodius.com" and the other about "nesdev.com and wiki.nesdev.com".