We accidentally burned through 200GB of proxy bandwidth in 6 hours

(blog.skyvern.com)

96 points by suchintan 10 months ago

patmcc 10 months ago

I'm now expecting we'll see a couple things in the next few years:

1. An explosion of residential proxy networks and other stuff to circumvent blocking of cloud IP ranges, for all the various AI scraping tools to use.

2. A corresponding explosion of countermeasures to the above. Instead of blocking suspicious IPs, maybe they get a 3GB file on their request to /scrape-target.html

Reply View 4 replies

bdcravens 10 months ago

Perhaps an explosion of usage. There's already a few very large residential proxy networks.

Reply View | 1 reply
- cute_boi 10 months ago
  
  And, I get frequently contacted by Bright data to install their code in my repo.
  
  Reply View | 0 replies
mrtesthah 10 months ago

I think that may be against the ToS of most residential ISPs.

Reply View | 1 reply
- bdcravens 10 months ago
  
  Perhaps, but it's already fairly prodigious. Among "ethical" providers, it's often bundled as a background service in a lot of clickwrap "freeware". (To say nothing of compromised computers in a botnet)
  
  Reply View | 0 replies

metadat 10 months ago

200GB is nothing since 2018 when AT&T mass introduced their 1-gig symmetric fiber. Any single common gigabit link can run 200GB in 15 minutes.

On any gig link, over the course of 6 hours you can transmit a little more than 4TB one way.. which is 40x more.

Reply View 6 replies

Johnny555 10 months ago

Too bad AWS didn’t get that memo, 200GB would cost $18 there, and somehow the company in the original post is paying $500 for that bandwidth with whoever their proxy host is.

Reply View | 5 replies
- suchintan 10 months ago
  
  Haha unfortunately we use residential proxies under the hood to simulate real users (as you'd expect from AI agents), where bandwidth is significantly more expensive!
  
  Reply View | 3 replies
  
  donmcronald 10 months ago
  
  How does a residential proxy work? Do people rent out their internet connections to commercial services?
  
  Reply View | 2 replies
- audrey1 10 months ago
  
  The screenshot is from webshare
  
  Reply View | 0 replies

omoikane 10 months ago

The discussion linked in the post is from 2022, and the corresponding issue has already been fixed:

https://issues.chromium.org/issues/40220332

I wonder if there is a more recent bug related to this?

Reply View 2 replies

TRiG_Ireland 10 months ago

I think that Chrome is still doing a lot of downloads. They're just no longer showing them to the user.

Reply View | 1 reply
- fijiaarone 10 months ago
  
  And uploads.
  
  Reply View | 0 replies

sam0x17 10 months ago

Gosh I regularly burn through that much just updating games in steam :D. Not proxy bandwidth of course but isn't it funny that the the line between regular usage and $$$ can be what is using the bandwidth. Or rather, isn't it funny that regular consumers expect to be able to use multiple terabytes of data for < $100/mo but the same can still be thousands in other enterprise domains

Reply View 2 replies

Suppafly 10 months ago

>Gosh I regularly burn through that much just updating games in steam
Same, and I get angry letters from comcast about abusing my 'unlimited' bandwidth.

Reply View | 1 reply
- sam0x17 10 months ago
  
  My condolences, I no longer live in an area controlled by the comcast empire, but I am familiar with your pain
  
  Reply View | 0 replies

perks_12 10 months ago

200GB for $500? What cloud is this?

Reply View 49 replies

jsnell 10 months ago

I don't think it's a cloud. It's more likely a residential proxy network, which are typically created by installing malware on users' machines.
The operators of these proxy networks want to avoid detection by both the users whose bandwidth they're stealing, and by the companies whose data is being scraped. So they want to make the bandwidth very expensive. And that expensive bandwidth in turn means that their only clients are dodgy as well. Either people looking to scrape data without consent and monetize it, or outright criminals.

Reply View | 41 replies
- iforgotpassword 10 months ago
  
  I use one. I run a bot on IRC that extracts the <title> of every link posted (or downloads the image/whatever and extracts Metadata) and announces that to the channel. It has become more and more pointless to run this on a vps. Google/YouTube block the IP range, a lot of websites return the cloudflare security check, Amazon works on some days and doesn't on others... Ever since I proxy via residential proxies it just works. I'm a smooth criminal. :>
  
  Reply View | 6 replies
  
  morkalork 10 months ago
  
  So much for the open internet.
  
  Reply View | 4 replies
  
  derekzhouzhen 10 months ago
  
  I feel your pain, but I refuse to cave. Say, 10% of the links fail to load, so what? It is their loss, not mine.
  
  Reply View | 0 replies
- bscphil 10 months ago
  
  It's kind of surprising that a presumptively legitimate company (and YC-funded startup) would out themselves as buying black market residential proxy bandwidth, isn't it?
  
  Reply View | 22 replies
  
  jsheard 10 months ago
  
  Their frontpage also advertises the ability to pass CAPTCHAs, whether by automation or more likely by delegating them to third-world CAPTCHA farms. If that's a major selling point for your automation service then your target market probably ranges from dubious (e.g. data scrapers trying to get around limits) to extremely dubious (e.g. ticket scalpers, spammers, click fraud, etc).
  
  Reply View | 16 replies
  
  mrguyorama 10 months ago
  
  How long have you been here? It's not surprising at all. HN and YC have not demonstrated an aversion to "uh, greyhat" activity.
  If it were 2000, people would be sharing their ad clicking startups.
  YC has funded a looooooot of sketchy companies.
  
  Reply View | 0 replies
  
  dewey 10 months ago
  
  Residential proxies are not necessarily "black market".
  
  Reply View | 3 replies
- dewey 10 months ago
  
  There's many reputable residential proxy networks too, usually there's a lot of vetting involved too as they don't want people running illegal activities though their network.
  It's almost a necessity these days to have access to that due to how much datacenter ranges are blocked.
  
  Reply View | 0 replies
- floam 10 months ago
  
  It’s not necessarily malware. There are services that are pretty upfront and pay cash money for residential US bandwidth. That said, naive people might be surprised when their IP starts getting blocked.
  e.g. https://www.honeygain.com/ (something like 100GB = $20).
  
  Reply View | 1 reply
  
  Saris 10 months ago
  
  >That said, naive people might be surprised when their IP starts getting blocked.
  Or law enforcement shows up at their door because their IP is involved in a bunch of illegal stuff.
  
  Reply View | 0 replies
- miohtama 10 months ago
  
  Here more on "free VPNs”
  https://www.kaspersky.com/blog/what-is-wrong-with-free-vpn-s...
  Usually such proxy networks are outright criminal (even if users are not).
  
  Reply View | 0 replies
- peab 10 months ago
  
  how does expensive bandwidth equate to dodgy clients? There are lot's of valid use cases for scraping data, and it's legal to scrape publicly available data, even if the websites hosting it try to block it (try a curl request to reddit, for example)
  
  Reply View | 2 replies
  
  patmcc 10 months ago
  
  >>>and it's legal to scrape publicly available data, even if the websites hosting it try to block it
  Is that something that's been fully decided? https://en.wikipedia.org/wiki/Craigslist_Inc._v._3Taps_Inc. is the most relevant case I'm aware of, and it suggests it might actually be illegal (if you know you've been blocked, at least).
  
  Reply View | 0 replies
  
  suchintan 10 months ago
  
  https://techcrunch.com/2024/01/24/court-rules-in-favor-of-a-...
  This is another interesting example where it was allowed
  
  Reply View | 0 replies
- morkalork 10 months ago
  
  Aren't there also some suspiciously cheap VPNs that do that in the background?
  
  Reply View | 2 replies
  
  duskwuff 10 months ago
  
  An employee of one proxy network describes that exact business model here:
  https://news.ycombinator.com/item?id=41597315
  
  Reply View | 0 replies
  
  ThePowerOfFuet 10 months ago
  
  Yes
  
  Reply View | 0 replies
- hypeatei 10 months ago
  
  Yeah, the author confirmed it in this thread actually:
  https://news.ycombinator.com/item?id=41594713
  
  Reply View | 0 replies
bdcravens 10 months ago

Residential proxy service
https://smartproxy.com/proxies/residential-proxies/pricing
(may not be this service, but this is an example, and the price is consistent with their larger commitments)

Reply View | 0 replies
tux3 10 months ago

Absolutely wild. A normal price for bandwidth before volume discounts is 1c/GB, or 10 bucks per TB

Reply View | 1 reply
- jsheard 10 months ago
  
  They're in the business of scraping/botting sites that don't want to be scraped/botted, and bandwidth that looks "legit" comes at a premium.
  
  Reply View | 0 replies
hooverd 10 months ago

api.skyvern.com is a CNAME to an EC2 ALB, but even using a NAT Gateway ($$$) I can't make more than $1/GB add up.

Reply View | 0 replies
intunderflow 10 months ago

Looks like webshare from the screenshot

Reply View | 0 replies
baq 10 months ago

I downloaded world of Warcraft the other day, 100GB, took less than 3 hours and you can be sure it didn’t cost blizzard $0.05.

Reply View | 1 reply
- roywiggins 10 months ago
  
  Blizzard quite famously used BitTorrent to save bandwidth, dunno if they still do:
  https://wowpedia.fandom.com/wiki/Blizzard_Downloader
  
  Reply View | 0 replies

tristor 10 months ago

I would have liked to see a bit more of 5 Whys here. It seems like a consistent lesson that startups have to learn over and over is how to manage external dependencies, and particularly the dangers of having Google as a dependency. This is new Chrom(e|ium) behavior, and it has a real cost, both for this company and for users, which may or may not be worth the ROI, but this is what happens when you have a large scale external dependency: stuff moves without your knowledge, consent, or control.

Instead of Always. Be. Closing. it should be Always. Be. Mitigating. Dependencies. for startups.

Reply View 3 replies

suchintan 10 months ago

This is a great callout.
We had an internal discussion about how to manage dependencies effectively, and we made the decision accept the risk that comes with blindly relying on Chrome for now, instead of investing heavily in mitigating that risk today.
The main motivator was for us to continue moving fast, and accept that we have a few hard dependencies in our business.
The goal is to find product market fit, then allocate time to de-risk some of these hard dependencies. If we fail to find product market fit, this may not matter at all

Reply View | 2 replies
- tristor 10 months ago
  
  I think that's a fair strategy. Strong PMF generally overcomes weak execution, the challenge is that when you have hard dependencies on entities like Google or Apple it can easily become existential. Even if you choose to move forward with this dependency you should establish guard rails within your system to ensure you catch shifts faster that may be impactful and have a plan for mitigation. For instance, you should identify key points of integration and possible alternatives even if you choose not to migrate now, so that a future migration is better understood and can be discussed intelligently in the heat of the moment. Even internal documentation can assist as a mitigation for dependency risk.
  
  Reply View | 1 reply
  
  suchintan 10 months ago
  
  Yeah exactly. One action item from this is that we need to add anomaly detection to our proxy usage metrics so we can catch this in 15 minutes instead of 6 hours :)
  
  Reply View | 0 replies

8organicbits 10 months ago

What infrastructure is this using? Bandwidth seems pretty pricy

Reply View 11 replies

tobyjsullivan 10 months ago

No kidding. AWS's notoriously expensive data transfer is only $0.09/GB. Who's charging $2.50/GB? Are they running on a cellular SIM with no data plan?

Reply View | 10 replies
- ronsor 10 months ago
  
  Residential rotating proxy providers charge very high rates for data, on the order of $1 - $10 per GB. (These providers often do run their proxies through the cellular network, actually.)
  
  Reply View | 7 replies
  
  SteveNuts 10 months ago
  
  Is this something where end users can get paid for doing nothing other than proxying some traffic through their ISP?
  
  Reply View | 6 replies
- mikeocool 10 months ago
  
  Sounds like they are running a web scarping business -- so maybe? Using a cellular connection would be one way to help not get immediately capcha-ed by every site using cloudflare.
  
  Reply View | 1 reply
  
  blitzar 10 months ago
  
  They should really setup their scraper and (exfil the data) via regular connections.
  
  Reply View | 0 replies

dusted 10 months ago

" 200GB of proxy bandwidth was approximately $500 burned over the course of 6 hours"

The fuck ? So Internet is literally more expensive than buying a drive at amazon, paying for shipping, filling it up putting it on a truck towards a destination anywhere in the world.

Reply View 3 replies

chrisandchris 10 months ago

Well, one part of the source of the problem is this, where I not even understand all of the words (a bit exaggarating):
> Skyvern is an AI agent that helps companies automate workflows in the browser. We run leverage proxy networks and run headful browser instances in the cloud to facilitate most of our automations.
So you're doing Selenium, just with Cloud, AI and some other buzzwords you found while Googling?

Reply View | 0 replies
HaZeust 10 months ago

I mean, yeah, bandwidth costs aren't just about bytes, they're about energy, infrastructure, and routing complexity too.

Reply View | 1 reply
- dusted 10 months ago
  
  Yes, but, honestly, so is the production of rust spinners :)
  
  Reply View | 0 replies

hkon 10 months ago

Literally means cloudprotection in Norwegian. Thought for a second we had gotten our own cloudflare.

Reply View 3 replies

keremyilmaz 10 months ago

I was so surprised to read this, especially since we definitely didn't know about it when naming Skyvern.
Out of curiosity, I checked both Cambridge's Norwegian-English dictionary and a few other Norwegian sources, but I couldn't find 'Skyvern' listed anywhere. Makes me wonder if Google translate just hallucinated the meaning.

Reply View | 2 replies
- hkon 10 months ago
  
  No, in Norwegian we make words by joining multiple word together (closed compound) in this case sky(cloud) and vern(protection).
  https://en.m.wikipedia.org/wiki/Compound_(linguistics)
  
  Reply View | 1 reply
  
  keremyilmaz 10 months ago
  
  Ah, now that makes sense! Thanks for sharing it. We definitely didn’t have that in mind when naming it Skyvern, but it's cool to see how it could be interpreted that way
  
  Reply View | 0 replies

tcfhgj 10 months ago

Please, gigabyte isn't a unit of bandwidth.

Bandwidth is measured in data/time

Reply View 2 replies

Saris 10 months ago

Tell that to every single ISP and Cell provider.

Reply View | 1 reply
- tcfhgj 10 months ago
  
  Data volume makes more sense than bandwidth: https://www.telekom.de/prepaid-aktivierung/en/start
  Their explanation of bandwidth looks fine as well: https://dih.telekom.com/en/glossary/bandwidth
  
  Reply View | 0 replies

bradley13 10 months ago

"We run leverage proxy networks and run headful browser instances"

Um...say what? I'm pretty broadly based in IT, and I have no idea what that means.

Reply View 1 reply

suchintan 10 months ago

Haha, apologies for the language!
We use residential proxy networks when running Skyvern to help simulate real human behaviour (because that's what Skyvern is trying to do).
We run headful browser instances (meaning a real chrome instance running with a real viewport) for the same cause!

Reply View | 0 replies

elphinstone 10 months ago

Skyvern is a great name, very evocative. Typical arrogant Google, downloading trash to the user without consent.

Reply View 0 replies

[removed] 10 months ago

[deleted]

Reply View 0 replies

olliej 10 months ago

Honestly given many of these stories, $500 seems to be getting off pretty lightly.

It’s still absurd to me that many (most?) of these hosting/bandwidth providers don’t seems to allow automatic cut offs and such

Reply View 2 replies

suchintan 10 months ago

It definitely could have been much worse. We burned through our monthly allocation in 6 hours HAHA, I'm grateful that our allocation wasn't something like 10TB

Reply View | 1 reply
- olliej 10 months ago
  
  yeah, that could have been "exciting" :-O
  
  Reply View | 0 replies

tim_at_ping 10 months ago

Hello,

A (different) proxy company owner here. This sucks! Sorry that you lost out on so much bandwidth.

Feel free to reach out to me at tim@pingproxies.com and I'd be happy to get you set up on our service and credit you with 100GB of free bandwidth to help soften the blow. I'll also be able to get you pricing alittle better than you're currently on if you are interested ;)

Within the next few months we're also releasing a bunch of tools to help stop things like this happening on our residential network such as some intelligent routing logic, spend controls and a few other things.

You may also want to look into Static Residential ISP Proxies - we charge these per IP address rather than bandwidth and they often end up more economical. We work with carriers like Spectrum, Comcast & AT&T directly to get IP addresses on their networks so they look like residential connections but host them in datacenters - this way you get 99.99%+ availability, 1G+ throughput, stable IP addresses and have unlimited bandwidth.

@ everyone else in the thread; if you run a start-up and need proxies then email me - happy to credit you with 50GB free residential bandwidth + give some advice on infra if needed.

Cheers, Tim at Ping

Reply View 28 replies

SteveNuts 10 months ago

I’m interested to know how your residential connections are sourced.
It says they’re “ethically sourced”, but it seems like malware/botnet like behavior.
Are these residential users aware their traffic is siphoned off for this purpose?

Reply View | 26 replies
- tim_at_ping 10 months ago
  
  Our main business is Static ISP Proxies; here we liaise directly with datacenters and carriers such as ATT, Comcast and others to bring subnets to their network and we'll then purchase IP transit from them.
  We do also have residential peer proxies available - you're right to have ethical concerns as there are bad actors out their that effectively build botnets and spread malware to get their nodes but the industry has developed a lot over the last few years and there are numerous companies, including ourselves, which have pretty strict ethical guidelines. Their are three main ways to ethically source real residential nodes:
  1. Direct payment to peers for traffic sent through their devices. There are several networks like EarnApp, Honey, Pawns and others where people can sign up and earn money for bandwidth sent through their devices. We liaise with these networks to add nodes to our pool.
  2. Quid pro quo with peer through providing free apps in return for the ability to route traffic through their devices. We don't currently engage in this method but we are planning on doing so within the next 12 months through a free VPN - the important thing here is that peers have to understand what they're signing up for in return for the free service - as long as you're upfront, then it is my belief that their is informed consent and it is therefore ethical; there is often a good value proposition to the customer in these cases i.e spend $7 a month on a paid VPN service or get a free one in return for exchanging a small amount of bandwidth which has zero marginal cost.
  3. Offer SDK to developers to monetize applications - this is pretty common and while it is similar to 2. - the ability to distribute the SDK to various developers makes it easier to get a large number of peers online. Again though, its important app developers provide notice of this to their users and most reputable SDK providers have strict guidelines and mandatory screens that must be shown to end users prior to registering them as a residential proxy node.
  There is also a lot of other things that are involved with making an ethical network - a big thing is to just signal that bad actors and criminals aren't welcome on your network. This is usually done by banning certain domains; for example, we ban all .edu and .gov domains as well as most banking/finance websites + are a member of the Internet Watch Foundation and block their listed domains. This has stops bad actors from using our proxy network for evil + protects peers in the network from bad activity going through their devices.
  Happy to answer any other questions if you have them :)
  
  Reply View | 18 replies
  
  oefrha 10 months ago
  
  Apparently you consider both 2 and 3 ethical, and your ethical company is at least expanding to 2. In that case, your ethical standard is just very different from many (most?) of us; we classify 2 and 3 as “shady as fuck”, and 1 as questionable.
  
  Reply View | 5 replies
  
  nikau 10 months ago
  
  > there is often a good value proposition to the customer in these cases i.e spend $7 a month on a paid VPN service or get a free one in return for exchanging a small amount of bandwidth which has zero marginal cost.
  Until someone sends bomb threats or downloads child porn via your IP....
  
  Reply View | 0 replies
  
  FusspawnUK 10 months ago
  
  Hey, Any experience with running bots for games upon your network, Most of them will block signups/auto ban datacenter ip's at this point, Curious if you might be a valid alternative.
  
  Reply View | 5 replies
  
  greyface- 10 months ago
  
  Are you concerned with this activity being prohibited by the AUP of your users' ISP? Do you allow eyeball ASes to opt out of having their network resold in this way?
  
  Reply View | 3 replies
  
  [removed] 10 months ago
  
  [deleted]
  
  Reply View | 0 replies
- 9cb14c1ec0 10 months ago
  
  Literally everyone says they use ethical sourcing, but I never believe that about any residential proxy service without solid proof.
  
  Reply View | 0 replies
- chimen 10 months ago
  
  They are never ethically sourced. Ethically for them means placing a phrase in a 10k word TOS when victims installs app X, game y which loads their sdk. Ethically here means "we warned them in a TOS"
  
  Reply View | 4 replies
  
  LargoLasskhyfv 10 months ago
  
  Huh?
  > We work with carriers like Spectrum, Comcast & AT&T directly to get IP addresses on their networks so they look like residential connections but host them in datacenters - this way you get 99.99%+ availability, 1G+ throughput, stable IP addresses and have unlimited bandwidth.
  
  Reply View | 3 replies
- Suppafly 10 months ago
  
  It can't be ethically sourced, because these residential users have to be tricked into violating the TOS of their ISP to be able to provide the bandwidth. Whether they are aware of it or not, they are risking losing their internet access.
  
  Reply View | 0 replies
[removed] 10 months ago

[deleted]

Reply View | 0 replies

ang_cire 10 months ago

Blocking Google from downloading anything onto your computer without consent is always a good idea.

Reply View 3 replies

suchintan 10 months ago

We were pretty careful about what we were blocking here -- had the exact same concern. Hopefully it doesn't come back to bite us in the future (new blogpost incoming?)

Reply View | 0 replies
hypeatei 10 months ago

Especially if you're using expensive bandwidth from botnets.

Reply View | 0 replies
[removed] 10 months ago

[deleted]

Reply View | 0 replies

monyasau 10 months ago

[dead]

Reply View 0 replies

[removed] 10 months ago

[deleted]

Reply View 0 replies

rustdeveloper 10 months ago

[flagged]

Reply View 2 replies

suchintan 10 months ago

This is really cool! I'll check it out :)

Reply View | 0 replies
KomoD 10 months ago

You're clearly associated with scrapingfish and not just a customer, your entire comment history is just shilling for them.

Reply View | 0 replies

meindnoch 10 months ago

>200GB of proxy bandwidth

Gigabyte is a measure of information.

Bandwidth is information transmitted over time.

Reply View 0 replies

keepamovin 10 months ago

you shouldn’t be paying by the terabyte. Colocate and just pay for the maximum throughout. Far better rates

Reply View 2 replies

skeeter2020 10 months ago

doesn't work when the sites you're scraping block the IPs/range of your server. They're using a proxy botnet that costs a premium

Reply View | 1 reply
- keepamovin 10 months ago
  
  makes sense but you shouldn't be paying a premium for a non-premium service. IP blocks and bandwidth have low unit cost at scale.
  
  Reply View | 0 replies