Evolving Web Technology for the Public Interest
Presented by: Lucas Pardue, Matt Hobbs, Neil Craig
Originally aired on February 11, 2021 @ 3:00 AM - 4:00 AM EST
New web technologies bring exciting new opportunities but it's not as simple as "just switch them on". When do you turn off the older stuff?
Join Matt Hobbs (@TheRealNooshu ), Neil Craig (@tdp_org ) and Lucas Pardue (@SimmerVigor ) for a discussion about the unique requirements and challenges for public service focused organisations.
English
Transcript (Beta)
The web, a digital frontier. I've made a picture of bundles of HTTP requests as they flow through the Internet.
What do they look like? Important public information, news, video, other stuff.
I don't know. I kept thinking about these things and I realized there's some people within my professional network that could give me way more insight than I have So I invited them on.
I'm Lucas Pardue, I'm an engineer at Cloudflare.
I'm joined by two guests this week on not an episode to do with HB3 only, but looking at other web tech evolution.
And so I've retitled the show slightly differently this week, but it's generally the same theme.
If you're fed up with being about QUIC and HB3, please do stay tuned.
We're going to take a kind of different view of things, not get into the technical weeds, but look at architecture requirements, challenges, how these things actually get deployed and affect real people with real problems and real devices.
So enough about me. I want to briefly introduce my guests this week and then I'll hand over to them to really talk to me about what they do.
So I'll start with Matt Hobbs. He is an engineering manager and former developer who used to code.
He's interested in web performance, accessibility and utilizing technology to improve people's lives.
You are an engineering manager at the Government Digital Services, that's the UK government.
We've got our worldwide viewers here. And I've also got Neil Craig joining us, who works for the BBC's Online Technology Group as a lead technical architect.
Neil didn't give me a sentence to describe what he does, so that is the sentence to describe what you didn't do.
But anyway, so I used to work briefly with Neil on and off a bit when I was at the BBC many moons ago.
But I don't get the inside track anymore on what he's up to.
So I get to see some public blog posts from time to time, but I'm really interested to hear maybe what we can find out.
So yeah, I think, which one of you would like to go first? And maybe introduce a bit about who you are in more detail, how technology in your organization works, those kinds of things.
I'm happy for Neil to go first. We're all gentlemen here.
All right, cool. Thanks, Matt. Thanks, Lucas. Yeah, so as Lucas said, my name's Neil and I work in a part of the BBC, which is the National Broadcaster in the UK, just for anyone who's not familiar.
We're a pretty old organization, been going for nearly 100 years now, actually.
And we were quite early on the Internet.
So there's lots of history within the org and I work in a part which is called the Online Technology Group, which doesn't do a great job of explaining what we do.
We're kind of core Internet services for the BBC. So there's lots of teams that we call product teams.
So they make things like the news websites, sport websites, children's weather, you know, all those kinds of things.
There's tons and tons and tons. And we sort of look after the shared core services which put all of that on the Internet.
So we can probably go into a little bit more detail a bit later.
But that's sort of essentially what I do. And I'm in an architectural role.
So, you know, I kind of wave my hands and draw diagrams and things and hope that it all gets built in the way that was in my head and that it will all work.
Yeah, and Yes, so myself.
Yeah, I'm Matt Hobbs. I'm head of front end and a lead developer at the Government Digital Service and that is essentially part of the cabinet office and we built and created a number of services.
The main service that people know us for is gov.uk but we also do things like gov.uk pay and the design system and gov.uk notify and we run those core services.
And in terms of the tech stack.
It's quite varied. So, for example, gov.uk works on Ruby on Rails, notifies is built with Python.
We have bits of Scala, we have bits of different tech stacks all over the shop.
So yeah, it's varied work. It's interesting work and a lot of people get to see it and a lot of people come to it for guidance.
So it's for me, it's all about essentially making it as quick as possible and making it as accessible as possible to reduce the barrier to entry, really.
And so from for my thank you.
Thank you very much for that. And thanks for your time, just for making this hour available.
I want to stress to people. This is live TV, so it's we're not making it as we go along.
We do have a plan, but it can be put on the spot.
So people like me who's fumbling right now, but I wanted to say we've enabled a feature this week called the live viewer call-in, which is a bit experimental.
I haven't tried it myself, but I believe people who are watching now can hit a button on the page and leave us effectively a voicemail that if we deem is fine by the moderation, we can play it out live.
So you can become a piece of history today, if you so wish.
And I'm monitoring that, which is why I keep looking up to this corner.
Please get involved if you'd like to. But anyway, back to the point here.
I think from my understanding of the different responsibilities of both of you, there's kind of a difference here in maybe how teams within the organizations approach deploying services to the web and to users, I guess.
So for Neil, it's maybe a bit more of a common process and like a centralized thing.
You know, if I wanted to stand up a new web page, well, I think you talk about product like sports and those kinds of things.
They're all effectively, most of them in the UK are after the slash.
So bbc.co.uk slash thing is a product. And so there's different benefits, a commonality here.
There's a benefit of a technology stack. Sometimes that can be a bit constraining.
Sometimes teams progress at different rates, but there's also a commonality of look and feel, which is what's perceived more by an end user.
So maybe, I don't know how much detail you can go into, but maybe you could like expand a little bit on how the BBC approaches those two aspects.
Yeah, for sure.
So I guess I'll set a little bit of historical context, not too much, but over the last, I don't know, probably six, seven years or so, roughly, maybe a little bit more, we've moved away from quite a rigid and controlled sort of web system, which was called, well, it was on a kind of set of infrastructure called Forge.
And it was, there was a thing called Power, which is a page assembly layer.
So that was all kind of built and centrally managed.
And teams were sort of customers of that.
And they had to work within those constraints. The advantages of that is that a lot of things were done for the teams.
A lot of things could be sort of centrally managed and you get economies of scale and that kind of thing.
It is a little bit restrictive and with the sort of, the growth of the cloud, we're sort of increasingly seeing teams wanting to have a little bit more freedom and control and push things out.
And that's gradually happened. And now actually we're just in the process of turning the power off, which will happen later this year.
And that sort of meant that our world has become sort of that much looser and freer.
So we've sort of needed to bring in some services, which sort of, you know, sort of sit in the middle and sort of straddle those two worlds.
So what we have is a, a deployment system called Cosmos, and that is run by one of the teams in OTG.
And our various product teams on news and sport and weather and so on, they use that system to deploy their websites.
So it's fairly build agnostic. Lots of our systems are now Node.js, but we certainly have other languages around as well.
We have certainly some Java and Python and so on, increasing amounts of static publishing as well.
But that generally all gets, it doesn't have to, but it generally all gets pushed through Cosmos.
And then that gets published and that goes out to AWS. So teams have control over their cloud formation stack, so they can deploy what they want, ELBs, NLBs, ALBs, API gateways, all those kinds of things.
And they sort of build their own systems and Cosmos kind of manages the deployment for them and does lots of those kind of really awkward things that, you know, you just don't want teams to be repeating.
And so just to interrupt a bit there, the, that's great. Thank you.
The, I want to phrase my question correctly here. I shouldn't have interrupted you.
Should I have written it down? Sorry. The, those services that are running to host those websites, are they like public facing on the Internet?
Like the focus of this show is to talk about Internet and web tech.
So I think it's like we, knowing the back ends is very important.
We have this whole thing about client-side rendering or server-side rendering.
Those things have a big role in web performance, but when it comes to like the Internet plumbing of TCP and IPv6 and those kinds of things that, that in my experience, typically no one cares about those.
They're a bit boring, but we care. And so like, how does that work? Is that, does everything kind of get funneled into you?
And then are you running your own content distribution?
Are you doing other things? You mentioned AWS is like all of these things in the mix.
Yep. Yep. So yeah, absolutely. So as the sort of team that runs the bulk of the core kind of infrastructure, we, so we sort of operate a fair bit of the stack that's sort of at the lower level, as you say, the kind of stuff that people don't want to have to care about.
So the networks, you know, the DNS, the traffic management and that kind of stuff.
So the way that it works is we have our sort of, in terms of talking about the websites, we have our two sort of main domains, which is www.bbc.co.uk and bbc.com.
So when you hit those, you'll come into either a CDN or one of our in -house traffic managers, which is one of my projects, I think, called GTM.
And that's based on NGINX. And what we do inside of there is we do path-based routing.
So GTM knows how to route the particular sort of content classes.
So the generally top level directory, so slash news or slash sport or wherever, but it's very flexible and can route lots of things under there and specific matches and all those kinds of things.
So it has the routing logic and it knows where the origin for the particular services being requested sits.
So GTM is also the origin for the CDN. So we use CDN sort of strategically and in places it doesn't really make financial sense to deploy, you know, individual traffic managers to or groups of traffic managers to.
So you can go CDN to GTM or straight to GTM, depending on where you are in the world.
And then GTM knows where the origin is.
So what GTM is doing for you is it's doing the TCP connections, obviously, it does the TLS termination, it does routing, it does caching, and it does HA for origins.
So it is basically a CDN really. We've sort of built our own in-house CDN and, you know, there's, well, I probably shouldn't go into too much detail, but the reason we built rather than bought was because it's quite tricky to find something that can manage the routing in the way that we want.
We've got an awful lot of sort of verticals, as we call them, news and sport and weather and so on, and just all of that routing logic is quite tricky to manage in a kind of, not in a belittling way, but a sort of off-the -shelf solution, you know, something that's relatively generic.
Yeah, that sounds bad, but I don't mean to.
That's fascinating. Like, we could probably spend three days, I mean, I spent a lot of time doing this, but I want to give some air time to Matt.
We can always come back to this. And I think the reason I want to dig in a little bit into that is just because I believe, Matt, that yours is quite a different approach in comparison.
So, do you want to shed some light on that? Yeah, so, really, it depends what service you're talking about.
So, essentially, a GDS manager, a number of services that they've built, but the main domain that we look after on gov.uk is www.gov.uk.
So, that's the guidance pages, start pages, and things like that.
So, that itself is built on Ruby, and then there is a CDN that sits in front of it, and a lot of the actual TLS termination and all the optimizations are done at the CDN level, and there are backups going on between those as well.
The CDNs with a competitor, I don't think I should mention their name, I don't know whether I should, but with a competitor of yours, and then we have different services that are on different CDNs as well, and different origins, and because we have such a different stack between the teams, it really depends on which team you're talking about as to how things have been set up, and then if you go broader out into government, there's different services on the service domain, and it's important to realize that when a user hits www.gov.uk, they are hitting the GDS service.
They go through the guidance system, and that sounded like a weapon. When it goes through all the guidance and you press the start button, you could then be offloaded to the, say, the tax website, HMRC or DWP, and essentially you're going to a completely separate department.
You're going to a separate, essentially a separate website.
It's on a different stack. It could essentially have a different front end, and that is purely managed by a completely different department, so even though the design and how it look and feel is meant to look very similar, it's completely separate in lots of different ways, so yeah, it's a very different setup, and one of our key, from a front-end point of view, from a head of front-end point of view, one of our key sort of strings that ties it all together is that branding and the commonality between it, because the reason it has been built in that way is because we want to remove the complexity of a user needing to know, oh, for this particular topic, I need to go to this particular department.
As a user, do they need to know that?
Potentially not. All they want to do is get from A to B, complete the service, complete the guidance, and then exit from it at the other end and go, well, I've done what I needed to do.
The fact that they've gone through two or three different departments, it doesn't really matter.
So, Matt, if I can just ask a quick question.
I think I remember from the times I've used various government websites, I think you use, so whereas we're kind of path-based groups, so slash news or slash support, your subdomains, right, is that what lets you have those completely separate stacks, I guess?
Yes, exactly. So you have the gov.uk, which is a top-level domain, and then you have a service domain as well, which is ETLD as well, they're both ETLDs, actually, and then you have your ETLD plus one and you have your, there's about 200 and something, 217, maybe more services, and they've all flown out of my head.
I can't remember a single one as an example, so I apologise for that.
But yes, that's essentially how it's managed. So you go to a different service domain and you are then with potentially a completely separate department with a different stack and a different front-end, and that's how that works, yeah.
Yeah, so that's really interesting to me, because I think, how do you manage change there?
It was about evolving web techs, and we want to, I don't know, put some new kind of CSS stuff in.
I'm not a web developer, right, so everything is bushy-washy there, I apologise profusely.
You're in, it's all within your own control, because you give it to the client and it will process what it wants.
When it comes to adding something like PLS, you're right, PLS 1.3 is new, but it's not brand new, it's been around for a while, it was in its stages for quite a while.
We're seeing big deployments, getting the support for it rolled out to different edge operations and different people running stacks, different servers, different kind of things, but then there's quite a difference between just turning it on, right, and people actually being able to use that.
One of the things that I found really important when I worked at the BBC was, for them there's, if I'm not mistaken, a mandate on accessibility of information and reach, and quite often this is the traditional sense there is reach of broadcast signals, so how big and how tall and how strong are your antennae and how much coverage can you get there, but the transition of the organisation from just a TV and radio broadcaster to online services met those things too, so it's as much about being accessible on the Internet, like from an IP routing perspective, as also providing technologies that clients can use and not marching too quickly with the drumbeat of progression, although it's great, and you know disruption is a big word in some parts of the world, I think the difference I had from that time is understanding that it can really affect other people, like especially parts of society that are maybe, you know, they wouldn't increase your bottom line profit margin kind of view of stuff, but you still have effectively like a duty of to make sure that they can continue using those services, and accessibility is a big thing.
I don't want to dig into that this time because I think it would need its own show and have experts in that field to give it due credit, but in terms of the technology and looking at things like old SSL versions that we know are insecure compared to newer ones and how that affects the ability of people to access something like their car tax thing that they need to fill out, otherwise they don't have a fine, so I'd like to kind of, I believe you've both got some experience in these fields of looking at transitioning from legacy TLS, say maybe we wanted to find legacy TLS here, I'll nominate Neil to do that, and say well you know, are you doing 1.3 on the tech stacks that you've talked about, have you decided what the minimum threshold version is, are there special services you provide, like you know, for instance with the BBC set -top boxes with like the usual long pole there that stuck out because they would be stuck on, you know, a baked in version of a tech, but like a TLS stack that was on a piece of equipment that wouldn't get any more vendor update, that kind of thing.
I think you've written a blog post on this, Neil.
Yeah I have, it's kind of both painful and near and dear to my heart.
There's quite a lot there actually, I'll steer slightly clear of TV, that's quite a different beast and I'm absolutely not an expert, what I would say though is it's very surprising sometimes how even very modern TVs have, what's the polite way to phrase it, surprisingly retro TLS stacks, but yeah I mean in terms of the web you're absolutely right, we've got, so we've got both the duty to serve the UK audience and actually also a mandate from the UK government to produce some sites which are called, they're under a sort of banner called World Service, so the idea of that is to try to spread, this is going to sound really terrible and imperialistic and I hope it isn't, but it's kind of the sort of UK's perspective on news and current events around the world, with I guess the intention that that is, you know, just sort of representing some of the world and kind of giving people perhaps an alternative view than they'd see in their own country.
So a lot of that is on the web these days, there are the World Service radio stations and so on, but we have some language specific sites, quite a lot of them actually around the world and they're generally sort of centered on news and sport and what we tend to find is that the audience demographics in certain regions of the country, say Sub-Saharan Africa for one, have really really different tech patterns to you might see in the UK or the US or Europe, and it can be really quite tricky to sort of balance the two, so we have that sort of global shared host name or pair of host names and that means that we have effectively one TLS configuration for the world.
We could slice it and dice it geographically but, you know, that brings a lot of complexity and potentially a lot of problems.
So we present one, we present one uniform config in terms of TLS and we see, say in the UK, we're about 85% TLS 1 .3, so that's pretty cool.
Big chunk of TLS 1.2, that's pretty cool. 1.0 is something like 0.7% so it's pretty low, but then if you were to look at perhaps one of the countries in Sub-Saharan Africa, and you might look at, say, our AMREC website, you could see the percentage of TLS 1.0 jumping to perhaps 20%.
So if we just took the decision based on a sort of global aggregated view of, oh, not many people are using 1.0, let's turn it off, we've created a complete hard fail to 20% of our world service audience in, say, Eritrea or Ethiopia or, you know, lots of countries where things like feature phones are quite prevalent, they're really good at recycling tech and keeping things running for years and years, which is amazing, it's brilliant, but what it means is that it's quite tricky for software.
So, yeah, that's certainly sort of a big chunk of what we see as a challenge and actually, you know, I think lockdown's really highlighted that there are a lot of families and folks within the UK who have got, you know, older hardware and we certainly don't want to be cutting them off either, so we're kind of walking this tightrope of trying to maintain the sort of older tech that's sort of well established and works on loads of kit for as long as possible whilst it's not either a huge maintenance burden or a horrible security hole.
So, yeah, that's kind of where we find ourselves and we're sort of struggling with how to sort of manage that deprecation and removal process in a way that's not completely jarring and awkward for the end user because if we pop up a message on your phone or your computer that says, hey, you need to upgrade to, you know, an iPhone 11, then, you know, surprise, tons of people around the world are going to say, yeah, you can't do that, you know, that's a hard cut off for the BBC and our relationship with them is gone.
So, you know, that's no good. So, yeah, I'm going to stop talking now because I've been talking for a long time.
I was just going to say I very meanly pulled up your blog post from, I think, February 2020.
Maybe I might be wrong, but when we were just talking about TLS 1.0, like we see, we've got a breakdown.
I presume some of these numbers are a bit changed even in the meantime.
So, you can see, like, that very low percentage. When you look in the round, you know, a version, like you say, the impact to a population can be quite significant.
So, I think measuring those things are quite tricky.
It depends on what your service is and how you measure those, which we might come on to, like, in a bit once we've heard from Matt.
But the thing I really found interesting from this blog post was the Bosnian Herzegovina statistics here, where, like, I don't know if you can go into that or if we have the time, but, you know, the number of requests measured there, 100% of them were TLS 1.0.
Yeah, I'm at a bit of a loss to explain it, I'll be completely honest.
To the extent that the data is correct, which, you know, for most of the rest of it is pretty well, it's pretty solid.
I don't have any, no reason to doubt it. You know, software caveats aside.
Yeah, it's incredible. And you look at the change in those 15 months or so, it's virtually disappeared.
So, whether that's a huge prevalence of TLS proxies in the country, I don't know, I'm afraid.
But yes. I'm not asking for an explanation.
I just want to, like, use it as an example to highlight to people that, like, the Internet's weird, man.
It's just, there's things you can assume and then there's things you see and you don't understand and they can come and go.
So, you know, maybe that was a fleet of devices all getting firmware update, I don't know.
I wouldn't want to speculate too far. But, you know, those crescendo moments can kind of become the green light to actually transitioning some technology that previously you might have said you can't do.
And it's not a case that you haven't done the homework and you haven't set everything up.
But often it's the opposite in that kind of chomping at the bit to be able to roll out something and everyone's asking you where it is and you have to kindly say, well, we have to hold back the people at the leading edge because the sliding window, kind of the weakest link in the chain can't be let go.
But I think, Matt, have you got anything to say on TLS and SSL?
Yeah. So, in terms of our, when I last checked, 1.0 was at 0 .08% and 1.1 was at 0.17%.
Most people were on 1.2. And then from March this year, 1.3 was enabled.
Unfortunately, I don't have what level of 1.3 we're at at the moment.
But, yes, we still have 1.0 and 1.1 enabled. And even though it's way past the point where lots of CDMs are starting to disable them and lots of services are starting to disable them, you've touched on the point exactly.
The problem we have from a government website is that, as a user, you should be able to get to this guidance.
And, as Neil says, 1 .0 and 1.1, if they were to disable it on an older device, it is a hard fail.
There's no error message saying you need to upgrade the device.
They're just going to get a standard error page. It's just going to be nothing there.
And the problem, even though it's such a small percentage, that small percentage could be coming from a particular demographic.
Say, for example, it's a library somewhere in the country where they're running very old hardware and people go to a library to go to use the Internet and they need to fill in their government, get to government services.
If we were to disable 1.0 and 1 .1, suddenly, without warning, that is completely removed.
And that is a it's a fine balance.
It's a fine balance between actually we have the technical knowledge to say we shouldn't be using 1.0 and we shouldn't be using 1.1 now because there are technical issues with them.
They're not as secure. But there's that balance between the security aspect and the accessibility aspect.
And it's really hard to figure out where to sit exactly.
Because if you were on a transactional service, for example, there'd be PCI compliance.
You'd have to be on 1.2 plus to do that.
But if you're serving static content, is it then okay to be having 1.0 and 1.1?
It's really difficult and it's something that we still have enabled. And I'm not too sure when exactly we will look at enabling them.
Just simply because of how do you identify where that group of people are?
So, unless it gets down to zero, which I can't see ever happening, personally, it's very difficult to know where to switch it off.
And from an actual stats point of view, we have it's around 90 odd percent.
Actually, I think it's higher than that. 90 plus percent are from the UK.
So, yeah, it's an interesting problem and difficult problem at the same time.
Yeah, and I think as you're talking, I was just thinking that sometimes it's not even the fact that failure is always bad, right?
But there can be points in time when failures are particularly bad moments of history of defining moments of a country and people's access to information is like paramount and it could be a matter of life and death.
So, I don't want to get too serious on that. But the other thing that came into my head is like it's not just about TLS on its own, but the other technologies around that that try to go from this old school style of having two different HTTP scheme and HTTPS scheme and you have things like HSTS or HTTP strict transport security probably is the acronym expansion and how there's cases where people have turned that on and then something's gone wrong and then they can't turn it off because of the interactions, kind of the symbiosis of clients and servers and good well -intentioned mechanisms like OCSP and other technologies that it augments the security going a bit awry.
And these things are so tricky to get right, but I am far from an expert in them as well and so I just see like you planning and planning for migration and evolution is one of these important things and I think there's some good resources out there, but yeah, it's pretty scary and I think the difficult thing I also have seen in the past is the observability of failure, like how when things go wrong what do people do?
Do they ring their ISP and say the web page doesn't load? Because kind of trained and learned behavior of if you see a white page in your browser or like so like a blank page with a dinosaur on it or whatever that maybe the Wi-Fi is off, maybe DNS isn't working if you understand what DNS is.
But these kind of different factors, there's a reporting into your web page to say well if you got a bad status code you might be able to from your server logs detect that and start to figure out what might be happening like misdirected requests, these kinds of things.
But at the layer before you've established like a level seven connection to a web server, the whole bunch of other pipes that could be broken and burst and leaking.
So I believe there's some technologies that are out there that can help some of those aspects.
Network error logging is one of them which I'm going to pass over to Neil to explain.
Yeah, so I came across network error logging in a chat with my friend Scott Helm when it was still like super super new and I mean I remember him showing me the draft spec and you know it kind of the ink wasn't dry on it.
But just looking at it and thinking wow this is just like the missing thing.
So you know we've, as Lucas was saying, we've had our server access logs, our traffic manager access logs, our CDN access logs for years and years and you know you can inspect them and find when stuff breaks.
But you can only see stuff that's broken in those if somebody's got to that.
So if they haven't you know you've just got this huge blank empty space and you're kind of you know none the wiser.
So what network error logging does is attempt to sort of fill that gap so it gives you the client's view of errors effectively with your website.
What you'll do is configure a response header from your service and that instructs a client which has support to contact an endpoint which you tell it whenever it sees an error and it splits those errors into a variety of classes.
So they kind of broadly walk up the stack from you know sort of network up through DNS and to HTTP and you can receive reports either sampled or the whole lot when something goes wrong at any one of those levels.
So if your edge service is completely flat out reflecting TCP connections you'll get a ton of those in.
If you've broken your DNS you'll get a ton of those reports in. TLS as well and as I say HTTP which has a lot of crossover with your access logs because obviously you've got a fair way up the stack by that point.
But depending on what you have or haven't got access to in your stack that can still be really quite useful and perhaps you could tie it in with other reports.
So network error logging is really that client's view.
So you have to build a reporting endpoint but it's really quite a simple thing to do and then you just instruct the client.
So it's kind of chromium based browsers right now as far as I know.
I don't think there's Firefox or Safari support.
I'm probably going to be told I'm wrong by somebody but yeah it's kind of super cool.
We've certainly caught some issues that way. So we have found things like slightly sort of from some perspective incomplete TLS chains.
So as anybody who's sort of configured a TLS search chain in the past will know there's quite often more than one way you can route through a chain and really a sort of big part of what you're concerned with and a barrier sometimes is the age of the routes that everything chains back to.
So the absolute route of trust and we actually caught an error when we deployed our new traffic managers actually when we deployed GTM and it turned out that we had effectively configured just the shortest route R-O-U-T -E to the route R-O-O-T and we'd sort of forgotten to go to allow this sort of slightly longer path to an older route.
So we had a load of Android 4 and 5 devices which really weirdly like really old operating system but they were running modern Chrome so they could do network error logging and they knew how to do all this modern TLS stuff but they were on an old OS so their route store was quite old so they just didn't have a new enough route in there from our RCA.
So we caught that. We've caught some TCP issues as well.
We've seen some some countries around the world blocking our website. We've detected that with network error logging.
So yeah it's pretty cool and as I say it kind of fills that big void that we just have for years that until somebody points it out you kind of might not necessarily spot.
But yeah I think it's a cool bit of tech and we've been running that a little while now.
We're just about to crank up our report percentage.
We're only on one percent right now so it means we miss a little bit.
It's kind of hard to sort out the signal from the noise but yeah very cool very cool thing and if you've not had a look at it I would definitely recommend it.
It's pretty useful. Thank you.
I've never I must admit I've never actually I've looked at that. I've never looked at that so that sounds really interesting but you touched on it at the end.
How do you how do you actually feel because the amount of data you must be getting through must be ginormous.
So how are you actually filtering that data? You've built your own tools internally or they're third-party tools that you're using?
So there are some third-party tools now. I believe some of the sort of run providers are doing are providing this sort of service so they'll have like a reporting endpoint that you can use and then a load of sort of reporting and filtering and graphing that can and alerting that can be driven off of that.
We've built our own because it's probably similar for you Matt but procurement is quite hard.
We have a lot of hurdles and hoops and all that kind of stuff so if it's something small like that particularly that's not really a big burden to run you'll quite often see that built.
So I mean I built that the one that we use is pretty easy but it effectively takes a block of JSON in and we push it through a little pipeline.
It does a bit of enrichment and then it goes into BigQuery so we have quite a lot of our access logging stuff is in BigQuery and then we drive Grafana out of that through Influx.
So yeah we have it's been a bit of a struggle to work out a sensible way to present this.
I definitely haven't got it perfect but it's a sort of reasonable first blast and as I say we are kind of detecting things.
We need to crank up the so there's two sampling rates you get with network error logging.
There is a failure fraction and a success fraction and you can set them independently and you can set them to anything you want.
I don't know if it's some sort of a float.
I haven't investigated how long but yeah we've got it at one percent of failures currently so we don't get tons and tons of reports.
It's a bit hard to sort the signal out but if you know the right kind of levers to pull within sort of Grafana or a BigQuery query which is always a weird thing to say then yeah you can find what you need.
We're just kind of trying to increase the signal strength a little bit but yeah you definitely spot issues so yeah it would be worth a look if you're in the market for that sort of thing.
That sounds awesome. We've had some questions come in yay thanks folks and I think these questions have transitioned nicely to the next bit of stuff we wanted to talk about so if you don't mind me interjecting and then Matt I think you could probably take a significant fraction of the remaining time but I'll start with a written question that's been sent from Michael Kennedy and they say a question for Neil and Matt.
At which point do you enable new features such as H3, Char20, Poly1305 etc and then a kind of follow-on to the network error logging stuff say would the BBC make the null solution open source?
Oh I don't even want my code. Matt do you want to go for the enablers?
Yeah so I mean we've very recently at the start of the year we enabled HTTP2 and there's a few blog posts around about that because it wasn't as easy as enabling it.
H3 is again we will be looking to the CDN that we are with at least for gov.uk for them to actually have it available.
Now I would be very wary to be enabling a beta sort of service on a government website I must be honest so that it would may take a little while for that to be enabled for it to be sort of battle tested and and whatnot but that's certainly something I'm excited for doing in the future.
In terms of when that's a good question I don't know whether when will it be finalised Lucas do you have any ideas?
So I'll get my hat on where has it gone I can't find it so yeah I'm the co-chair of the quick working group who is working on the quick transport protocol and it's mapping to HTTP which is known as HTTP3 so yeah we hit a big milestone the other day of submitting the family of documents to the IESG or to our area director to do some review for them forward on so we're getting there some of them have been submitted some are still kind of getting some review it's very much a oh you've referenced this thing and you've done it in a way that's a bit weird according to some formal thing so yeah you know these are very much in development tools I don't want to go too much into the protocol nitty-gritty but I do want to say that just just turning things on like you know the features are there to give you the potential to have improved things whatever you might think are improved yes performance but there's also reliability so better quality of experience or other use cases and being able with h3 or even h2 right some of some of the motivation there was to save server resources by having connection reuse through multiplexing but those things don't just come for free and I think Matt like some of the work you did in your blog posts which I'll share a picture of one in a moment is to really understand what's going on you've got these different domains that they're not maybe sharded in the traditional sense of how people approach like web performance optimizations but by virtue of the way that the that the UK service is architected that things you know you thought you understood hb2 but in practice people or browsers do things slightly differently and an understanding that correlation with webpip was quite a journey I believe yeah definitely it was a really interesting learning curve and I specifically remember I pitched it to internally saying h2 it's the future it's going to speed everything up it'll be so simple and we just need to enable it it literally in the document I put together it was something stupid I see no downsides to actually enabling http2 how naive I was and I remember sitting downstairs after it was enabled and I was sitting just over there and I looked at the graphs that came back and they'd all gone in the wrong direction and I could not understand it I was like but this is meant to help everyone this is what we what has gone wrong in this in this sense and I was digging into many many different features and trying different things and sort of pre-connecting and I tried everything for over six six weeks and we didn't we didn't manage to solve it at the time and it was one of those where I at that point I understood that by enabling h2 for a significant fraction of people we were essentially slowing down the performance of gov.uk now if we hadn't been testing the web performance it would have just been enabled and just gone yes everything's good we're on h2 everything works but luckily we were very rigorous in what we were doing and we noticed the issue and we had we turned it off we went back to http 1.1 because in my head and us as an organization we can't justify turning something on that is actually making things worse for users and to cut a long story short there's lots in the blog post but there was an issue with sub -resource integrity and how it was actually being used for javascript and for css it was creating anonymous connections which was basically adding time at the start of the the actual connection so all of our metrics were completely out by a certain amount because with h2 everything is trying to go down a single tcp connection and you break that tcp connection you've suddenly got a massive bottleneck and that was the bottleneck that sort of pushed it over the edge and what broke everything so once we fixed a few bits and pieces removed some resource integrity from the css and javascript and now we've removed the shard domain for the for the assets it's now all over a single tcp connection on h2 and we enabled it permanently and all the graphs went in the right direction this time which i was so so happy about that was a very good day when that happened that's great and i think it's important to stress well maybe not stress but but to say that the performance isn't just like i'm on a fiber connection on a big beefy desktop it's great but to realize that you know people their sole access to vital services can be on a very cost effective phone on a cost effective plan that has data that they want to minimize the usage of and you know the the more you can minimize the spending stuff and the faster you can do that really make a service accessible in a way that's not about web accessibility but in terms of access accessibility weird in a phrase so yeah it's wherever you can all of this is a trade-off obviously but where you can cut those those lines and improve stuff the thing that i really as a user of uk services is how quick they are responsive and you know to the point when obviously there's there's things that have to happen on the server side processing payments or filling in forms and even those are responsive or at least give feedback as to what's happening this is you know a please wait there don't hit reload again otherwise all sorts of bad stuff could happen got the dog in the background so this is a great time we have a phone in question with a voice that i'm going to try hit a magic button and see if it'll play live into this thing it may it's maybe addressing something we talked about a little bit earlier but why not let's give it a go we've got nine minutes on air i know there's some of the topics but i see a button there might be a slight delay so i'll fill some time i'm going to hit play again we might have it twice a replay attack uh oh okay that's odd well you said the technology to make this work well we're in case people are curious the Cloudflare tv runs via a zoom teleconference meeting and the fourth participant should enable um like a play out of that audio but didn't work uh oh well um yes if if one of the teams listening and watching our a remote all-in participant is not there so if they if they ping me and fix it we'll play this question at the end such as the march of progress that things fail um right sorry for that interlude traps i think uh yeah where we go on to with uh eight minutes remaining anything you feel you haven't said you want to get off the chest i was going to mention one thing which we've just started uh to work on which that's just kind of reminded me of um which is uh i guess there's been a fair amount of noise about google's um web vitals so uh effectively a sort of boiled down set of metrics that you can use to um understand how how well your web pages are performing um we've been quite interested in that i've been speaking to google pretty regularly about it to team and um one of the things we're actually just about to start doing is uh reusing our network error logging report infrastructure to um receive uh anonymous reports from browsers about web vitals metrics so we're pretty keen to kind of get that in and um do that before we do some uh sort of migrations and you know exactly as matt was saying we want to be able to measure the impact of what we're doing you know we want to kind of do the old going back to school hypothesize what's going to happen do it measure it improve it you know that kind of stuff so yeah i was just kind of mentioned making a quick mention of that because i think it's quite an interesting thing um i don't know how well it's going to work out on the real Internets because you know they're different to what you think but um yeah it's quite an interesting thing so hopefully we'll get that enabled over the next few weeks or so and uh we'll see how that goes because we've got a big dns migration underway so that should make things a bit faster especially for folks further from london so i'm quite excited about that it's been a lot of months of work oh i see matt you've got a uh a brotli t -shirt on is that like a a hint at something you're working on or just a coincidence um so eventually um i would like to um enable brotli i'd imagine when we um we would do it at the cdn level um rather than at the origin i know brotli is quite resource intensive but that's certainly something that we are to encode and decode and whatnot so that's something that we would like to enable eventually i'm also trying to pitch and move forwards with the collection of real user monitoring data um because that's um i think it's it's vital to understand um there are we've sort of touched on it a little bit where there are certain areas of the country which are very rural um and there are like in a pack well it seems like a previous life i used to commute to london um and uh that was always so frustrating on the train where you go in and out of hot spots you're wanting to read you're wanting to browse and there are a lot of people in similar situations all over the country so the actual collection of this real data from users will allow us to say we have a certain demographic of users who are really struggling with this and it's because of a certain device or it's because of a certain connection or it's because we have particular pages where there's a bottleneck on that page um and we don't have the visibility at the moment to actually identify that and i think that's the the key to improving performance is to actually identify the issue and verify the issue and be able to iterate if you're iterating in the dark um who knows what you're doing so and and be able to measure these things now to see what what effect the changes had but also in the future to track regressions or not just the evolution of the technology but the protocols let's call them Internet protocols but also the evolution of hardware like that i think um we're getting close to the top of the hour it's going to be over soon i'm going to have one more attempt to play this question and see if that works and then i'll give both of you some opportunity final closing statement if we have the time we've got three minutes 30 let's see that's the dog that wasn't the calling user if you heard that um no i i don't hear anything oh okay well um i'll read the question because we might not get it um but it's been transcribed so it's also a bit weird in that it's got transcribed slightly but um it just said this is jason from san francisco calling this is a fantastic discussion um so far one question about it to the folks here is about using old versions of tls and that he understands not disabling the connections if possible but would it make sense to flag them like and tell people that they're using an old version and that um maybe there's something that they could do because unless you're informed it's hard to know that you're doing something not wrong but maybe there's there's a better option like is that anything that either of you have tried uh yes we used to um just to jump in quickly um we used to do that on gov.uk uh we used to have a um we still have an old browser cookie banner um to tell you when uh you're using a deprecated bat or an older browser essentially and we used to do the same with tls as well um for an old version of tls we utilize the same space um but in terms of we've changed our cookie policy and cookie guidance and things like that and we've actually removed that functionality now so yes that's something that we we did try for a time so you say you removed it was that because it wasn't working or was it just not enough benefit to maintain that was there was a question around um how many people were seeing it um and what feedback we're actually getting and there were other things to do with um sort of the use of cookies um on the website and it was one of those where that was one of the main reasons um to remove because we were using cookies to actually um enable the functionality yeah to track it in ga and whatnot and we sort of removed that for that reason as well okay the reason i ask is um we have um discussed the idea of doing effectively that so trying to um sort of use our traffic managers to flag um deprecated uh you know cipher suites and that kind of thing and to inform the user but to try and frame that in sort of user compatible language so not you're using an old cipher suite it'd be more like um you know come this date um you won't be able to access this website with the current browser here's a list of ones that we're going to support but it's a lot of work and we haven't quite got there yet we haven't for users in in um in africa particularly actually um as i mentioned earlier so we'd love to kind of move on from that but it's super tough it's super tough to do um well okay um thank you very much sorry to interject there but we've got 25 seconds left so i just want to take that time the opportunity to thank you both very much for for making the time to come on the show and to talk about not just boring hb3 but but the wider context and the things that are important to some some people and some challenges that you have and i'd like to thank you for providing the service for me and others around the world um bye awesome excellent thank you