Comment by maxbond
If you can run inference in real time (while doing a video call), and you can extract text through your operating system's accessibility APIs (eg the application isn't doing it's own bespoke text rendering), then probably. You'll still need to figure out where the entity appears on screen in order to censor them. (Or maybe the accessibility APIs have that information?) And you'll need some way to get in-between the OS and the screen share, like a virtual display or something.