Phillip already gave a great overview of how how most bots operate, but I just wanted to cover things in a little more detail, as I have a bit of personal experience developing on or for all the types of bots he covered.
In Runescape, there was a large project (RSBot) that would copy the contents of memory from the Runescape client into its local memory, where it could then view the entire state of the game with no risk of the client catching on. It required a bit of reverse engineering to determine where in memory to look in order to find the pointers to the data, but once they had done so, they exposed an API to take advantage of the information. It would know exactly what was where in the world by getting the coordinates of an object and then transforming them with the camera transform matrix to get the on-screen position. The hit masks were also readable, so it was trivial to determine exactly what range to move the mouse into in order to get a desired result.
The bot provided a bunch of debugging information, such as annotations that told the developer which tile coordinates are where, which ID this object has, what ID belongs to a given item, etc. This information could then be used to make bots. The actual process of making the bot scripts was actually quite simple. The framework provided many utility functions, such as move_to(world_coordinates) or mouse_move(x, y) which would perform the specified actions in a somewhat believable way (moving the mouse along a random spline, repeatedly moving via both the minimap and screen, and so on)
Also in Runescape, you have the option of drastically reducing the quality of the graphics. It was actually pretty easy to make certain kinds of bots by filming the screen and applying some basic computer vision concepts in order to construct a model of the world. I made both a curse bot and smelting bot using this technique, both of which worked quite well. It would just take the frame, increase the saturation as much as it could, and then try to extract patterns from it which it could then generate a probability map for click zones.