We don't have any manual, I'm sorry.
A good news is that a big part of websites can be read and processed.
A bad news is that each case is unique and individual. In many cases, it cannot be done without knowledge of HTML language and without some programming skills. Sometimes, there's no way (need to login etc.).
The problem is at the beginning: Somebody forgot to provide a suitable machine-to-machine output from the broadcast automation. In the final result, some information may be lost and a huge package of 'information' has been added, like text formatting, page layout, ads etc., 1 MB of rubbish which somewhere contains the 64 characters of the original text. The rest of the data is entirely useless for getting the original information we want to collect. Thus it is preferable to collect the text information so close to the original source as possible. Usually a good webmaster should provide appropriate help and access to the information.