Posted on Apr 5, 2022

Just a moment...

SR (Join to see)

322

1

0

This hypothesis was tested using earlier copies of the LinuxQuestions threads from the Internet Archive15, which were crawled in 2002 and were used to test it. These websites would include the same thread material as the others, but they would be organised in a different way. Unluckily, we discovered that only the sticky threads, which are administrator threads that "stuck" at the top of the thread list, were stored by the Internet Archive, which meant that we couldn't utilise the same data as we had before. The post strings from the sticky threads were copied and pasted into a training database, which allowed us to successfully train models for both the old and new online forum versions utilising the same information. These new models achieved flawless accuracy of 1.0 when we annotated 20 threads across them, whereas the threads from 2002 had a recall of 0.91 and the threads from 2008 had a recall of 0.94. The recall for these results is higher than that obtained in the original experiment (see Table 5), but this can be attributed to the fact that the threads are administrator discussions about forum policy, and thus contain fewer code snippets, which were the source of the recall problem in the previous experiment. The 2002 model fared significantly worse than the 2008 model because the initial post sometimes had a different set of qualities and was thus unable to be matched. This experiment demonstrated that SiteScraper could be automatically retrained when the structure was altered while the same content was used to create the new structure.
We discovered that the structure of the tags in the 2002 SiteScraper model was almost similar to the structure of the tags in the 2008 SiteScraper model, but the attributes were completely different. Due to the lack of external CSS files when LinuxQuestions first launched in 2002, many of the style attributes were placed directly inside the HTML elements. By 2008, however, the class settings were the most often used property, and the style specifications were being stored in external CSS files. As a result, LinuxQuestions, and presumably many other websites, will not have to alter their HTML structure as often as they did in the past since re-styling of the website may be accomplished outside of the fundamental document markup, which is excellent news for SiteScraper.
While we were working on SiteScraper, we discovered that structural updates were common, despite our efforts to separate content from structure through CSS. For example, during our two-month development period, the structure of theaustraliandailycmu.com, amazon and all three stock sites were all updated. That is about a quarter of our data in only two months, indicating that dealing with structural upgrades is a genuine issue that has to be addressed immediately.
It would theoretically be feasible to automatically retrain stock and weather websites as well, although this would be more difficult due to the fact that these sites often do not save past data.
In contrast, because stock and weather data are not dependent on the websites that display them, if models were trained for a set of websites that all used the same data source, then if one model was broken as a result of an update, the model could be retrained using current data from the other models.
The challenge of retraining commerce websites and search engines is more difficult to solve. These kind of websites do not preserve trustworthy historical data – prices fluctuate and search ranking algorithms are modified — and the data they provide is created by them and thus cannot be independently validated.
If SiteScraper is fortunate, the structural change will not occur at the same time as the content update, allowing the model to be retrained in the meantime. Otherwise, these sorts of websites fall into the third update category defined in Section 1, which is characterised by the presence of both changing content and structure and is beyond the scope of SiteScraper. There is only one way out of this: retraining on the basis of a specially-crafted query that yields an anticipated result, such as the Scrubyt example used in looking for ruby on Google. However, it is evident that this method is not long-term. For More information about Web Scraping Services Please visit: https://it-s.com/our-services/data-tranformation-services/web-scraping-services/