Kanneganti, D. (2022). Using recurrent neural networks and web crawlers to scrape open data from the Internet. The Young Researcher, 6(1), 60-71. http://www.theyoungresearcher.com/papers/kanneganti.pdf
Web crawling techniques in conjunction with Recurrent Neural Networks (RNNs) have been applied to several areas in the field of data mining on the Internet, but how they would best be applied to searching for open datasets has not yet been studied. Open data are data generated by the community and serve as effective alternatives to more centralized data collection, such as the US Census. Since the individuals and small organizations that collect open data often lack the infrastructure to make it widely available, open data portals serve as key access points to centralize open data. Unfortunately, due to lack of funding, open data portals struggle to efficiently scrape large amounts of open data from the internet. The purpose of this experiment is to bridge the gap between open data sources and open data portals by creating an algorithm that can quickly find open data on the internet. Through the use of an RNN and focused web crawler, a search algo- rithm was developed that could scrape 1000 web pages per minute and identify open datasets at an 85% accuracy, both metrics suggesting that the algorithm is a significant improvement over existing open data collection methods. For future research into this field of study, this work suggests that the application of automated open data collection and the implications of the prolifera- tion of open data portals be studied.
Keywords: web crawling, neural networks, open data, internet