Skip to main content
Adobe Stock 1208499647

Beyond Web Scraping: Unlocking Hard-to-Reach Data with Advanced Strategies

Web scraping is an integral step on the road to obtaining insight. But it’s only one step. To obtain the type of intelligence needed for improved corporate and governmental decision making, organizations need a comprehensive data strategy. Yes, you have to scrape hard-to-reach publicly available information (PAI). But you also have to know where to find PAI that meets your specific intelligence needs. You have to confirm its accuracy. Then you must structure it, enrich it, store it, and audit it. It’s a complicated process, made easier using advanced intelligence solutions.  

With these solutions, what type of insight can you expect to obtain?  

Consider Babel Street Data’s work in retail. Retailers and retail analysts know that data drives major decisions on purchasing, staffing, advertising, and more. However, collecting data from every retail site now in operation would be a virtually impossible task. Babel Street Data has found that the top-seller lists of a single major online retailer can serve as a useful proxy for the entire retail market. Data culled from this retailer provides insight into sales volumes of specific products, their sales velocity, initial pricing versus sale pricing, and more. After scraping this site daily for more than 10 years — collecting information on the top 100 products in a variety of categories — Babel Street Data can provide clients with significant insight into the retail sector, covering everything from day-to-day changes in segment-level sales to trends for specific brands. Babel Street also collects retail data from hundreds of brand sites, big box stores, and specialty sellers for beauty products, cosmetics, handbags, apparel, and more.  

Government agencies — notably defense and intelligence organizations — can obtain equally significant insight. For example, presidential Executive Orders, national security strategies, defense strategies, and numerous legislative mandates attest to the criticality of protecting the United States supply chain from the presence of companies subject to Foreign Ownership, Control or Influence (FOCI). Babel Street Data can help analysts and intelligence officers vet vendors to protect supply chains. In doing so, analysts can produce their own watchlists of risk-indicated organizations — lists that go deeper than just naming ultimate beneficial owners.  

Six steps to implementing an effective data strategy

Effective data strategies — whether home-grown or developed by a third-party provider offering advanced intelligence solutions — require at least six essential actions. Let’s look at each.  

1. Find and scrape pertinent information

Smart data collection strategies should be designed to meet an organization’s specific information needs. Maybe you work for a hedge fund that invests in private companies operating in the food-service industry. Understanding that robotics has the potential to automate up to 82 percent of restaurant roles, including 31 percent of roles devoted to food preparation,[1] your company wants to learn more about automated chain restaurants. A data collection strategy for obtaining insight into these chains may include scraping corporate web sites and other sources to study variables including restaurant locations, menu prices, availability and cost of ingredients, food service delivery statistics, restaurant reviews, reservation availability, competitive sales figures, and more.  

Or maybe you represent a government agency that, as discussed above, needs to understand which companies doing business in or for the United States are potentially connected to adversarial or hostile regimes. Your data collection strategy must certainly consider which web sites to scrape. Additionally, because the PAI presented by some countries to domestic users varies wildly from the information presented to foreign users, you’ll need in-country IPs to scrape the most accurate information possible. Privacy is an additional concern. Managing the attribution of your crawlers is essential. A robust proxy network is required to avoid detection. Otherwise, you risk crawlers being blocked, or a hostile government discovering your crawler and deliberately presenting misinformation.  

2. Prepare scraped data

Typically, data scraped must be cleansed, normalized, aggregated, and enriched. Why? Scraped data is typically unstructured. It may contain errors, along with incomplete, conflicting, and biased information. Specialized teams are needed to identify and correct errors and inconsistencies. They must also organize data into a standardized format — aggregating information while eliminating redundancies. They may choose to enrich data with additional information for deeper insight. They may also need to resolve entities — or identify and match records that refer to the same entity, even if those records have variations or inconsistencies. (See Entity resolution, below.)  

3. Audit it

In alignment with best practices and new citation and verification standards from the Office of the Director of National Intelligence,[2] organizations should retain a record of every fetch made by web crawlers. Audit information should note if the web crawler was blocked — an indication that potentially valuable data was left unscraped. Audit logs should maintain a cached version of every web page scraped. Automated quality assurance tests should be run regularly.  

4. Store it

In support of data governance and security, data should be stored in a centralized data warehouse. A centralized warehouse provides a unified view of information stored, and a single source of truth for the organization. It also enables data to be easily queried by users for enhanced analytics and for use in delivery pipelines. This centralized infrastructure should align with the security requirements of financial and governmental entities.  

5. Deliver it

Technology professionals must deliver data in a format that corporate and governmental decision makers can use and understand. Delivery via an API or raw data files delivered as CSVs and JSONs are commonly used methods. In addition, data should be capable of being integrated into third-party systems, including the Snowflake analysis platforms, the AWS cloud computing platform, and others.  

6. Comply with legal and ethical standards for scraping data

To comply with the European Union’s General Data Protection Regulation (GDPR), the California Consumer Privacy Act (CCPA), and other mandates, web scrapers must refrain from collecting any personally identifiable information. In addition, ethics and best practices require web scrapers to refrain from harming the organization whose data is being scraped. This means web scrapers should avoid sending so many requests that a site becomes overloaded. In short, companies offering web scraping technologies should be good stewards of the internet.  

The limitations of commercial web scraping technologies

Advanced intelligence solutions offer datasets, technology, and services — or any combination thereof. They can help organizations at every step of their data strategy. This is not true for much of the commercial web scraping technology now on the market.  

The low cost and ease-of-use associated with many commercial web scrapers often tempt organizations. Still, add-ons offered by popular web browsers and lightweight commercial software will likely prove inadequate for enterprise use. These technologies have limited reach, often proving unable to scrape needed data. They cannot scrape data at scale. They may limit user customization capabilities. Output may require significant cleansing and normalization, along with reconciliation of incomplete or conflicting data.  

Understanding these limitations, some companies try to engineer their own web scrapers. But this is a challenging process, requiring significant technical expertise, large financial outlays, and construction of storage capacity to retain scraped data.  

Exacerbating these challenges is the fact that most companies don’t want their data scraped. They implement anti-scraping measures that identify bot-like activity. They ban requests from flagged IP addresses. They implement CAPTCHA tests. Companies seeking insight can find it very difficult to work around these roadblocks.  

For these reasons, many organizations prefer to work with third-party providers of advanced intelligence solutions. Established providers offer high-quality data, technology, and services, typically at a lower cost than building data solutions in-house.  

Why Babel Street?

Babel Street Data provides the global IP network infrastructure and secure access needed to scrape the type of hard-to-reach data that can help businesses and government agencies obtain vital insight.  

While we offer hundreds of ready-to-use datasets covering virtually every industry, our customization capabilities are important to many clients. We develop customized datasets (and build the web crawlers necessary to compile them) to meet the distinct needs of specific organizations.  

In some cases, clients come to us identifying data sources, instructing us on how they want their web crawlers to operate, and how they want their data delivered. Alternatively, clients come to us with no more than a specific question or series of questions to be answered. (What intelligence is available on Myanmar’s deputy prime minister? Would it be more profitable to invest in robot fast food restaurants, robot pizzerias, or in high-tech fine dining?) In these cases, Babel Street does everything from data scraping through delivery.  

Babel Street is one of very few technology partners to offer all the capabilities discussed in the implementing an effective data strategy section of this article. We also provide:  

A seasoned global-proxy network for ethical data scraping  

Our data scraping is based on a seasoned proxy network. We don’t just have IPs in 95 countries, enabling us to access hard-to-find data. We have redundant IPs in those countries. If we lose one, work continues. The scope of our IP network allows us to run multiple data requests concurrently, speeding the scraping process.

Wherever we operate, we scrape data ethically. We comply with GDPR and CCPA mandates and prohibitions. Our work does not overwhelm the websites being scraped. We are a member of the Financial Information Services Division (FISD) of the Software & Information Industry Association (SIIA) and comply with its principles and best practices.

Regularly updated data from hard-to-reach regions

New data is created every moment. Much of it either updates or supplants previously captured information. Our Elite Regional Access (ERA) data subscription service augments datasets with a continuous stream of regional information from hard-to-reach online sources. This helps provide organizations with the most up-to-date data possible from regions such as the Middle East, Europe, and Asia, and from individual countries including Iran, Russia, and China. This information helps defense and security users uncover critical information on military postures, strategic developments, and security policies. Corporate users can identify and monitor supply chain risks, market dynamics, emerging risks, and other variables.

Entity resolution  

“Mark A. Murphy” is listed in one database as the president of a manufacturing company seeking to produce fuel systems for the U.S. Navy. “Marcus Andrew Murphy” is named elsewhere as a suspected domestic terrorist. Are they the same person?

You are considering working with a supplier, China Moves, to produce components for your new smart device. Is China Moves the same entity as China Mobile, a telecommunications company owned by the People’s Republic of China?

Do you know?

Working with Babel Street Data gives organizations access to the Babel Street Ecosystem of technologies. Among the ecosystem’s capabilities is entity resolution: the process of reviewing publicly available information associated with a person, corporation, address, or other variable, and appending it to a record being examined. This sharpens the insight governments and corporations need for critical decision-making and is vital for compliance with DoD mandates requiring the vetting of corporations for potential FOCI.

Web scraping is only one step on the journey to intelligence. For maximum insight, organizations need a complete data strategy for collecting, structuring, enriching, auditing, and delivering information, and for resolving entities. Babel Street’s advanced intelligence solutions can help.

Endnotes

1. Ewing-Chow, Daphne, “Here Are Five Global Restaurants Staffed By Robot Chefs,” Forbes, March 2024, https://www.forbes.com/sites/daphneewingchow/2024/03/31/here-are-five-global-restaurants-staffed-by-robot-chefs/ 

2. Office of the Director of National Intelligence, “Citation and Reference for Publicly Available Information, Commercially Available Information, and Open Source Intelligence,” December 2024, https://www.dni.gov/files/documents/ICD/ICS-206-01.pdf 

Disclaimer:

All names, companies, and incidents portrayed in this document are fictitious. No identification with actual persons (living or deceased), places, companies, and products are intended or should be inferred. 

Babel Street Home