10 Questions to Ask before Starting Your Web Scraping Project

published on 12 December 2022

Answer these 10 key questions before proceeding with your web scraping project.

Start your web scraping project only after considering 10 key questions.
Start your web scraping project only after considering 10 key questions.

Are you ready to kick off a web scraping project?

Before you get too far in, it is important to pause and consider 10 key questions before proceeding with your web scraping project.

Web scraping is a process that uses software to automatically extract data from websites.

Web scraping is an increasingly popular method of gathering large amounts of data from multiple online sources quickly and efficiently.

However, web scraping can quickly become complex and missing any one of key items to consider could end up costing you both time and money.

It's best to be prepared for maximum efficiency and effectiveness before kicking off a web scraping project.

So let's go over 10 key questions that you need to consider before kicking off your web scraping project!

#1. Is web scraping legal?

Many people shy away from web scraping because they assume it is illegal.

However, web scraping of public data is legal in the US.

In April 2022, the US Court of Appeals for the Ninth Circuit reaffirmed its original decision that scraping data that is publicly-accessible on the internet is not a violation of the Computer Fraud and Abuse Act (CFAA), which governs what constitutes computer hacking under US law.

However, you should examine the Terms of Service of your source website to determine if the site permits data extraction before making a decision on whether or not to use the website for your web scraping project.

For example, certain websites state that scraping without permission is a violation of the website terms.

In such cases, you must get the website owner's consent before extracting any data from the site. Violating a website's policies and guidelines could get you in trouble with the company that owns the website.

Therefore, it is best to ensure that you are adhering to the Terms of Service of any website from which you extract data.

Let's take a look at two recent cases.

Microsoft-owned LinkedIn has had a long-running case with employment analytics firm hiQ Labs, which has been scraping publicly-accessible data for use in hiQ's recruitment insights app.

The case started in 2017 when LinkedIn filed a lawsuit against hiQ Labs to stop hiQ from harvesting public profile data from LinkedIn.

LinkedIn argued that hiQ's use of LinkedIn data violated LinkedIn's User Agreement and that hiQ was in violation of the Computer Fraud and Abuse Act (CFAA).

In April 2022, the U.S. Ninth Circuit of Appeals ruled in HiQ v. LinkedIn that web scraping of publicly-accessible data is legal.

According to the court ruling, scraping data from a public website does not violate the Computer Fraud and Abuse Act (CFAA).

However, in November 2022, the court ruled that LinkedIn may enforce its User Agreement against data scraping.

In October 2020, Facebook (now Meta) filed a lawsuit against two companies that had been scraping from Facebook and Instagram, "in order to sell marketing intelligence and other services," according to Facebook.

Facebook named in the lawsuit Israeli-based BrandTotal Ltd. and Delaware-incorporated Unimania Inc.

In October 2022, Facebook parent Meta settled the lawsuit with both companies, with both companies named in the lawsuit agreeing to a permanent injunction banning them from scraping from Facebook or Instagram; or profiting from the data they collected.

Both companies also agreed to settle for an undisclosed sum.

However, the district court overseeing the case ruled that BrandTotal did not violate the Computer Fraud and Abuse Act (CFAA).

In both the Meta and LinkedIn cases, courts ruled that the companies named in the lawsuits did not violate the Computer Fraud and Abuse Act (CFAA).

So, again, web scraping is not illegal.

However, courts ruled in favor of Meta and LinkedIn on the basis of both companies' User Agreements.

Therefore, although web scraping is legal, you must adhere to the User Agreements and Terms of Service of whatever source website from which you are extracting data.

Interested in learning more about the legality of web scraping?

Check out Is Web Scraping Legal?

To ensure that your web scraping project adheres to legal requirements and websites' terms, it is key that you work with an experienced web scraping service provider that will effectively guide you through each step of the process.

#2. Why do I want to scrape data?

With the hype around terms like "big data", "data analytics", "data science" and even "alternative data", it is too easy to jump headfirst into collecting massive amounts of data with no clear plan or a roadmap for how you will use the data to generate business value.

Before you start collecting any data, you should define your business goals.

Are you looking to get more customers? Charge cheaper than your competitors? Make optimized investment decisions? Strengthen your brand value? Do something else?

Next, outline the steps you will implement to achieve your business goals.

Do you want to generate leads? Track competitor pricing? Collect stock prices? Perform sentiment analysis for your brand? Execute some other use case?

Finally, list out potential websites that contain the data you require to execute the steps that will drive your business forward.

Once you are clear on what you want to do with the data, web scraping is an invaluable tool that enables you to collect data from websites efficiently and quickly to power data-driven decisions that drive your business forward.

#3. From which websites do I scrape data?

Based on the goals you have outlined for collecting data, you must then decide on the types of websites from which you will source the data you need.

It is important that you source your data carefully from only websites that are reliable sources of information.

Several websites make it hard to scrape data from them because web crawlers increase the load on the websites' servers.

One way to check what kinds of web crawlers a website permits is to look at the website's robots.txt file. To view the robots.txt file, append it to the website URL, as https://website.com/robots.txt.

Examples of the kinds of websites from which you can source data for your web scraping project are:

  1. Ecommerce sites for product listings, SKUs, availability, prices, etc.
  2. Web directory sites for person and company information.
  3. Government websites, such as the US Federal Reserve Bank's site, for interest rates and currency data.
  4. Stock exchange sites for stock prices, trading volume and company reference data.
  5. Social media sites for publicly-accessible profile data and posts.

Web scraping from social media sites has increasingly become more difficult due to aggressive technical and legal measures that social media companies have been taking to safeguard user privacy, by restricting the harvesting of publicly-accessible data from their sites.

Rightfully so, a key priority for social media sites is to protect the privacy of their users.

Therefore, social media sites have been demanding that web scraping companies treat social media data as restricted or private data, even though such data is publicly-accessible on search engines, like Google.

On the other hand, social media companies would like to continue benefiting from the massive visibility on and traffic from the search engines.

Social media companies are playing the balancing act of profiting from Google search while demanding that their data is treated as if it was not publicly-available on Google search.

#4. What tool do I use to scrape data?

There is a dizzying amount of tools, services, programming languages and frameworks you can use to scrape data.

Where do you start from? The answer depends on several factors, such as your technical expertise and whether you have the desire and ability to commit the resources, time and effort into executing a web scraping project.

Do you want to use an off-the-shelf tool? There are several off-the-shelf tools on the market. If so, keep in mind that you will need to invest the time to learn how to use the tool.

Then you will need to commit time to actually operate the tool to collect the data you need.

Finally, you will have to shape the data you collect into a consumable structure and format from which you can effectively perform your data analytics.

Do you want to build your own custom tool using your favorite programming language? You can build your web scraping tool using most popular languages, such as Python, Java, JavaScript, C# or C++.

In fact, many of these languages have libraries and frameworks that can do the heavy-lifting for your web scraping project.

You will need to commit the time to build and refine your capabilities to build a custom web scraping tool using your favorite programming language and framework.

Furthermore, given that companies can change the structure of their websites, you will need to be prepared to continuously update your code to work with the latest structure of the websites from which you scrape data.

Using web scraping to generate quality data on a sustained basis is no easy feat.

You and your team will need to invest the time and effort to build mastery of the technologies you use to scrape data.

You will need to continuously update your web scraping tool to address changes that website owners can make to their websites.

Also, if you want to scrape data on a frequent basis, such as on a daily basis, you will need to staff your project with people that can fix and maintain your web scraping application on an ongoing basis.

The tech ecosystem is evolving rapidly and there is always faster and cheaper technology in the marketplace pipeline.

Therefore, whatever you build can very well become outdated whenever a more effective or efficient web scraping technology lands on the market.

Finally, keep in mind that collecting data from the web is typically the first step in the long journey of data-driven decision-making.

After you collect the data, do you want to store the data in Excel, in Google Sheets or in a database?

Do you want to use a SQL relational or NoSQL database?

Do you need to clean the data? Standardize the data? Enhance and enrich your data?

Do you have the big data engineering expertise to process large amounts of data in the millions or even billions?

Can you handle semi-structured or unstructured data?

Additionally, do you want your data to be available on the cloud?

Do you want to perform business intelligence or other kinds of analytics on your data?

Do you want to build machine learning models to power predictive analytics?

Performing web scraping then leveraging the data to power data-driven decision-making can quickly become an expensive and time-consuming effort.

You can offload such work to an experienced web scraping services provider. This way, you can focus on driving your business forward, while the web scraping service extracts and prepares the web data you require to power your data-driven decision-making.

We have built an AI-powered, cloud-based web scraping service that is pay-per-use. So you only pay for data we deliver to you.

Our web scraping service handles all the data extraction work for you, especially for when you require frequent data extracts, custom data transformation or cloud integration.

Our highly-experienced professional services staff work with you to craft and execute a custom web scraping and data engineering solution that shapes data we extract to fit your exact needs.

We perform whatever kind of data cleansing, matching or transformation you require.

We shape your data to the optimal state you want, to power the analytics you want to perform.

We have built powerful cloud integrations for AWS, Google Cloud, Azure, Snowflake and Databricks, so we can make your data available for you in whatever cloud service you want.

#5. Will the website from which I am scraping data change its structure?

The short answer is yes.

Most companies are constantly optimizing their websites to be more effective at converting website visitors into customers, subscribers, users or to take some other specific action.

Many digital-native companies have teams dedicated full-time to website analytics, that implement methods such as A/B Testing, with the key goal of Conversion Rate Optimization (CRO).

Conversion Rate Optimization refers to the changes and enhancements a company makes to a webpage, in order to maximize the number of visitors that take a specific action, such as to place an order, subscribe to a newsletter, follow on social media or click a button.

Therefore, many companies will continuously change their websites.

Changes to a website from which you are scraping data can impact your web scraping tools.

Web scraping routines parse specific elements of the webpage. Web scraping tools identify such elements using identifiers or other attributes.

Successful web scraping is dependent on the identifiers of key webpage elements remaining consistent, keeping static formats and staying in the same position on the web page.

Therefore, a simple cosmetic change to the structure of the web page from which you are scraping data can cause your web scraping tool to break.

Continuously having to update your web scraping tool can very quickly become a prohibitive cost.

You can engage the services of an experienced web scraping service provider to ensure that your web scraping routines work seamlessly as the structure of the source websites change.

#6. How often do I need to scrape data?

How often you need to scrape data depends on the requirements for the freshness or recency of data to drive your business decisions.

If you are making decisions on a quarterly basis, then monthly or even quarterly data extracts might suffice.

For example, if an investment fund is reallocating its portfolio based on companies' quarterly earnings reports, such a fund would not require earnings data more frequently than once a quarter.

For use cases in which you make decisions on a daily basis, you require data that is not more than a day old.

Some use cases even require fresh data multiple times a day, such as every hour or even multiple times in an hour.

For example, an ecommerce company that updates its prices in near-real-time, based on competitor prices, would require data to be extracted on an almost instantaneous basis from competitors.

Consequently, several web scraping use cases require recurring data extracts.

Therefore, you will need to schedule your web scraping tools to run as frequently as the use case demands for the data that drives decision-making.

Keep in mind that frequent web scraping can cause the source site to block your IP address from making more requests. Several websites implement such blocking measures to protect their sites from being overloaded with traffic or to prevent outright site abuse, such as distributed denial-of-service (DDoS) attacks.

An experienced web scraping service will implement measures, such as using VPNs and rotating IP addresses, to avoid being blocked by source web sites.

You can ensure that your recurring web data extracts execute reliably and on-time by subscribing to a web scraping service that delivers your data extracts to you on schedule.

#7. How can I ensure the data quality of my web data extracts?

Data-driven decision-making is only as good as the quality of data feeding such decisions.

A machine learning model powering predictive analytics is only as good as the quality of data with which the model was trained.

Bad data can wreak havoc on your business strategy. As the saying goes, "Garbage in, garbage out."

Therefore, it is of utmost importance that you implement robust data quality standards on your web data extracts.

You will need to encode data quality tests that validate the fidelity and veracity of your data.

Robust data quality testing will validate the following:

  • Completeness: Did you capture all the data expected for the extract? Is there missing data? Do you have the expected row counts?
  • Correctness: Is the data accurate? Are there spelling errors? Does your extract contain duplicate data? Empty fields? Invalid or nonsensical values? Does your data contain outliers or anomalies?
  • Freshness / Recency: Did the data extract finish on time? Did the extract capture data for the most recent period? Did your daily extract capture data for yesterday?

You can automate most data quality tests.

However, it is important that you manually spot check your data frequently to ensure that your automated tests are performing as expected.

The profile and distribution of your data can change with time.

For example, the majority of your orders can gradually shift from coming from one country (E.g., the USA) to multiple countries (E.g., the UK, France, Switzerland), potentially requiring that you update or create new data quality tests to address the different data profile for orders from other countries (E.g., Different kinds of addresses, multiple currencies).

You must stay on top of your data quality monitoring to ensure that your data quality tests address the potentially evolving state of your data. 

Much of the data on the web is partially structured and sometimes unstructured.

Therefore, you will be able to use such web data extracts only after you have structured the data for your consumption.

Therefore, in addition to validating the data quality of your web extracts, you will likely perform downstream data cleansing, standardization, enrichment and transformation to ensure that your data is in the precise state you require.

For example, you might have to convert empty values to zeros, standardize addresses, geocode address data or perform currency conversions.

There is a virtually unlimited number of potential transformations you might be required to perform on your web data extracts to ensure that your data is in the optimal state to power your data-driven decision-making.

Data Quality is a Journey not a Destination.

To implement robust data quality mechanisms and customized data transformations on your web data extracts, it is best to work with a web scraping services provider that has deep expertise in data engineering, in addition to extensive experience in web scraping.

#8. How can I prevent my web scraping program from being blocked by CAPTCHA tests?

Many websites implement CAPTCHA protection to prevent bots from submitting junk or spam data to forms and to limit web crawling.

CAPTCHA tests on a website ask you to perform tests that only a human can pass. Websites implement such CAPTCHA tests to confirm that the visitor is a human and not a bot.

Examples of such tests are: Type out characters presented on the screen; Select all pictures that contain a bridge.

Most websites that implement CAPTCHA will only present CAPTCHA tests to a website visitor when the visit triggers certain conditions.

Examples of conditions that can trigger a CAPTCHA test are: Multiple form submissions from the same visitor or IP address; Visits from IP addresses or IP address ranges that have been identified as high-risk.

The best way to avoid being blocked by CAPTCHA tests is to avoid triggering such tests altogether.

Experienced web scraping service providers implement several techniques to avoid triggering CAPTCHA tests.

Examples of techniques to avoid triggering CAPTCHA tests include: Inserting delays in between web scraping requests; and rotating IP addresses by using proxy servers.

#9. How much will a web scraping project cost me?

You can purchase an off-the-shelf web scraping tool or build a do-it-yourself Python script to take care of simple web scraping requirements.

However, you will quickly run into limitations doing it yourself, if you require complex, high-volume and/or high-frequency extracts.

Additionally, you will run into limitations going down the do-it-yourself route, if your source web page is anything beyond basic; has CAPTCHA protection; or uses dynamic code, such as JavaScript.

Getting a web scraping service to do the heavy lifting for you can be a great time and money saver. This way, you can focus on running your business, while the web scraping service handles running reliable data extracts on your behalf.

The cost of a web scraping project varies depending on the number of web pages to scrape, the data you want, and the complexity of the web page structure.

Additional factors that will impact the cost of your web scraping project include the degree of data engineering work to perform after extracting data; the use of cloud storage and compute services; the use of IP rotation and proxy servers to avoid CAPTCHA triggers; and the frequency with which you will be scraping data.

Oftentimes, a web scraping service will provide a free consultation to assess your project requirements, after which they will provide a service estimate.

Such a consultation can give you a proper understanding of how much your web scraping project will cost before you proceed.

Your best bet is to work with an experienced web scraping service that charges you on a pay-per-use basis. This way, you are paying only for the web data extracts that the service provider delivers to you.

Most web scraping projects will require some data cleansing, standardization, enrichment and transformation of the extracted data.

Therefore, you want to ensure your web scraping service has the experience to perform the data engineering you might require.

We have built an AI-powered, cloud-based, pay-per-use web scraping service that can scrape data from any website.

Schedule a free consultation with us.

#10. What is the output format of web scraping data?

Most off-the-shelf tools and programming languages will generate web scraping output data in Excel, CSV and JSON formats.

If you want to cleanse, standardize or transform your data before using it, then you will need to perform  data engineering work after the data extract.

In addition to transforming your data, you might desire to store your data in a database or use a cloud service.

Our AI-powered web scraping service can perform extensive data engineering on your data to transform it into whatever structure or format you need. We can create customized Excel sheets. We can enhance JSON data. We can enrich CSV files.

Additionally, we have built robust cloud integrations to deliver your data on whatever cloud service you desire, on AWS, Google Cloud, Azure, Snowflake or Databricks.

We can upload your data to cloud storage (Amazon S3/Azure Blob Storage/Google Cloud Storage), export your data to Google Sheets or push your files to an FTP server.

Finally, we can schedule the extraction and delivery of your data to fit your exact requirements: hourly, daily, weekly, monthly, quarterly or whatever custom schedule you want.

Conclusion

Web scraping is an effective method to collect data for your business or research requirements.

However, before you start a web scraping project, it is important to ask yourself some key questions, in order to ensure that web scraping is the right solution for you and that you will be able to get the value you need from web scraping.

Do you have any other questions about web scraping?

Check out our Frequently Asked Questions, for answers to more questions you might have.

Ready to Get Started with Web Scraping?

Our team of experts can enable you to implement a web scraping service tailored specifically to your needs.

Contact us today to learn more.

WSaaS is your go-to solution to quickly extract useful data from websites to help you grow your business.

Leverage AI-powered technology and take your business to the next level with our cloud-based web scraping service.

Join over 1,000 customers that have already trusted us to drive their business forward.

Read more