How to extract all links from a webpage in python

Published on Aug. 22, 2023, 12:15 p.m.

To extract all website links from a webpage using Python , you can use the urllib and BeautifulSoup libraries.

To install the BeautifulSoup package, you can use the pip tool in Python. Here are the steps:

  1. Open a terminal or command prompt and enter the Python virtual environment.
  2. Run the following command to install Beautiful Soup:
pip install beautifulsoup4

If you want to install an older version of Beautiful Soup, you can run the following command:

pip install beautifulsoup==3.2.2
  1. Once the installation is complete, you can import the Beautiful Soup module in Python and start using it.
from bs4 import BeautifulSoup

If you encounter any issues during installation, make sure that your pip tool is up-to-date and compatible with your operating system and Python version.

Here’s an example code:

import urllib.request
from bs4 import BeautifulSoup

# Specify the website URL
url = input("Enter website URL: ")

# Open the URL and read its HTML content
html = urllib.request.urlopen(url).read()

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(html, features='html.parser')

# Find all <a> tags that contain a href attribute
links = soup.find_all('a', href=True)

# Print out all the links
for link in links:
    print(link.get('href'))

In this code, we first prompt the user to enter the website URL. We then use the urllib library to open the URL and read its HTML content. Next, we use BeautifulSoup to parse the HTML content and find all tags that contain an href attribute. Finally, we loop through the links and print out the URLs.

Note that this code only extracts the links from a single webpage. If you want to recursively crawl multiple pages and extract all links from a website, you’ll need to implement additional logic to keep track of visited pages and avoid visiting the same page multiple times.

To extract all links from a webpage using the requests library in Python

To extract all links from a webpage using the requests library in Python, you can use the re module to perform regular expression matching. Here’s an example:

import re
import requests

# Specify the website URL
url = input("Enter website URL: ")

# Send a GET request and get the webpage content
response = requests.get(url)
content = response.text

# Use regular expression to match all links
links = re.findall('"((http|ftp)s?://.*?)"', content)

# Print out all the links
for link in links:
    print(link[0])

In this example, we first send a GET request using the requests library and get the webpage content. Then we use regular expression matching to find all links in the webpage content. Finally, we loop through the links and print out each link.

Please note that this code only extracts links from a single webpage. If you want to recursively crawl multiple pages and extract all links from a website, you’ll need to implement additional logic to keep track of visited pages and avoid visiting the same page multiple times.