Skip to content

Commit 1c9ead3

Browse files
committed
add video download crawler
0 parents  commit 1c9ead3

File tree

5 files changed

+264
-0
lines changed

5 files changed

+264
-0
lines changed

.gitignore

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
venv/

README.md

+78
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
# Web Crawler for Video Downloads
2+
3+
A Python web crawler script designed to explore web pages, find video files, and download them. The script supports customisation HTML tags and attributes for discovering videos and links, and it allows for parallel URL exploration.
4+
5+
## Features
6+
7+
- **Download Videos**: Finds and downloads video files from specified HTML tags and attributes.
8+
- **Explore Links**: Follows links on the page to find more videos, with support for recursive crawling.
9+
- **Parallel Processing**: Uses threading to explore and download in parallel for faster execution.
10+
- **Configurable**: Allows customisation of HTML tags and attributes for video sources and links.
11+
12+
## Prerequisites
13+
14+
- Python 3.6 or higher
15+
- `requests` library
16+
- `beautifulsoup4` library
17+
18+
## Installation
19+
20+
1. **Clone the Repository**
21+
22+
```bash
23+
git clone https://github.com/yourusername/web-crawler.git
24+
cd web-crawler
25+
```
26+
27+
2. **Install Required Libraries**
28+
29+
You can install the necessary Python libraries using `pip`:
30+
31+
```bash
32+
pip install -r requirements.txt
33+
```
34+
35+
## Usage
36+
37+
To run the web crawler script, use the following command:
38+
39+
```bash
40+
python crawler.py [START_URL] [FOLDER_PATH] [KEYWORDS] [--max_depth MAX_DEPTH] [--download_tag TAG:ATTRIBUTE] [--explore_tag TAG:ATTRIBUTE]
41+
```
42+
43+
### Arguments
44+
45+
- `START_URL`: The starting URL for the web crawler.
46+
- `FOLDER_PATH`: The folder path where downloaded videos will be saved.
47+
- `KEYWORDS`: Space-separated keywords to filter links and videos.
48+
- `--max_depth`: (Optional) Maximum depth to crawl. Default is `2`.
49+
- `--download_tag`: (Optional) Tag and attribute used to find video sources (format: `tag:attribute`). Default is `source:src`.
50+
- `--explore_tag`: (Optional) Tag and attribute used to find links to explore (format: `tag:attribute`). Default is `a:href`.
51+
52+
### Example
53+
54+
To start crawling from `https://example.com` and download videos to the `./videos` folder, with a maximum depth of `3` and using default tags and attributes:
55+
56+
```bash
57+
python crawler.py https://example.com ./videos "video" --max_depth 3
58+
```
59+
60+
To specify custom tags and attributes for finding videos and links:
61+
62+
```bash
63+
python crawler.py https://example.com ./videos "video" --max_depth 3 --download_tag "source:src" --explore_tag "a:href"
64+
```
65+
66+
## Contributing
67+
68+
Contributions are welcome! Please follow these steps:
69+
70+
1. Fork the repository.
71+
2. Create a feature branch (`git checkout -b feature/YourFeature`).
72+
3. Commit your changes (`git commit -am 'Add new feature'`).
73+
4. Push to the branch (`git push origin feature/YourFeature`).
74+
5. Create a new Pull Request.
75+
76+
## License
77+
78+
This project is licensed under the MIT License

crawler.py

+154
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,154 @@
1+
import requests
2+
from bs4 import BeautifulSoup
3+
import os
4+
import argparse
5+
from urllib.parse import urljoin
6+
from threading import Thread, Lock
7+
import queue
8+
9+
# Global queue for URLs to be explored and lock for thread-safe operations
10+
url_queue = queue.Queue()
11+
url_lock = Lock()
12+
13+
# Function to download a video file
14+
def download_video(url, folder_path, downloaded_urls):
15+
if url in downloaded_urls:
16+
print(f"Already downloaded: {url}")
17+
return
18+
try:
19+
print(f"Attempting to download: {url}")
20+
response = requests.get(url, stream=True)
21+
response.raise_for_status() # Check for HTTP errors
22+
file_name = url.split('/')[-1]
23+
with open(os.path.join(folder_path, file_name), 'wb') as file:
24+
for chunk in response.iter_content(chunk_size=1024):
25+
if chunk:
26+
file.write(chunk)
27+
downloaded_urls.add(url)
28+
print(f"Downloaded {file_name}")
29+
except requests.RequestException as e:
30+
print(f"Error downloading {url}: {e}")
31+
32+
# Function to check if any keyword is in the URL (case-insensitive)
33+
def contains_keyword(url, keywords):
34+
url_lower = url.lower()
35+
return any(keyword.lower() in url_lower for keyword in keywords)
36+
37+
# Function to find video files in specified tags and attributes
38+
def find_videos(page_url, folder_path, visited_urls, keywords, downloaded_urls, depth, max_depth, download_tag, explore_tag):
39+
if depth > max_depth:
40+
return
41+
if page_url in visited_urls:
42+
return
43+
visited_urls.add(page_url)
44+
45+
try:
46+
print(f"Fetching page: {page_url}")
47+
response = requests.get(page_url)
48+
response.raise_for_status() # Check for HTTP errors
49+
soup = BeautifulSoup(response.content, 'html.parser')
50+
51+
# Print a snippet of HTML content for debugging
52+
print("HTML Snippet:")
53+
print(soup.prettify()[:1000]) # Print the first 1000 characters for a quick overview
54+
55+
# Find video files in specified tags and attributes
56+
video_found = False
57+
for tag, attr in download_tag:
58+
for element in soup.find_all(tag, {attr: True}):
59+
video_url = element[attr]
60+
if contains_keyword(video_url, keywords):
61+
full_url = urljoin(page_url, video_url)
62+
print(f"Found video URL: {full_url}")
63+
download_video(full_url, folder_path, downloaded_urls)
64+
video_found = True
65+
66+
# Find video files in iframes
67+
for iframe in soup.find_all('iframe', src=True):
68+
iframe_url = iframe['src']
69+
iframe_url = urljoin(page_url, iframe_url)
70+
print(f"Found iframe URL: {iframe_url}")
71+
try:
72+
iframe_response = requests.get(iframe_url)
73+
iframe_response.raise_for_status()
74+
iframe_soup = BeautifulSoup(iframe_response.content, 'html.parser')
75+
video_found = find_videos_in_iframe(iframe_soup, folder_path, keywords, iframe_url, downloaded_urls, download_tag) or video_found
76+
except requests.RequestException as e:
77+
print(f"Error fetching iframe content {iframe_url}: {e}")
78+
79+
if not video_found:
80+
print("No videos found on this page.")
81+
82+
# Recursively follow links on the page with specific keywords
83+
for tag, attr in explore_tag:
84+
for link in soup.find_all(tag, {attr: True}):
85+
link_url = link[attr]
86+
link_url = urljoin(page_url, link_url)
87+
if contains_keyword(link_url, keywords):
88+
with url_lock:
89+
if link_url not in visited_urls:
90+
print(f"Adding link to queue: {link_url}")
91+
url_queue.put((link_url, depth + 1))
92+
93+
except requests.RequestException as e:
94+
print(f"Error fetching {page_url}: {e}")
95+
96+
def find_videos_in_iframe(soup, folder_path, keywords, page_url, downloaded_urls, download_tag):
97+
video_found = False
98+
for tag, attr in download_tag:
99+
for element in soup.find_all(tag, {attr: True}):
100+
video_url = element[attr]
101+
if contains_keyword(video_url, keywords):
102+
full_url = urljoin(page_url, video_url)
103+
print(f"Found video URL in iframe: {full_url}")
104+
download_video(full_url, folder_path, downloaded_urls)
105+
video_found = True
106+
return video_found
107+
108+
# Worker function for exploring URLs
109+
def explore_urls(start_url, folder_path, keywords, max_depth, download_tag, explore_tag):
110+
visited_urls = set()
111+
downloaded_urls = set()
112+
113+
# Start with the initial URL
114+
url_queue.put((start_url, 0))
115+
116+
while not url_queue.empty():
117+
current_url, current_depth = url_queue.get()
118+
find_videos(current_url, folder_path, visited_urls, keywords, downloaded_urls, current_depth, max_depth, download_tag, explore_tag)
119+
url_queue.task_done()
120+
121+
def crawl(start_url, folder_path, keywords, max_depth, download_tag, explore_tag):
122+
# Create folder path if it does not exist
123+
if not os.path.exists(folder_path):
124+
os.makedirs(folder_path)
125+
126+
# Create and start worker threads for parallel exploration
127+
num_threads = 4 # Number of threads for parallel exploration
128+
threads = []
129+
for _ in range(num_threads):
130+
thread = Thread(target=explore_urls, args=(start_url, folder_path, keywords, max_depth, download_tag, explore_tag))
131+
thread.start()
132+
threads.append(thread)
133+
134+
# Wait for all threads to complete
135+
for thread in threads:
136+
thread.join()
137+
138+
if __name__ == "__main__":
139+
parser = argparse.ArgumentParser(description='Web crawler to download video files from web pages.')
140+
parser.add_argument('start_url', type=str, help='The starting URL for the web crawler.')
141+
parser.add_argument('folder_path', type=str, help='The folder path where videos will be downloaded.')
142+
parser.add_argument('keywords', type=str, nargs='+', help='Keywords to filter links.')
143+
parser.add_argument('--max_depth', type=int, default=2, help='Maximum depth to crawl.')
144+
parser.add_argument('--download_tag', type=str, default='source:src', help='Tag and attribute used to find video sources (format: tag:attribute).')
145+
parser.add_argument('--explore_tag', type=str, default='a:href', help='Tag and attribute used to find links to explore (format: tag:attribute).')
146+
147+
args = parser.parse_args()
148+
149+
# Parse the tag and attribute arguments
150+
download_tag = [tuple(tag_attr.split(':')) for tag_attr in args.download_tag.split(',')]
151+
explore_tag = [tuple(tag_attr.split(':')) for tag_attr in args.explore_tag.split(',')]
152+
153+
crawl(args.start_url, args.folder_path, args.keywords, args.max_depth, download_tag, explore_tag)
154+

requirements.txt

+2
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
requests
2+
beautifulsoup4

run.sh

+29
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
#!/bin/bash
2+
3+
4+
# Prompt for the keywords
5+
read -p "Enter keywords (space-separated): " keywords
6+
7+
# Prompt for the folder to save
8+
read -p "Enter the folder location to save (default is '$keywords'): " download_tag
9+
download_tag=${download_tag:$keywords}
10+
11+
# Prompt for the maximum depth
12+
read -p "Enter the maximum depth (default is 2): " max_depth
13+
14+
# Set default value for max_depth if not provided
15+
max_depth=${max_depth:-2}
16+
17+
18+
# Prompt for the HTML tag and attribute used to find video sources
19+
read -p "Enter the HTML tag and attribute used to find video sources (format: tag:attribute, default is 'source:src'): " download_tag
20+
download_tag=${download_tag:-source:src}
21+
22+
# Prompt for the HTML tag and attribute used to find links to explore
23+
read -p "Enter the HTML tag and attribute used to find links to explore (format: tag:attribute, default is 'a:href'): " explore_tag
24+
explore_tag=${explore_tag:-a:href}
25+
26+
27+
# Run the Python script with the provided inputs
28+
python crawler.py "https://deephot.link/?s=$keywords" "$keywords" $keywords --max_depth "$max_depth" --download_tag "$download_tag" --explore_tag "$explore_tag"
29+

0 commit comments

Comments
 (0)