Introduction
Extracting video, picture URLs, and textual content from the webpage will be finished simply with selenium and delightful soup in python. If there are URLs like “https://…video.mp4” because the src then we are able to straight entry these movies.
Nonetheless, there are such a lot of web sites that use the blob format URLs like src=”blob:https://video_url”. We are able to extract them utilizing selenium + bs4 however we can’t entry them straight as a result of these are generated internally by the browser.
What are BLOB URLs?
Blob URLs can solely be generated internally by the browser. URL.createObjectURL() will create a particular reference to the Blob or File object which later will be launched utilizing URL.revokeObjectURL(). These URLs can solely be used regionally in a single occasion of the browser and in the identical session.
BLOB URLs are usually used to show or play multimedia content material, corresponding to movies, straight in an online browser or media participant, with out the necessity to obtain the content material to the person’s native gadget. They’re typically used together with HTML5 video parts, which permit net builders to embed video content material straight into an online web page, utilizing a easy <video> tag.
To beat the above concern we’ve discovered two strategies that may assist to extract the video URL straight:
- YT-dlp
- Selenium + Community logs
YT-dlp
YT-dlp is a really helpful module to obtain youtube movies and in addition extracts different attributes of youtube movies like titles, descriptions, tags, and so forth. Now we have discovered a technique to extract movies from regular net pages (non-youtube) utilizing some further choices with it. Under are the steps and pattern code for utilizing it.
Set up YT-dlp module for ubuntu
sudo snap set up yt-dlp
Under is the easy code for video URL extraction utilizing yt-dlp with the python subprocess. We’re utilizing further choices like -f, -g, -q, and so forth. The outline for these choices will be discovered on the git hub of yt-dlp.
import subprocess
def get_video_urls(url):
videos_url = []
youtube_subprocess = subprocess.Popen(["yt-dlp","-f","all","-g","-q","--ignore-error",
"--no-warnings", url], stdout=subprocess.PIPE)
attempt:
video_url_list = youtube_subprocess.talk(timeout=15)[0].decode("utf-8").cut up("n")
for video in video_url_list:
if video.endswith(".mp4") or video.endswith(".mp3") or video.endswith(".mov") or video.endswith(".webm"):
videos_url.append(video)
if len(videos_url) == 0:
for video in video_url_list:
if video.endswith(".m3u8"):
videos_url.append(video)
besides subprocess.TimeoutExpired:
youtube_subprocess.kill()
return videos_url
print(get_video_urls(url="https://version.cnn.com/movies/world/2022/12/06/china-beijing-covid-restrictions-wang-dnt-ebof-vpx.cnn"))
Selenium + Community logs
Every time blob format URLs are used within the web site and the video is being performed, we are able to entry the streaming URL (.m3u8) for that video within the browser’s community tab. We are able to use the community and efficiency logs to seek out the streaming URLs.
What’s M3U8?
M3U8 is a textual content file that makes use of UTF-8-encoded characters to specify the areas of a number of media recordsdata. It’s generally used to specify a playlist of audio or video recordsdata for streaming over the web, utilizing a media participant that helps the M3U8 format, corresponding to VLC, Apple’s iTunes, and QuickTime. The file usually has the “.m3u8” file extension and begins with a listing of a number of media recordsdata, adopted by a sequence of attribute data traces. Every line in an M3U8 file usually specifies a single media file, together with its title and size, or a reference to a different M3U8 file for streaming a playlist of media recordsdata.
We are able to extract the community and efficiency logs utilizing selenium with some superior choices. Carry out the next steps to put in all of the required packages:
pip set up selenium
pip set up webdriver_manager
Under is the pattern code for getting streaming URL (.m3u8) utilizing selenium and community logs:
from selenium import webdriver
from selenium.webdriver.frequent.desired_capabilities import DesiredCapabilities
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
import time
import json
from selenium.webdriver.frequent.by import By
import json
desired_capabilities = DesiredCapabilities.CHROME
desired_capabilities["goog:loggingPrefs"] = {"efficiency": "ALL"}
choices = webdriver.ChromeOptions()
choices.add_argument("--no-sandbox")
choices.add_argument("--headless")
choices.add_argument('--disable-dev-shm-usage')
choices.add_argument("start-maximized")
choices.add_argument("--autoplay-policy=no-user-gesture-required")
choices.add_argument("disable-infobars")
choices.add_argument("--disable-extensions")
choices.add_argument("--ignore-certificate-errors")
choices.add_argument("--mute-audio")
choices.add_argument("--disable-notifications")
choices.add_argument("--disable-popup-blocking")
choices.add_argument(f'user-agent={desired_capabilities}')
driver = webdriver.Chrome(service=Service(ChromeDriverManager().set up()),
choices=choices,
desired_capabilities=desired_capabilities)
def get_m3u8_urls(url):
driver.get(url)
driver.execute_script("window.scrollTo(0, 10000)")
time.sleep(20)
logs = driver.get_log("efficiency")
url_list = []
for log in logs:
network_log = json.hundreds(log["message"])["message"]
if ("Community.response" in network_log["method"]
or "Community.request" in network_log["method"]
or "Community.webSocket" in network_log["method"]):
if 'request' in network_log["params"]:
if 'url' in network_log["params"]["request"]:
if 'm3u8' in network_log["params"]["request"]["url"] or '.mp4' in network_log["params"]["request"]["url"]:
if "blob" not in network_log["params"]["request"]["url"]:
if '.m3u8' in network_log["params"]["request"]["url"]:
url_list.append( network_log["params"]["request"]["url"] )
driver.shut()
return url_list
if __name__ == "__main__":
url = "https://fruitlab.com/video/aTUqTrJrMtj6FgO5?ntp=ggm"
url_list = get_m3u8_urls(url)
print(url_list)
When you get the streaming URL it may be performed within the VLC media participant utilizing the stream possibility.
The m3u8 URL will also be downloaded as a .mp4 file utilizing the FFmpeg module. It may be put in in ubuntu utilizing:
sudo apt set up ffmpeg
After putting in FFmpeg we are able to simply obtain the video utilizing the beneath command:
ffmpeg -i http://..m3u8 -c copy -bsf:a aac_adtstoasc output.mp4
Hope you want these two approaches of Advance video scraping. Do tell us you probably have any queries.