How can I use awk to extract URL's from a HTML file?

I have an HTML file with javascript and CSS in the source. Listed in the JS is a series of URLs’ embedded with other meta-data. I want to use awk to extract the URLs (all enclosed in double quotes with the http:// prefix) and dump the urls to stdout. But I do not know how to use awk, but it seems to be the tool to use.

title: "Dsssat",
artist: "cxpl djij awsoj e",
mp3: "",

You can use grep. To include the double quotes:

grep -o '"http://[^"]*"' myfile.html

To exclude the double quotes:

grep -o 'http://[^"]*' myfile.html


You may want to do some further filtering to ensure that you only match the URLs in the JavaScript objects:

grep -o 'mp3: "http://[^"]*"' myfile.html | grep -o '"http://[^"]*"'

grep -o 'mp3: "http://[^"]*"' myfile.html | grep -o 'http://[^"]*'
Answered By: TachyonVortex

Why use awk? sed is better at this:

sed -ne 's/.*(http[^"]*).*/1/p' < foo.js
Answered By: Dennis Kaarsemaker
Categories: Answers Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.