I recently saw a giant list of links to Udemy courses, in the form below. I found this unwieldy and impossible to read, so I wrote a python script to extract out the titles and add formatting.
Coupons are valid for a limited time only, so grab them while they last.
WEB DEVELOPMENT
www.udemy.com/ultimate-web/learn/v4/?couponCode=LRNWEB
www.udemy.com/responsive-website-template-from-scratch-html-css/?couponCode=FREEFB
www.udemy.com/web-design-creating-websites-from-scratch/?couponCode=WEBFREE
My script changed the links and added dashed separators to distinguish when the topic changed.
Coupons are valid for a limited time only, so grab them while they last.
WEB DEVELOPMENT
------------------------------------------------------------------------------
ultimate web
www.udemy.com/ultimate-web/learn/v4/?couponCode=LRNWEB
responsive website template from scratch html css
www.udemy.com/responsive-website-template-from-scratch-html-css/?couponCode=FREEFB
web design creating websites from scratch
www.udemy.com/web-design-creating-websites-from-scratch/?couponCode=WEBFREE
The Code
I used regular expressions for the extraction, and then wrote several output formats for the links, including HTML anchor tags, markdown format, and the currently shown format where urls are tabbed in. This was because pastebin wouldn’t accept links with alternate text.
import re
def ProcessLine(pattern, line):
match = re.search(pattern, line)
if match is None:
return line + "*" * 79 + "\n"
else:
words = match.group(1).replace("-", " ")
# This one simply puts a line between text and link so pastebin can use it
return f'{words}\n\t\t{line}\n'
# www.udemy.com/applewatchcourse/?couponCode=EnrollFREE
if __name__ == '__main__':
# Extract the text after '.com/' and the next slash
pattern = re.compile(r'^www[.]udemy[.]com[/]([^/]+)[/]')
with open('links.txt', 'r') as read:
with open('fixedlinks.txt', 'w' ) as write:
for line in read:
write.write(f'{ProcessLine(pattern, line)}')