Skip to content
Advertisement

Why can’t python sockets resolve url’s with http in it

My Script basically looks like this

import socket

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(('http://goal.com', 80))

i am getting an error, like this

Traceback (most recent call last):
  File "/home/danidee/PycharmProjects/b&h/dang.py", line 4, in <module>
    s.connect(('http://goal.com', 80))
socket.gaierror: [Errno -2] Name or service not known

This means that python can’t resolve that particular host,

if i remove the http:// and leave it as goal.com or make that request through my browser, it’s just fine and the script and the browser don’t throw any errors, one of the answers on stackoverflow suggested removing proxies which I’ve done but the error doesn’t go away, from the terminal I’ve tried

host http://goal.com

but it does not resolve also, I’m guessing this is something from my system configuration but i can’t figure it out I’ve looked into /etc/hosts, but i did not see anything weird there.

i know i could just do without the http:// but i would like to know why it doesn’t work that way, and also what if i want to specify another protocol like https or ftp

for ftp i’ve seen codes that do ftp.goal.com for example and it seems to work fine for them but i tried that with http and it still failed.

Advertisement

Answer

For this to be answered, you need to understand how the TCP/IP stack works. BTW, I’ll just ignore the OSI model because it is mostly useless in most real-world situations.

At the bottom of all, we have the protocols used to transfer bits and bytes between physical/wireless links. This includes things such as Ethernet, 802.11, MAC stuff, etc… It’s a direct comunication from one machine directly connected to another machine. Period.

Then, over that, we have one of the gems of protocol design, the Internet Protocol. Unlike what you may think, it’s a (relatively) very simple protocol. It’s purpose is to define the concepts of hosts, addresses, routing, quality of service, and a few other things. It’s very minimalistic in perspective. Thanks to the IP, one machine can indirectly connect to another through a whole arrangement of networks and gateways (that usually means routers).

The IP by itself, however, has certain pitfalls. Namely…

  • There’s no concept of ports.
  • Everything must be represented by internet addresses, that is, there are no thingslikethis.com, also known as domains.
  • The IP is unreliable, meaning that packets can get lost (with no notification whatsoever), duplicated, corrupted, etc… This does happen in modern networks. There’s no concept of “connection” whatsoever. Just packets, period.

So, to solve (most) of these problems, comes a not-so-bright gem in protocol design, the Transmission Control Protocol. Note: Ignored UDP for straightforwardness’ sake. The TCP’s purpose is to allow a reliable and stateful connection to be established over an unreliable routing protocol, usually the IP. To do so, it adds a considerable overhead to the packets, which is sometimes undesirable. It also has some nice extra features, such as ports. The idea is that a port represents a “service” or “application” that runs inside a host, along with other applications. This is a primordial concept of multitasking systems. A pair made up of a host’s address and a port is referred to as a “socket”. A pair of sockets, one in host A pointing to host B, and another in host B pointing to host A, upon three-way handshaking, is called a connection. Thanks to the TCP, we can now say things such as 192.168.1.123:8080, send data there, and be confident that the data either never reaches the destination, or reaches it successfully and correctly. However, still no domains!

Enters the Domain Name System. It defines a hierarchical structure of “domains”, symbolical names representing either a host or another hierarchical structure of the same kind. For instance, we have the top-level domain com, and its subdomain google, which happens to refer to 201.191.202.174. We refer to domains in reverse-order notation, separated by dots, in the style of google.com. With the IP plus the TCP plus the DNS, we can know say things such as google.com:21, and get a reliable connection to it. Hurray!

It’s now worth noting that when Python talks about “sockets”, like most libraries/languages/operating systems, it’s talking about sockets in the sense of the TCP. And, as we already know, TCP can only handle things of the style 192.168.1.123:8080. However, Python’s socket.socket.connect, though mostly a wrapper around it, gives you a little abstraction over C/POSIX’s connect(3), and it is that it performs that appropriate dances with the DNS if a hostname is provided instead of an actual address. Nonetheless, the abstraction ends there.

However, what about funky things such as https://qwe.rty.uio/asd/fgh.html? To solve this, enters one of the most complex parts of the equation. Understood by none but glorified by all, the Hypertext Transfer Protocol. HTTP is somewhat of a vagely multipurpose protocol. On its basis, it defines Uniform Resource Identifiers, which are most of the time Uniform Resource Locators. This allows you to use the slash (/) after a domain in order to address a “resource” inside it, such as an image or webpage. URIs also define the way the resource is accessed (http:// means “through the HTTP”, https:// means “through the HTTPS“, ftp:// means “through the FTP“, etc…). HTTP adds a uncountable amount of extra funky things that are necessary for the World Wide Web (often incorrectly called “the Internet”) to work the way it does, such as sessions, authentication, encryption, status codes, caching, proxies, file downloads, etc…

tl;dr: Python’s socket library is a thin wrapper around C’s that happens to add a DNS resolution vanguard mechanism. Excluding this, it works with vanilla TCP concepts.

User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement