I had a colleague approach me the other day asking:

“Can you teach me some Python?”

Naturally, I said:

“Sure, no problem”, and I sent him an hour meeting request.

The meeting came, and I found myself uneasy. Where does one start explaining Python in 1 hour? Data types? For loops? Anaconda? Capital letter conventions? Pandas? Numpy? Jupyter notebooks? Machine Learning?

The guy was interested in data ingestion, manipulation and visualisation, but showing him that you can read in a csv with pandas and you can create visualisations with seaborn easily didn’t give him or me a warm fuzzy feeling.
It felt as if I was skipping over the basics, how do you know that you have to use matplotlib to save your seaborn figures? How do you explain the difference between a Jupyter notebook and a script? Where does anaconda (a bigger snake) fit into the whole picture? Going on a tangent and explaining each of these topics seemed to cause more confusion.

So here we are, with a basic layout of tools and tips used in Python in general and focusing a bit on data related tasks. If you are already a seasoned Pythoner, you might not gain that much from this blog post, as this is more targetted at the basics. However, if you’ve also pondered about some of the topics mentioned above — keep reading.

The Language

First released in 1991, Python is an object orientated, interpreted, high-level, general-purpose programming language or as Jake VanderPlas, author of The Python Data Science Handbook once tweeted:

Python is the second-best language for everything.

As a side note, all those underlined concepts are hyperlinks, if you don’t know any of them, or you are unsure, take the time to click them and read up a bit. However, I digress.

Being the second best language for everything makes Python an excellent all-purpose swiss knife to have in your back pocket that can ease 90% of the problems and tasks that you’ve got in your life. From recording the temperature in your home, to automatically retweeting trump, to writing a script to update your budget on Google Sheets, to wrangling data to gain novel insights in your workplace. Python is a one-stop shop that brings all of this and more to the table.

Python puts a great deal of emphasises on code readability which can be seen by the language’s use of whitespaces to indent code. See the example of a for loop in C# below.

for (int i = 0; i < 10; i++)
{
Console.WriteLine("Value of i: {0}", i);
}

Notice all the brackets and verbosity? Now compare that to a for loop in Python.

for i in range(10):
print('Value of i: '+str(i))

Cleaner right? Much cleaner! Here we run into the first golden rule of Python, viz. whitespace has meaning. Which is different from most other programming languages where one uses curly braces “{}” to group code scopes.

Python is thought of as a “batteries included” language with the base installation of Python shipping with most of the necessary packages/libraries/modules you would need. Here we run into the second golden rule of Python: everything is an object. Packages, libraries, modules, functions, classes, data types — you name it, and it’s an object.

No wonder it is called an object orientated language 🤯. However, the magic of everything being an object isn’t always that apparent, until you start exploiting the object-ness of Python and you start overlapping and reusing knowledge about objects gained elsewhere in Python to solve your current problem. I’ll explain with an example.

Say we’ve got a string: “The brown dog jumped over the lazy dog.”

Let’s assign this sentence to a variable, my_string, and utilising some of the out-of-the-box methods available for the base Python string type (or string object). Here I’ll be using the convention to show executed code prepended with a>>>; the output of the executed command then appears below the command.

>>> my_string = 'The brown dog jumped over the lazy dog.'>>> type(my_string)
str
>>> my_string.split()
['The', 'brown', 'dog', 'jumped', 'over', 'the', 'lazy', 'dog.']
>>> my_string.replace('brown', 'hazel')
The hazel dog jumped over the lazy dog.
>>> my_string.title()
The Brown Dog Jumped Over The Lazy Dog.
>>> my_string.isdigit()
False

Just because our variable my_string is of type str, we get all of these, and many more, methods working out of the box. No imports, no weird configs, no inheritance — it just always works out of the box on any Python interpreter. If you have Python open you can see all of the attributes/methods that you get with the str type by calling the dir() function that tries to return the list of valid attributes for an object.

Rather than having all of its functionality built into its core, Python was designed to be highly extensible. Here we run into golden rule number 3 for Python: Don’t reinvent the wheel! Chances are, someone has made something that solves what you are trying to do. Moreover, as Python is open source and the community is very sharing, it is highly encouraged to borrow, tweak, break and improve other people’s code.

Ok, basic overview of the language covered, but please go and read the wiki page of Python if you are further interested in what makes Python tick, where it came from and where it is used. You might be surprised to find that a lot of big applications like Instagram, Spotify, Netflix, Reddit and YouTube all have Python somewhere in the mix.

Data Types

Python has a few core data types which everything else builds on, and I like to think of data types as layers in the OSI model. If you are not familiar with the OSI model, it’s a conceptual abstraction of layers that enable protocol and interface compatibility.

A quick example of the OSI model is a fibre cable serving as layer 1 (the physical layer) and the Ethernet protocol (layer 2) running on top of layer 1. The abstraction occurs as the Ethernet protocol doesn’t care what layer 1 is, it could be fibre, microwave even smoke signals. The idea is that higher level layers assume that the lower layers do their job well and you can build upwards from there. Similarly, we can build much more complex data types from the base data types in Python — and take a guess, all of these data types are also objects with a suite of built-in methods, as with str.

The basic Python data types are:

  • Integers
  • Floats
  • Complex Numbers
  • Strings
  • Booleans

Below is an example of each.

>>> my_int = 300
>>> my_float = 300.3
>>> my_complex = 1 + 3j
>>> my_string = "We've seen this one before"
>>> my_bool = True

We also get composite data types, some of which are:

  • Lists
  • Dictionaries
  • Sets
  • Tuples

These composite data types are all “array’s” of numbers, but slightly different. Unlike other languages, Python doesn’t mind if the types within an array are the same. In other words, we can put strings and booleans and complex numbers all in the same list and iterate over them. The automatic data type choice brings us to golden rule number 4: Python infers data types, which can be a blessing or a curse, but you don’t ever have to worry about telling Python you are about to declare an integer, it will infer it. Sometimes this is desirable, sometimes not, for the times that it is not you can be explicit about what type you want.

So composite data types… Below is an example of each of the 4 mentioned above.

>>> my_list = [1, True, 'some words']
>>> my_tuple = tuple(my_list)
>>> my_set = set(my_list)
>>> my_dictionary = dict({'elephant':'A large, five-toed animal.',
'go-cart':'A small carriage for young children to ride in.'})

The difference between these 4 array types is the following:

  • Lists are the most generally used, and they are just a list of things as the name suggests, more importantly, lists are mutable — that means that you can change the 2 value if you would so please, my_list[2] = ‘other words’.
  • Tuples are almost identical to lists, except for one critical difference — they are immutable. Immutable means that you cannot change the second value in the tuple if you want to. Executing my_tuple[2] = ‘other words’ will raise an error.
  • Sets are different as they only contain unique values; in other words, if we had a list with 4 values: [1,2,3,1] casting it to a set results in {1,2,3}. Note that the order of values in sets aren’t guaranteed and might change if you cast a list to a set.
  • Lastly, dictionaries are key-value stores. They have a key, in our example the word, and an associated value, in our example the definition of the word. Note that you can’t have multiple entries of the same key as only the last declared key-value pair is stored.

Ok, that wraps up most of the basics regarding base data structures in Python. Again all of these are objects and have their associated attributes and methods (the nice built-in functions like str.split(), and you can inspect these attributes by running dir(my_list) for example.

Nomenclature

So I’ve been irritating some seasoned Pythoners by calling some objects that are attributes methods and also calling libraries packages. So this section elaborates a bit on what the “correct” names for objects are, but when you are just starting with Python — think of everything as an object — no seriously! Everything!

A function is a function that you declare, for example:

def my_function(argument_1, another_arugment):
"""
A function that adds two arguments together
"""
return argument_1 + another_argument

Note the whitespace grouping all the internals of the function together.

A class is a blueprint created by a programmer for an object. A class defines a set of attributes that characterise an object that gets instantiated from this class, for example:

class Person():
def __init__(self, name, surname):
self.name = name
self.surname = surname
def get_full_name(self):
return self.name + ' ' + self.surname

Here we’ve created a blueprint for the Person class, notice the capital letter — I’ll touch on that a bit later. We can now create various people using our Person class; each instance of these Person classes is called an object.

An object is the realised version of the class, where the class is just the blueprint defined in the program, for example:

p1 = Person('John', 'Doe')
p2 = Person('Miley', 'Cyrus')

To be technically correct only p1 and p2 are objects. We can now do interesting things with these initiated objects, like ask for their name, surname or full name, for example:

>>> print(p1.name)
John
>>> print(p2.get_full_name())
Miley Cyrus

Now for some other technical jargon. You see the get_full_name() function that we declared in our class. We’re not allowed to call that a function anymore. Because it is inside a class, the correct term for it is a method. Similarly, we don’t say name and surname are variables, because the variables live inside a class we refer to them as attributes.

Some other terms you might run into are module, package, library and framework which I briefly discuss below.

  • A module is a file which contains python functions, global variables etc. It is nothing but a .py file which has python executable code/statement.
  • A package is a namespace which contains multiple package/modules. It is a directory which contains a particular file — __init__.py.
  • A library is a collection of various packages. There is no difference between package and python library conceptually.
  • A framework is a collection of various libraries which architects the code flow.

However, when you are starting, don’t worry about all of this jargon, call everything an object and learn what is essential, but now you know what you should call things if you’d like to be technically correct.

Packages

I mentioned above that someone has probably built whatever you need in a package or library somewhere. But where? Moreover, how can you leverage all the hard work other people have put in? Well by installing packages. Installing packages can be done in two ways using pip or using conda.

The aforementioned is the pure Python place to install packages. PIP stands for Package Installer for Python and is itself a python package that you can install — see how everything is just an object.

Conda, on the other hand, is an attempt to aid package management. Short for Anaconda, the project aims to bundle commonly used Python resources together into a bigger snake. If you are starting, I’d advise you to download Anaconda as this comes with all the things you’ll need to get going out of the box and you don’t have to fiddle with environment variables and such.

PEPS

Python Enhancement Proposals are conventions are rules that I advice you start following from the start. You can name your variables or your classes anything, but if everybody sticks to the conventions, then it is obvious what you are talking about without any comments.

One of these PEPs is PEP20, which is called The Zen of Python, and you can print it out in any Python interpreter by running the following code:

>>> import this

Executing the above command prints out PEP20, which reads:

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

For me, PEP20 summarises the culture surrounding Python and gives some guidance when you have to make a difficult design choice. You can find all the PEPs here, but one I’d like to take some time with is PEP8, titled the Style Guide for Python Code.

Without going into too much detail, this gist of PEP8 is:

code is read much more often than it is written.

The guidelines provided in PEP8 are intended to improve the readability of code and make it consistent across the broad spectrum of Python code. As PEP 20 says, “Readability counts”. Below are some highlights of code styling conventions:

  • Indentation: use 4 spaces per indentation level.
  • Tabs or Spaces? Spaces are the preferred indentation method.
  • Maximum Line Length? Limit all lines to a maximum of 79 characters.
  • Blank Lines? Surround top-level function and class definitions with two blank lines.
  • Names to Avoid? Never use the characters ‘l’ (lowercase letter el), ‘O’ (uppercase letter oh), or ‘I’ (uppercase letter eye) as single character variable names.
  • Package and Module Names? Modules should have short, all-lowercase names.
  • Class Names? Class names should normally use the CapWords convention.
  • Function and Variable Names? Function names should be lowercase, with words separated by underscores as necessary to improve readability.
  • Constants? Constants are usually defined on a module level and written in all capital letters with underscores separating words. Examples include MAX_OVERFLOW and TOTAL.

Data Science Stack

Right, you’ve come a long way. Hopefully, by now you are more comfortable with Python and see all the subtle nuances that can aid you in telling you more about what you are working with, without being explicit about it. For example, if I see from sklearn.preprocessing import StandardScaler I intuitively know that StandardScaler is a class that I have to initialise, only because it is written as CapWords.

Moving along to the tools you’ll need to do some data ingestion, manipulation and visualisation. Shockingly there are packages for all of these tasks, and the 5 main packages I want to mention here are:

  • numpy
  • matplotlib
  • pandas
  • seaborn
  • pickle

NumPy, short for Numerical Python, is the fundamental package for scientific computing with Python. It contains among other things:

  • a powerful N-dimensional array object
  • sophisticated (broadcasting) functions
  • tools for integrating C/C++ and Fortran code
  • useful linear algebra, Fourier transform, and random number capabilities

Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shells, the Jupyter notebook, web application servers, and four graphical user interface toolkits.

Pandas is a library providing high-performance, easy-to-use data structures and data analysis tools in Python. The primary data type in pandas is a DataFrame which is basically a table consisting of rows and columns. Moreover, as always, DataFrames have some fantastic methods that are bound to change the way you manipulate data forever.

Seaborn is a data visualization library built on top of matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. As Seaborn is a pretty wrapper built on top of matplotlib, you can always use lower level matplotlib functions to fine tune your Seaborn if you require some funky additions.

The pickle module implements binary protocols for serialising and de-serialising a Python object structure. “Pickling” is the process whereby a Python object hierarchy is converted into a byte stream, and “unpickling” is the inverse operation. Say you’ve worked a lot on cleaning a DataFrame and you would like to make a checkpoint and save the current state of your DataFrame — datatypes of each column included — then you’d pickle up the DataFrame and reuse it later on.

The gist of this section is: If you are interested in data ingestion, manipulation and visualisation, familiar yourself with the following package imports and watch a few tutorial videos on these packages.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import pickle as pck

IDE

An integrated development environment (IDE) is a software application that provides comprehensive facilities to computer programmers for software development, and there are many options for Python, but what would I advise?

Anaconda ships with 2 great IDE’s, Jupyter notebooks and Spyder, and I’d recommend you start using them to get coding right away. There are however times where one is better suited than the other. Ultimately it comes down to personal preference and workflow, but I like to do the exploratory part of coding in a notebook and later on copy and reduce the notebook sandbox code into a productionable script. I prefer to use atom instead of Spyder, but Spyder will get the job.

Concluding Thoughts

I hope that this blog has given you some insights into why Python is such a popular language nowadays. This blog hasn’t even scratched the surface of all the subtle easter eggs locked away in Python, but these are best discovered on your own.

Ultimately Python is more than just a programming language; it is a means to express your creativity through code by abstracting away a lot of the repetitive, tedious coding tasks found in lower level languages. Don’t get me wrong, there is a place for optimised C code, but what Python loses in performance, it makes up ten fold in ease of use.

Data Scientist, Co-Founder Automator Plus, Musician

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store