For the school on chemoinformatics (BIGCHEM project). Munich, 17-21 October, 2016.

Dr. Pavel Polishchuk


The basic elements of Python 3

  1. Python is an object-oriented language, however it supports procedural programming that makes it perfect for fast development of simple scenario scripts.
  2. Python is open-source
  3. Python is cross-platform
  4. Python is powerful:
    • dynamic typing of variables - no need to declare variable types
    • automatic memory management (garbage collector)
    • many built-in and third party libraries
  5. Python has API to many languages
  6. Python is easy to learn and easy to use

My Python experience

measured in lines of code:
In [265]:
!find ~/Python -type f -name '*.py' -exec cat {} \; | sed '/^\s*#/d;/^\s*$/d;/^\s*\/\//d' | wc -l
23243
measured in the number of developed open-source tools:
  1. SiRMS - https://github.com/DrrDom/sirms
    Simplex Representation of Molecular Structure
    The tool for calculating of fragment descriptors for single compounds, mixtures, "quasi"-mixtures and reactions with atom labeled by different user-defined properties (charge, lipophilicity, H-bonding, etc).
  2. SPCI - https://github.com/DrrDom/spci
    Structural and Physico-Chemical Interpretation of QSAR models
    The tool with GUI for automatic mining of chemical datasets which performs model building, validation and interpretation and provides with chemically meaningful output.

More detals are here: http://qsar4u.com/



PEP 8 -- Style Guide for Python Code

https://www.python.org/dev/peps/pep-0008/

  1. Indentation - four spaces.
  2. Lines should contain up to 79 symbols.
  3. Blank lines to separate functions, classes, logical blocks in code, etc.
  4. Import each module on a separate line.
  5. Use whitespaces in expressions.
  6. Use single or double quotes for strings consistently.
  7. Leave useful comments: in-line, block or docstrings.
  8. Name classes in CapitalizeWords, name function_with_underscore.
  9. Avoid name conflicts.
    etc.

Built-in data types

Immutable:

  • Numbers
  • Strings
  • Tuples

Mutable:

  • Lists
  • Dictionaries
  • Sets
  • Files
  • Classes

Numbers

  • integer
  • float
  • complex
In [266]:
n1 = 1
type(n1)
Out[266]:
int
In [267]:
n2 = 1.0
type(n2)
Out[267]:
float
In [268]:
n3 = 3 + 3j
type(n3)
Out[268]:
complex

Python provides unlimited precision of integers

In [269]:
n4 = 2 ** 1000
n4
Out[269]:
10715086071862673209484250490600018105614048117055336074437503883703510511249361224931983788156958581275946729175531468251871452856923140435984577574698574803934567774824230985421074605062371141877954182153046474983581941267398767559165543946077062914571196477686542167660429831652624386837205668069376

Pyhton supports all math operations under numbers

In [270]:
5 + 2  # addition
Out[270]:
7
In [271]:
5 * 2  # multiplication
Out[271]:
10
In [272]:
5 / 2  # division
Out[272]:
2.5
In [273]:
5 // 2  # integer part of division
Out[273]:
2
In [274]:
5 % 2  # residue of division
Out[274]:
1
In [275]:
5 ** 2  # exponentiation
Out[275]:
25

Strings

are sequences of characters (surrounded by single or double quotations)

In [276]:
s1 = 'Olomouc'
s2 = "Olomouc"
s1 == s2
Out[276]:
True

Using double quotes you may represent the string with apostrophes

In [277]:
s3 = "Mom's son"
print(s3)
Mom's son

You may create multiline comments with triple quotes

In [278]:
s4 = """This is
a very long
comment"""
s4
Out[278]:
'This is\na very long\ncomment'
In [279]:
print(s4)
This is
a very long
comment

Strings may be concatenated

In [280]:
s1 + " is a nice city"
Out[280]:
'Olomouc is a nice city'

or repeated

In [281]:
s1 * 3
Out[281]:
'OlomoucOlomoucOlomouc'

There a lot of method which can be applied to strings

In [282]:
s1.find("uc")  # beware! indexing starts from 0
Out[282]:
5
In [283]:
s1.replace("omou", "ympi")
Out[283]:
'Olympic'
In [284]:
s5 = "  String with heading and trailing spaces   \n"
s5
Out[284]:
'  String with heading and trailing spaces   \n'
In [285]:
s5.strip()   # remove heading and trailing whitespaces
Out[285]:
'String with heading and trailing spaces'
In [286]:
s4.split("\n")
Out[286]:
['This is', 'a very long', 'comment']

Lists

are ordered collections of items

In [287]:
ls1 = ['abc', 3, s1]
ls1
Out[287]:
['abc', 3, 'Olomouc']
In [288]:
ls2 = [3, 4, 5]
In [289]:
ls1 + ls2  # concatenation of lists returns new list
Out[289]:
['abc', 3, 'Olomouc', 3, 4, 5]
In [290]:
ls1.extend(ls2)  # update list with items from another changes the original list
ls1
Out[290]:
['abc', 3, 'Olomouc', 3, 4, 5]
In [291]:
ls1.append(10)  # append to the list
ls1
Out[291]:
['abc', 3, 'Olomouc', 3, 4, 5, 10]
In [292]:
ls1.append(ls2)  # nested lists
ls1
Out[292]:
['abc', 3, 'Olomouc', 3, 4, 5, 10, [3, 4, 5]]
In [293]:
[0, 1] * 5   # repeat of list items
Out[293]:
[0, 1, 0, 1, 0, 1, 0, 1, 0, 1]

Tuples

are ordered collections of items as lists but they are immutable That is particularly very useful when you exchange data between classes or modules to make sure that they will not be changed accidently.

In [294]:
t1 = (2, 3, 4)
t1
Out[294]:
(2, 3, 4)

Ordered collections (lists and tuples) of string items can be converted to one string with specified separator that is very useful when you store data to text files.

In [295]:
"\t".join(['1', '2', '3'])
Out[295]:
'1\t2\t3'
In [296]:
" ".join(('1', '2', '3'))
Out[296]:
'1 2 3'

Slicing and indexing of strings, lists and tuples

In [297]:
s1
Out[297]:
'Olomouc'
In [298]:
s1[0]  # access items by index
Out[298]:
'O'
In [299]:
s1[1]
Out[299]:
'l'
In [300]:
s1[-1]  # access last item
Out[300]:
'c'
In [301]:
s1[len(s1) - 1]  # the same
Out[301]:
'c'
In [302]:
s1[0:3]
Out[302]:
'Olo'
In [303]:
s1[:3]  # the first and the last indices may be omitted
Out[303]:
'Olo'
In [304]:
s1[4:]
Out[304]:
'ouc'
In [305]:
s1[2:-1]
Out[305]:
'omou'
In [306]:
s1[:]  # creates a copy of an object, that is particularly useful when work with lists
Out[306]:
'Olomouc'

Dictionaries

consists of key-value pairs like hash tables or associative arrays.
Keys can be of any immutable type: number, string or tuple.
Values are items of any type without restrictions.

Dictionaries are very fast and efficient. They can be accessed only by keys.

In [307]:
d = {1: 'Olomouc', 2: 'nice', 3: "city"}
print(d)
{1: 'Olomouc', 2: 'nice', 3: 'city'}
In [308]:
d[1]   # get item with key 1 
       # (you cannot use slices like in lists, you need to iterate all keys to return corresponding values)
Out[308]:
'Olomouc'
In [309]:
# d[0]   # get item with key 0 which is absent and thus it leads to error
In [310]:
list(d.keys())
Out[310]:
[1, 2, 3]
In [311]:
list(d.values())
Out[311]:
['Olomouc', 'nice', 'city']
In [312]:
list(d.items())
Out[312]:
[(1, 'Olomouc'), (2, 'nice'), (3, 'city')]
In [313]:
if 0 in d.keys():     # check for key existence
    print("Success")
else:
    print("Failure")
Failure
In [314]:
d['list'] = [1, 2, 4]  # add new value, this will rewrite your data if it is already exists with this key
d
Out[314]:
{1: 'Olomouc', 2: 'nice', 3: 'city', 'list': [1, 2, 4]}
In [315]:
d[3] = 3  # replace with new value
d
Out[315]:
{1: 'Olomouc', 2: 'nice', 3: 3, 'list': [1, 2, 4]}
In [316]:
d[3] = d[3] + 4   # update existing item
d
Out[316]:
{1: 'Olomouc', 2: 'nice', 3: 7, 'list': [1, 2, 4]}
In [317]:
d[3] += 4   # the same
d
Out[317]:
{1: 'Olomouc', 2: 'nice', 3: 11, 'list': [1, 2, 4]}
In [318]:
del d[3]   # remove item from dict

Sets

are unordered sets of unique immutable items

In [319]:
s1 = set([1, 2, 3])   # set can be created from iterable
s1
Out[319]:
{1, 2, 3}
In [320]:
s2 = {4, 5, 1, 2}  # set can be created from separate items
s2
Out[320]:
{1, 2, 4, 5}
In [321]:
s1 & s2   # intersection
Out[321]:
{1, 2}
In [322]:
s1 | s2   # union
Out[322]:
{1, 2, 3, 4, 5}
In [323]:
s1 - s2   # difference
Out[323]:
{3}
In [324]:
s2 - s1   # difference is not symmetrical
Out[324]:
{4, 5}

Data type conversion

In [325]:
int('12')   # string to integer
Out[325]:
12
In [326]:
int(12.2)   # float to integer
Out[326]:
12
In [327]:
float('12')   # string to float
Out[327]:
12.0
In [328]:
str(12)   # number to string
Out[328]:
'12'
In [329]:
int('10001101', 2)   # convert string to integer with base 2
Out[329]:
141
In [330]:
a = [1, 1, 2, 3, 4]
a
Out[330]:
[1, 1, 2, 3, 4]
In [331]:
tuple(a)   # converts to tuple
Out[331]:
(1, 1, 2, 3, 4)
In [332]:
set(a)   # converts to set and keep only unique items
Out[332]:
{1, 2, 3, 4}
In [333]:
list(set(a))   # converts to the set and back to the list - can be used to remove duplicates from the list
Out[333]:
[1, 2, 3, 4]

List comprehensions

simplify generation of iterable objects (lists, dicts, sets, tuples)

Let's generate list containing the number of characters in each word in the sentence

In [334]:
s = "Chemoinformatics is a bright star on in the scientific universe" 

How this can be done. Solution 1.

In [335]:
output = []
for word in s.split(' '):
    output.append(len(word))
output
Out[335]:
[16, 2, 1, 6, 4, 2, 2, 3, 10, 8]

Solution 2 using list comprehensions.

In [336]:
output = [len(word) for word in s.split(' ')]
output
Out[336]:
[16, 2, 1, 6, 4, 2, 2, 3, 10, 8]

It is possible to create tuple instead of a list

In [337]:
output = tuple(len(word) for word in s.split(' '))
output
Out[337]:
(16, 2, 1, 6, 4, 2, 2, 3, 10, 8)

or even dict with words as a key and their length will be values

In [338]:
output = {word: len(word) for word in s.split(' ')}
output
Out[338]:
{'Chemoinformatics': 16,
 'a': 1,
 'bright': 6,
 'in': 2,
 'is': 2,
 'on': 2,
 'scientific': 10,
 'star': 4,
 'the': 3,
 'universe': 8}

or set

In [339]:
output = {len(word) for word in s.split(' ')}
output
Out[339]:
{1, 2, 3, 4, 6, 8, 10, 16}

Generators

are simple functions which return an iterable set of items, one at a time.

In [340]:
def gen_subseq(seq, length):
    for i in range(len(seq) - length):
        yield seq[i:i+length]
        
s = 'AGTGGTCA'
gen_subseq(s, 3)
Out[340]:
<generator object gen_subseq at 0x7f8026b258e0>
In [341]:
list(gen_subseq(s, 3))
Out[341]:
['AGT', 'GTG', 'TGG', 'GGT', 'GTC']
In [342]:
for subseq in gen_subseq(s, 3):
    if subseq == "GGT":
        break
    else:
        print(subseq)
AGT
GTG
TGG

Recursive generators is very simple starting from Python 3.3. Below is a generator of integers starting from the specified one.

In [343]:
def infinity(start):
    yield start
    yield from infinity(start + 1)

However recursion has a maximum depth. If a program will reach it an error will be raisen. You may increase the recursion depth in system settings or reimplement the procedure without recursion.


Variable assignment, shallow and deep copy of objects

Variables are assigned by reference not by value. This may lead to some unxpected situations in case of mutable data types. Compare different situations.

In [344]:
a = 4
b = a
a = 5
print(a)
print(b)
5
4
In [345]:
L = [1, 2, 3]
M = L
L[0] = 9
print(L)
print(M)
[9, 2, 3]
[9, 2, 3]
In [346]:
M is L   # check identity of referenced objects
Out[346]:
True
In [347]:
N = L[:]
N is L
Out[347]:
False
In [348]:
L[0] = 'p'
print(L)
print(N)
['p', 2, 3]
[9, 2, 3]
In [349]:
L = [1, [2, 3]]
M = L[:]
M is L
Out[349]:
False
In [350]:
print(L)
print(M)
[1, [2, 3]]
[1, [2, 3]]
In [351]:
L[1][1] = 5
print(L)
print(M)
[1, [2, 5]]
[1, [2, 5]]
In [352]:
from copy import deepcopy
L = [1, [2, 3]]
M = deepcopy(L)
L[1][1] = 5
print(L)
print(M)
[1, [2, 5]]
[1, [2, 3]]

Some built-in functions

min(), max(), sum()

In [353]:
ls = [1, 2, 3, 4]
In [354]:
min(ls)
Out[354]:
1
In [355]:
max(ls)
Out[355]:
4
In [356]:
sum(ls)
Out[356]:
10

zip(*iterables) - makes an iterator that aggregates elements from each of the iterables

In [357]:
s = 'ABCD'
In [358]:
zip(ls, s)
Out[358]:
<zip at 0x7f8026b38148>
In [359]:
list(zip(ls, s))
Out[359]:
[(1, 'A'), (2, 'B'), (3, 'C'), (4, 'D')]
In [360]:
d = dict(zip(ls, s))   # useful for creating dict from separate lists of keys and values
d
Out[360]:
{1: 'A', 2: 'B', 3: 'C', 4: 'D'}

enumerate()

In [361]:
enumerate(s)
Out[361]:
<enumerate at 0x7f8026ba1e58>
In [362]:
list(enumerate(s))
Out[362]:
[(0, 'A'), (1, 'B'), (2, 'C'), (3, 'D')]

They are paticularly useful for loops:

In [363]:
for i, (number, letter) in enumerate(zip(ls, s)):    # unpacking zipped values is not neccessary
    print('Iteration %i: number %i is assigned to letter %s' % (i, number, letter))
Iteration 0: number 1 is assigned to letter A
Iteration 1: number 2 is assigned to letter B
Iteration 2: number 3 is assigned to letter C
Iteration 3: number 4 is assigned to letter D
In [364]:
for i, item in enumerate(zip(ls, s)):                # item is a tuple
    print('Iteration %i: number %i is assigned to letter %s' % (i, item[0], item[1]))
Iteration 0: number 1 is assigned to letter A
Iteration 1: number 2 is assigned to letter B
Iteration 2: number 3 is assigned to letter C
Iteration 3: number 4 is assigned to letter D

Functions

is a block of organized and reusable code which perform a particular action. This help to keep your code modular and flexible. There are a lot of built-in functions like print, len, sum, etc. Let's create our own function which will calculate the mean value of a list.

In [365]:
def mean(lst):
    if lst:    # check if list is not empty
        return sum(lst) / len(lst)
    else:
        return None
In [366]:
mean([1, 2, 3, 4])
Out[366]:
2.5

The same using error handling

In [367]:
def mean(lst):
    try:
        return sum(lst) / len(lst)
    except ZeroDivisionError:
        return None

print(mean([]))
None
In [368]:
def sd(lst):
    if not lst:
        return None        # if list is empty return None
    if len(lst) == 1:
        return 0
    else:
        m = mean(lst)
        return sum((item - m) ** 2 for item in lst) / (len(lst) - 1)
In [369]:
sd([1, 2, 3, 4])
Out[369]:
1.6666666666666667
In [370]:
sd([5])
Out[370]:
0
In [371]:
sd([])

Variables in Python are passed to functions by reference, which may cause errors.

In [372]:
def func(lst):
    lst.append(mean(lst))   # add mean value to the list
    return sum(lst)         # calc sum and return the value

ls = [1, 2, 3, 4]
s = func(ls)
print(s)
print(ls)                   # list was changed!
12.5
[1, 2, 3, 4, 2.5]

To avoid such behaviour one needs to copy or deepcopy the modified object inside the function before using it.

Variables may be passed to a function by position and by name. However positional argument cannot follow named argument.

In [373]:
def div(x, y):
    return x / y

print(div(2, 5))
print(div(5, 2))
print(div(y=5, x=2))
0.4
2.5
0.4

You may set default values of function arguments

In [374]:
def div(x, y=10):
    return x / y

print(div(2))
print(div(2, 5))
print(div(y=2, x=5))
0.2
0.4
2.5

You may pass arbitrary number of named and not named arguments. Not named arguments can be passed with variable started from *, named arguments can be passed with variable started from **. Not named arguments will be passed as a tuple, named arguments will be passed as a dict.

In [375]:
def func(arg1, *args, **kargs):
    print("arg1 = ", arg1)
    print("not named args = ", args)
    print("named args = ", kargs)
In [376]:
func(1, 2, 3)
arg1 =  1
not named args =  (2, 3)
named args =  {}
In [377]:
func(1, arg2 = 2, arg3 = 3)
arg1 =  1
not named args =  ()
named args =  {'arg2': 2, 'arg3': 3}
In [378]:
func(1, 3, key1=10, key2=2)
arg1 =  1
not named args =  (3,)
named args =  {'key1': 10, 'key2': 2}

File I/O

Let's us read the file which has header and each line contains compound name and activity values separated by tab (\t) and calculate average and standard deviation of activity values for each compound and store results to another text file.

Compound_name pIC50
Mol_1 8.6
Mol_1 8.7
Mol_2 7.2
Mol_3 6.5
Mol_3 6.5
Mol_1 9
Mol_4 7.5
Mol_5 6.9
Mol_6 8.1
Mol_7 9.2
Mol_2 4.1

There are several file modes: r - read, w - write, a - append, t -text, b - binary.

In [379]:
f = open("data/activity.txt", 'rt')   # open file for reading in text mode

File descriptor has several attributes:

In [380]:
print("Name of the file: ", f.name)
print("File closed?: ", f.closed)
print("File mode : ", f.mode)
Name of the file:  data/activity.txt
File closed?:  False
File mode :  rt

Iterate over lines and save them in dict

In [381]:
d = {}   # create dict where we will store reading results as a list of values for each compound
         # because some compounds can have several values
f.readline()                             # read the first line from file (header) to skip it
for line in f:
    if line.strip():                     # check if line is not empty (skip empty lines)
        tmp = line.strip().split('\t')   # remove whitespaces and split line on tabs (\t) 
                                         # this will avoid errors if compound names contain spaces
        if tmp[0] not in d.keys():
            d[tmp[0]] = [float(tmp[1])]
        else:
            d[tmp[0]].append(float(tmp[1]))
f.close()                                # close file descriptor
                                         # otherwise it may be inaccessible by other applications
In [382]:
d
Out[382]:
{'Mol_1': [8.6, 8.7, 9.0],
 'Mol_2': [7.2, 4.1],
 'Mol_3': [6.5, 6.5],
 'Mol_4': [7.5],
 'Mol_5': [6.9],
 'Mol_6': [8.1],
 'Mol_7': [9.2]}

Full text of above commands with handle of possible exceptions:

In [383]:
f = open("data/activity.txt", 'rt')
d = {}
try:
    f.readline()
    for line in f:
        if line.strip():
            tmp = line.strip().split('\t')
            if tmp[0] not in d.keys():
                d[tmp[0]] = [float(tmp[1])]
            else:
                d[tmp[0]].append(float(tmp[1]))
finally:
    f.close()   # if you use f = open(...) statement you need to use try-finally block to be sure 
                # that in the case of exceptions you file will be closed and file descriptor will be released
                # otherwise file can be blocked to access by other applications
                # (it is true if you open file for editing)

Alternative solution with several improvements:

In [384]:
from collections import defaultdict   # import classes, functions, etc from a module

d = defaultdict(list)   # create dict with default values equals to empty list
                        # if one will access not extisted item it will get with an empty list
    
with open("data/activity.txt", 'rt') as f:           # files opened using with statement 
                                                     # will be closed automatically
    f.readline()                                     # skip header
    for line in f:
        if line.strip():                             # skip empty lines
            name, value = line.strip().split('\t')   # since we know that only two elements are in each line
                                                     # we may use such unpacking
            d[name].append(float(value))
In [385]:
d
Out[385]:
defaultdict(list,
            {'Mol_1': [8.6, 8.7, 9.0],
             'Mol_2': [7.2, 4.1],
             'Mol_3': [6.5, 6.5],
             'Mol_4': [7.5],
             'Mol_5': [6.9],
             'Mol_6': [8.1],
             'Mol_7': [9.2]})

Now let's calculate average and standard deviation of our values

In [386]:
output = {}
for k, v in d.items():    # iterate over pairs of keys and values
    avg = mean(v)
    std = sd(v)
    output[k] = (avg, std)
In [387]:
output
Out[387]:
{'Mol_1': (8.766666666666666, 0.04333333333333344),
 'Mol_2': (5.65, 4.8050000000000015),
 'Mol_3': (6.5, 0.0),
 'Mol_4': (7.5, 0),
 'Mol_5': (6.9, 0),
 'Mol_6': (8.1, 0),
 'Mol_7': (9.2, 0)}

One-liner solution using list comprehensions

In [388]:
output = {k: (mean(v), sd(v)) for k, v in d.items()}
In [389]:
output
Out[389]:
{'Mol_1': (8.766666666666666, 0.04333333333333344),
 'Mol_2': (5.65, 4.8050000000000015),
 'Mol_3': (6.5, 0.0),
 'Mol_4': (7.5, 0),
 'Mol_5': (6.9, 0),
 'Mol_6': (8.1, 0),
 'Mol_7': (9.2, 0)}

Save results to a text file

In [390]:
with open("data/activity_stat.txt", "wt") as f:             # if file exists it will be 
                                                            # silently rewritten
    for k, v in output.items():
        f.write(k + "\t" + "\t".join(map(str, v)) + "\n")   # map applies the specified function 
                                                            # over all items of the given iterable

The whole text of the script:

In [391]:
from collections import defaultdict   # import classes, functions, etc from a module

d = defaultdict(list)
    
with open("data/activity.txt", 'rt') as f:
    f.readline()                                     
    for line in f:
        if line.strip():                             
            name, value = line.strip().split('\t')   
            d[name].append(float(value))
            
output = {k: (mean(v), sd(v)) for k, v in d.items()}

with open("data/activity_stat.txt", "wt") as f:
    for k, v in output.items():
        f.write(k + "\t" + "\t".join(map(str, v)) + "\n")

Create scripts

Let's create a script which will take a text file with compounds and thier activities as an input and return text file with average and standard deviation of activity values for each compound as shown in example above.

We already have backbone of our script which makes I/O and all calculations. However to use it we will need to edit file names each time.

In [392]:
from collections import defaultdict  


def mean(lst):
    if lst:   
        return sum(lst) / len(lst)
    else:
        return None
    
    
def sd(lst):
    if not lst:
        return None       
    if len(lst) == 1:
        return 0
    else:
        m = mean(lst)
        return sum((item - m) ** 2 for item in lst) / (len(lst) - 1)
    
    
d = defaultdict(list)
    
with open("data/activity.txt", 'rt') as f:
    f.readline()                                     
    for line in f:
        if line.strip():                             
            name, value = line.strip().split('\t')   
            d[name].append(float(value))
            
output = {k: (mean(v), sd(v)) for k, v in d.items()}

with open("data/activity_stat.txt", "wt") as f:
    for k, v in output.items():
        f.write(k + "\t" + "\t".join(map(str, v)) + "\n")

To run scripts from command line we will need to pass input and output file names to it and parse these command line arguments. Below there is a backbone for a script:

#!/usr/bin/env python3 # shebang string to specify path to the Python interpreter import argparse # import module to parse command line args and create a help message if __name__ == '__main__': # entry point of the script parser = argparse.ArgumentParser(description='Calculate average and standard deviation of compound activity.') parser.add_argument('-i', '--input', metavar='input.txt', required=True, help='text file with header and two tab-separated column with compound name and ' 'activity value.') parser.add_argument('-o', '--out', metavar='output.txt', required=False, default=None, help='output text file in tab separated format with average and sd for each compound. ' 'If the name will be omitted output will be in stdout.') args = vars(parser.parse_args()) for k, v in args.items(): if k == "input": input_fname = v if k == "out": output_fname = v

Add functions from above:

#!/usr/bin/env python3 import argparse from collections import defaultdict def mean(lst): if lst: return sum(lst) / len(lst) else: return None def sd(lst): if not lst: return None if len(lst) == 1: return 0 else: m = mean(lst) return sum((item - m) ** 2 for item in lst) / (len(lst) - 1) def load_file(fname): d = defaultdict(list) with open(fname, 'rt') as f: f.readline() for line in f: if line.strip(): name, value = line.strip().split('\t') d[name].append(float(value)) return d def save_file(fname, data): """ fname - file name data - dict, keys are compound names, values are tuples of average and sd activity values """ with open(fname, "wt") as f: for k, v in output.items(): f.write(k + "\t" + "\t".join(map(str, v)) + "\n") if __name__ == '__main__': # entry point of the script parser = argparse.ArgumentParser(description='Calculate average and standard deviation of compound activity.') parser.add_argument('-i', '--input', metavar='input.txt', required=True, help='text file with header and two tab-separated column with compound name and ' 'activity value.') parser.add_argument('-o', '--out', metavar='output.txt', required=False, default=None, help='output text file in tab separated format with average and sd for each compound. ' 'If the name will be omitted output will be in stdout.') args = vars(parser.parse_args()) for k, v in args.items(): if k == "input": input_fname = v if k == "out": output_fname = v d = load_file(input_fname) output = {k: (mean(v), sd(v)) for k, v in d.items()} if output_fname is None: # print to stdout (may also be done with sys.stdout.write) for k, v in output.items(): print(k, *v, sep="\t") # * unpacks iterable else: save_file(output_fname, output)

Save the script, change permission to executable and run it from command line with different arguments

calc.py -h
calc.py -i input_file_name.txt
calc.py --input input_file_name.txt > output_file_name.txt
calc.py -i input_file_name.txt -o output_file_name.txt

Classes

provide better modularity and flexibility of a code.

Class: A user-defined prototype for an object that defines a set of attributes that characterize any object of the class. The attributes are data members (class variables and instance variables) and methods, accessed via dot notation.
Instance: An individual object of a certain class.
Class variable: A variable that is shared by all instances of a class. Class variables are defined within a class but outside any of the class's methods. Class variables are not used as frequently as instance variables are.
Instance variable: A variable that is defined inside a method and belongs only to the current instance of a class.
Method: A special kind of function that is defined in a class definition.
Method overloading: The assignment of more than one behavior to a particular method. The operation performed varies by the types of objects or arguments involved.
Inheritance: The transfer of the characteristics of a class to other classes that are derived from it.

Create class

In [393]:
class A:
    def __init__(self, name, value=0):
        self.name = name                # public instance attribute
        self.__value = value            # private instance attribute
    def get_attr(self):
        return "parent class value: " + str(self.__value)
    def set_attr(self, value):
        self.__value = value
    
class B(A):
    def __init__(self, name, value):
        super(B, self).__init__(name)   # init parent clas if necessary
        self.__value = value            # it will not override parent value
    def get_attr(self):                 # override parent method
        return "child class value: " + str(self.__value)
In [394]:
a = A('Main class', 24)
b = B('Derived class', 42)
In [395]:
print(a.name)
print(a.get_attr())
Main class
parent class value: 24
In [396]:
print(b.name)
print(b.get_attr())
print(super(B, b).get_attr())
Derived class
child class value: 42
parent class value: 0
In [397]:
b.set_attr(55)
print(b.get_attr())
child class value: 42
In [398]:
b.__dict__    # look at the namespace
Out[398]:
{'_A__value': 55, '_B__value': 42, 'name': 'Derived class'}

Multiprocessing

Create a script which will run a long calculation function in a single thread as usual.

#!/usr/bin/env python3 import argparse def long_calc(n): i = 0 while i < n * 1000: i += 1 return n * 2 if __name__ == '__main__': parser = argparse.ArgumentParser(description='Read text file with numbers, calculate reasults and save them.') parser.add_argument('-i', '--input', metavar='input.txt', required=True, help='input text file with single numbers on separate lines.') parser.add_argument('-o', '--out', metavar='output.txt', required=True, help='output text file.') args = vars(parser.parse_args()) for o, v in args.items(): if o == "input": in_fname = v if o == "out": out_fname = v with open(in_fname) as f: with open(out_fname, "wt") as f_out: for line in f: v = line.strip() if v: f_out.write(str(long_calc(int(v))) + "\n")

$ time ./single_process.py -i rand_int.txt -o output_single.txt

real 0m22.080s
user 0m22.060s
sys 0m0.016s

Implement the same script using multiprocessing module.
Objects which are passed to a process should be pickable (manual serialization may be required).

#!/usr/bin/env python3 import os import argparse from multiprocessing import Pool, cpu_count def long_calc(line): if line.strip(): n = int(line.strip()) i = 0 while i < n * 500: i += 1 return n * 2 if __name__ == '__main__': parser = argparse.ArgumentParser(description='Multiprocessing example. Read text file with numbers, calculate reasults and save them.') parser.add_argument('-i', '--input', metavar='input.txt', required=True, help='input text file with single numbers on separate lines.') parser.add_argument('-o', '--out', metavar='output.txt', required=True, help='output text file.') parser.add_argument('-c', '--ncpu', metavar='NCPU', required=False, default=None, help='number of cores used for calculation. By default all but one will be used.') args = vars(parser.parse_args()) for o, v in args.items(): if o == "input": in_fname = v if o == "out": out_fname = v if o == "ncpu": ncpu = cpu_count() - 1 if v is None else max(min(int(v), cpu_count()), 1) if os.path.isfile(out_fname): os.remove(out_fname) p = Pool(ncpu) with open(out_fname, "wt") as f: for res in p.imap_unordered(long_calc, open(in_fname), chunksize=10): f.write(str(res) + "\n")

$ time ./multi_process_2.py -i rand_int.txt -o multi_output_2.txt -c 2

real 0m13.403s
user 0m26.552s
sys 0m0.028s

The obtained files are identical after sorting them, because implemented multiprocessing does not garantee the same order of output results.

diff <(sort output_single.txt) <(sort multi_output_2.txt)

Explanation of differences between imap/imap_unordered and map/map_async
http://stackoverflow.com/questions/26520781/multiprocessing-pool-whats-the-difference-between-map-async-and-imap

An alternative implementation which uses Queue to pass data to processes.

#!/usr/bin/env python3 import os import argparse from multiprocessing import Process, Manager, cpu_count def long_calc(n): i = 0 while i < n * 500: i += 1 return n * 2 def long_calc_queue(q, output_file, lock): while True: value = q.get() if value is None: break res = long_calc(value) lock.acquire() try: open(output_file, "at").write(str(res) + "\n") finally: lock.release() if __name__ == '__main__': parser = argparse.ArgumentParser(description='Multiprocessing example. Read text file with numbers, calculate reasults and save them.') parser.add_argument('-i', '--input', metavar='input.txt', required=True, help='input text file with single numbers on separate lines.') parser.add_argument('-o', '--out', metavar='output.txt', required=True, help='output text file.') parser.add_argument('-c', '--ncpu', metavar='NCPU', required=False, default=None, help='number of cores used for calculation. By default all but one will be used.') args = vars(parser.parse_args()) for o, v in args.items(): if o == "input": in_fname = v if o == "out": out_fname = v if o == "ncpu": ncpu = cpu_count() - 1 if v is None else max(min(int(v), cpu_count()), 1) if os.path.isfile(out_fname): os.remove(out_fname) manager = Manager() lock = manager.Lock() q = manager.Queue(10 * ncpu) pool = [] for _ in range(ncpu): p = Process(target=long_calc_queue, args=(q, out_fname, lock)) p.start() pool.append(p) with open(in_fname) as f: for line in f: v = line.strip() if v: q.put(int(v)) for _ in range(ncpu): q.put(None) for p in pool: p.join()

$ time ./multi_process.py -i rand_int.txt -o multi_output.txt -c 2

real 0m14.779s
user 0m28.624s
sys 0m0.160s

Time gain does not linearly depend on the number of cores due to different reasons:

  • some overhead is always present, therefore there is not necessary that if one specifies more cores it will increase the speed
  • synchronization between processes reduces effectiveness
  • read/write files may be a bottleneck

Literature and other knowledge sources