PROGRAMMING: BEST PRACTICES

CSCAR WORKSHOP

08/22/2017

Marcio Mourao, Michael Clark, Alex Gaenko

Workshop setup

Go to the page https://marcio-mourao.github.io/

Download "Workshop.ipynb" and/or "Workshop.html" under "Best Practices in Scientific Computing" to your "username/Documents"

For the notebook:

Click the Windows button (Bottom Left Corner)

Click "All apps"

Click "Anaconda3 (64-bit)"

Click "Jupyter Notebook"

Click "Workshop.ipynb" (this should open a new tab in the browser)

Introduction

Please, sign up the sheet!
Don't forget to go to: http://cscar.research.umich.edu/ to know what we're offering!
Any questions/feedback, you can send an email to Marcio, Michael or Alex.

Summary of this workshop

Use sensible identifier names

Use proper and consistent indentation

Avoid deep nesting

Limit line length

Comment and Document

Use Modularization (DRY)

Run tests

Run performance analysis

Use an Interactive Development Environment (IDE)

Use a Version Control System (VCS)

The KISS principle

References

https://www.continuum.io/anaconda-overview

https://code.tutsplus.com/tutorials/top-15-best-practices-for-writing-super-readable-code--net-8118

https://salfarisi25.wordpress.com/2010/12/22/advantage-and-disadvantage-of-using-ide/

https://www.python.org/dev/peps/pep-0008/

https://wiki.python.org/moin/

http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1001745

https://davidhamann.de/2017/04/22/debugging-jupyter-notebooks/

https://powerfulpython.com/blog/automated-tests-types-for-python/

Import relevant packages for this workshop¶

import numpy as np

Use sensible identifier names

Identifier names must begin with either a letter or an underscore character. Following that, you can use an unlimited sequence of letters, numbers, or underscore characters

The letters can be uppercase or lowercase, and case is significant. In other words, the identifier **``Ax``** is not the same as the identifier **``aX``**

Numbers can be any of the digit characters between and including 0 and 9

Underscores can have a special meaning, so be careful when using it (this is especially often true for leading, trailing, and double underscores).

Identifier names should be descriptive and contain one or more words

Even if you language permits Unicode (non-English) identifiers, *do not use them*!

Consistent naming scheme¶

There are two popular options:

camelCase: First letter of each word is capitalized, except the first word, like: ``getMaxValue()``

underscores_in_names: Underscores between words, like: ``get_max_value()``

Important special case of camelCase: "Hungarian notation"

``nTotalItems``: this variable has a meaning of "size" (and is presumably integer)
``iCurrentItem``: this variable has a meaning of "index" (and is also integer)
``fReflectivity``: this variable is a floating point

The issue is not without controversies, and choosing proper names is a skill. See SE-Radio Episode 278: Peter Hilton on Naming (http://www.se-radio.net/2016/12/se-radio-episode-278-peter-hilton-on-naming/) for some discussion.

Important: whatever approach you choose, Be consistent

def get_min_value(lst):
    out=min(lst)
    return out

def get_max_value(lst):
    out=max(lst)
    return out

myList=np.array([0.3,4,1.5,5.7])
minValue=get_min_value(myList)
maxValue=get_max_value(myList)

print(minValue)
print(maxValue)

0.3
5.7

Consistent temporary names¶

Temporary variables can be as short as a single character. Be consistent with names that have the same kind of role.

Important: Be consistent

#i for loop counters
for i in range(0,2):    
    #j for nested loop counters
    for j in range(2,4):
        print("Value i,j: %d,%d" %(i,j))

Value i,j: 0,2
Value i,j: 0,3
Value i,j: 1,2
Value i,j: 1,3

#out for returning variables
def create_zeros(n):
    out=np.zeros(n)
    return out

def create_ones(n):
    out=np.ones(n)
    return out

nZeros=create_zeros(2)
nOnes=create_ones(3)

print(nZeros)
print(nOnes)

[ 0.  0.]
[ 1.  1.  1.]

Avoid "magic numbers"¶

def BusinessLogic(d):
    # ....
    return d*8*5*52 # WHAT ARE THOSE NUMBERS???

def BusinessLogic(work_days):
    #.....
    HOURS_PER_WORKDAY=8
    WORKDAYS_PER_WEEK=5
    WEEKS_PER_YEAR=52
    return HOURS_PER_WORKDAY*WORKDAYS_PER_WEEK*WEEKS_PER_YEAR*work_days

Use proper and consistent indentation

# ...
def factorial(l):
if l==0:
return 1
else:
return l*factorial(l-1)
    
factValue=factorial(5)
print(factValue)

  File "<ipython-input-14-d943fe242193>", line 3
    if l==0:
     ^
IndentationError: expected an indented block

...even if your language does not enforce identation.

Compare:

#include <iostream>
int main(int argc, char** argv) { const char* name="world"; if (argc>2) { name=argv[1]; } std::cout << "Hello, " << name << "!" << std::endl; return 0; }

with:

#include <iostream>

int main(int argc, char** argv) 
{ 
    const char* name="world";
    if (argc>2) { 
        name=argv[1]; 
    } 

    std::cout << "Hello, "
              << name 
              << "!"
              << std::endl;

    return 0; 
}

Although Python does enforce some identation, you are still responsible for readability:

# No: Further indentation required as indentation is not distinguishable.
def long_function_name(
    var_one, var_two, var_three,
    var_four):
    print(var_one)

var_one=1
var_two=2
var_three=3
var_four=4
    
# No: Arguments on first line forbidden when not using vertical alignment.
foo = long_function_name(var_one, var_two,
    var_three, var_four)

1

# Yes: More indentation included to distinguish this from the rest.
def long_function_name(
        var_one, var_two, var_three,
        var_four):
    print(var_one)

var_one=1
var_two=2
var_three=3
var_four=4
    
# Yes: Aligned with opening delimiter.
foo = long_function_name(var_one, var_two,
                         var_three, var_four)

# Yes: Hanging indents should add a level.
foo = long_function_name(
    var_one, var_two,
    var_three, var_four)

1
1

Avoid deep nesting

def do_stuff():
# ...
    if (logical_condition1):
        if (logical_condition2):
            if (logical_condition3):
                if (logical_condition4): 
                    return True 
                else:
                    return False
            else:
                return False
        else:
            return False
    else:
        return False

def do_stuff():
# ...
    if (logical_condition1 and logical_condition2 and logical_condition3 and logical_condition4):
        return True
    else:
        return False

Limit line length

Limit all lines to a maximum of 79 characters

For flowing long blocks of text with fewer structural restrictions (docstrings or comments), the line length should be limited to 72 characters

The preferred way of wrapping long lines is by using Python's implied line continuation inside parentheses, brackets and braces. Long lines can be broken over multiple lines by wrapping expressions in parentheses. These should be used in preference to using a backslash for line continuation

with open('/path/to/some/file/you/want/to/read') as file_1, \ open('/path/to/some/file/being/written', 'w') as file_2: file_2.write(file_1.read())

# No: operators sit far away from their operands
income = (gross_wages +
          taxable_interest +
          (dividends - qualified_dividends) -
          ira_deduction -
          student_loan_interest)

# Yes: easy to match operators with operands
income = (gross_wages
          + taxable_interest
          + (dividends - qualified_dividends)
          - ira_deduction
          - student_loan_interest)

Comment and Document

Comments are not parsed by the language interpreter/compiler. Therefore, the computer has no way of verifying that the comment makes sense, is correct, and is not out-of-date.

Comments that contradict the code are worse than no comments. Always make a priority of keeping the comments up-to-date when the code changes.

Comments should be complete sentences. If a comment is a phrase or sentence, its first word should be capitalized, unless it is an identifier that begins with a lower case letter (never alter the case of identifiers!).

If a comment is short, the period at the end can be omitted. Block comments generally consist of one or more paragraphs built out of complete sentences, and each sentence should end in a period.

Python coders from non-English speaking countries: please write your comments in English, unless you are 120% sure that the code will never be read by people who don't speak your language.

## Have you ever seen this?
def DoWork(arg):
    # ... do some work ...
    pass
    # CAUTION: This function should never return 0!
    return 0

Block Comments¶

Block comments generally apply to some (or all) code that follows them, and are indented to the same level as that code. Each line of a block comment starts with a # and a single space (unless it is indented text inside the comment).

Paragraphs inside a block comment are separated by a line containing a single # .

Inline Comments¶

x = x + 1                 # No: Increment x

x = x + 1                 # Yes: Compensate for border

Use modularization (DRY)

Anything that is repeated in two or more places is more difficult to maintain.

Every piece of data must have a single authoritative representation in the system

Physical constants ought to be defined exactly once.

Modularize code rather than copying and pasting.

Reinventing the wheel¶

"Reinventing the wheel" is usually counter-productive and inefficient.

"Stand on the shoulders of giants"! Consider using:

Standard library of your language. Chances that the language authors thought of it!

Third-party libraries, especially well-known in your field (such as ``NumPy``). Chances that you are not the first person facing this problem!

Specialized distributions (``Anaconda``, ``Intel distribution for Python``).

Note: While it is true that sometimes a sub-problem can be more naturally expressed and/or easier and faster solved in some other language, it is not for the faint of heart! We may cover multi-language development in a future workshop.

import fibonacci

fib=fibonacci.ifib

print(fibonacci.fib(3))
print(fib(3))

2
2

Run tests

An assertion is simply a statement that something holds true at a particular point in a program. Assertions can be used to ensure that inputs are valid, outputs are consistent, and so on.

The approach based on asserting input/output values is sometimes called *programming by contract*.

First, they ensure that if something does go wrong, the program will halt immediately, which simplifies debugging.

Second, assertions are executable documentation, i.e., they explain the program as well as checking its behavior. This makes them more useful in many cases than comments since the reader can be sure that they are accurate and up to date.

As such, assertions helps with "impossible comments" from the example above:

def DoWork(arg):
    # ... do some work ...
    pass
    assert out!=0, "This function should never return 0"
    return out

More involved example:¶

class MyDB:
    def __init__(self):
        self._id2name_map = {}
        self._name2id_map = {}
 
    def add(self, id, name):
        self._name2id_map[name] = id
        self._id2name_map[id] = name
 
    def by_name(self, name):
        return self._name2id_map[name]

def by_name(self, name):
    id = self._name2id_map[name]
    assert self._id2name_map[id] == name
    return id

from types import *
class MyDB:
    ...
    def add(self, id, name):
        assert type(id) is IntType, "id is not an integer: %r" % id
        assert type(name) is StringType, "name is not a string: %r" % name

Fixing bugs that have been identified is often easier if you use a symbolic debugger to track them down.

import pdb;

def add_to_life_universe_everything(x):
    answer = 42
    pdb.set_trace()
    answer += x    
    
    return answer

add_to_life_universe_everything(12)

> <ipython-input-3-61b8dfe8948f>(6)add_to_life_universe_everything()
-> answer += x
(Pdb) answer
42
(Pdb) x
12
(Pdb) n
> <ipython-input-3-61b8dfe8948f>(8)add_to_life_universe_everything()
-> return answer
(Pdb) answer
54
(Pdb) x
12
(Pdb) c

54

Run automated tests:

Ensure a single unit of code returns the correct results (unit tests), that pieces of code work correctly when combined (integration tests).

Creating and managing tests is easier if programmers use an off-the-shelf unit testing library to initialize inputs, run tests, and report their results in a uniform way.

One way of generating tests is to check to see whether the code matches the researcher's expectations of its behavior.

Another approach for generating tests is to turn bugs into test cases by writing tests that trigger a bug that has been found in the code.

Finally, there is a whole approach called **Test-Driven Development** (**TDD** - We will cover it in more detail in a future workshop.)

wordcount('foo bar foo ') {'foo': 2, 'bar': 1}

from wordcount import wordcount

class TestUnit(unittest.TestCase): def test_wordcount(self): self.assertDictEqual( {'foo' : 2, 'bar' : 1}, wordcount('foo bar foo '))

Run performance analysis

In Donald Knuth's paper "Structured Programming with go to Statements", he wrote:

Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%."

(Ref: http://wiki.c2.com/?PrematureOptimization)

Before optimizing it, make sure your code works correctly: it is better to have correct program running slow than (subtly) incorrect program running fast!

Determine if it is actually worth speeding that piece of code up.

If it is, use a profiler to identify bottlenecks.

You can be more productive when you write code in the highest-level language possible.

def profileCommand(n):
    out=1
    for i in range(1, n+1):
        out=out*i
    return out

import cProfile
cProfile.run("result=profileCommand(2000)")
#print(result)

         4 function calls in 0.002 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.002    0.002    0.002    0.002 <ipython-input-3-82b9994e247b>:1(profileCommand)
        1    0.000    0.000    0.002    0.002 <string>:1(<module>)
        1    0.000    0.000    0.002    0.002 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

# For benchmark tests
from timeit import Timer

def test(x):
    "Stupid test function"
    L = [i for i in range(x)]

t = Timer("test(100)", "from __main__ import test")
print(t.timeit())

6.383903488000215

Use an Interactive Development Environment (IDE)

Core Features¶

Code completion
Resource management
Debugging tools
Compile and build

Advantages¶

Less time and effort
Enforce project or company standards
Project management

Disadvantages¶

Learning Curve
A sophisticated IDE may not be a good tool for beginning programmers
Will not fix bad code, practices, or design
Often heavy on resources
Enforced workflow may not be your preferred one

IDEs Examples¶

Spyder for Python applications
RStudio for R applications
Eclipse for Java applications
XCode for C++ applications
Much more...

Use a Version Control System (VCS)

When working with code and data, you need to keep track of the changes and collaborate on a program or dataset.

Typical solutions are to email software to colleagues or to copy successive versions of it to a shared folder, e.g., Dropbox (http://www.dropbox.com) - Don't do this!

Use a VCS - A VCS stores snapshots of a project's files in a repository (or a set of repositories).

Crucially, if several people have edited files simultaneously, the VCS highlights the differences and requires them to resolve any conflicts before accepting the changes.

The VCS also stores the entire history of those files.

Many good VCS are open source and freely available:

Git (http://git-scm.com)
Subversion (http://subversion.apache.org)
Mercurial (http://mercurial.selenic.com)

There are also free hosting services available:

GitHub (https://github.com)
GitLab (https://gitlab.com)
BitBucket (https://bitbucket.org)

The Kiss principle

KISS stands for Keep It Simple and Stupid (figuratively speaking!)

Other conditions being equal, prefer simplicity to complexity.

Do not try to be too clever. Think of other people reading your code — or of yourself 2 weeks down the road!

If you think you need performance, first think again (we will talk about it shortly).

If you do need clever tricks to achieve performance, comment and document extensively!

PROGRAMMING: BEST PRACTICES CSCAR WORKSHOP 08/22/2017