Day 20

Tuesday, April 18, 2017

Deep learning (convolutional neural networks)

Tensorflow.

1-week delay while I prepare material.

Pandas

DataFrame

constructors

from clipboard

import pandas
df = pandas.read_clipboard()
df
a b c
0 1 2 3
1 5 7 9

from Excel file

df = pandas.read_excel('foo.xlsx')

A dataset: airports

Airports (~47,000 records)

From csv file

df = pandas.read_csv('airports.csv')
df.head()
id ident type name latitude_deg longitude_deg elevation_ft continent iso_country iso_region municipality scheduled_service gps_code iata_code local_code home_link wikipedia_link keywords
0 6523 00A heliport Total Rf Heliport 40.070801 -74.933601 11 NaN US US-PA Bensalem no 00A NaN 00A NaN NaN NaN
1 6524 00AK small_airport Lowell Field 59.949200 -151.695999 450 NaN US US-AK Anchor Point no 00AK NaN 00AK NaN NaN NaN
2 6525 00AL small_airport Epps Airpark 34.864799 -86.770302 820 NaN US US-AL Harvest no 00AL NaN 00AL NaN NaN NaN
3 6526 00AR heliport Newport Hospital & Clinic Heliport 35.608700 -91.254898 237 NaN US US-AR Newport no 00AR NaN 00AR NaN NaN NaN
4 6527 00AZ small_airport Cordes Airport 34.305599 -112.165001 3810 NaN US US-AZ Cordes no 00AZ NaN 00AZ NaN NaN NaN

fillna()

Exercise: fix the airports.csv dataframe by replacing all the nulls by "NA". (Pandas interprets NA for North America as "Not Available" :-/ )

df = df.fillna('NA')
df.head()
id ident type name latitude_deg longitude_deg elevation_ft continent iso_country iso_region municipality scheduled_service gps_code iata_code local_code home_link wikipedia_link keywords
0 6523 00A heliport Total Rf Heliport 40.070801 -74.933601 11 NA US US-PA Bensalem no 00A NA 00A NA NA NA
1 6524 00AK small_airport Lowell Field 59.949200 -151.695999 450 NA US US-AK Anchor Point no 00AK NA 00AK NA NA NA
2 6525 00AL small_airport Epps Airpark 34.864799 -86.770302 820 NA US US-AL Harvest no 00AL NA 00AL NA NA NA
3 6526 00AR heliport Newport Hospital & Clinic Heliport 35.608700 -91.254898 237 NA US US-AR Newport no 00AR NA 00AR NA NA NA
4 6527 00AZ small_airport Cordes Airport 34.305599 -112.165001 3810 NA US US-AZ Cordes no 00AZ NA 00AZ NA NA NA

From a numpy array:

a = vander([3,4,5,6,7])
a
array([[  81,   27,    9,    3,    1],
       [ 256,   64,   16,    4,    1],
       [ 625,  125,   25,    5,    1],
       [1296,  216,   36,    6,    1],
       [2401,  343,   49,    7,    1]])
df = pandas.DataFrame(a)
df
0 1 2 3 4
0 81 27 9 3 1
1 256 64 16 4 1
2 625 125 25 5 1
3 1296 216 36 6 1
4 2401 343 49 7 1

Columns and index

On construction of a dataframe, pandas will provide labels for the rows and columns, as seen above.

But we can change them if we like:

df.columns=['aa','b','c','d','z']
df

(and df.index = whatever, for the rows).

aa b c d z
0 81 27 9 3 1
1 256 64 16 4 1
2 625 125 25 5 1
3 1296 216 36 6 1
4 2401 343 49 7 1

Accessing columns

Access a column like this:

df['aa']
0      81
1     256
2     625
3    1296
4    2401
Name: aa, dtype: int64

Or a subset of the columns:

df[['aa','z']]
aa z
0 81 1
1 256 1
2 625 1
3 1296 1
4 2401 1

The index of the dataframe labels the rows:

df.index
Int64Index([0, 1, 2, 3, 4], dtype='int64')

We can set it to anything we like as long as each row has a unique index value.

df.index=['Maggie','Edward','Sanjeevani','Michael','Robert']
df
aa b c d z
Maggie 81 27 9 3 1
Edward 625 125 25 5 1
Sanjeevani 1296 216 36 6 1
Michael 256 64 16 4 1
Robert 2401 343 49 7 1

Accessing rows: loc, iloc, ix

loc, iloc, ix are three accessors

loc provides access by labels:

df.loc['Sanjeevani']
aa    1296
b      216
c       36
d        6
z        1
Name: caleb, dtype: int64

iloc provides access by row number:

df.iloc[2]
aa    625
b     125
c      25
d       5
z       1
Name: aly, dtype: int64

Labels could be integers:

df.index=['Maggie',4,'Sanjeevani','Michael','Robert']
df
aa b c d z
Maggie 81 27 9 3 1
4 625 125 25 5 1
Sanjeevani 1296 216 36 6 1
Michael 256 64 16 4 1
Robert 2401 343 49 7 1

Then

df.loc[4]  # gives row with label 4
aa    256
b      64
c      16
d       4
z       1
Name: 4, dtype: int64
df.iloc[4]  # gives row #4
aa    2401
b      343
c       49
d        7
z        1
Name: darnell, dtype: int64

ix provides access by either label or row number. If a row has an integer label, i, and we ask ix for row i, do we get row with label i? Or row number i?

df.ix[4]
aa    2401
b      343
c       49
d        7
z        1
Name: darnell, dtype: int64

Answer row number i.

These indexers also support slicing. Beware that unlike every other start:stop slicing in Python, "stop" is included:

df.loc['Maggie':'Michael']
aa b c d z
Maggie 81 27 9 3 1
4 625 125 25 5 1
Sanjeevani 1296 216 36 6 1
Michael 256 64 16 4 1