pandas

  

Why Use Pandas?

  • Fast and efficient for manipulating and analyzing data.
  • Data from different file objects can be easily loaded.
  • Flexible reshaping and pivoting of data sets
  • Provides time-series functionality.

    pip install pandas

Pandas Data Structures

Pandas generally provide two data structures for manipulating data, They are: 

  • Series
  • DataFrame

Series

Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called indexes.
Pandas Series is nothing but a column in an Excel sheet. Labels need not be unique but must be a hashable type. The object supports both integer and label-based indexing and provides a host of methods for performing operations involving the index.

Series Data Frame

Note: For more information, refer to Python | Pandas Series 

Creating a Series

In the real world, a Pandas Series will be created by loading the datasets from existing storage, storage can be SQL Database, CSV file, or an Excel file. Pandas Series can be created from lists, dictionaries, and from scalar values, etc.

Example:

import pandas as pd 
import numpy as np
  
# Creating empty series 
ser = pd.Series() 
print("Pandas Series: ", ser) 
  
# simple array 
data = np.array(['g', 'e', 'e', 'k', 's']) 
    
ser = pd.Series(data) 
print("Pandas Series:\n", ser)

Output:

Pandas Series: Series([], dtype: float64)
Pandas Series:
0 g
1 e
2 e
3 k
4 s
dtype: object

Note: For more information, refer to Creating a Pandas Series

DataFrame

Pandas DataFrame is a two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal components, the data, rows, and columns.

pandas-dataframe

Note: For more information, refer to Python | Pandas DataFrame 

Creating Data Frame

In the real world, a Pandas DataFrame will be created by loading the datasets from existing storage, storage can be SQL Database, CSV file, or an Excel file. Pandas DataFrame can be created from lists, dictionaries, and from a list of dictionaries, etc.

Example:

import pandas as pd 
    
# Calling DataFrame constructor 
df = pd.DataFrame() 
print(df)
  
# list of strings 
lst = ['Geeks', 'For', 'Geeks', 'is', 'portal', 'for', 'Geeks'
    
# Calling DataFrame constructor on list 
df = pd.DataFrame(lst) 
print(df)

Output:

Empty DataFrame
Columns: []
Index: []
0
0 Geeks
1 For
2 Geeks
3 is
4 portal
5 for
6 Geeks

Creating a Pandas DataFrame

# import pandas as pd
import pandas as pd
 
# Calling DataFrame constructor
df = pd.DataFrame()
 
print(df)

Output : 
 

Empty DataFrame
Columns: []
Index: []

Creating a dataframe using List: 
DataFrame can be created using a single list or a list of lists. 
 

# import pandas as pd
import pandas as pd
 
# list of strings
lst = ['Geeks', 'For', 'Geeks', 'is',
            'portal', 'for', 'Geeks']
 
# Calling DataFrame constructor on list
df = pd.DataFrame(lst)
print(df)

Output: 
 

  
Creating DataFrame from dict of ndarray/lists: 
To create DataFrame from dict of narray/list, all the narray must be of same length. If index is passed then the length index should be equal to the length of arrays. If no index is passed, then by default, index will be range(n) where n is the array length. 
 

# Python code demonstrate creating
# DataFrame from dict narray / lists
# By default addresses.
 
import pandas as pd
 
# initialise data of lists.
data = {'Name':['Tom', 'nick', 'krish', 'jack'], 'Age':[20, 21, 19, 18]}
 
# Create DataFrame
df = pd.DataFrame(data)
 
# Print the output.
print(df)

Output: 
 

  
Create pandas dataframe from lists using dictionary: 
Creating pandas data-frame from lists using dictionary can be achieved in different ways. We can create pandas dataframe from lists using dictionary using pandas.DataFrame. With this method in Pandas we can transform a dictionary of list to a dataframe.
 

# importing pandas as pd
import pandas as pd
 
# dictionary of lists
dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"],
        'degree': ["MBA", "BCA", "M.Tech", "MBA"],
        'score':[90, 40, 80, 98]}
 
df = pd.DataFrame(dict)
 
print(df)

Output: 
 

Python Pandas Series

Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.).

Pandas Series Examples

# import pandas as pd
import pandas as pd
 
# simple array
data = [1, 2, 3, 4]
 
ser = pd.Series(data)
print(ser)

Output :

0    1
1    2
2    3
3    4
dtype: int64

The axis labels are collectively called index. Pandas Series is nothing but a column in an excel sheet.
Labels need not be unique but must be a hashable type. The object supports both integer and label-based indexing and provides a host of methods for performing operations involving the index.
Pandas Series

Python Pandas Series

We will get a brief insight on all these basic operations which can be performed on Pandas Series :

Creating a Pandas Series

In the real world, a Pandas Series will be created by loading the datasets from existing storage, storage can be SQL Database, CSV file, and Excel file. Pandas Series can be created from the lists, dictionary, and from a scalar value etc. Series can be created in different ways, here are some ways by which we create a series:

Creating a series from array: In order to create a series from array, we have to import a numpy module and have to use array() function.

# import pandas as pd
import pandas as pd
 
# import numpy as np
import numpy as np
 
# simple array
data = np.array(['g','e','e','k','s'])
 
ser = pd.Series(data)
print(ser)

Output :


 
Creating a series from Lists:
In order to create a series from list, we have to first create a list after that we can create a series from list.

import pandas as pd
 
# a simple list
list = ['g', 'e', 'e', 'k', 's']
  
# create series form a list
ser = pd.Series(list)
print(ser)

Output :


For more details refer to Creating a Pandas Series
 

Accessing element of Series

There are two ways through which we can access element of series, they are :

  • Accessing Element from Series with Position
  • Accessing Element Using Label (index)

Accessing Element from Series with Position : In order to access the series element refers to the index number. Use the index operator [ ] to access an element in a series. The index must be an integer. In order to access multiple elements from a series, we use Slice operation.

Accessing first 5 elements of Series

# import pandas and numpy
import pandas as pd
import numpy as np
 
# creating simple array
data = np.array(['g','e','e','k','s','f', 'o','r','g','e','e','k','s'])
ser = pd.Series(data)
  
  
#retrieve the first element
print(ser[:5])

Output :

 
Accessing Element Using Label (index) :
In order to access an element from series, we have to set values by index label. A Series is like a fixed-size dictionary in that you can get and set values by index label.

Accessing a single element using index label

# import pandas and numpy
import pandas as pd
import numpy as np
 
# creating simple array
data = np.array(['g','e','e','k','s','f', 'o','r','g','e','e','k','s'])
ser = pd.Series(data,index=[10,11,12,13,14,15,16,17,18,19,20,21,22])
  
  
# accessing a element using index element
print(ser[16])

Output :

o

For more details refer to Accessing element of Series
 

Indexing and Selecting Data in Series

Indexing in pandas means simply selecting particular data from a Series. Indexing could mean selecting all the data, some of the data from particular columns. Indexing can also be known as Subset Selection.

Indexing a Series using indexing operator [] :
Indexing operator is used to refer to the square brackets following an object. The .loc and .iloc indexers also use the indexing operator to make selections. In this indexing operator to refer to df[ ].

# importing pandas module 
import pandas as pd 
     
# making data frame 
df = pd.read_csv("nba.csv"
   
ser = pd.Series(df['Name'])
data = ser.head(10)
data


Now we access the element of series using index operator [ ].

# using indexing operator
data[3:6]

Output :

 
Indexing a Series using .loc[ ] :
This function selects data by refering the explicit index . The df.loc indexer selects data in a different way than just the indexing operator. It can select subsets of data.

# importing pandas module 
import pandas as pd 
     
# making data frame 
df = pd.read_csv("nba.csv"
   
ser = pd.Series(df['Name'])
data = ser.head(10)
data


Now we access the element of series using .loc[] function.

# using .loc[] function
data.loc[3:6]

Output :

 
Indexing a Series using .iloc[ ] :
This function allows us to retrieve data by position. In order to do that, we’ll need to specify the positions of the data that we want. The df.iloc indexer is very similar to df.loc but only uses integer locations to make its selections.

# importing pandas module 
import pandas as pd 
     
# making data frame 
df = pd.read_csv("nba.csv"
   
ser = pd.Series(df['Name'])
data = ser.head(10)
data


Now we access the element of Series using .iloc[] function.

# using .iloc[] function
data.iloc[3:6]

Output :

 

Binary Operation on Series

We can perform binary operation on series like addition, subtraction and many other operation. In order to perform binary operation on series we have to use some function like .add(),.sub() etc..
Code #1:

# importing pandas module 
import pandas as pd 
 
# creating a series
data = pd.Series([5, 2, 3,7], index=['a', 'b', 'c', 'd'])
 
# creating a series
data1 = pd.Series([1, 6, 4, 9], index=['a', 'b', 'd', 'e'])
 
print(data, "\n\n", data1)


Now we add two series using .add() function.

# adding two series using
# .add
data.add(data1, fill_value=0)

Output :

Code #2:

# importing pandas module 
import pandas as pd 
 
# creating a series
data = pd.Series([5, 2, 3,7], index=['a', 'b', 'c', 'd'])
 
# creating a series
data1 = pd.Series([1, 6, 4, 9], index=['a', 'b', 'd', 'e'])
 
print(data, "\n\n", data1)


Now we subtract two series using .sub function.

# subtracting two series using
# .sub
data.sub(data1, fill_value=0)

Output :

For more details refer to Binary operation methods on series
 

Conversion Operation on Series

In conversion operation we perform various operation like changing datatype of series, changing a series to list etc. In order to perform conversion operation we have various function which help in conversion like .astype().tolist() etc.
Code #1:

# Python program using astype
# to convert a datatype of series
 
# importing pandas module 
import pandas as pd
   
# reading csv file from url 
data = pd.read_csv("nba.csv")
    
# dropping null value columns to avoid errors
data.dropna(inplace = True)
   
# storing dtype before converting
before = data.dtypes
   
# converting dtypes using astype
data["Salary"]= data["Salary"].astype(int)
data["Number"]= data["Number"].astype(str)
   
# storing dtype after converting
after = data.dtypes
   
# printing to compare
print("BEFORE CONVERSION\n", before, "\n")
print("AFTER CONVERSION\n", after, "\n")

Output :

 
Code #2:

# Python program converting
# a series into list
 
# importing pandas module 
import pandas as pd 
   
# importing regex module
import re
     
# making data frame 
data = pd.read_csv("nba.csv"
     
# removing null values to avoid errors 
data.dropna(inplace = True
   
# storing dtype before operation
dtype_before = type(data["Salary"])
   
# converting to list
salary_list = data["Salary"].tolist()
   
# storing dtype after operation
dtype_after = type(salary_list)
   
# printing dtype
print("Data type before converting = {}\nData type after converting = {}"
      .format(dtype_before, dtype_after))
   
# displaying list
salary_list

Output :

Binary operation methods on series:

FUNCTIONDESCRIPTION
add()Method is used to add series or list like objects with same length to the caller series
sub()Method is used to subtract series or list like objects with same length from the caller series
mul()Method is used to multiply series or list like objects with same length with the caller series
div()Method is used to divide series or list like objects with same length by the caller series
sum()Returns the sum of the values for the requested axis
prod()Returns the product of the values for the requested axis
mean()Returns the mean of the values for the requested axis
pow()Method is used to put each element of passed series as exponential power of caller series and returned the results
abs()Method is used to get the absolute numeric value of each element in Series/DataFrame
cov()Method is used to find covariance of two series

 
Pandas series method:

FUNCTIONDESCRIPTION
Series()A pandas Series can be created with the Series() constructor method. This constructor method accepts a variety of inputs
combine_first()Method is used to combine two series into one
count()Returns number of non-NA/null observations in the Series
size()Returns the number of elements in the underlying data
name()Method allows to give a name to a Series object, i.e. to the column
is_unique()Method returns boolean if values in the object are unique
idxmax()Method to extract the index positions of the highest values in a Series
idxmin()Method to extract the index positions of the lowest values in a Series
sort_values()Method is called on a Series to sort the values in ascending or descending order
sort_index()Method is called on a pandas Series to sort it by the index instead of its values
head()Method is used to return a specified number of rows from the beginning of a Series. The method returns a brand new Series
tail()Method is used to return a specified number of rows from the end of a Series. The method returns a brand new Series
le()Used to compare every element of Caller series with passed series.It returns True for every element which is Less than or Equal to the element in passed series
ne()Used to compare every element of Caller series with passed series. It returns True for every element which is Not Equal to the element in passed series
ge()Used to compare every element of Caller series with passed series. It returns True for every element which is Greater than or Equal to the element in passed series
eq()Used to compare every element of Caller series with passed series. It returns True for every element which is Equal to the element in passed series
gt()Used to compare two series and return Boolean value for every respective element
lt()Used to compare two series and return Boolean value for every respective element
clip()Used to clip value below and above to passed Least and Max value
clip_lower()Used to clip values below a passed least value
clip_upper()Used to clip values above a passed maximum value
astype()Method is used to change data type of a series
tolist()Method is used to convert a series to list
get()Method is called on a Series to extract values from a Series. This is alternative syntax to the traditional bracket syntax
unique()Pandas unique() is used to see the unique values in a particular column
nunique()Pandas nunique() is used to get a count of unique values
value_counts()Method to count the number of the times each unique value occurs in a Series
factorize()Method helps to get the numeric representation of an array by identifying distinct values
map()Method to tie together the values from one object to another
between()Pandas between() method is used on series to check which values lie between first and second argument
apply()Method is called and feeded a Python function as an argument to use the function on every Series value. This method is helpful for executing custom operations that are not included in pandas or numpy

# import pandas as pd
import pandas as pd
 
# Creating empty series
ser = pd.Series()
 
print(ser)

Output :

Series([], dtype: float64)

By default, the data type of Series is float. 

 Creating a series from array: In order to create a series from NumPy array, we have to import numpy module and have to use array() function. 

# import pandas as pd
import pandas as pd
 
# import numpy as np
import numpy as np
 
# simple array
data = np.array(['g', 'e', 'e', 'k', 's'])
 
ser = pd.Series(data)
print(ser)

Output:

    

By default, the index of the series starts from 0 till the length of series -1.

Creating a series from array with an index: In order to create a series by explicitly proving index instead of the default, we have to provide a list of elements to the index parameter with the same number of elements as it is an array. 

# import pandas as pd
import pandas as pd
 
# import numpy as np
import numpy as np
 
# simple array
data = np.array(['g', 'e', 'e', 'k', 's'])
 
# providing an index
ser = pd.Series(data, index=[10, 11, 12, 13, 14])
print(ser)

Output:

   

 Creating a series from Lists: In order to create a series from list, we have to first create a list after that we can create a series from list. 

import pandas as pd
 
# a simple list
list = ['g', 'e', 'e', 'k', 's']
 
# create series form a list
ser = pd.Series(list)
print(ser)

Output :

  

Creating a series from Dictionary: In order to create a series from the dictionary, we have to first create a dictionary after that we can make a series using dictionary. Dictionary keys are used to construct indexes of Series.  

import pandas as pd
 
# a simple dictionary
dict = {'Geeks': 10,
        'for': 20,
        'geeks': 30}
 
# create series from dictionary
ser = pd.Series(dict)
 
print(ser)

Output:

   

Creating a series from Scalar value: In order to create a series from scalar value, an index must be provided. The scalar value will be repeated to match the length of the index. 

import pandas as pd
 
import numpy as np
 
# giving a scalar value with index
ser = pd.Series(10, index=[0, 1, 2, 3, 4, 5])
 
print(ser)

Output:

    

Creating a series using NumPy functions : In order to create a series using numpy function, we can use different function of numpy like numpy.linspace()numpy.random.radn()

# import pandas and numpy
import pandas as pd
import numpy as np
 
# series with numpy linspace()
ser1 = pd.Series(np.linspace(3, 33, 3))
print(ser1)
 
# series with numpy linspace()
ser2 = pd.Series(np.linspace(1, 100, 10))
print(& quot
       \n"       , ser2)

Output:

 

Creating a Series using range function:

# code
import pandas as pd
ser=pd.Series(range(10))
print(ser)
Output:
0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int64

Creating a Series using for loop and list comprehension:

import pandas as pd
ser=pd.Series(range(1,20,3), index=[x for x in 'abcdefg'])
print(ser)

Output:

a     1
b     4
c     7
d    10
e    13
f    16
g    19
dtype: int64

Creating a Series using mathematical expressions:

import pandas as pd
import numpy as np
ser=np.arange(10,15)
serobj=pd.Series(data=ser*5,index=ser)
print(serobj)

Output:

10    50
11    55
12    60
13    65
14    70
dtype: int32
Pandas Dataframe/Series.head() method

Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages and makes importing and analyzing data much easier.

Pandas head() method is used to return top n (5 by default) rows of a data frame or series.

Syntax: Dataframe.head(n=5)

Parameters:
n: integer value, number of rows to be returned

Return type: Dataframe with top n rows

To download the data set used in following example, click here.
In the following examples, the data frame used contains data of some NBA players. The image of data frame before any operations is attached below.

Example #1:

In this example, top 5 rows of data frame are returned and stored in a new variable. No parameter is passed to .head() method since by default it is 5.

# importing pandas module
import pandas as pd
  
# making data frame
  
# calling head() method 
# storing in new variable
data_top = data.head()
  
# display
data_top

Output:
As shown in the output image, it can be seen that the index of returned rows is ranging from 0 to 4. Hence, top 5 rows were returned.

 
Example #2: Calling on Series with n parameter()

In this example, the .head() method is called on series with custom input of n parameter to return top 9 rows of the series.

# importing pandas module
import pandas as pd
  
# making data frame
  
# number of rows to return
n = 9
  
# creating series
series = data["Name"]
  
# returning top n rows
top = series.head(n = n)
  
# display
top

Output:
As shown in the output image, top 9 rows ranging from 0 to 8th index position were returned.

Example #1:
In this example, bottom 5 rows of data frame are returned and stored in a new variable. No parameter is passed to .tail() method since by default it is 5.

# importing pandas module
import pandas as pd
  
# making data frame
  
# calling tail() method 
# storing in new variable
data_bottom = data.tail()
  
# display
data_bottom

Output:
As shown in the output image, it can be seen that the index of returned rows is ranging from 453 to 457. Hence, last 5 rows were returned.

 
Example #2: Calling on Series with n parameter()

In this example, the .tail() method is called on series with custom input of n parameter to return bottom 12 rows of the series.

# importing pandas module
import pandas as pd
  
# making data frame
  
# number of rows to return
n = 12
  
# creating series
series = data["Salary"]
  
# returning top n rows
bottom = series.tail(n = n)
  
# display
bottom

Output:
As shown in the output image, top 12 rows ranging from 446 to 457th index position of the Salary column were returned.


Pandas DataFrame describe()

Pandas describe() is used to view some basic statistical details like percentile, mean, std, etc. of a data frame or a series of numeric values. When this method is applied to a series of strings, it returns a different output which is shown in the examples below.

Syntax: DataFrame.describe(percentiles=None, include=None, exclude=None) 

Parameters: 

  • percentile: list like data type of numbers between 0-1 to return the respective percentile 
  • include: List of data types to be included while describing dataframe. Default is None 
  • exclude: List of data types to be Excluded while describing dataframe. Default is None 

Return type: Statistical summary of data frame.

Dataset used

To download the data set used in the following example, click here. In the following examples, the data frame used contains data from some NBA players. Let’s have a look at the data by importing it.

import pandas as pd
# reading and printing csv file
data = pd.read_csv('nba.csv')
print(data.head())

Output: 

            Name            Team  Number Position   Age Height  Weight            College     Salary
0  Avery Bradley  Boston Celtics     0.0       PG  25.0    6-2   180.0              Texas  7730337.0  
1    Jae Crowder  Boston Celtics    99.0       SF  25.0    6-6   235.0          Marquette  6796117.0 
2   John Holland  Boston Celtics    30.0       SG  27.0    6-5   205.0  Boston University        NaN 
3    R.J. Hunter  Boston Celtics    28.0       SG  22.0    6-5   185.0      Georgia State  1148640.0 
4  Jonas Jerebko  Boston Celtics     8.0       PF  29.0   6-10   231.0                NaN  5000000.0 

Using Describe function in Pandas

We can easily learn about several statistical measures, including mean, median, standard deviation, quartiles, and more, by using describe() on a DataFrame.

print(data.descibe())
           Number         Age      Weight        Salary
count  457.000000  457.000000  457.000000  4.460000e+02
mean    17.678337   26.938731  221.522976  4.842684e+06
std     15.966090    4.404016   26.368343  5.229238e+06
min      0.000000   19.000000  161.000000  3.088800e+04
25%      5.000000   24.000000  200.000000  1.044792e+06
50%     13.000000   26.000000  220.000000  2.839073e+06
75%     25.000000   30.000000  240.000000  6.500000e+06
max     99.000000   40.000000  307.000000  2.500000e+07

Explanation of the description of numerical columns:

count: Total Number of Non-Empty values
mean: Mean of the column values
std: Standard Deviation of the column values
min: Minimum value from the column
25%: 25 percentile
50%: 50 percentile
75%: 75 percentile
max: Maximum value from the column

Pandas describe() behavior for numeric dtypes

In this example, the data frame is described and [‘object’] is passed to include a parameter to see a description of the object series. [.20, .40, .60, .80] is passed to the percentile parameter to view the respective percentile of the Numeric series. 

import pandas as pd
data = pd.read_csv('nba.csv')
 
# removing null values to avoid errors
data.dropna(inplace=True)
 
# percentile list
perc = [.20, .40, .60, .80]
 
# list of dtypes to include
include = ['object', 'float', 'int']
 
# calling describe method
desc = data.describe(percentiles=perc, include=include)
 
# display
desc

Output:

                 Name                  Team      Number Position         Age  \
count             364                   364  364.000000      364  364.000000   
unique            364                    30         NaN        5         NaN   
top     Avery Bradley  New Orleans Pelicans         NaN       SG         NaN   
freq                1                    16         NaN       87         NaN   
mean              NaN                   NaN   16.829670      NaN   26.615385   
std               NaN                   NaN   14.994162      NaN    4.233591   
min               NaN                   NaN    0.000000      NaN   19.000000   
20%               NaN                   NaN    4.000000      NaN   23.000000   
40%               NaN                   NaN    9.000000      NaN   25.000000   
50%               NaN                   NaN   12.000000      NaN   26.000000   
60%               NaN                   NaN   17.000000      NaN   27.000000   
80%               NaN                   NaN   30.000000      NaN   30.000000   
max               NaN                   NaN   99.000000      NaN   40.000000  
       Height      Weight   College        Salary  
count     364  364.000000       364  3.640000e+02  
unique     17         NaN       115           NaN  
top       6-9         NaN  Kentucky           NaN  
freq       49         NaN        22           NaN  
mean      NaN  219.785714       NaN  4.620311e+06  
std       NaN   24.793099       NaN  5.119716e+06  
min       NaN  161.000000       NaN  5.572200e+04  
20%       NaN  195.000000       NaN  9.472760e+05  
40%       NaN  212.000000       NaN  1.638754e+06  
50%       NaN  220.000000       NaN  2.515440e+06  
60%       NaN  228.000000       NaN  3.429934e+06  
80%       NaN  242.400000       NaN  7.838202e+06  
max       NaN  279.000000       NaN  2.287500e+07  

As shown in the output image, the Statistical description of the Dataframe was returned with the respectively passed percentiles. For the columns with strings, NaN was returned for numeric operations. 

Describing series of strings 

In this example, the described method is called by the Name column to see the behavior with the object data type. 

# importing pandas module
import pandas as pd
 
# making data frame
data = pd.read_csv("nba.csv")
 
# removing null values to avoid errors
data.dropna(inplace=True)
 
# calling describe method
desc = data["Name"].describe()
 
# display
desc

Output: As shown in the output image, the behavior of describe() is different with a series of strings. Different stats were returned like count of values, unique values, top, and frequency of occurrence in this case. 

count               457
unique              457
top       Avery Bradley
freq                  1
Name: Name, dtype: object

Dealing with Rows and Columns in Pandas DataFrame

Column Selection:
In Order to select a column in Pandas DataFrame, we can either access the columns by calling them by their columns name.

# Import pandas package
import pandas as pd
  
# Define a dictionary containing employee data
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
        'Age':[27, 24, 22, 32],
        'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
        'Qualification':['Msc', 'MA', 'MCA', 'Phd']}
  
# Convert the dictionary into DataFrame 
df = pd.DataFrame(data)
  
# select two columns
print(df[['Name', 'Qualification']])

Output:


Column Addition:
In Order to add a column in Pandas DataFrame, we can declare a new list as a column and add to a existing Dataframe.

# Import pandas package 
import pandas as pd
  
# Define a dictionary containing Students data
data = {'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'],
        'Height': [5.1, 6.2, 5.1, 5.2],
        'Qualification': ['Msc', 'MA', 'Msc', 'Msc']}
  
# Convert the dictionary into DataFrame
df = pd.DataFrame(data)
  
# Declare a list that is to be converted into a column
address = ['Delhi', 'Bangalore', 'Chennai', 'Patna']
  
# Using 'Address' as the column name
# and equating it to the list
df['Address'] = address
  
# Observe the result
print(df)

Output:


 
Column Deletion:
In Order to delete a column in Pandas DataFrame, we can use the drop() method. Columns is deleted by dropping columns with column names.

# importing pandas module
import pandas as pd
  
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name" )
  
# dropping passed columns
data.drop(["Team", "Weight"], axis = 1, inplace = True)
  
# display
print(data)

Output:
As shown in the output images, the new output doesn’t have the passed columns. Those values were dropped since axis was set equal to 1 and the changes were made in the original data frame since inplace was True.

Data Frame before Dropping Columns-


Data Frame after Dropping Columns-


Dealing with Rows:

In order to deal with rows, we can perform basic operations on rows like selecting, deleting, adding and renaming.

Row Selection:
Pandas provide a unique method to retrieve rows from a Data frame.DataFrame.loc[] method is used to retrieve rows from Pandas DataFrame. Rows can also be selected by passing integer location to an iloc[] function.

# importing pandas package
import pandas as pd
  
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
  
# retrieving row by loc method
first = data.loc["Avery Bradley"]
second = data.loc["R.J. Hunter"]
  
  
print(first, "\n\n\n", second)

Output:
As shown in the output image, two series were returned since there was only one parameter both of the times.


Row Addition:
In Order to add a Row in Pandas DataFrame, we can concat the old dataframe with new one.

# importing pandas module 
import pandas as pd 
    
# making data frame 
df = pd.read_csv("nba.csv", index_col ="Name"
  
df.head(10)
  
new_row = pd.DataFrame({'Name':'Geeks', 'Team':'Boston', 'Number':3,
                        'Position':'PG', 'Age':33, 'Height':'6-2',
                        'Weight':189, 'College':'MIT', 'Salary':99999},
                                                            index =[0])
# simply concatenate both dataframes
df = pd.concat([new_row, df]).reset_index(drop = True)
df.head(5)

Output:

Data Frame before Adding Row-


Data Frame after Adding Row-

Row Deletion:
In Order to delete a row in Pandas DataFrame, we can use the drop() method. Rows is deleted by dropping Rows by index label.

# importing pandas module
import pandas as pd
  
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name" )
  
# dropping passed values
data.drop(["Avery Bradley", "John Holland", "R.J. Hunter",
                            "R.J. Hunter"], inplace = True)
  
# display
data

Output:
As shown in the output images, the new output doesn’t have the passed values. Those values were dropped and the changes were made in the original data frame since inplace was True.

Data Frame before Dropping values-


Data Frame after Dropping values-

Pandas Extracting rows using .loc[]

Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages and makes importing and analyzing data much easier.

Pandas provide a unique method to retrieve rows from a Data frame. DataFrame.loc[] method is a method that takes only index labels and returns row or dataframe if the index label exists in the caller data frame.

Syntax: pandas.DataFrame.loc[]

Parameters:
Index label: String or list of string of index label of rows

Return type: Data frame or Series depending on parameters

To download the CSV used in code, click here.

Example #1: Extracting single Row

In this example, Name column is made as the index column and then two single rows are extracted one by one in the form of series using index label of rows.

# importing pandas package
import pandas as pd
  
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
  
# retrieving row by loc method
first = data.loc["Avery Bradley"]
second = data.loc["R.J. Hunter"]
  
  
print(first, "\n\n\n", second)

Output:
As shown in the output image, two series were returned since there was only one parameter both of the times.

 
Example #2: Multiple parameters

In this example, Name column is made as the index column and then two single rows are extracted at the same time by passing a list as parameter.

# importing pandas package
import pandas as pd
  
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
  
# retrieving rows by loc method
rows = data.loc[["Avery Bradley", "R.J. Hunter"]]
  
# checking data type of rows
print(type(rows))
  
# display
rows

Output:
As shown in the output image, this time the data type of returned value is a data frame. Both of the rows were extracted and displayed like a new data frame.

 

Example #3: Extracting multiple rows with same index

In this example, Team name is made as the index column and one team name is passed to .loc method to check if all values with same team name have been returned or not.

# importing pandas package
import pandas as pd
  
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Team")
  
# retrieving rows by loc method
rows = data.loc["Utah Jazz"]
  
# checking data type of rows
print(type(rows))
  
# display
rows

Output:
As shown in the output image, All rows with team name “Utah Jazz” were returned in the form of a data frame.

 

Example #4: Extracting rows between two index labels

In this example, two index label of rows are passed and all the rows that fall between those two index label have been returned (Both index labels Inclusive).

# importing pandas package
import pandas as pd
  
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
  
# retrieving rows by loc method
rows = data.loc["Avery Bradley":"Isaiah Thomas"]
  
# checking data type of rows
print(type(rows))
  
# display
rows

Output:
As shown in the output image, all the rows that fall between passed two index labels are returned in the form of a data frame.

Extracting rows using Pandas .iloc[]

The difference between the loc and iloc functions is that the loc

 function selects rows using row labels (e.g. tea ) whereas the iloc function

 selects rows using their integer positions


The Pandas library provides a unique method to retrieve rows from a DataFrame. Dataframe.iloc[] method is used when the index label of a data frame is something other than numeric series of 0, 1, 2, 3….n or in case the user doesn’t know the index label. Rows can be extracted using an imaginary index position that isn’t visible in the Dataframe.

Pandas .iloc[] Syntax

Syntax: pandas.DataFrame.iloc[] 

Parameters: 

  • Index Position: Index position of rows in integer or list of integer. 

Return type: Data frame or Series depending on parameters

Dataset Used:

To download the CSV used in the code, click here. 

Example 1: Extracting a single row and comparing with .loc[] In this example, the same index number row is extracted by both .iloc[] and.loc[] methods and compared. Since the index column by default is numeric, hence the index label will also be integers. 

# importing pandas package
import pandas as pd
 
# making data frame from csv file
data = pd.read_csv(& quot
                    nba.csv & quot
                    )
 
# retrieving rows by loc method
row1 = data.loc[3]
 
# retrieving rows by iloc method
row2 = data.iloc[3]
 
# checking if values are equal
row1 == row2

Output: 

As shown in the output image, the results returned by both methods are the same.

 

   

Example 2: Extracting multiple rows with index In this example, multiple rows are extracted, first by passing a list and then by passing integers to extract rows between that range. After that, both values are compared. 

# importing pandas package
import pandas as pd
 
# making data frame from csv file
data = pd.read_csv(& quot
                    nba.csv & quot
                    )
 
# retrieving rows by loc method
row1 = data.iloc[[4, 5, 6, 7]]
 
# retrieving rows by loc method
row2 = data.iloc[4:8]
 
# comparing values
row1 == row2

Output: 

As shown in the output image, the results returned by both methods are the same. All values are True except values in the college column since those were NaN values. 

 

Suppose we want to select columns AgeCollege and Salary for only rows with a labels Amir Johnson and Terry Rozier

Our final DataFrame would look like this:

Selecting some rows and all columns

Let’s say we want to select row Amir JhonsonTerry Rozier and John Holland with all columns in a dataframe.

Our final DataFrame would look like this:

Selecting some columns and all rows

Let’s say we want to select columns Age, Height and Salary with all rows in a dataframe.

Our final DataFrame would look like this:

 

Pandas Indexing using [ ].loc[].iloc[ ].ix[ ]

There are a lot of ways to pull the elements, rows, and columns from a DataFrame. There are some indexing method in Pandas which help in getting an element from a DataFrame. These indexing methods appear very similar but behave very differently. Pandas support four types of Multi-axes indexing they are:

  • Dataframe.[ ] ; This function also known as indexing operator
  • Dataframe.loc[ ] : This function is used for labels.
  • Dataframe.iloc[ ] : This function is used for positions or integer based
  • Dataframe.ix[] : This function is used for both label and integer based

Collectively, they are called the indexers. These are by far the most common ways to index data. These are four function which help in getting the elements, rows, and columns from a DataFrame.
 
Indexing a Dataframe using indexing operator [] :
Indexing operator is used to refer to the square brackets following an object. The .loc and .iloc indexers also use the indexing operator to make selections. In this indexing operator to refer to df[].

Selecting a single columns

In order to select a single column, we simply put the name of the column in-between the brackets

# importing pandas package
import pandas as pd
  
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
  
# retrieving columns by indexing operator
first = data["Age"]
  
  
  
print(first)

Output:

Selecting multiple columns

In order to select multiple columns, we have to pass a list of columns in an indexing operator.

# importing pandas package
import pandas as pd
  
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
  
# retrieving multiple columns by indexing operator
first = data[["Age", "College", "Salary"]]
  
  
  
first

Output:

 
Indexing a DataFrame using .loc[ ] :
This function selects data by the label of the rows and columns. The df.loc indexer selects data in a different way than just the indexing operator. It can select subsets of rows or columns. It can also simultaneously select subsets of rows and columns.

Selecting a single row

In order to select a single row using .loc[], we put a single row label in a .loc function.

# importing pandas package
import pandas as pd
  
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
  
# retrieving row by loc method
first = data.loc["Avery Bradley"]
second = data.loc["R.J. Hunter"]
  
  
print(first, "\n\n\n", second)

Output:
As shown in the output image, two series were returned since there was only one parameter both of the times.

 

Selecting multiple rows

In order to select multiple rows, we put all the row labels in a list and pass that to .loc function.

import pandas as pd
  
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
  
# retrieving multiple rows by loc method
first = data.loc[["Avery Bradley", "R.J. Hunter"]]
  
  
  
print(first)

Output:

 

Selecting two rows and three columns

In order to select two rows and three columns, we select a two rows which we want to select and three columns and put it in a separate list like this:

Dataframe.loc[["row1", "row2"], ["column1", "column2", "column3"]]

import pandas as pd
  
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
  
# retrieving two rows and three columns by loc method
first = data.loc[["Avery Bradley", "R.J. Hunter"],
                   ["Team", "Number", "Position"]]
  
  
  
print(first)

Output:

 

Selecting all of the rows and some columns

In order to select all of the rows and some columns, we use single colon [:] to select all of rows and list of some columns which we want to select like this:

Dataframe.loc[:, ["column1", "column2", "column3"]]

import pandas as pd
  
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
  
# retrieving all rows and some columns by loc method
first = data.loc[:, ["Team", "Number", "Position"]]
  
  
  
print(first)

Output:

 
Indexing a DataFrame using .iloc[ ] :
This function allows us to retrieve rows and columns by position. In order to do that, we’ll need to specify the positions of the rows that we want, and the positions of the columns that we want as well. The df.iloc indexer is very similar to df.loc but only uses integer locations to make its selections.

Selecting a single row

In order to select a single row using .iloc[], we can pass a single integer to .iloc[] function.

import pandas as pd
  
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
  
  
# retrieving rows by iloc method 
row2 = data.iloc[3
  
  
  
print(row2)

Output:

 

Selecting multiple rows

In order to select multiple rows, we can pass a list of integer to .iloc[] function.

import pandas as pd
  
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
  
  
# retrieving multiple rows by iloc method 
row2 = data.iloc [[3, 5, 7]]
  
  
  
row2

Output:

 

Selecting two rows and two columns

In order to select two rows and two columns, we create a list of 2 integer for rows and list of 2 integer for columns then pass to a .iloc[] function.

import pandas as pd
  
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
  
  
# retrieving two rows and two columns by iloc method 
row2 = data.iloc [[3, 4], [1, 2]]
  
  
  
print(row2)

Output:

 

Selecting all the rows and a some columns

In order to select all rows and some columns, we use single colon [:] to select all of rows and for columns we make a list of integer then pass to a .iloc[] function.

import pandas as pd
  
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
  
  
# retrieving all rows and some columns by iloc method 
row2 = data.iloc [:, [1, 2]]
  
  
  
print(row2)

Output:

 
Indexing a using Dataframe.ix[ ] :
Early in the development of pandas, there existed another indexer, ix. This indexer was capable of selecting both by label and by integer location. While it was versatile, it caused lots of confusion because it’s not explicit. Sometimes integers can also be labels for rows or columns. Thus there were instances where it was ambiguous. Generally, ix is label based and acts just as the .loc indexer. However, .ix also supports integer type selections (as in .iloc) where passed an integer. This only works where the index of the DataFrame is not integer based .ix will accept any of the inputs of .loc and .iloc.
Note: The .ix indexer has been deprecated in recent versions of Pandas.

Selecting a single row using .ix[] as .loc[]

In order to select a single row, we put a single row label in a .ix function. This function act similar as .loc[] if we pass a row label as a argument of a function.

# importing pandas package
import pandas as pd
   
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
   
# retrieving row by ix method
first = data.ix["Avery Bradley"]
  
   
   
print(first)
  

Output:

Selecting a single row using .ix[] as .iloc[]

In order to select a single row, we can pass a single integer to .ix[] function. This function similar as a iloc[] function if we pass an integer in a .ix[] function.

# importing pandas package
import pandas as pd
   
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
   
# retrieving row by ix method
first = data.ix[1]
  
   
   
print(first)

Output:

 Suppose we want to select columns Age, College and Salary for only rows with a labels Amir Johnson and Terry Rozier


Our final DataFrame would look like this:

Selecting some rows and all columns

Let’s say we want to select row Amir JhonsonTerry Rozier and John Holland with all columns in a dataframe.

Our final DataFrame would look like this:

Selecting some columns and all rows

Let’s say we want to select columns Age, Height and Salary with all rows in a dataframe.

Our final DataFrame would look like this:

 

Pandas Indexing using [ ].loc[].iloc[ ].ix[ ]

There are a lot of ways to pull the elements, rows, and columns from a DataFrame. There are some indexing method in Pandas which help in getting an element from a DataFrame. These indexing methods appear very similar but behave very differently. Pandas support four types of Multi-axes indexing they are:

  • Dataframe.[ ] ; This function also known as indexing operator
  • Dataframe.loc[ ] : This function is used for labels.
  • Dataframe.iloc[ ] : This function is used for positions or integer based
  • Dataframe.ix[] : This function is used for both label and integer based

Collectively, they are called the indexers. These are by far the most common ways to index data. These are four function which help in getting the elements, rows, and columns from a DataFrame.
 
Indexing a Dataframe using indexing operator [] :
Indexing operator is used to refer to the square brackets following an object. The .loc and .iloc indexers also use the indexing operator to make selections. In this indexing operator to refer to df[].

Selecting a single columns

In order to select a single column, we simply put the name of the column in-between the brackets

# importing pandas package
import pandas as pd
  
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
  
# retrieving columns by indexing operator
first = data["Age"]
  
  
  
print(first)

Output:

Selecting multiple columns

In order to select multiple columns, we have to pass a list of columns in an indexing operator.

# importing pandas package
import pandas as pd
  
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
  
# retrieving multiple columns by indexing operator
first = data[["Age", "College", "Salary"]]
  
  
  
first

Output:

 
Indexing a DataFrame using .loc[ ] :
This function selects data by the label of the rows and columns. The df.loc indexer selects data in a different way than just the indexing operator. It can select subsets of rows or columns. It can also simultaneously select subsets of rows and columns.

Selecting a single row

In order to select a single row using .loc[], we put a single row label in a .loc function.

# importing pandas package
import pandas as pd
  
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
  
# retrieving row by loc method
first = data.loc["Avery Bradley"]
second = data.loc["R.J. Hunter"]
  
  
print(first, "\n\n\n", second)

Output:
As shown in the output image, two series were returned since there was only one parameter both of the times.

 

Selecting multiple rows

In order to select multiple rows, we put all the row labels in a list and pass that to .loc function.

import pandas as pd
  
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
  
# retrieving multiple rows by loc method
first = data.loc[["Avery Bradley", "R.J. Hunter"]]
  
  
  
print(first)

Output:

 

Selecting two rows and three columns

In order to select two rows and three columns, we select a two rows which we want to select and three columns and put it in a separate list like this:

Dataframe.loc[["row1", "row2"], ["column1", "column2", "column3"]]

import pandas as pd
  
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
  
# retrieving two rows and three columns by loc method
first = data.loc[["Avery Bradley", "R.J. Hunter"],
                   ["Team", "Number", "Position"]]
  
  
  
print(first)

Output:

 

Selecting all of the rows and some columns

In order to select all of the rows and some columns, we use single colon [:] to select all of rows and list of some columns which we want to select like this:

Dataframe.loc[:, ["column1", "column2", "column3"]]

import pandas as pd
  
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
  
# retrieving all rows and some columns by loc method
first = data.loc[:, ["Team", "Number", "Position"]]
  
  
  
print(first)

Output:

 
Indexing a DataFrame using .iloc[ ] :
This function allows us to retrieve rows and columns by position. In order to do that, we’ll need to specify the positions of the rows that we want, and the positions of the columns that we want as well. The df.iloc indexer is very similar to df.loc but only uses integer locations to make its selections.

Selecting a single row

In order to select a single row using .iloc[], we can pass a single integer to .iloc[] function.

import pandas as pd
  
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
  
  
# retrieving rows by iloc method 
row2 = data.iloc[3
  
  
  
print(row2)

Output:

 

Selecting multiple rows

In order to select multiple rows, we can pass a list of integer to .iloc[] function.

import pandas as pd
  
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
  
  
# retrieving multiple rows by iloc method 
row2 = data.iloc [[3, 5, 7]]
  
  
  
row2

Output:

 

Selecting two rows and two columns

In order to select two rows and two columns, we create a list of 2 integer for rows and list of 2 integer for columns then pass to a .iloc[] function.

import pandas as pd
  
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
  
  
# retrieving two rows and two columns by iloc method 
row2 = data.iloc [[3, 4], [1, 2]]
  
  
  
print(row2)

Output:

 

Selecting all the rows and a some columns

In order to select all rows and some columns, we use single colon [:] to select all of rows and for columns we make a list of integer then pass to a .iloc[] function.

import pandas as pd
  
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
  
  
# retrieving all rows and some columns by iloc method 
row2 = data.iloc [:, [1, 2]]
  
  
  
print(row2)

Output:

 
Indexing a using Dataframe.ix[ ] :
Early in the development of pandas, there existed another indexer, ix. This indexer was capable of selecting both by label and by integer location. While it was versatile, it caused lots of confusion because it’s not explicit. Sometimes integers can also be labels for rows or columns. Thus there were instances where it was ambiguous. Generally, ix is label based and acts just as the .loc indexer. However, .ix also supports integer type selections (as in .iloc) where passed an integer. This only works where the index of the DataFrame is not integer based .ix will accept any of the inputs of .loc and .iloc.
Note: The .ix indexer has been deprecated in recent versions of Pandas.

Selecting a single row using .ix[] as .loc[]

In order to select a single row, we put a single row label in a .ix function. This function act similar as .loc[] if we pass a row label as a argument of a function.

# importing pandas package
import pandas as pd
   
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
   
# retrieving row by ix method
first = data.ix["Avery Bradley"]
  
   
   
print(first)
  

Output:

Selecting a single row using .ix[] as .iloc[]

In order to select a single row, we can pass a single integer to .ix[] function. This function similar as a iloc[] function if we pass an integer in a .ix[] function.

# importing pandas package
import pandas as pd
   
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
   
# retrieving row by ix method
first = data.ix[1]
  
   
   
print(first)

Output:

 

Syntax: DataFrame.pop(item)
Parameters: 
item: Column name to be popped in string
Return type: Popped column in form of Pandas Series

To download the CSV used in code, click here.
Example #1: 
In this example, a column have been popped and returned by the function. The new data frame is then compared with the old one.
 

Python3

import pandas as pd
# importing pandas package
 
data = pd.read_csv("nba.csv")
# making data frame from csv file
 
popped_col = data.pop("Team")
# storing data in new var
 
data
# display

Output: 
In the output images, the data frames are compared before and after using .pop(). As shown in second image, Team column has been popped out.
Dataframe before using .pop() 
 

Dataframe after using .pop() 
 

  
Example #2: Popping and pushing in other data frame 
In this example, a copy of data frame is made and the popped column is inserted at the end of the other data frame.
 

Python3

import pandas as pd
# importing pandas package
 
data = pd.read_csv("nba.csv")
# making data frame from csv file
 
new = data.copy()
# creating independent copy of data frame
 
popped_col = data.pop("Name")
# storing data in new var
 
new["New Col"]= popped_col
# creating new col and passing popped col
 
new
# display

Output: 
As shown in the output image, the new data frame is having the New col at the end which is nothing but the Name column which was popped out earlier. 
 

Syntax: DataFrame.get(key, default=None)

Parameters :
key : object

Returns : value : type of items contained in object

For link to CSV file Used in Code, click here

Example #1: Use get() function to extract a column out of the dataframe

# importing pandas as pd
import pandas as pd
  
# Creating the dataframe 
df = pd.read_csv("nba.csv")
  
# Print the dataframe
df

Output :

Now apply the get() function. We are going to extract the “Salary” column from the dataframe.

# applying get() function 
df.get("Salary")

Output :

Notice, the output is not a dataframe but a pandas series object.
 
Example #2: Use get() function to extract multiple columns at a time in random order

# importing pandas as pd
import pandas as pd
  
# Creating the dataframe 
df = pd.read_csv("nba.csv")
  
df.get(["Salary", "Team", "Name"])

Output :

Syntax: DataFrame.isin(values)

Parameters:
values: iterable, Series, List, Tuple, DataFrame or dictionary to check in the caller Series/Data Frame.

Return Type: DataFrame of Boolean of Dimension.

To download the CSV file used, Click Here.

Example #1: Single Parameter filtering
In the following Example, Rows are checked and a boolean series is returned which is True wherever Gender=”Male”. Then the series is passed to data frame to see new filtered data frame.

# importing pandas package
import pandas as pd
  
# making data frame from csv file
data = pd.read_csv("employees.csv")
  
# creating a bool series from isin()
new = data["Gender"].isin(["Male"])
  
# displaying data with gender = male only
data[new]

Output:
As shown in the output image, only Rows having gender = “Male” are returned.

 

Example #2: Multiple parameter Filtering
In the following example, the data frame is filtered on the basis of Gender as well as Team. Rows having Gender=”Female” and Team=”Engineering”, “Distribution” or “Finance” are returned.

# importing pandas package
import pandas as pd
  
# making data frame from csv file
data = pd.read_csv("employees.csv")
  
# creating filters of bool series from isin()
filter1 = data["Gender"].isin(["Female"])
filter2 = data["Team"].isin(["Engineering", "Distribution", "Finance" ])
  
# displaying data with both filter applied and mandatory 
data[filter1 & filter2]

Output:
As shown in the output image, Rows having Gender=”Female” and Team=”Engineering”, “Distribution” or “Finance” are returned.

Syntax:
DataFrame.where(cond, other=nan, inplace=False, axis=None, level=None, errors=’raise’, try_cast=False, raise_on_error=None)
 

Parameters:

cond: One or more condition to check data frame for.
other: Replace rows which don’t satisfy the condition with user defined object, Default is NaN
inplace: Boolean value, Makes changes in data frame itself if True
axis: axis to check( row or columns)

For link to the CSV file used, Click here.

Example #1: Single Condition operation

In this example, rows having particular Team name will be shown and rest will be replaced by NaN using .where() method.

# importing pandas package
import pandas as pd
  
# making data frame from csv file
data = pd.read_csv("nba.csv")
  
# sorting dataframe
data.sort_values("Team", inplace = True)
  
# making boolean series for a team name
filter = data["Team"]=="Atlanta Hawks"
  
# filtering data
data.where(filter, inplace = True)
  
# display
data

Output:

As shown in the output image, every row which doesn’t have Team = Atlanta Hawks is replaced with NaN.

 

Example #2: Multi-condition Operations

Data is filtered on the basis of both Team and Age. Only the rows having Team name “Atlanta Hawks” and players having age above 24 will be displayed.

# importing pandas package
import pandas as pd
  
# making data frame from csv file
data = pd.read_csv("nba.csv")
  
# sorting dataframe
data.sort_values("Team", inplace = True)
  
# making boolean series for a team name
filter1 = data["Team"]=="Atlanta Hawks"
  
# making boolean series for age
filter2 = data["Age"]>24
  
# filtering data on basis of both filters
data.where(filter1 & filter2, inplace = True)
  
# display
data

Output:
As shown in the output image, Only the rows having Team name “Atlanta Hawks” and players having age above 24 are displayed.

Boolean Indexing in Pandas

Accessing a DataFrame with a boolean index: 

In order to access a dataframe with a boolean index, we have to create a dataframe in which the index of dataframe contains a boolean value that is “True” or “False”.

Example  

Python3

# importing pandas as pd
import pandas as pd
  
# dictionary of lists
dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"],
        'degree': ["MBA", "BCA", "M.Tech", "MBA"],
        'score':[90, 40, 80, 98]}
  
df = pd.DataFrame(dict, index = [True, False, True, False])
  
print(df)

Output: 
 

Accessing a Dataframe with a boolean index using .ix[]

In order to access a dataframe using .ix[], we have to pass boolean value (True or False) and integer value to .ix[] function because as we know that .ix[] function is a hybrid of .loc[] and .iloc[] function. 

Code #1:  

Python3

# importing pandas as pd
import pandas as pd
  
# dictionary of lists
dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"],
        'degree': ["MBA", "BCA", "M.Tech", "MBA"],
        'score':[90, 40, 80, 98]}
 
# creating a dataframe with boolean index
df = pd.DataFrame(dict, index = [True, False, True, False])
  
 
# accessing a dataframe using .ix[] function
print(df.ix[True])

Output: 


 

Code #2: 

Python

# importing pandas as pd
import pandas as pd
  
# dictionary of lists
dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"],
        'degree': ["MBA", "BCA", "M.Tech", "MBA"],
        'score':[90, 40, 80, 98]}
 
# creating a dataframe with boolean index
df = pd.DataFrame(dict, index = [True, False, True, False])
  
 
# accessing a dataframe using .ix[] function
print(df.ix[1])

Output: 
 

import pandas as geek
    
# making data frame from csv file
    
# Integer slicing
print("Slicing only rows(till index 4):")
x1 = data.ix[:4, ]
print(x1, "\n")
   
print("Slicing rows and columns(rows=4, col 1-4, excluding 4):")
x2 = data.ix[:4, 1:4]
print(x2)

Output :

# importing pandas package 
import pandas as geek
    
# making data frame from csv file
data = geek.read_csv("nba.csv")  
    
# Index slicing on Height column
print("After index slicing:")
x1 = data.ix[10:20, 'Height']
print(x1, "\n")
  
# Index slicing on Salary column
x2 = data.ix[10:20, 'Salary']
print(x2)

Output:

Series.str.slice()

Syntax: Series.str.slice(start=None, stop=None, step=None)

Parameters:
start: int value, tells where to start slicing
stop: int value, tells where to end slicing
step: int value, tells how much characters to step during slicing

Return type: Series with sliced substrings

To download the CSV used in code, click here.

In the following examples, the data frame used contains data of some NBA players. The image of data frame before any operations is attached below.

 
Example #1:
In this example, the salary column has been sliced to get values before decimal. For example, we want to do some mathematical operations and for that we need integer data, so the salary column will be sliced till the 2nd last element(-2 position).
Since the salary column is imported as float64 data type, it is first converted to string using the .astype() method.

# importing pandas module 
import pandas as pd 
    
# making data frame 
    
# removing null values to avoid errors 
data.dropna(inplace = True
  
# start stop and step variables
start, stop, step = 0, -2, 1
  
# converting to string data type
data["Salary"]= data["Salary"].astype(str)
  
# slicing till 2nd last element
data["Salary (int)"]= data["Salary"].str.slice(start, stop, step)
  
# display
data.head(10)

Output:
As shown in the output image, the string has been sliced and the string before decimal is stored in new column.


Note:This method doesn’t have any parameters to handle null values and hence they were already removed using .dropna() method.
 
Example #2:
In this example, the name column is sliced and step parameter is kept 2. Hence it will be stepping two characters during slicing.

# importing pandas module 
import pandas as pd 
    
# making data frame 
    
# removing null values to avoid errors 
data.dropna(inplace = True
  
# start stop and step variables
start, stop, step = 0, -2, 2
  
# slicing till 2nd last element
data["Name"]= data["Name"].str.slice(start, stop, step)
  
# display
data.head(10)

Output:
As it can be seen in the output image, the Name was sliced and 2 characters were skipped during slicing.

column-slices


# importing pandas
import pandas as pd
 
# Using DataFrame() method from pandas module
df1 = pd.DataFrame({"a": [1, 2, 3, 4, 5, 6, 7],
                    "b": [2, 3, 4, 2, 3, 4, 5],
                    "c": [3, 4, 5, 2, 3, 4, 5],
                    "d": [4, 5, 6, 2, 3, 4, 5],
                    "e": [5, 6, 7, 2, 3, 4, 5]})
 
display(df1)

Output:

 

Method 1: Slice Columns in pandas using reindex

Slicing column from ‘c’ to ‘b’.

df2 = df1.reindex(columns = ['c','b'])
print(df2)

Output:

 

Method 2: Slice Columns in pandas using loc[]

The df.loc[] is present in the Pandas package loc can be used to slice a Dataframe using indexing. Pandas DataFrame.loc attribute accesses a group of rows and columns by label(s) or a boolean array in the given DataFrame.

Syntax: [ : , first : last : step]

Example 1:

Slicing column from ‘b’ to ‘d’ with step 2.

df2 = df1.loc[:, "b":"d":2]
print(df2)

Output:

 

Example 2:

Slicing column from ‘c’ to ‘e’ with step 1.

df2 = df1.loc[:, "c":"e":1]
print(df2)

Output:

 

Method 3: Slice Columns in pandas using iloc[]

The iloc is present in the Pandas package. The iloc can be used to slice a Dataframe using indexing. df.iloc[] method is used when the index label of a data frame is something other than numeric series of 0, 1, 2, 3….n or in case the user doesn’t know the index label. Rows can be extracted using an imaginary index position that isn’t visible in the data frame.

Syntax: [ start : stop : step]

Example 1:

Slicing column from ‘1’ to ‘3’ with step 1.

df2 = df1.iloc[:, 1:3:1]
print(df2)

Output:

 

Example 2:

Slicing column from ‘0’ to ‘3’ with step 2.

df2 = df1.iloc[:, 0:3:2]
print(df2)

Output:


Creating

A Dataframe is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. In dataframe datasets arrange in rows and columns, we can store any number of datasets in a dataframe. We can perform many operations on these datasets like arithmetic operation, columns/rows selection, columns/rows addition etc. 
 

Pandas DataFrame can be created in multiple ways. Let’s discuss different ways to create a DataFrame one by one.
Creating an empty dataframe : 
A basic DataFrame, which can be created is an Empty Dataframe. An Empty Dataframe is created just by calling a dataframe constructor. 
 

# import pandas as pd
import pandas as pd
 
# Calling DataFrame constructor
df = pd.DataFrame()
 
print(df)

Output : 
 

Empty DataFrame
Columns: []
Index: []

  
Creating a dataframe using List: 
DataFrame can be created using a single list or a list of lists. 
 

# import pandas as pd
import pandas as pd
 
# list of strings
lst = ['Geeks', 'For', 'Geeks', 'is',
            'portal', 'for', 'Geeks']
 
# Calling DataFrame constructor on list
df = pd.DataFrame(lst)
print(df)

Output: 
 

  
Creating DataFrame from dict of ndarray/lists: 
To create DataFrame from dict of narray/list, all the narray must be of same length. If index is passed then the length index should be equal to the length of arrays. If no index is passed, then by default, index will be range(n) where n is the array length. 
 

# Python code demonstrate creating
# DataFrame from dict narray / lists
# By default addresses.
 
import pandas as pd
 
# initialise data of lists.
data = {'Name':['Tom', 'nick', 'krish', 'jack'], 'Age':[20, 21, 19, 18]}
 
# Create DataFrame
df = pd.DataFrame(data)
 
# Print the output.
print(df)

Output: 
 

  
Create pandas dataframe from lists using dictionary: 
Creating pandas data-frame from lists using dictionary can be achieved in different ways. We can create pandas dataframe from lists using dictionary using pandas.DataFrame. With this method in Pandas we can transform a dictionary of list to a dataframe.
 

# importing pandas as pd
import pandas as pd
 
# dictionary of lists
dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"],
        'degree': ["MBA", "BCA", "M.Tech", "MBA"],
        'score':[90, 40, 80, 98]}
 
df = pd.DataFrame(dict)
 
print(df)

Output: 
 

 

Series

Series Examples

# import pandas as pd
import pandas as pd
 
# simple array
data = [1, 2, 3, 4]
 
ser = pd.Series(data)
print(ser)

Output :

0    1
1    2
2    3
3    4
dtype: int64

The axis labels are collectively called index. Pandas Series is nothing but a column in an excel sheet.
Labels need not be unique but must be a hashable type. The object supports both integer and label-based indexing and provides a host of methods for performing operations involving the index.
Pandas Series

Python Pandas Series

We will get a brief insight on all these basic operations which can be performed on Pandas Series :

Creating a Pandas Series

In the real world, a Pandas Series will be created by loading the datasets from existing storage, storage can be SQL Database, CSV file, and Excel file. Pandas Series can be created from the lists, dictionary, and from a scalar value etc. Series can be created in different ways, here are some ways by which we create a series:

Creating a series from array: In order to create a series from array, we have to import a numpy module and have to use array() function.

# import pandas as pd
import pandas as pd
 
# import numpy as np
import numpy as np
 
# simple array
data = np.array(['g','e','e','k','s'])
 
ser = pd.Series(data)
print(ser)

Output :


 
Creating a series from Lists:
In order to create a series from list, we have to first create a list after that we can create a series from list.

import pandas as pd
 
# a simple list
list = ['g', 'e', 'e', 'k', 's']
  
# create series form a list
ser = pd.Series(list)
print(ser)

Output :


For more details refer to Creating a Pandas Series
 

Accessing element of Series

There are two ways through which we can access element of series, they are :

  • Accessing Element from Series with Position
  • Accessing Element Using Label (index)

Accessing Element from Series with Position : In order to access the series element refers to the index number. Use the index operator [ ] to access an element in a series. The index must be an integer. In order to access multiple elements from a series, we use Slice operation.

Accessing first 5 elements of Series

# import pandas and numpy
import pandas as pd
import numpy as np
 
# creating simple array
data = np.array(['g','e','e','k','s','f', 'o','r','g','e','e','k','s'])
ser = pd.Series(data)
  
  
#retrieve the first element
print(ser[:5])

Output :

 
Accessing Element Using Label (index) :
In order to access an element from series, we have to set values by index label. A Series is like a fixed-size dictionary in that you can get and set values by index label.

Accessing a single element using index label

# import pandas and numpy
import pandas as pd
import numpy as np
 
# creating simple array
data = np.array(['g','e','e','k','s','f', 'o','r','g','e','e','k','s'])
ser = pd.Series(data,index=[10,11,12,13,14,15,16,17,18,19,20,21,22])
  
  
# accessing a element using index element
print(ser[16])

Output :

o

For more details refer to Accessing element of Series
 

Indexing and Selecting Data in Series

Indexing in pandas means simply selecting particular data from a Series. Indexing could mean selecting all the data, some of the data from particular columns. Indexing can also be known as Subset Selection.

Indexing a Series using indexing operator [] :
Indexing operator is used to refer to the square brackets following an object. The .loc and .iloc indexers also use the indexing operator to make selections. In this indexing operator to refer to df[ ].

# importing pandas module 
import pandas as pd 
     
# making data frame 
df = pd.read_csv("nba.csv"
   
ser = pd.Series(df['Name'])
data = ser.head(10)
data


Now we access the element of series using index operator [ ].

# using indexing operator
data[3:6]

Output :

 
Indexing a Series using .loc[ ] :
This function selects data by refering the explicit index . The df.loc indexer selects data in a different way than just the indexing operator. It can select subsets of data.

# importing pandas module 
import pandas as pd 
     
# making data frame 
df = pd.read_csv("nba.csv"
   
ser = pd.Series(df['Name'])
data = ser.head(10)
data


Now we access the element of series using .loc[] function.

# using .loc[] function
data.loc[3:6]

Output :

 
Indexing a Series using .iloc[ ] :
This function allows us to retrieve data by position. In order to do that, we’ll need to specify the positions of the data that we want. The df.iloc indexer is very similar to df.loc but only uses integer locations to make its selections.

# importing pandas module 
import pandas as pd 
     
# making data frame 
df = pd.read_csv("nba.csv"
   
ser = pd.Series(df['Name'])
data = ser.head(10)
data


Now we access the element of Series using .iloc[] function.

# using .iloc[] function
data.iloc[3:6]

Output :

 

Binary Operation on Series

We can perform binary operation on series like addition, subtraction and many other operation. In order to perform binary operation on series we have to use some function like .add(),.sub() etc..
Code #1:

# importing pandas module 
import pandas as pd 
 
# creating a series
data = pd.Series([5, 2, 3,7], index=['a', 'b', 'c', 'd'])
 
# creating a series
data1 = pd.Series([1, 6, 4, 9], index=['a', 'b', 'd', 'e'])
 
print(data, "\n\n", data1)


Now we add two series using .add() function.

# adding two series using
# .add
data.add(data1, fill_value=0)

Output :

Code #2:

# importing pandas module 
import pandas as pd 
 
# creating a series
data = pd.Series([5, 2, 3,7], index=['a', 'b', 'c', 'd'])
 
# creating a series
data1 = pd.Series([1, 6, 4, 9], index=['a', 'b', 'd', 'e'])
 
print(data, "\n\n", data1)


Now we subtract two series using .sub function.

# subtracting two series using
# .sub
data.sub(data1, fill_value=0)

Output :

For more details refer to Binary operation methods on series
 

Conversion Operation on Series

In conversion operation we perform various operation like changing datatype of series, changing a series to list etc. In order to perform conversion operation we have various function which help in conversion like .astype().tolist() etc.
Code #1:

# Python program using astype
# to convert a datatype of series
 
# importing pandas module 
import pandas as pd
   
# reading csv file from url 
data = pd.read_csv("nba.csv")
    
# dropping null value columns to avoid errors
data.dropna(inplace = True)
   
# storing dtype before converting
before = data.dtypes
   
# converting dtypes using astype
data["Salary"]= data["Salary"].astype(int)
data["Number"]= data["Number"].astype(str)
   
# storing dtype after converting
after = data.dtypes
   
# printing to compare
print("BEFORE CONVERSION\n", before, "\n")
print("AFTER CONVERSION\n", after, "\n")

Output :

 
Code #2:

# Python program converting
# a series into list
 
# importing pandas module 
import pandas as pd 
   
# importing regex module
import re
     
# making data frame 
data = pd.read_csv("nba.csv"
     
# removing null values to avoid errors 
data.dropna(inplace = True
   
# storing dtype before operation
dtype_before = type(data["Salary"])
   
# converting to list
salary_list = data["Salary"].tolist()
   
# storing dtype after operation
dtype_after = type(salary_list)
   
# printing dtype
print("Data type before converting = {}\nData type after converting = {}"
      .format(dtype_before, dtype_after))
   
# displaying list
salary_list

Output :

Python3

# import pandas as pd
import pandas as pd
 
# Creating empty series
ser = pd.Series()
 
print(ser)

Output :

Series([], dtype: float64)

By default, the data type of Series is float. 

 Creating a series from array: In order to create a series from NumPy array, we have to import numpy module and have to use array() function. 

Python3

# import pandas as pd
import pandas as pd
 
# import numpy as np
import numpy as np
 
# simple array
data = np.array(['g', 'e', 'e', 'k', 's'])
 
ser = pd.Series(data)
print(ser)

Output:

   

By default, the index of the series starts from 0 till the length of series -1.

Creating a series from array with an index: In order to create a series by explicitly proving index instead of the default, we have to provide a list of elements to the index parameter with the same number of elements as it is an array. 

Python3

# import pandas as pd
import pandas as pd
 
# import numpy as np
import numpy as np
 
# simple array
data = np.array(['g', 'e', 'e', 'k', 's'])
 
# providing an index
ser = pd.Series(data, index=[10, 11, 12, 13, 14])
print(ser)

Output:

  

 Creating a series from Lists: In order to create a series from list, we have to first create a list after that we can create a series from list. 

Python3

import pandas as pd
 
# a simple list
list = ['g', 'e', 'e', 'k', 's']
 
# create series form a list
ser = pd.Series(list)
print(ser)

Output :

 

Creating a series from Dictionary: In order to create a series from the dictionary, we have to first create a dictionary after that we can make a series using dictionary. Dictionary keys are used to construct indexes of Series.  

Python3

import pandas as pd
 
# a simple dictionary
dict = {'Geeks': 10,
        'for': 20,
        'geeks': 30}
 
# create series from dictionary
ser = pd.Series(dict)
 
print(ser)

Output:

  

Creating a series from Scalar value: In order to create a series from scalar value, an index must be provided. The scalar value will be repeated to match the length of the index. 

Python3

import pandas as pd
 
import numpy as np
 
# giving a scalar value with index
ser = pd.Series(10, index=[0, 1, 2, 3, 4, 5])
 
print(ser)

Output:

   

Creating a series using NumPy functions : In order to create a series using numpy function, we can use different function of numpy like numpy.linspace()numpy.random.radn()

Python3

# import pandas and numpy
import pandas as pd
import numpy as np
 
# series with numpy linspace()
ser1 = pd.Series(np.linspace(3, 33, 3))
print(ser1)
 
# series with numpy linspace()
ser2 = pd.Series(np.linspace(1, 100, 10))
print(& quot
       \n"       , ser2)

Output:

 

Creating a Series using range function:

Python3

# code
import pandas as pd
ser=pd.Series(range(10))
print(ser)
Output:
0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int64

Creating a Series using for loop and list comprehension:

Python3

import pandas as pd
ser=pd.Series(range(1,20,3), index=[x for x in 'abcdefg'])
print(ser)

Output:

a     1
b     4
c     7
d    10
e    13
f    16
g    19
dtype: int64

Creating a Series using mathematical expressions:

Python3

import pandas as pd
import numpy as np
ser=np.arange(10,15)
serobj=pd.Series(data=ser*5,index=ser)
print(serobj)

Output:

10    50
11    55
12    60
13    65
14    70
dtype: int32


head

Example #1:

In this example, top 5 rows of data frame are returned and stored in a new variable. No parameter is passed to .head() method since by default it is 5.

# importing pandas module
import pandas as pd
  
# making data frame
  
# calling head() method 
# storing in new variable
data_top = data.head()
  
# display
data_top

Output:
As shown in the output image, it can be seen that the index of returned rows is ranging from 0 to 4. Hence, top 5 rows were returned.

 
Example #2: Calling on Series with n parameter()

In this example, the .head() method is called on series with custom input of n parameter to return top 9 rows of the series.

# importing pandas module
import pandas as pd
  
# making data frame
  
# number of rows to return
n = 9
  
# creating series
series = data["Name"]
  
# returning top n rows
top = series.head(n = n)
  
# display
top

Output:
As shown in the output image, top 9 rows ranging from 0 to 8th index position were returned.



describe

Pandas describe() is used to view some basic statistical details like percentile, mean, std, etc. of a data frame or a series of numeric values. When this method is applied to a series of strings, it returns a different output which is shown in the examples below.

Syntax: DataFrame.describe(percentiles=None, include=None, exclude=None) 

Parameters: 

  • percentile: list like data type of numbers between 0-1 to return the respective percentile 
  • include: List of data types to be included while describing dataframe. Default is None 
  • exclude: List of data types to be Excluded while describing dataframe. Default is None 

Return type: Statistical summary of data frame.

Creating DataFrame for demonstration:

To download the data set used in the following example, click here. In the following examples, the data frame used contains data from some NBA players. Let’s have a look at the data by importing it.

Python3

import pandas as pd
# reading and printing csv file
data = pd.read_csv('nba.csv')
print(data.head())

Output: 

            Name            Team  Number Position   Age Height  Weight            College     Salary
0  Avery Bradley  Boston Celtics     0.0       PG  25.0    6-2   180.0              Texas  7730337.0  
1    Jae Crowder  Boston Celtics    99.0       SF  25.0    6-6   235.0          Marquette  6796117.0 
2   John Holland  Boston Celtics    30.0       SG  27.0    6-5   205.0  Boston University        NaN 
3    R.J. Hunter  Boston Celtics    28.0       SG  22.0    6-5   185.0      Georgia State  1148640.0 
4  Jonas Jerebko  Boston Celtics     8.0       PF  29.0   6-10   231.0                NaN  5000000.0 

Using Describe function in Pandas

We can easily learn about several statistical measures, including mean, median, standard deviation, quartiles, and more, by using describe() on a DataFrame.

Python3

print(data.descibe())
           Number         Age      Weight        Salary
count  457.000000  457.000000  457.000000  4.460000e+02
mean    17.678337   26.938731  221.522976  4.842684e+06
std     15.966090    4.404016   26.368343  5.229238e+06
min      0.000000   19.000000  161.000000  3.088800e+04
25%      5.000000   24.000000  200.000000  1.044792e+06
50%     13.000000   26.000000  220.000000  2.839073e+06
75%     25.000000   30.000000  240.000000  6.500000e+06
max     99.000000   40.000000  307.000000  2.500000e+07

Explanation of the description of numerical columns:

count: Total Number of Non-Empty values
mean: Mean of the column values
std: Standard Deviation of the column values
min: Minimum value from the column
25%: 25 percentile
50%: 50 percentile
75%: 75 percentile
max: Maximum value from the column

Pandas describe() behavior for numeric dtypes

In this example, the data frame is described and [‘object’] is passed to include a parameter to see a description of the object series. [.20, .40, .60, .80] is passed to the percentile parameter to view the respective percentile of the Numeric series. 

Python3

import pandas as pd
data = pd.read_csv('nba.csv')
 
# removing null values to avoid errors
data.dropna(inplace=True)
 
# percentile list
perc = [.20, .40, .60, .80]
 
# list of dtypes to include
include = ['object', 'float', 'int']
 
# calling describe method
desc = data.describe(percentiles=perc, include=include)
 
# display
desc

Output:

                 Name                  Team      Number Position         Age  \
count             364                   364  364.000000      364  364.000000   
unique            364                    30         NaN        5         NaN   
top     Avery Bradley  New Orleans Pelicans         NaN       SG         NaN   
freq                1                    16         NaN       87         NaN   
mean              NaN                   NaN   16.829670      NaN   26.615385   
std               NaN                   NaN   14.994162      NaN    4.233591   
min               NaN                   NaN    0.000000      NaN   19.000000   
20%               NaN                   NaN    4.000000      NaN   23.000000   
40%               NaN                   NaN    9.000000      NaN   25.000000   
50%               NaN                   NaN   12.000000      NaN   26.000000   
60%               NaN                   NaN   17.000000      NaN   27.000000   
80%               NaN                   NaN   30.000000      NaN   30.000000   
max               NaN                   NaN   99.000000      NaN   40.000000  
       Height      Weight   College        Salary  
count     364  364.000000       364  3.640000e+02  
unique     17         NaN       115           NaN  
top       6-9         NaN  Kentucky           NaN  
freq       49         NaN        22           NaN  
mean      NaN  219.785714       NaN  4.620311e+06  
std       NaN   24.793099       NaN  5.119716e+06  
min       NaN  161.000000       NaN  5.572200e+04  
20%       NaN  195.000000       NaN  9.472760e+05  
40%       NaN  212.000000       NaN  1.638754e+06  
50%       NaN  220.000000       NaN  2.515440e+06  
60%       NaN  228.000000       NaN  3.429934e+06  
80%       NaN  242.400000       NaN  7.838202e+06  
max       NaN  279.000000       NaN  2.287500e+07  

As shown in the output image, the Statistical description of the Dataframe was returned with the respectively passed percentiles. For the columns with strings, NaN was returned for numeric operations. 

Describing series of strings 

In this example, the described method is called by the Name column to see the behavior with the object data type. 

Python3

# importing pandas module
import pandas as pd
 
# making data frame
data = pd.read_csv("nba.csv")
 
# removing null values to avoid errors
data.dropna(inplace=True)
 
# calling describe method
desc = data["Name"].describe()
 
# display
desc

Output: As shown in the output image, the behavior of describe() is different with a series of strings. Different stats were returned like count of values, unique values, top, and frequency of occurrence in this case. 

count               457
unique              457
top       Avery Bradley
freq                  1
Name: Name, dtype: object



Dealing with Rows and Column

Dealing with Columns

In order to deal with columns, we perform basic operations on columns like selecting, deleting, adding and renaming.

Column Selection

:

In Order to select a column in Pandas DataFrame, we can either access the columns by calling them by their columns name.

Python3

# Import pandas package
import pandas as pd
 
# Define a dictionary containing employee data
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
        'Age':[27, 24, 22, 32],
        'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
        'Qualification':['Msc', 'MA', 'MCA', 'Phd']}
 
# Convert the dictionary into DataFrame
df = pd.DataFrame(data)
 
# select two columns
print(df[['Name', 'Qualification']])

Output:

For more examples refer to How to select multiple columns in a pandas dataframeColumn Addition: In Order to add a column in Pandas DataFrame, we can declare a new list as a column and add to a existing Dataframe.

Python3



# Import pandas package
import pandas as pd

# Define a dictionary containing Students data
data = {'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Height': [5.1, 6.2, 5.1, 5.2],
'Qualification': ['Msc', 'MA', 'Msc', 'Msc']}

# Convert the dictionary into DataFrame
df = pd.DataFrame(data)

# Declare a list that is to be converted into a column
address = ['Delhi', 'Bangalore', 'Chennai', 'Patna']

# Using 'Address' as the column name
# and equating it to the list
df['Address'] = address

# Observe the result
print(df)

Output:

For more examples refer to Adding new column to existing DataFrame in PandasColumn Deletion: In Order to delete a column in Pandas DataFrame, we can use the

drop() method. Columns is deleted by dropping columns with column names.

Python3

# importing pandas module
import pandas as pd
 
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name" )
 
# dropping passed columns
data.drop(["Team", "Weight"], axis = 1, inplace = True)
 
# display
print(data)

Output:

As shown in the output images, the new output doesn’t have the passed columns. Those values were dropped since axis was set equal to 1 and the changes were made in the original data frame since inplace was True.

Data Frame before Dropping Columns-

Data Frame after Dropping Columns-

For more examples refer to Delete columns from DataFrame using Pandas.drop()

Dealing with Rows:

In order to deal with rows, we can perform basic operations on rows like selecting, deleting, adding and renaming.

Row Selection

:

Pandas provide a unique method to retrieve rows from a Data frame.

DataFrame.loc[]

method is used to retrieve rows from Pandas DataFrame. Rows can also be selected by passing integer location to an

iloc[]

function.

Python

# importing pandas package
import pandas as pd
 
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
 
# retrieving row by loc method
first = data.loc["Avery Bradley"]
second = data.loc["R.J. Hunter"]
 
 
print(first, "\n\n\n", second)

Output:

As shown in the output image, two series were returned since there was only one parameter both of the times.

For more examples refer to Pandas Extracting rows using .loc[]Row Addition:

In Order to add a Row in Pandas DataFrame, we can concat the old dataframe with new one.

Python3



# importing pandas module
import pandas as pd

# making data frame
df = pd.read_csv("nba.csv", index_col ="Name")

df.head(10)

new_row = pd.DataFrame({'Name':'Geeks', 'Team':'Boston', 'Number':3,
'Position':'PG', 'Age':33, 'Height':'6-2',
'Weight':189, 'College':'MIT', 'Salary':99999},
index =[0])
# simply concatenate both dataframes
df = pd.concat([new_row, df]).reset_index(drop = True)
df.head(5)

Output:Data Frame before Adding Row-

Data Frame after Adding Row-

For more examples refer to Add a row at top in pandas DataFrameRow Deletion:

In Order to delete a row in Pandas DataFrame, we can use the drop() method. Rows is deleted by dropping Rows by index label.

Python3

# importing pandas module
import pandas as pd
 
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name" )
 
# dropping passed values
data.drop(["Avery Bradley", "John Holland", "R.J. Hunter",
                            "R.J. Hunter"], inplace = True)
 
# display
data

Output:

As shown in the output images, the new output doesn’t have the passed values. Those values were dropped and the changes were made in the original data frame since inplace was True.

Data Frame before Dropping values-

Data Frame after Dropping values-

Extracting rows using .loc[]

Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages and makes importing and analyzing data much easier.

Pandas provide a unique method to retrieve rows from a Data frame. DataFrame.loc[] method is a method that takes only index labels and returns row or dataframe if the index label exists in the caller data frame.

Syntax: pandas.DataFrame.loc[]

Parameters:
Index label: String or list of string of index label of rows

Return type: Data frame or Series depending on parameters

To download the CSV used in code, click here.

Example #1: Extracting single Row

In this example, Name column is made as the index column and then two single rows are extracted one by one in the form of series using index label of rows.

# importing pandas package
import pandas as pd
  
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
  
# retrieving row by loc method
first = data.loc["Avery Bradley"]
second = data.loc["R.J. Hunter"]
  
  
print(first, "\n\n\n", second)

Output:
As shown in the output image, two series were returned since there was only one parameter both of the times.

 
Example #2: Multiple parameters

In this example, Name column is made as the index column and then two single rows are extracted at the same time by passing a list as parameter.

# importing pandas package
import pandas as pd
  
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
  
# retrieving rows by loc method
rows = data.loc[["Avery Bradley", "R.J. Hunter"]]
  
# checking data type of rows
print(type(rows))
  
# display
rows

Output:
As shown in the output image, this time the data type of returned value is a data frame. Both of the rows were extracted and displayed like a new data frame.

 

Example #3: Extracting multiple rows with same index

In this example, Team name is made as the index column and one team name is passed to .loc method to check if all values with same team name have been returned or not.

# importing pandas package
import pandas as pd
  
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Team")
  
# retrieving rows by loc method
rows = data.loc["Utah Jazz"]
  
# checking data type of rows
print(type(rows))
  
# display
rows

Output:
As shown in the output image, All rows with team name “Utah Jazz” were returned in the form of a data frame.

 

Example #4: Extracting rows between two index labels

In this example, two index label of rows are passed and all the rows that fall between those two index label have been returned (Both index labels Inclusive).

# importing pandas package
import pandas as pd
  
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
  
# retrieving rows by loc method
rows = data.loc["Avery Bradley":"Isaiah Thomas"]
  
# checking data type of rows
print(type(rows))
  
# display
rows

Output:
As shown in the output image, all the rows that fall between passed two index labels are returned in the form of a data frame.



Extracting rows using Pandas .iloc[]

Example 1: Extracting a single row and comparing with .loc[] In this example, the same index number row is extracted by both .iloc[] and.loc[] methods and compared. Since the index column by default is numeric, hence the index label will also be integers. 

Python3

# importing pandas package
import pandas as pd
 
# making data frame from csv file
data = pd.read_csv(& quot
                    nba.csv & quot
                    )
 
# retrieving rows by loc method
row1 = data.loc[3]
 
# retrieving rows by iloc method
row2 = data.iloc[3]
 
# checking if values are equal
row1 == row2

Output: 

As shown in the output image, the results returned by both methods are the same.

 

   

Example 2: Extracting multiple rows with index In this example, multiple rows are extracted, first by passing a list and then by passing integers to extract rows between that range. After that, both values are compared. 

Python3

# importing pandas package
import pandas as pd
 
# making data frame from csv file
data = pd.read_csv(& quot
                    nba.csv & quot
                    )
 
# retrieving rows by loc method
row1 = data.iloc[[4, 5, 6, 7]]
 
# retrieving rows by loc method
row2 = data.iloc[4:8]
 
# comparing values
row1 == row2

Output: 

As shown in the output image, the results returned by both methods are the same. All values are True except values in the college column since those were NaN values. 

 



Indexing and Selecting Data

Pandas Indexing using [ ].loc[].iloc[ ].ix[ ]

There are a lot of ways to pull the elements, rows, and columns from a DataFrame. There are some indexing method in Pandas which help in getting an element from a DataFrame. These indexing methods appear very similar but behave very differently. Pandas support four types of Multi-axes indexing they are:

  • Dataframe.[ ] ; This function also known as indexing operator
  • Dataframe.loc[ ] : This function is used for labels.
  • Dataframe.iloc[ ] : This function is used for positions or integer based
  • Dataframe.ix[] : This function is used for both label and integer based

Collectively, they are called the indexers. These are by far the most common ways to index data. These are four function which help in getting the elements, rows, and columns from a DataFrame.
 
Indexing a Dataframe using indexing operator [] :
Indexing operator is used to refer to the square brackets following an object. The .loc and .iloc indexers also use the indexing operator to make selections. In this indexing operator to refer to df[].

Selecting a single columns

In order to select a single column, we simply put the name of the column in-between the brackets

# importing pandas package
import pandas as pd
  
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
  
# retrieving columns by indexing operator
first = data["Age"]
  
  
  
print(first)

Output:

Selecting multiple columns

In order to select multiple columns, we have to pass a list of columns in an indexing operator.

# importing pandas package
import pandas as pd
  
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
  
# retrieving multiple columns by indexing operator
first = data[["Age", "College", "Salary"]]
  
  
  
first

Output:

 
Indexing a DataFrame using .loc[ ] :
This function selects data by the label of the rows and columns. The df.loc indexer selects data in a different way than just the indexing operator. It can select subsets of rows or columns. It can also simultaneously select subsets of rows and columns.

Selecting a single row

In order to select a single row using .loc[], we put a single row label in a .loc function.

# importing pandas package
import pandas as pd
  
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
  
# retrieving row by loc method
first = data.loc["Avery Bradley"]
second = data.loc["R.J. Hunter"]
  
  
print(first, "\n\n\n", second)

Output:
As shown in the output image, two series were returned since there was only one parameter both of the times.

 

Selecting multiple rows

In order to select multiple rows, we put all the row labels in a list and pass that to .loc function.

import pandas as pd
  
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
  
# retrieving multiple rows by loc method
first = data.loc[["Avery Bradley", "R.J. Hunter"]]
  
  
  
print(first)

Output:

 

Selecting two rows and three columns

In order to select two rows and three columns, we select a two rows which we want to select and three columns and put it in a separate list like this:

Dataframe.loc[["row1", "row2"], ["column1", "column2", "column3"]]

import pandas as pd
  
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
  
# retrieving two rows and three columns by loc method
first = data.loc[["Avery Bradley", "R.J. Hunter"],
                   ["Team", "Number", "Position"]]
  
  
  
print(first)

Output:

 

Selecting all of the rows and some columns

In order to select all of the rows and some columns, we use single colon [:] to select all of rows and list of some columns which we want to select like this:

Dataframe.loc[:, ["column1", "column2", "column3"]]

import pandas as pd
  
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
  
# retrieving all rows and some columns by loc method
first = data.loc[:, ["Team", "Number", "Position"]]
  
  
  
print(first)

Output:

 
Indexing a DataFrame using .iloc[ ] :
This function allows us to retrieve rows and columns by position. In order to do that, we’ll need to specify the positions of the rows that we want, and the positions of the columns that we want as well. The df.iloc indexer is very similar to df.loc but only uses integer locations to make its selections.

Selecting a single row

In order to select a single row using .iloc[], we can pass a single integer to .iloc[] function.

import pandas as pd
  
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
  
  
# retrieving rows by iloc method 
row2 = data.iloc[3
  
  
  
print(row2)

Output:

 

Selecting multiple rows

In order to select multiple rows, we can pass a list of integer to .iloc[] function.

import pandas as pd
  
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
  
  
# retrieving multiple rows by iloc method 
row2 = data.iloc [[3, 5, 7]]
  
  
  
row2

Output:

 

Selecting two rows and two columns

In order to select two rows and two columns, we create a list of 2 integer for rows and list of 2 integer for columns then pass to a .iloc[] function.

import pandas as pd
  
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
  
  
# retrieving two rows and two columns by iloc method 
row2 = data.iloc [[3, 4], [1, 2]]
  
  
  
print(row2)

Output:

 

Selecting all the rows and a some columns

In order to select all rows and some columns, we use single colon [:] to select all of rows and for columns we make a list of integer then pass to a .iloc[] function.

import pandas as pd
  
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
  
  
# retrieving all rows and some columns by iloc method 
row2 = data.iloc [:, [1, 2]]
  
  
  
print(row2)

Output:

 
Indexing a using Dataframe.ix[ ] :
Early in the development of pandas, there existed another indexer, ix. This indexer was capable of selecting both by label and by integer location. While it was versatile, it caused lots of confusion because it’s not explicit. Sometimes integers can also be labels for rows or columns. Thus there were instances where it was ambiguous. Generally, ix is label based and acts just as the .loc indexer. However, .ix also supports integer type selections (as in .iloc) where passed an integer. This only works where the index of the DataFrame is not integer based .ix will accept any of the inputs of .loc and .iloc.
Note: The .ix indexer has been deprecated in recent versions of Pandas.

Selecting a single row using .ix[] as .loc[]

In order to select a single row, we put a single row label in a .ix function. This function act similar as .loc[] if we pass a row label as a argument of a function.

# importing pandas package
import pandas as pd
   
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
   
# retrieving row by ix method
first = data.ix["Avery Bradley"]
  
   
   
print(first)
  

Output:

Selecting a single row using .ix[] as .iloc[]

In order to select a single row, we can pass a single integer to .ix[] function. This function similar as a iloc[] function if we pass an integer in a .ix[] function.

# importing pandas package
import pandas as pd
   
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
   
# retrieving row by ix method
first = data.ix[1]
  
   
   
print(first)

Output:

 

Methods for indexing in DataFrame

FunctionDescription
Dataframe.head()Return top n rows of a data frame.
Dataframe.tail()Return bottom n rows of a data frame.
Dataframe.at[]Access a single value for a row/column label pair.
Dataframe.iat[]Access a single value for a row/column pair by integer position.
Dataframe.tail()Purely integer-location based indexing for selection by position.
DataFrame.lookup()Label-based “fancy indexing” function for DataFrame.
DataFrame.pop()Return item and drop from frame.
DataFrame.xs()Returns a cross-section (row(s) or column(s)) from the DataFrame.
DataFrame.get()Get item from object for given key (DataFrame column, Panel slice, etc.).
DataFrame.isin()Return boolean DataFrame showing whether each element in the DataFrame is contained in values.
DataFrame.where()Return an object of same shape as self and whose corresponding entries are from self where cond is True and otherwise are from other.
DataFrame.mask()Return an object of same shape as self and whose corresponding entries are from self where cond is False and otherwise are from other.
DataFrame.query()Query the columns of a frame with a boolean expression.
DataFrame.insert()Insert column into DataFrame at specified location.


Boolean Indexing

Accessing a DataFrame with a boolean index: 

In order to access a dataframe with a boolean index, we have to create a dataframe in which the index of dataframe contains a boolean value that is “True” or “False”.

Example  

Python3

# importing pandas as pd
import pandas as pd
  
# dictionary of lists
dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"],
        'degree': ["MBA", "BCA", "M.Tech", "MBA"],
        'score':[90, 40, 80, 98]}
  
df = pd.DataFrame(dict, index = [True, False, True, False])
  
print(df)

Output: 
 

Now we have created a dataframe with the boolean index after that user can access a dataframe with the help of the boolean index. User can access a dataframe using three functions that is .loc[], .iloc[], .ix[] 

Accessing a Dataframe with a boolean index using .loc[]

In order to access a dataframe with a boolean index using .loc[], we simply pass a boolean value (True or False) in a .loc[] function. 

Python3

# importing pandas as pd
import pandas as pd
  
# dictionary of lists
dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"],
        'degree': ["MBA", "BCA", "M.Tech", "MBA"],
        'score':[90, 40, 80, 98]}
 
# creating a dataframe with boolean index
df = pd.DataFrame(dict, index = [True, False, True, False])
 
# accessing a dataframe using .loc[] function
print(df.loc[True])

Output: 
 

Accessing a Dataframe with a boolean index using .iloc[]

In order to access a dataframe using .iloc[], we have to pass a boolean value (True or False)  but iloc[] function accepts only integer as an argument so it will throw an error so we can only access a dataframe when we pass an integer in iloc[] function 

Code #1:  

Python3

# importing pandas as pd
import pandas as pd
  
# dictionary of lists
dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"],
        'degree': ["MBA", "BCA", "M.Tech", "MBA"],
        'score':[90, 40, 80, 98]}
 
# creating a dataframe with boolean index 
df = pd.DataFrame(dict, index = [True, False, True, False])
 
# accessing a dataframe using .iloc[] function
print(df.iloc[True])

Output: 

TypeError

Code #2:  

Python3

# importing pandas as pd
import pandas as pd
  
# dictionary of lists
dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"],
        'degree': ["MBA", "BCA", "M.Tech", "MBA"],
        'score':[90, 40, 80, 98]}
 
# creating a dataframe with boolean index 
df = pd.DataFrame(dict, index = [True, False, True, False])
  
 
# accessing a dataframe using .iloc[] function
print(df.iloc[1])

Output:

 

Accessing a Dataframe with a boolean index using .ix[]

In order to access a dataframe using .ix[], we have to pass boolean value (True or False) and integer value to .ix[] function because as we know that .ix[] function is a hybrid of .loc[] and .iloc[] function. 

Code #1:  

Python3

# importing pandas as pd
import pandas as pd
  
# dictionary of lists
dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"],
        'degree': ["MBA", "BCA", "M.Tech", "MBA"],
        'score':[90, 40, 80, 98]}
 
# creating a dataframe with boolean index
df = pd.DataFrame(dict, index = [True, False, True, False])
  
 
# accessing a dataframe using .ix[] function
print(df.ix[True])

Output: 


 

Code #2: 

Python

# importing pandas as pd
import pandas as pd
  
# dictionary of lists
dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"],
        'degree': ["MBA", "BCA", "M.Tech", "MBA"],
        'score':[90, 40, 80, 98]}
 
# creating a dataframe with boolean index
df = pd.DataFrame(dict, index = [True, False, True, False])
  
 
# accessing a dataframe using .ix[] function
print(df.ix[1])

Output: 
 

Applying a boolean mask to a dataframe : 

In a dataframe, we can apply a boolean mask. In order to do that we can use __getitems__ or [] accessor. We can apply a boolean mask by giving a list of True and False of the same length as contain in a dataframe. When we apply a boolean mask it will print only that dataframe in which we pass a boolean value True. To download “nba1.1” CSV file click here.

Code #1: 

Python3

# importing pandas as pd
import pandas as pd
  
# dictionary of lists
dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"],
        'degree': ["MBA", "BCA", "M.Tech", "MBA"],
        'score':[90, 40, 80, 98]}
  
df = pd.DataFrame(dict, index = [0, 1, 2, 3])
  
 
 
print(df[[True, False, True, False]])

Output: 
 

Code #2: 
 

Python3

# importing pandas package
import pandas as pd
  
# making data frame from csv file
data = pd.read_csv("nba1.1.csv")
  
df = pd.DataFrame(data, index = [0, 1, 2, 3, 4, 5, 6,
                                 7, 8, 9, 10, 11, 12])
 
  
print(df[[True, False, True, False, True,
    False, True, False, True, False,
                True, False, True]])

Output: 
 

Masking data based on column value: 

In a dataframe we can filter a data based on a column value.  In order to filter data, we can apply certain conditions on the dataframe using different operators like ==, >, <, <=, >=. When we apply these operators to the dataframe then it produces a Series of True and False. To download the “nba.csv” CSV, click here.

Code #1: 

Python

# importing pandas as pd
import pandas as pd
  
# dictionary of lists
dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"],
        'degree': ["BCA", "BCA", "M.Tech", "BCA"],
        'score':[90, 40, 80, 98]}
 
# creating a dataframe
df = pd.DataFrame(dict)
  
# using a comparison operator for filtering of data
print(df['degree'] == 'BCA')

Output: 
 

Code #2: 

Python

# importing pandas package
import pandas as pd
  
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
  
# using greater than operator for filtering of data
print(data['Age'] > 25)

Output: 
 

Masking data based on index value : 

In a dataframe we can filter a data based on a column value. In order to filter data, we can create a mask based on the index values using different operators like ==, >, <, etc… . To download “nba1.1” CSV file click here.

Code #1:  

Python3

# importing pandas as pd
import pandas as pd
  
# dictionary of lists
dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"],
        'degree': ["BCA", "BCA", "M.Tech", "BCA"],
        'score':[90, 40, 80, 98]}
  
 
df = pd.DataFrame(dict, index = [0, 1, 2, 3])
 
mask = df.index == 0
 
print(df[mask])

Output: 
 

Code #2: 
 

Python3

# importing pandas package
import pandas as pd
  
# making data frame from csv file
data = pd.read_csv("nba1.1.csv")
 
# giving a index to a dataframe
df = pd.DataFrame(data, index = [0, 1, 2, 3, 4, 5, 6,
                                 7, 8, 9, 10, 11, 12])
 
# filtering data on index value
mask = df.index > 7
 
print(df[mask])

Output: 
 

 



DataFrame.ix[ ]

Pandas DataFrame.ix[ ] is both Label and Integer based slicing technique. Besides pure label based and integer based, Pandas provides a hybrid method for selections and subsetting the object using the ix[] operator. ix[] is the most general indexer and will support any of the inputs in loc[] and iloc[].

Syntax: DataFrame.ix[ ]

Parameters:
Index Position: Index position of rows in integer or list of integer.
Index label: String or list of string of index label of rows

Returns: Data frame or Series depending on parameters

Code #1:

# importing pandas package 
import pandas as geek
    
# making data frame from csv file
    
# Integer slicing
print("Slicing only rows(till index 4):")
x1 = data.ix[:4, ]
print(x1, "\n")
   
print("Slicing rows and columns(rows=4, col 1-4, excluding 4):")
x2 = data.ix[:4, 1:4]
print(x2)

Output :


 
Code #2:

# importing pandas package 
import pandas as geek
    
# making data frame from csv file
data = geek.read_csv("nba.csv")  
    
# Index slicing on Height column
print("After index slicing:")
x1 = data.ix[10:20, 'Height']
print(x1, "\n")
  
# Index slicing on Salary column
x2 = data.ix[10:20, 'Salary']
print(x2)

Output:

 

Code #3:

# importing pandas and numpy
import pandas as pd
import numpy as np
   
df = pd.DataFrame(np.random.randn(10, 4),
          columns = ['A', 'B', 'C', 'D'])
  
print("Original DataFrame: \n" , df)
   
# Integer slicing
print("\n Slicing only rows:")
print("--------------------------")
x1 = df.ix[:4, ]
print(x1)
   
print("\n Slicing rows and columns:")
print("----------------------------")
x2 = df.ix[:4, 1:3]
print(x2)

Output :

 
Code #4:

# importing pandas and numpy
import pandas as pd
import numpy as np
   
df = pd.DataFrame(np.random.randn(10, 4),
          columns = ['A', 'B', 'C', 'D'])
  
print("Original DataFrame: \n" , df)
   
# Integer slicing (printing all the rows of column 'A')
print("\n After index slicing (On 'A'):")
print("--------------------------")
x = df.ix[:, 'A']
  
print(x)

Output :



Series.str.slice()

Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric python packages. Pandas is one of those packages and makes importing and analyzing data much easier.

Pandas str.slice() method is used to slice substrings from a string present in Pandas series object. It is very similar to Python’s basic principal of slicing objects that works on [start:stop:step] which means it requires three parameters, where to start, where to end and how much elements to skip.
Since this is a pandas string method, .str has to be prefixed every time before calling this method. Otherwise, it gives an error.

Syntax: Series.str.slice(start=None, stop=None, step=None)

Parameters:
start: int value, tells where to start slicing
stop: int value, tells where to end slicing
step: int value, tells how much characters to step during slicing

Return type: Series with sliced substrings

To download the CSV used in code, click here.

In the following examples, the data frame used contains data of some NBA players. The image of data frame before any operations is attached below.

 
Example #1:
In this example, the salary column has been sliced to get values before decimal. For example, we want to do some mathematical operations and for that we need integer data, so the salary column will be sliced till the 2nd last element(-2 position).
Since the salary column is imported as float64 data type, it is first converted to string using the .astype() method.

# importing pandas module 
import pandas as pd 
    
# making data frame 
    
# removing null values to avoid errors 
data.dropna(inplace = True
  
# start stop and step variables
start, stop, step = 0, -2, 1
  
# converting to string data type
data["Salary"]= data["Salary"].astype(str)
  
# slicing till 2nd last element
data["Salary (int)"]= data["Salary"].str.slice(start, stop, step)
  
# display
data.head(10)

Output:
As shown in the output image, the string has been sliced and the string before decimal is stored in new column.


Note:This method doesn’t have any parameters to handle null values and hence they were already removed using .dropna() method.
 
Example #2:
In this example, the name column is sliced and step parameter is kept 2. Hence it will be stepping two characters during slicing.

# importing pandas module 
import pandas as pd 
    
# making data frame 
    
# removing null values to avoid errors 
data.dropna(inplace = True
  
# start stop and step variables
start, stop, step = 0, -2, 2
  
# slicing till 2nd last element
data["Name"]= data["Name"].str.slice(start, stop, step)
  
# display
data.head(10)

Output:
As it can be seen in the output image, the Name was sliced and 2 characters were skipped during slicing.



column-slices of DataFrame in Pandas?

Creating Dataframe to slice columns

Python3

# importing pandas
import pandas as pd
 
# Using DataFrame() method from pandas module
df1 = pd.DataFrame({"a": [1, 2, 3, 4, 5, 6, 7],
                    "b": [2, 3, 4, 2, 3, 4, 5],
                    "c": [3, 4, 5, 2, 3, 4, 5],
                    "d": [4, 5, 6, 2, 3, 4, 5],
                    "e": [5, 6, 7, 2, 3, 4, 5]})
 
display(df1)

Output:

 

Method 1: Slice Columns in pandas using reindex

Slicing column from ‘c’ to ‘b’.

Python3

df2 = df1.reindex(columns = ['c','b'])
print(df2)

Output:

 

Method 2: Slice Columns in pandas using loc[]

The df.loc[] is present in the Pandas package loc can be used to slice a Dataframe using indexing. Pandas DataFrame.loc attribute accesses a group of rows and columns by label(s) or a boolean array in the given DataFrame.

Syntax: [ : , first : last : step]

Example 1:

Slicing column from ‘b’ to ‘d’ with step 2.

Python3

df2 = df1.loc[:, "b":"d":2]
print(df2)

Output:

 

Example 2:

Slicing column from ‘c’ to ‘e’ with step 1.

Python3

df2 = df1.loc[:, "c":"e":1]
print(df2)

Output:

 

Method 3: Slice Columns in pandas using iloc[]

The iloc is present in the Pandas package. The iloc can be used to slice a Dataframe using indexing. df.iloc[] method is used when the index label of a data frame is something other than numeric series of 0, 1, 2, 3….n or in case the user doesn’t know the index label. Rows can be extracted using an imaginary index position that isn’t visible in the data frame.

Syntax: [ start : stop : step]

Example 1:

Slicing column from ‘1’ to ‘3’ with step 1.

Python3

df2 = df1.iloc[:, 1:3:1]
print(df2)

Output:

 

Example 2:

Slicing column from ‘0’ to ‘3’ with step 2.

Python3

df2 = df1.iloc[:, 0:3:2]
print(df2)

Output:

 



Pandas.apply()

import pandas as pd
s = pd.read_csv("stock.csv", squeeze=True)

Syntax:

s.apply(func, convert_dtype=True, args=())

Parameters:

func: .apply takes a function and applies it to all values of pandas series.
convert_dtype: Convert dtype as per the function’s operation.
args=(): Additional arguments to pass to function instead of series.
Return Type: Pandas Series after applied function/operation.

For the dataset, click here to download.

Example #1:

The following example passes a function and checks the value of each element in series and returns low, normal or High accordingly.

import pandas as pd
  
# reading csv
s = pd.read_csv("stock.csv", squeeze = True)
  
# defining function to check price
def fun(num):
  
    if num<200:
        return "Low"
  
    elif num>= 200 and num<400:
        return "Normal"
  
    else:
        return "High"
  
# passing function to apply and storing returned series in new
new = s.apply(fun)
  
# printing first 3 element
print(new.head(3))
  
# printing elements somewhere near the middle of series
print(new[1400], new[1500], new[1600])
  
# printing last 3 elements
print(new.tail(3))

Output:

Example #2:

In the following example, a temporary anonymous function is made in .apply itself using lambda. It adds 5 to each value in series and returns a new series.

import pandas as pd
s = pd.read_csv("stock.csv", squeeze = True)
  
# adding 5 to each value
new = s.apply(lambda num : num + 5)
  
# printing first 5 elements of old and new series
print(s.head(), '\n', new.head())
  
# printing last 5 elements of old and new series
print('\n\n', s.tail(), '\n', new.tail())

Output:

0    50.12
1    54.10
2    54.65
3    52.38
4    52.95
Name: Stock Price, dtype: float64 

0    55.12
1    59.10
2    59.65
3    57.38
4    57.95
Name: Stock Price, dtype: float64

3007    772.88
3008    771.07
3009    773.18
3010    771.61
3011    782.22
Name: Stock Price, dtype: float64
 
3007    777.88
3008    776.07
3009    778.18
3010    776.61
3011    787.22
Name: Stock Price, dtype: float64

Apply function to every row in a Pandas DataFrame

One can use apply() function in order to apply function to every row in given dataframe. Let’s see the ways we can do this task.

Example 1: 

Python3

# Import pandas package
import pandas as pd
 
# Function to add
def add(a, b, c):
    return a + b + c
 
def main():
     
    # create a dictionary with
    # three fields each
    data = {
            'A':[1, 2, 3],
            'B':[4, 5, 6],
            'C':[7, 8, 9] }
     
    # Convert the dictionary into DataFrame
    df = pd.DataFrame(data)
    print("Original DataFrame:\n", df)
     
    df['add'] = df.apply(lambda row : add(row['A'],
                     row['B'], row['C']), axis = 1)
  
    print('\nAfter Applying Function: ')
    # printing the new dataframe
    print(df)
  
if __name__ == '__main__':
    main()

Output: 

Original DataFrame:
A B C
0 1 4 7
1 2 5 8
2 3 6 9
After Applying Function:
A B C add
0 1 4 7 12
1 2 5 8 15
2 3 6 9 18

Example 2: You can use the numpy function as the parameters to the dataframe as well.
 

Python3

import pandas as pd
import numpy as np
  
def main():
     
    # create a dictionary with
    # five fields each
    data = {
            'A':[1, 2, 3],
            'B':[4, 5, 6],
            'C':[7, 8, 9] }
     
    # Convert the dictionary into DataFrame
    df = pd.DataFrame(data)
    print("Original DataFrame:\n", df)
     
    # applying function to each row in the dataframe
    # and storing result in a new column
    df['add'] = df.apply(np.sum, axis = 1)
  
    print('\nAfter Applying Function: ')
    # printing the new dataframe
    print(df)
  
if __name__ == '__main__':
    main()

Output: 

Original DataFrame:
A B C
0 1 4 7
1 2 5 8
2 3 6 9
After Applying Function:
A B C add
0 1 4 7 12
1 2 5 8 15
2 3 6 9 18

Example 3: Normalising Data 

Python3

# Import pandas package
import pandas as pd
 
def normalize(x, y):
    x_new = ((x - np.mean([x, y])) /
             (max(x, y) - min(x, y)))
     
    # print(x_new)
    return x_new
 
def main():
     
    # create a dictionary with three fields each
    data = {
        'X':[1, 2, 3],
        'Y':[45, 65, 89] }
     
    # Convert the dictionary into DataFrame
    df = pd.DataFrame(data)
    print("Original DataFrame:\n", df)
     
    df['X'] = df.apply(lambda row : normalize(row['X'],
                                  row['Y']), axis = 1)
  
    print('\nNormalized:')
    print(df)
  
if __name__ == '__main__':
    main()

Output: 

Original DataFrame:
X Y
0 1 45
1 2 65
2 3 89
Normalized:
X Y
0 -0.5 45
1 -0.5 65
2 -0.5 89

Example 4: Generate range

Python3

import pandas as pd
import numpy as np
  
pd.options.mode.chained_assignment = None
 
# Function to generate range
def generate_range(n):
     
    # printing the range for eg:
    # input is 67 output is 60-70
    n = int(n)
     
    lower_limit = n//10 * 10
    upper_limit = lower_limit + 10
     
    return str(str(lower_limit) + '-' + str(upper_limit))
      
def replace(row):
    for i, item in enumerate(row):
         
        # updating the value of the row
        row[i] = generate_range(item)
    return row
          
  
def main():
    # create a dictionary with
    # three fields each
    data = {
            'A':[0, 2, 3],
            'B':[4, 15, 6],
            'C':[47, 8, 19] }
     
    # Convert the dictionary into DataFrame
    df = pd.DataFrame(data)
  
    print('Before applying function: ')
    print(df)
      
    # applying function to each row in
    # dataframe and storing result in a new column
    df = df.apply(lambda row : replace(row))
      
  
    print('After Applying Function: ')
    # printing the new dataframe
    print(df)
  
if __name__ == '__main__':
    main()

Output: 

Before applying function: 
A B C
0 0 4 47
1 2 15 8
2 3 6 19
After Applying Function:
A B C
0 0-10 0-10 40-50
1 0-10 10-20 0-10
2 0-10 0-10 10-20
 


dataframe.aggregate()

Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages and makes importing and analyzing data much easier.

Dataframe.aggregate() function is used to apply some aggregation across one or more column. Aggregate using callable, string, dict, or list of string/callables. Most frequently used aggregations are:

sum: Return the sum of the values for the requested axis
min: Return the minimum of the values for the requested axis
max: Return the maximum of the values for the requested axis

Syntax: DataFrame.aggregate(func, axis=0, *args, **kwargs)

Parameters:
func : callable, string, dictionary, or list of string/callables. Function to use for aggregating the data. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply. For a DataFrame, can pass a dict, if the keys are DataFrame column names.
axis : (default 0) {0 or ‘index’, 1 or ‘columns’} 0 or ‘index’: apply function to each column. 1 or ‘columns’: apply function to each row.

Returns: Aggregated DataFrame

For link to CSV file Used in Code, click here

Example #1: Aggregate ‘sum’ and ‘min’ function across all the columns in data frame.

# importing pandas package
import pandas as pd
  
# making data frame from csv file
df = pd.read_csv("nba.csv")
  
# printing the first 10 rows of the dataframe
df[:10]

Aggregation works with only numeric type columns.

# Applying aggregation across all the columns 
# sum and min will be found for each 
# numeric type column in df dataframe
  
df.aggregate(['sum', 'min'])

Output:
For each column which are having numeric values, minimum and sum of all values has been found. For dataframe df , we have four such columns Number, Age, Weight, Salary.

 
Example #2:

In Pandas, we can also apply different aggregation functions across different columns. For that, we need to pass a dictionary with key containing the column names and values containing the list of aggregation functions for any specific column.

# importing pandas package
import pandas as pd
  
# making data frame from csv file
df = pd.read_csv("nba.csv")
  
# We are going to find aggregation for these columns
df.aggregate({"Number":['sum', 'min'],
              "Age":['max', 'min'],
              "Weight":['min', 'sum'], 
              "Salary":['sum']})

Output:
Separate aggregation has been applied to each column, if any specific aggregation is not applied on a column then it has NaN value corresponding to it.



L

 

Adding new column

By declaring a new list as a column

Python3

# Import pandas package
import pandas as pd
 
# Define a dictionary containing Students data
data = {'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'],
        'Height': [5.1, 6.2, 5.1, 5.2],
        'Qualification': ['Msc', 'MA', 'Msc', 'Msc']}
 
# Convert the dictionary into DataFrame
df = pd.DataFrame(data)
 
# Declare a list that is to be converted into a column
address = ['Delhi', 'Bangalore', 'Chennai', 'Patna']
 
# Using 'Address' as the column name
# and equating it to the list
df['Address'] = address
 
# Observe the result
print(df)

Output: 

Note that the length of your list should match the length of the index column otherwise it will show an error. 

By using DataFrame.insert()

It gives the freedom to add a column at any position we like and not just at the end. It also provides different options for inserting the column values.

Python3

# Import pandas package
import pandas as pd
 
# Define a dictionary containing Students data
data = {'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'],
        'Height': [5.1, 6.2, 5.1, 5.2],
        'Qualification': ['Msc', 'MA', 'Msc', 'Msc']}
 
# Convert the dictionary into DataFrame
df = pd.DataFrame(data)
 
# Using DataFrame.insert() to add a column
df.insert(2, "Age", [21, 23, 24, 21], True)
 
# Observe the result
print(df)

Output: 

Using Dataframe.assign() method

This method will create a new dataframe with a new column added to the old dataframe.

Python3

# Import pandas package
import pandas as pd
 
# Define a dictionary containing Students data
data = {'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'],
        'Height': [5.1, 6.2, 5.1, 5.2],
        'Qualification': ['Msc', 'MA', 'Msc', 'Msc']}
 
 
# Convert the dictionary into DataFrame
df = pd.DataFrame(data)
 
# Using 'Address' as the column name and equating it to the list
df2 = df.assign(address=['Delhi', 'Bangalore', 'Chennai', 'Patna'])
 
# Observe the result
print(df2)

Output: 

By using a dictionary

We can use a Python dictionary to add a new column in pandas DataFrame. Use an existing column as the key values and their respective values will be the values for a new column.

Python3

# Import pandas package
import pandas as pd
 
# Define a dictionary containing Students data
data = {'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'],
        'Height': [5.1, 6.2, 5.1, 5.2],
        'Qualification': ['Msc', 'MA', 'Msc', 'Msc']}
 
# Define a dictionary with key values of
# an existing column and their respective
# value pairs as the # values for our new column.
address = {'Delhi': 'Jai', 'Bangalore': 'Princi',
           'Patna': 'Gaurav', 'Chennai': 'Anuj'}
 
# Convert the dictionary into DataFrame
df = pd.DataFrame(data)
 
# Provide 'Address' as the column name
df['Address'] = address
 
# Observe the output
print(df)

Output: 



Delete

Drop Function in Pandas

Pandas provide data analysts with a way to delete and filter data frames using dataframe.drop() method. Rows or columns can be removed using an index label or column name using this method.

Syntax:
DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors=’raise’)

Parameters:
labels: String or list of strings referring row or column name.
axis: int or string value, 0 ‘index’ for Rows and 1 ‘columns’ for Columns.
index or columns: Single label or list. index or columns are an alternative to axis and cannot be used together. level: Used to specify level in case data frame is having multiple level index.
inplace: Makes changes in original Data Frame if True.
errors: Ignores error if any value from the list doesn’t exists and drops rest of the values when errors = ‘ignore’ Return type: Dataframe with dropped values

Dropping Rows by index label

In this code, A list of index labels is passed and the rows corresponding to those labels are dropped using .drop() method. To download the CSV used in the code, click here.

Python3

# importing pandas module
import pandas as pd
 
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col="Name")
print(data.head(5))

OutputData Frame before Dropping values

                         Team  Number Position   Age Height  Weight            College     Salary
Name                                                                  
Avery Bradley  Boston Celtics     0.0       PG  25.0    6-2   180.0              Texas  7730337.0
Jae Crowder    Boston Celtics    99.0       SF  25.0    6-6   235.0          Marquette  6796117.0
John Holland   Boston Celtics    30.0       SG  27.0    6-5   205.0  Boston University        NaN
R.J. Hunter    Boston Celtics    28.0       SG  22.0    6-5   185.0      Georgia State  1148640.0
Jonas Jerebko  Boston Celtics     8.0       PF  29.0   6-10   231.0                NaN  5000000.0

Applying the drop function.

Python3

# dropping passed values
data.drop(["Avery Bradley", "John Holland", "R.J. Hunter"], inplace = True)
 
# display
print(data)

Output: Data Frame after Dropping values

As shown in the output before, the new output doesn’t have the passed values. Those values were dropped and the changes were made in the original data frame since inplace was True.

                         Team  Number Position   Age Height  Weight   College     Salary
Name                                                                  
Jae Crowder    Boston Celtics    99.0       SF  25.0    6-6   235.0 Marquette  6796117.0
Jonas Jerebko  Boston Celtics     8.0       PF  29.0   6-10   231.0       NaN  5000000.0 
Amir Johnson   Boston Celtics    90.0       PF  29.0    6-9   240.0       NaN 12000000.0 
Jordan Mickey  Boston Celtics    55.0       PF  21.0    6-8   235.0       LSU  1170960.0 
Kelly Olynyk   Boston Celtics    41.0        C  25.0    7-0   238.0   Gonzaga  2165160.0  

Dropping columns with column name

In this code, Passed columns are dropped using column names. axis parameter is kept 1 since 1 refers to columns.

Python3

# importing pandas module
import pandas as pd
   
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name" )
print(data.head())

OutputData Frame before Dropping Columns

                         Team  Number Position   Age Height  Weight            College     Salary
Name                                                                  
Avery Bradley  Boston Celtics     0.0       PG  25.0    6-2   180.0              Texas  7730337.0
Jae Crowder    Boston Celtics    99.0       SF  25.0    6-6   235.0          Marquette  6796117.0
John Holland   Boston Celtics    30.0       SG  27.0    6-5   205.0  Boston University        NaN
R.J. Hunter    Boston Celtics    28.0       SG  22.0    6-5   185.0      Georgia State  1148640.0
Jonas Jerebko  Boston Celtics     8.0       PF  29.0   6-10   231.0                NaN  5000000.0

Applying drop function.

Python3

# dropping passed columns
data.drop(["Team", "Weight"], axis = 1, inplace = True)
   
# display
print(data.head())

Output: Data Frame after Dropping Columns

As shown in the output images, the new output doesn’t have the passed columns. Those values were dropped since the axis was set equal to 1 and the changes were made in the original data frame since inplace was True.

               Number Position   Age Height            College     Salary
Name                                                                     
Avery Bradley     0.0       PG  25.0    6-2              Texas  7730337.0
Jae Crowder      99.0       SF  25.0    6-6          Marquette  6796117.0
John Holland     30.0       SG  27.0    6-5  Boston University        NaN
R.J. Hunter      28.0       SG  22.0    6-5      Georgia State  1148640.0
Jonas Jerebko     8.0       PF  29.0   6-10                NaN  5000000.0


DataFrame.truncate

Pandas DataFrame is a two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. It can be thought of as a dict-like container for Series objects. This is the primary data structure of the Pandas.

Pandas DataFrame.truncate() function is used to truncate a Series or DataFrame before and after some index value. This is a useful shorthand for boolean indexing based on index values above or below certain thresholds.

Syntax: DataFrame.truncate(before=None, after=None, axis=None, copy=True)

Parameter :
before : Truncate all rows before this index value.
after : Truncate all rows after this index value.
axis : Axis to truncate. Truncates the index (rows) by default.
copy : Return a copy of the truncated section.

Returns : The truncated Series or DataFrame.

Example #1: Use DataFrame.truncate() function to truncate some entries before and after the passed labels of the given dataframe.

# importing pandas as pd
import pandas as pd
  
# Creating the DataFrame
df = pd.DataFrame({'Weight':[45, 88, 56, 15, 71],
                   'Name':['Sam', 'Andrea', 'Alex', 'Robin', 'Kia'],
                   'Age':[14, 25, 55, 8, 21]})
  
# Create the index
index_ = pd.date_range('2010-10-09 08:45', periods = 5, freq ='H')
  
# Set the index
df.index = index_
  
# Print the DataFrame
print(df)

Output :

Now we will use DataFrame.truncate() function to truncate the entries before ‘2010-10-09 09:45:00’ and after ‘2010-10-09 11:45:00’ in the given dataframe.

# return the truncated dataframe
result = df.truncate(before = '2010-10-09 09:45:00', after = '2010-10-09 11:45:00')
  
# Print the result
print(result)

Output :

As we can see in the output, the DataFrame.truncate() function has successfully truncated the entries before and after the passed labels in the given dataframe.
 
Example #2: Use DataFrame.truncate() function to truncate some entries before and after the passed labels of the given dataframe.

# importing pandas as pd
import pandas as pd
  
# Creating the DataFrame
df = pd.DataFrame({"A":[12, 4, 5, None, 1], 
                   "B":[7, 2, 54, 3, None], 
                   "C":[20, 16, 11, 3, 8], 
                   "D":[14, 3, None, 2, 6]}) 
  
# Create the index
index_ = ['Row_1', 'Row_2', 'Row_3', 'Row_4', 'Row_5']
  
# Set the index
df.index = index_
  
# Print the DataFrame
print(df)

Output :

Now we will use DataFrame.truncate() function to truncate the entries before ‘Row_3’ and after ‘Row_4’ in the given dataframe.

# return the truncated dataframe
result = df.truncate(before = 'Row_3', after = 'Row_4')
  
# Print the result
print(result)

Output :

As we can see in the output, the DataFrame.truncate() function has successfully truncated the entries before and after the passed labels in the given dataframe.



Dataframe.sort_values()

Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages, and makes importing and analyzing data much easier.
Pandas sort_values() function sorts a data frame in Ascending or Descending order of passed Column. It’s different than the sorted Python function since it cannot sort a data frame and particular column cannot be selected.
Let’s discuss Dataframe.sort_values() Single Parameter Sorting:
Syntax: 
 

DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, kind=’quicksort’, na_position=’last’)

 
Every parameter has some default values except the ‘by’ parameter.
Parameters: 
 

by: Single/List of column names to sort Data Frame by. 
axis: 0 or ‘index’ for rows and 1 or ‘columns’ for Column. 
ascending: Boolean value which sorts Data frame in ascending order if True. 
inplace: Boolean value. Makes the changes in passed data frame itself if True. 
kind: String which can have three inputs(‘quicksort’, ‘mergesort’ or ‘heapsort’) of algorithm used to sort data frame. 
na_position: Takes two string input ‘last’ or ‘first’ to set position of Null values. Default is ‘last’.

Return Type: 
 

Returns a sorted Data Frame with Same dimensions as of the function caller Data Frame.

For link to CSV file Used in Code, click here.
Example #1: Sorting by Name 
In the following example, A data frame is made from the csv file and the data frame is sorted in ascending order of Names of Players.
Before Sorting-
 

# importing pandas package
import pandas as pd
 
# making data frame from csv file
data = pd.read_csv("nba.csv")
 
# display
data

Output: 
 

After Sorting-
 

# importing pandas package
import pandas as pd
 
# making data frame from csv file
data = pd.read_csv("nba.csv")
 
# sorting data frame by name
data.sort_values("Name", axis = 0, ascending = True,
                 inplace = True, na_position ='last')
 
# display
data

As shown in the image, index column is now jumbled since the data frame is sorted by Name.
Output: 
 

  
Example #2: Changing position of Null values
In the given data, there are many null values in different columns which are put in the last by default. In this example, the Data Frame is sorted with respect to Salary column and Null values are kept at the top.
 

# importing pandas package
import pandas as pd
 
# making data frame from csv file
data = pd.read_csv("nba.csv")
 
# sorting data frame by name
data.sort_values("Salary", axis = 0, ascending = True,
                 inplace = True, na_position ='first')
 
data
# display

As shown in output image, The NaN values are at the top and after that comes the sorted value of Salary.
Output: 
 

 



L

Comments

Popular posts from this blog

starburst

td nz teradata sql