- Comparison with other tools
- Comparison with R / R libraries
- Comparison with SQL
- Comparison with SAS
- Comparison with Stata
Comparison with other tools
Comparison with R / R libraries
Since pandas aims to provide a lot of the data manipulation and analysis functionality that people use R for, this page was started to provide a more detailed look at the R language) and its many third party libraries as they relate to pandas. In comparisons with R and CRAN libraries, we care about the following things:
- Functionality / flexibility: what can/cannot be done with each tool
- Performance: how fast are operations. Hard numbers/benchmarks are preferable
- Ease-of-use: Is one tool easier/harder to use (you may have to be the judge of this, given side-by-side code comparisons)
This page is also here to offer a bit of a translation guide for users of these R packages.
For transfer of DataFrame objects from pandas to R, one option is to use HDF5 files, see External Compatibility for an example.
Quick Reference
We’ll start off with a quick reference guide pairing some common R operations using dplyr with pandas equivalents.
Querying, Filtering, Sampling
| R | pandas |
|---|---|
| dim(df) | df.shape |
| head(df) | df.head() |
| slice(df, 1:10) | df.iloc[:9] |
| filter(df, col1 == 1, col2 == 1) | df.query(‘col1 == 1 & col2 == 1’) |
| df[df$col1 == 1 & df$col2 == 1,] | df[(df.col1 == 1) & (df.col2 == 1)] |
| select(df, col1, col2) | df[[‘col1’, ‘col2’]] |
| select(df, col1:col3) | df.loc[:, ‘col1’:’col3’] |
| select(df, -(col1:col3)) | df.drop(cols_to_drop, axis=1) but see [1] |
| distinct(select(df, col1)) | df[[‘col1’]].drop_duplicates() |
| distinct(select(df, col1, col2)) | df[[‘col1’, ‘col2’]].drop_duplicates() |
| sample_n(df, 10) | df.sample(n=10) |
| sample_frac(df, 0.01) | df.sample(frac=0.01) |
::: tip
[1] | R’s shorthand for a subrange of columns (select(df, col1:col3)) can be approached cleanly in pandas, if you have the list of columns, for example df[cols[1:3]] or df.drop(cols[1:3]), but doing this by column name is a bit messy.
:::
Sorting
| R | pandas |
|---|---|
| arrange(df, col1, col2) | df.sort_values([‘col1’, ‘col2’]) |
| arrange(df, desc(col1)) | df.sort_values(‘col1’, ascending=False) |
Transforming
| R | pandas |
|---|---|
| select(df, col_one = col1) | df.rename(columns={‘col1’: ‘col_one’})[‘col_one’] |
| rename(df, col_one = col1) | df.rename(columns={‘col1’: ‘col_one’}) |
| mutate(df, c=a-b) | df.assign(c=df.a-df.b) |
Grouping and Summarizing
| R | pandas |
|---|---|
| summary(df) | df.describe() |
| gdf <- group_by(df, col1) | gdf = df.groupby(‘col1’) |
| summarise(gdf, avg=mean(col1, na.rm=TRUE)) | df.groupby(‘col1’).agg({‘col1’: ‘mean’}) |
| summarise(gdf, total=sum(col1)) | df.groupby(‘col1’).sum() |
Base R
Slicing with R’s c
R makes it easy to access data.frame columns by name
df <- data.frame(a=rnorm(5), b=rnorm(5), c=rnorm(5), d=rnorm(5), e=rnorm(5))df[, c("a", "c", "e")]
or by integer location
df <- data.frame(matrix(rnorm(1000), ncol=100))df[, c(1:10, 25:30, 40, 50:100)]
Selecting multiple columns by name in pandas is straightforward
In [1]: df = pd.DataFrame(np.random.randn(10, 3), columns=list('abc'))In [2]: df[['a', 'c']]Out[2]:a c0 0.469112 -1.5090591 -1.135632 -0.1732152 0.119209 -0.8618493 -2.104569 1.0718044 0.721555 -1.0395755 0.271860 0.5670206 0.276232 -0.6736907 0.113648 0.5249888 0.404705 -1.7150029 -1.039268 -1.157892In [3]: df.loc[:, ['a', 'c']]Out[3]:a c0 0.469112 -1.5090591 -1.135632 -0.1732152 0.119209 -0.8618493 -2.104569 1.0718044 0.721555 -1.0395755 0.271860 0.5670206 0.276232 -0.6736907 0.113648 0.5249888 0.404705 -1.7150029 -1.039268 -1.157892
Selecting multiple noncontiguous columns by integer location can be achieved with a combination of the iloc indexer attribute and numpy.r_.
In [4]: named = list('abcdefg')In [5]: n = 30In [6]: columns = named + np.arange(len(named), n).tolist()In [7]: df = pd.DataFrame(np.random.randn(n, n), columns=columns)In [8]: df.iloc[:, np.r_[:10, 24:30]]Out[8]:a b c d e f g 7 8 9 24 25 26 27 28 290 -1.344312 0.844885 1.075770 -0.109050 1.643563 -1.469388 0.357021 -0.674600 -1.776904 -0.968914 -1.170299 -0.226169 0.410835 0.813850 0.132003 -0.8273171 -0.076467 -1.187678 1.130127 -1.436737 -1.413681 1.607920 1.024180 0.569605 0.875906 -2.211372 0.959726 -1.110336 -0.619976 0.149748 -0.732339 0.6877382 0.176444 0.403310 -0.154951 0.301624 -2.179861 -1.369849 -0.954208 1.462696 -1.743161 -0.826591 0.084844 0.432390 1.519970 -0.493662 0.600178 0.2742303 0.132885 -0.023688 2.410179 1.450520 0.206053 -0.251905 -2.213588 1.063327 1.266143 0.299368 -2.484478 -0.281461 0.030711 0.109121 1.126203 -0.9773494 1.474071 -0.064034 -1.282782 0.781836 -1.071357 0.441153 2.353925 0.583787 0.221471 -0.744471 -1.197071 -1.066969 -0.303421 -0.858447 0.306996 -0.0286655 0.384316 1.574159 1.588931 0.476720 0.473424 -0.242861 -0.014805 -0.284319 0.650776 -1.461665 -0.902937 0.068159 -0.057873 -0.368204 -1.144073 0.8612096 0.800193 0.782098 -1.069094 -1.099248 0.255269 0.009750 0.661084 0.379319 -0.008434 1.952541 0.604603 2.121453 0.597701 0.563700 0.967661 -1.057909.. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...23 1.534417 -1.374226 -0.367477 0.782551 1.356489 0.981552 0.304501 0.354041 -1.232756 -0.267074 0.641606 -1.690959 0.961088 0.052372 1.166439 0.40728124 0.859275 -0.995910 0.261263 1.783442 0.380989 2.289726 0.309489 2.189028 1.389045 -0.873585 -0.169076 0.840316 0.638172 0.890673 -1.949397 -0.00343725 1.492125 -0.068190 0.681456 1.221829 -0.434352 1.204815 -0.195612 1.251683 -1.040389 -0.796211 1.944517 0.042344 -0.307904 0.428572 0.880609 0.48764526 0.725238 0.624607 -0.141185 -0.143948 -0.328162 2.095086 -0.608888 -0.926422 1.872601 -2.513465 -0.846188 1.190624 0.778507 1.008500 1.424017 0.71711027 1.262419 1.950057 0.301038 -0.933858 0.814946 0.181439 -0.110015 -2.364638 -1.584814 0.307941 -1.341814 0.334281 -0.162227 1.007824 2.826008 1.45838328 -1.585746 -0.899734 0.921494 -0.211762 -0.059182 0.058308 0.915377 -0.696321 0.150664 -3.060395 0.403620 -0.026602 -0.240481 0.577223 -1.088417 0.32668729 -0.986248 0.169729 -1.158091 1.019673 0.646039 0.917399 -0.010435 0.366366 0.922729 0.869610 -1.209247 -0.671466 0.332872 -2.013086 -1.602549 0.333109[30 rows x 16 columns]
aggregate
In R you may want to split data into subsets and compute the mean for each. Using a data.frame called df and splitting it into groups by1 and by2:
df <- data.frame(v1 = c(1,3,5,7,8,3,5,NA,4,5,7,9),v2 = c(11,33,55,77,88,33,55,NA,44,55,77,99),by1 = c("red", "blue", 1, 2, NA, "big", 1, 2, "red", 1, NA, 12),by2 = c("wet", "dry", 99, 95, NA, "damp", 95, 99, "red", 99, NA, NA))aggregate(x=df[, c("v1", "v2")], by=list(mydf2$by1, mydf2$by2), FUN = mean)
The groupby() method is similar to base R aggregate function.
In [9]: df = pd.DataFrame(...: {'v1': [1, 3, 5, 7, 8, 3, 5, np.nan, 4, 5, 7, 9],...: 'v2': [11, 33, 55, 77, 88, 33, 55, np.nan, 44, 55, 77, 99],...: 'by1': ["red", "blue", 1, 2, np.nan, "big", 1, 2, "red", 1, np.nan, 12],...: 'by2': ["wet", "dry", 99, 95, np.nan, "damp", 95, 99, "red", 99, np.nan,...: np.nan]})...:In [10]: g = df.groupby(['by1', 'by2'])In [11]: g[['v1', 'v2']].mean()Out[11]:v1 v2by1 by21 95 5.0 55.099 5.0 55.02 95 7.0 77.099 NaN NaNbig damp 3.0 33.0blue dry 3.0 33.0red red 4.0 44.0wet 1.0 11.0
For more details and examples see the groupby documentation.
match / %in%
A common way to select data in R is using %in% which is defined using the function match. The operator %in% is used to return a logical vector indicating if there is a match or not:
s <- 0:4s %in% c(2,4)
The isin() method is similar to R %in% operator:
In [12]: s = pd.Series(np.arange(5), dtype=np.float32)In [13]: s.isin([2, 4])Out[13]:0 False1 False2 True3 False4 Truedtype: bool
The match function returns a vector of the positions of matches of its first argument in its second:
s <- 0:4match(s, c(2,4))
For more details and examples see the reshaping documentation.
tapply
tapply is similar to aggregate, but data can be in a ragged array, since the subclass sizes are possibly irregular. Using a data.frame called baseball, and retrieving information based on the array team:
baseball <-data.frame(team = gl(5, 5,labels = paste("Team", LETTERS[1:5])),player = sample(letters, 25),batting.average = runif(25, .200, .400))tapply(baseball$batting.average, baseball.example$team,max)
In pandas we may use pivot_table() method to handle this:
In [14]: import randomIn [15]: import stringIn [16]: baseball = pd.DataFrame(....: {'team': ["team %d" % (x + 1) for x in range(5)] * 5,....: 'player': random.sample(list(string.ascii_lowercase), 25),....: 'batting avg': np.random.uniform(.200, .400, 25)})....:In [17]: baseball.pivot_table(values='batting avg', columns='team', aggfunc=np.max)Out[17]:team team 1 team 2 team 3 team 4 team 5batting avg 0.352134 0.295327 0.397191 0.394457 0.396194
For more details and examples see the reshaping documentation.
subset
The query() method is similar to the base R subset function. In R you might want to get the rows of a data.frame where one column’s values are less than another column’s values:
df <- data.frame(a=rnorm(10), b=rnorm(10))subset(df, a <= b)df[df$a <= df$b,] # note the comma
In pandas, there are a few ways to perform subsetting. You can use query() or pass an expression as if it were an index/slice as well as standard boolean indexing:
In [18]: df = pd.DataFrame({'a': np.random.randn(10), 'b': np.random.randn(10)})In [19]: df.query('a <= b')Out[19]:a b1 0.174950 0.5528872 -0.023167 0.1480843 -0.495291 -0.3002184 -0.860736 0.1973785 -1.134146 1.7207807 -0.290098 0.0835158 0.238636 0.946550In [20]: df[df.a <= df.b]Out[20]:a b1 0.174950 0.5528872 -0.023167 0.1480843 -0.495291 -0.3002184 -0.860736 0.1973785 -1.134146 1.7207807 -0.290098 0.0835158 0.238636 0.946550In [21]: df.loc[df.a <= df.b]Out[21]:a b1 0.174950 0.5528872 -0.023167 0.1480843 -0.495291 -0.3002184 -0.860736 0.1973785 -1.134146 1.7207807 -0.290098 0.0835158 0.238636 0.946550
For more details and examples see the query documentation.
with
An expression using a data.frame called df in R with the columns a and b would be evaluated using with like so:
df <- data.frame(a=rnorm(10), b=rnorm(10))with(df, a + b)df$a + df$b # same as the previous expression
In pandas the equivalent expression, using the eval() method, would be:
In [22]: df = pd.DataFrame({'a': np.random.randn(10), 'b': np.random.randn(10)})In [23]: df.eval('a + b')Out[23]:0 -0.0914301 -2.4838902 -0.2527283 -0.6264444 -0.2617405 2.1495036 -0.3322147 0.7993318 -2.3772459 2.104677dtype: float64In [24]: df.a + df.b # same as the previous expressionOut[24]:0 -0.0914301 -2.4838902 -0.2527283 -0.6264444 -0.2617405 2.1495036 -0.3322147 0.7993318 -2.3772459 2.104677dtype: float64
In certain cases eval() will be much faster than evaluation in pure Python. For more details and examples see the eval documentation.
plyr
plyr is an R library for the split-apply-combine strategy for data analysis. The functions revolve around three data structures in R, a for arrays, l for lists, and d for data.frame. The table below shows how these data structures could be mapped in Python.
| R | Python |
|---|---|
| array | list |
| lists | dictionary or list of objects |
| data.frame | dataframe |
ddply
An expression using a data.frame called df in R where you want to summarize x by month:
require(plyr)df <- data.frame(x = runif(120, 1, 168),y = runif(120, 7, 334),z = runif(120, 1.7, 20.7),month = rep(c(5,6,7,8),30),week = sample(1:4, 120, TRUE))ddply(df, .(month, week), summarize,mean = round(mean(x), 2),sd = round(sd(x), 2))
In pandas the equivalent expression, using the groupby() method, would be:
In [25]: df = pd.DataFrame({'x': np.random.uniform(1., 168., 120),....: 'y': np.random.uniform(7., 334., 120),....: 'z': np.random.uniform(1.7, 20.7, 120),....: 'month': [5, 6, 7, 8] * 30,....: 'week': np.random.randint(1, 4, 120)})....:In [26]: grouped = df.groupby(['month', 'week'])In [27]: grouped['x'].agg([np.mean, np.std])Out[27]:mean stdmonth week5 1 63.653367 40.6019652 78.126605 53.3424003 92.091886 57.6301106 1 81.747070 54.3392182 70.971205 54.6872873 100.968344 54.0100817 1 61.576332 38.8442742 61.733510 48.2090133 71.688795 37.5956388 1 62.741922 34.6181532 91.774627 49.7902023 73.936856 60.773900
For more details and examples see the groupby documentation.
reshape / reshape2
melt.array
An expression using a 3 dimensional array called a in R where you want to melt it into a data.frame:
a <- array(c(1:23, NA), c(2,3,4))data.frame(melt(a))
In Python, since a is a list, you can simply use list comprehension.
In [28]: a = np.array(list(range(1, 24)) + [np.NAN]).reshape(2, 3, 4)In [29]: pd.DataFrame([tuple(list(x) + [val]) for x, val in np.ndenumerate(a)])Out[29]:0 1 2 30 0 0 0 1.01 0 0 1 2.02 0 0 2 3.03 0 0 3 4.04 0 1 0 5.05 0 1 1 6.06 0 1 2 7.0.. .. .. .. ...17 1 1 1 18.018 1 1 2 19.019 1 1 3 20.020 1 2 0 21.021 1 2 1 22.022 1 2 2 23.023 1 2 3 NaN[24 rows x 4 columns]
melt.list
An expression using a list called a in R where you want to melt it into a data.frame:
a <- as.list(c(1:4, NA))data.frame(melt(a))
In Python, this list would be a list of tuples, so DataFrame() method would convert it to a dataframe as required.
In [30]: a = list(enumerate(list(range(1, 5)) + [np.NAN]))In [31]: pd.DataFrame(a)Out[31]:0 10 0 1.01 1 2.02 2 3.03 3 4.04 4 NaN
For more details and examples see the Into to Data Structures documentation.
melt.data.frame
An expression using a data.frame called cheese in R where you want to reshape the data.frame:
cheese <- data.frame(first = c('John', 'Mary'),last = c('Doe', 'Bo'),height = c(5.5, 6.0),weight = c(130, 150))melt(cheese, id=c("first", "last"))
In Python, the melt() method is the R equivalent:
In [32]: cheese = pd.DataFrame({'first': ['John', 'Mary'],....: 'last': ['Doe', 'Bo'],....: 'height': [5.5, 6.0],....: 'weight': [130, 150]})....:In [33]: pd.melt(cheese, id_vars=['first', 'last'])Out[33]:first last variable value0 John Doe height 5.51 Mary Bo height 6.02 John Doe weight 130.03 Mary Bo weight 150.0In [34]: cheese.set_index(['first', 'last']).stack() # alternative wayOut[34]:first lastJohn Doe height 5.5weight 130.0Mary Bo height 6.0weight 150.0dtype: float64
For more details and examples see the reshaping documentation.
cast
In R acast is an expression using a data.frame called df in R to cast into a higher dimensional array:
df <- data.frame(x = runif(12, 1, 168),y = runif(12, 7, 334),z = runif(12, 1.7, 20.7),month = rep(c(5,6,7),4),week = rep(c(1,2), 6))mdf <- melt(df, id=c("month", "week"))acast(mdf, week ~ month ~ variable, mean)
In Python the best way is to make use of pivot_table():
In [35]: df = pd.DataFrame({'x': np.random.uniform(1., 168., 12),....: 'y': np.random.uniform(7., 334., 12),....: 'z': np.random.uniform(1.7, 20.7, 12),....: 'month': [5, 6, 7] * 4,....: 'week': [1, 2] * 6})....:In [36]: mdf = pd.melt(df, id_vars=['month', 'week'])In [37]: pd.pivot_table(mdf, values='value', index=['variable', 'week'],....: columns=['month'], aggfunc=np.mean)....:Out[37]:month 5 6 7variable weekx 1 93.888747 98.762034 55.2196732 94.391427 38.112932 83.942781y 1 94.306912 279.454811 227.8404492 87.392662 193.028166 173.899260z 1 11.016009 10.079307 16.1705492 8.476111 17.638509 19.003494
Similarly for dcast which uses a data.frame called df in R to aggregate information based on Animal and FeedType:
df <- data.frame(Animal = c('Animal1', 'Animal2', 'Animal3', 'Animal2', 'Animal1','Animal2', 'Animal3'),FeedType = c('A', 'B', 'A', 'A', 'B', 'B', 'A'),Amount = c(10, 7, 4, 2, 5, 6, 2))dcast(df, Animal ~ FeedType, sum, fill=NaN)# Alternative method using base Rwith(df, tapply(Amount, list(Animal, FeedType), sum))
Python can approach this in two different ways. Firstly, similar to above using pivot_table():
In [38]: df = pd.DataFrame({....: 'Animal': ['Animal1', 'Animal2', 'Animal3', 'Animal2', 'Animal1',....: 'Animal2', 'Animal3'],....: 'FeedType': ['A', 'B', 'A', 'A', 'B', 'B', 'A'],....: 'Amount': [10, 7, 4, 2, 5, 6, 2],....: })....:In [39]: df.pivot_table(values='Amount', index='Animal', columns='FeedType',....: aggfunc='sum')....:Out[39]:FeedType A BAnimalAnimal1 10.0 5.0Animal2 2.0 13.0Animal3 6.0 NaN
The second approach is to use the groupby() method:
In [40]: df.groupby(['Animal', 'FeedType'])['Amount'].sum()Out[40]:Animal FeedTypeAnimal1 A 10B 5Animal2 A 2B 13Animal3 A 6Name: Amount, dtype: int64
For more details and examples see the reshaping documentation or the groupby documentation.
factor
pandas has a data type for categorical data.
cut(c(1,2,3,4,5,6), 3)factor(c(1,2,3,2,2,3))
In pandas this is accomplished with pd.cut and astype("category"):
In [41]: pd.cut(pd.Series([1, 2, 3, 4, 5, 6]), 3)Out[41]:0 (0.995, 2.667]1 (0.995, 2.667]2 (2.667, 4.333]3 (2.667, 4.333]4 (4.333, 6.0]5 (4.333, 6.0]dtype: categoryCategories (3, interval[float64]): [(0.995, 2.667] < (2.667, 4.333] < (4.333, 6.0]]In [42]: pd.Series([1, 2, 3, 2, 2, 3]).astype("category")Out[42]:0 11 22 33 24 25 3dtype: categoryCategories (3, int64): [1, 2, 3]
For more details and examples see categorical introduction and the API documentation. There is also a documentation regarding the differences to R’s factor.
Comparison with SQL
Since many potential pandas users have some familiarity with SQL, this page is meant to provide some examples of how various SQL operations would be performed using pandas.
If you’re new to pandas, you might want to first read through 10 Minutes to pandas to familiarize yourself with the library.
As is customary, we import pandas and NumPy as follows:
In [1]: import pandas as pdIn [2]: import numpy as np
Most of the examples will utilize the tips dataset found within pandas tests. We’ll read the data into a DataFrame called tips and assume we have a database table of the same name and structure.
In [3]: url = ('https://raw.github.com/pandas-dev'...: '/pandas/master/pandas/tests/data/tips.csv')...:In [4]: tips = pd.read_csv(url)In [5]: tips.head()Out[5]:total_bill tip sex smoker day time size0 16.99 1.01 Female No Sun Dinner 21 10.34 1.66 Male No Sun Dinner 32 21.01 3.50 Male No Sun Dinner 33 23.68 3.31 Male No Sun Dinner 24 24.59 3.61 Female No Sun Dinner 4
SELECT
In SQL, selection is done using a comma-separated list of columns you’d like to select (or a * to select all columns):
SELECT total_bill, tip, smoker, timeFROM tipsLIMIT 5;
With pandas, column selection is done by passing a list of column names to your DataFrame:
In [6]: tips[['total_bill', 'tip', 'smoker', 'time']].head(5)Out[6]:total_bill tip smoker time0 16.99 1.01 No Dinner1 10.34 1.66 No Dinner2 21.01 3.50 No Dinner3 23.68 3.31 No Dinner4 24.59 3.61 No Dinner
Calling the DataFrame without the list of column names would display all columns (akin to SQL’s *).
WHERE
Filtering in SQL is done via a WHERE clause.
SELECT *FROM tipsWHERE time = 'Dinner'LIMIT 5;
DataFrames can be filtered in multiple ways; the most intuitive of which is using boolean indexing.
In [7]: tips[tips['time'] == 'Dinner'].head(5)Out[7]:total_bill tip sex smoker day time size0 16.99 1.01 Female No Sun Dinner 21 10.34 1.66 Male No Sun Dinner 32 21.01 3.50 Male No Sun Dinner 33 23.68 3.31 Male No Sun Dinner 24 24.59 3.61 Female No Sun Dinner 4
The above statement is simply passing a Series of True/False objects to the DataFrame, returning all rows with True.
In [8]: is_dinner = tips['time'] == 'Dinner'In [9]: is_dinner.value_counts()Out[9]:True 176False 68Name: time, dtype: int64In [10]: tips[is_dinner].head(5)Out[10]:total_bill tip sex smoker day time size0 16.99 1.01 Female No Sun Dinner 21 10.34 1.66 Male No Sun Dinner 32 21.01 3.50 Male No Sun Dinner 33 23.68 3.31 Male No Sun Dinner 24 24.59 3.61 Female No Sun Dinner 4
Just like SQL’s OR and AND, multiple conditions can be passed to a DataFrame using | (OR) and & (AND).
-- tips of more than $5.00 at Dinner mealsSELECT *FROM tipsWHERE time = 'Dinner' AND tip > 5.00;
# tips of more than $5.00 at Dinner mealsIn [11]: tips[(tips['time'] == 'Dinner') & (tips['tip'] > 5.00)]Out[11]:total_bill tip sex smoker day time size23 39.42 7.58 Male No Sat Dinner 444 30.40 5.60 Male No Sun Dinner 447 32.40 6.00 Male No Sun Dinner 452 34.81 5.20 Female No Sun Dinner 459 48.27 6.73 Male No Sat Dinner 4116 29.93 5.07 Male No Sun Dinner 4155 29.85 5.14 Female No Sun Dinner 5170 50.81 10.00 Male Yes Sat Dinner 3172 7.25 5.15 Male Yes Sun Dinner 2181 23.33 5.65 Male Yes Sun Dinner 2183 23.17 6.50 Male Yes Sun Dinner 4211 25.89 5.16 Male Yes Sat Dinner 4212 48.33 9.00 Male No Sat Dinner 4214 28.17 6.50 Female Yes Sat Dinner 3239 29.03 5.92 Male No Sat Dinner 3
-- tips by parties of at least 5 diners OR bill total was more than $45SELECT *FROM tipsWHERE size >= 5 OR total_bill > 45;
# tips by parties of at least 5 diners OR bill total was more than $45In [12]: tips[(tips['size'] >= 5) | (tips['total_bill'] > 45)]Out[12]:total_bill tip sex smoker day time size59 48.27 6.73 Male No Sat Dinner 4125 29.80 4.20 Female No Thur Lunch 6141 34.30 6.70 Male No Thur Lunch 6142 41.19 5.00 Male No Thur Lunch 5143 27.05 5.00 Female No Thur Lunch 6155 29.85 5.14 Female No Sun Dinner 5156 48.17 5.00 Male No Sun Dinner 6170 50.81 10.00 Male Yes Sat Dinner 3182 45.35 3.50 Male Yes Sun Dinner 3185 20.69 5.00 Male No Sun Dinner 5187 30.46 2.00 Male Yes Sun Dinner 5212 48.33 9.00 Male No Sat Dinner 4216 28.15 3.00 Male Yes Sat Dinner 5
NULL checking is done using the notna() and isna() methods.
In [13]: frame = pd.DataFrame({'col1': ['A', 'B', np.NaN, 'C', 'D'],....: 'col2': ['F', np.NaN, 'G', 'H', 'I']})....:In [14]: frameOut[14]:col1 col20 A F1 B NaN2 NaN G3 C H4 D I
Assume we have a table of the same structure as our DataFrame above. We can see only the records where col2 IS NULL with the following query:
SELECT *FROM frameWHERE col2 IS NULL;
In [15]: frame[frame['col2'].isna()]Out[15]:col1 col21 B NaN
Getting items where col1 IS NOT NULL can be done with notna().
SELECT *FROM frameWHERE col1 IS NOT NULL;
In [16]: frame[frame['col1'].notna()]Out[16]:col1 col20 A F1 B NaN3 C H4 D I
GROUP BY
In pandas, SQL’s GROUP BY operations are performed using the similarly named groupby() method. groupby() typically refers to a process where we’d like to split a dataset into groups, apply some function (typically aggregation) , and then combine the groups together.
A common SQL operation would be getting the count of records in each group throughout a dataset. For instance, a query getting us the number of tips left by sex:
SELECT sex, count(*)FROM tipsGROUP BY sex;/*Female 87Male 157*/
The pandas equivalent would be:
In [17]: tips.groupby('sex').size()Out[17]:sexFemale 87Male 157dtype: int64
Notice that in the pandas code we used size() and not count(). This is because count() applies the function to each column, returning the number of not null records within each.
In [18]: tips.groupby('sex').count()Out[18]:total_bill tip smoker day time sizesexFemale 87 87 87 87 87 87Male 157 157 157 157 157 157
Alternatively, we could have applied the count() method to an individual column:
In [19]: tips.groupby('sex')['total_bill'].count()Out[19]:sexFemale 87Male 157Name: total_bill, dtype: int64
Multiple functions can also be applied at once. For instance, say we’d like to see how tip amount differs by day of the week - agg() allows you to pass a dictionary to your grouped DataFrame, indicating which functions to apply to specific columns.
SELECT day, AVG(tip), COUNT(*)FROM tipsGROUP BY day;/*Fri 2.734737 19Sat 2.993103 87Sun 3.255132 76Thur 2.771452 62*/
In [20]: tips.groupby('day').agg({'tip': np.mean, 'day': np.size})Out[20]:tip daydayFri 2.734737 19Sat 2.993103 87Sun 3.255132 76Thur 2.771452 62
Grouping by more than one column is done by passing a list of columns to the groupby() method.
SELECT smoker, day, COUNT(*), AVG(tip)FROM tipsGROUP BY smoker, day;/*smoker dayNo Fri 4 2.812500Sat 45 3.102889Sun 57 3.167895Thur 45 2.673778Yes Fri 15 2.714000Sat 42 2.875476Sun 19 3.516842Thur 17 3.030000*/
In [21]: tips.groupby(['smoker', 'day']).agg({'tip': [np.size, np.mean]})Out[21]:tipsize meansmoker dayNo Fri 4.0 2.812500Sat 45.0 3.102889Sun 57.0 3.167895Thur 45.0 2.673778Yes Fri 15.0 2.714000Sat 42.0 2.875476Sun 19.0 3.516842Thur 17.0 3.030000
JOIN
JOINs can be performed with join() or merge(). By default, join() will join the DataFrames on their indices. Each method has parameters allowing you to specify the type of join to perform (LEFT, RIGHT, INNER, FULL) or the columns to join on (column names or indices).
In [22]: df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],....: 'value': np.random.randn(4)})....:In [23]: df2 = pd.DataFrame({'key': ['B', 'D', 'D', 'E'],....: 'value': np.random.randn(4)})....:
Assume we have two database tables of the same name and structure as our DataFrames.
Now let’s go over the various types of JOINs.
INNER JOIN
SELECT *FROM df1INNER JOIN df2ON df1.key = df2.key;
# merge performs an INNER JOIN by defaultIn [24]: pd.merge(df1, df2, on='key')Out[24]:key value_x value_y0 B -0.282863 1.2121121 D -1.135632 -0.1732152 D -1.135632 0.119209
merge() also offers parameters for cases when you’d like to join one DataFrame’s column with another DataFrame’s index.
In [25]: indexed_df2 = df2.set_index('key')In [26]: pd.merge(df1, indexed_df2, left_on='key', right_index=True)Out[26]:key value_x value_y1 B -0.282863 1.2121123 D -1.135632 -0.1732153 D -1.135632 0.119209
LEFT OUTER JOIN
-- show all records from df1SELECT *FROM df1LEFT OUTER JOIN df2ON df1.key = df2.key;
# show all records from df1In [27]: pd.merge(df1, df2, on='key', how='left')Out[27]:key value_x value_y0 A 0.469112 NaN1 B -0.282863 1.2121122 C -1.509059 NaN3 D -1.135632 -0.1732154 D -1.135632 0.119209
RIGHT JOIN
-- show all records from df2SELECT *FROM df1RIGHT OUTER JOIN df2ON df1.key = df2.key;
# show all records from df2In [28]: pd.merge(df1, df2, on='key', how='right')Out[28]:key value_x value_y0 B -0.282863 1.2121121 D -1.135632 -0.1732152 D -1.135632 0.1192093 E NaN -1.044236
FULL JOIN
pandas also allows for FULL JOINs, which display both sides of the dataset, whether or not the joined columns find a match. As of writing, FULL JOINs are not supported in all RDBMS (MySQL).
-- show all records from both tablesSELECT *FROM df1FULL OUTER JOIN df2ON df1.key = df2.key;
# show all records from both framesIn [29]: pd.merge(df1, df2, on='key', how='outer')Out[29]:key value_x value_y0 A 0.469112 NaN1 B -0.282863 1.2121122 C -1.509059 NaN3 D -1.135632 -0.1732154 D -1.135632 0.1192095 E NaN -1.044236
UNION
UNION ALL can be performed using concat().
In [30]: df1 = pd.DataFrame({'city': ['Chicago', 'San Francisco', 'New York City'],....: 'rank': range(1, 4)})....:In [31]: df2 = pd.DataFrame({'city': ['Chicago', 'Boston', 'Los Angeles'],....: 'rank': [1, 4, 5]})....:
SELECT city, rankFROM df1UNION ALLSELECT city, rankFROM df2;/*city rankChicago 1San Francisco 2New York City 3Chicago 1Boston 4Los Angeles 5*/
In [32]: pd.concat([df1, df2])Out[32]:city rank0 Chicago 11 San Francisco 22 New York City 30 Chicago 11 Boston 42 Los Angeles 5
SQL’s UNION is similar to UNION ALL, however UNION will remove duplicate rows.
SELECT city, rankFROM df1UNIONSELECT city, rankFROM df2;-- notice that there is only one Chicago record this time/*city rankChicago 1San Francisco 2New York City 3Boston 4Los Angeles 5*/
In pandas, you can use concat() in conjunction with drop_duplicates().
In [33]: pd.concat([df1, df2]).drop_duplicates() Out[33]: city rank 0 Chicago 1 1 San Francisco 2 2 New York City 3 1 Boston 4 2 Los Angeles 5
Pandas equivalents for some SQL analytic and aggregate functions
Top N rows with offset
-- MySQLSELECT * FROM tipsORDER BY tip DESCLIMIT 10 OFFSET 5;
In [34]: tips.nlargest(10 + 5, columns='tip').tail(10)Out[34]:total_bill tip sex smoker day time size183 23.17 6.50 Male Yes Sun Dinner 4214 28.17 6.50 Female Yes Sat Dinner 347 32.40 6.00 Male No Sun Dinner 4239 29.03 5.92 Male No Sat Dinner 388 24.71 5.85 Male No Thur Lunch 2181 23.33 5.65 Male Yes Sun Dinner 244 30.40 5.60 Male No Sun Dinner 452 34.81 5.20 Female No Sun Dinner 485 34.83 5.17 Female No Thur Lunch 4211 25.89 5.16 Male Yes Sat Dinner 4
Top N rows per group
-- Oracle's ROW_NUMBER() analytic functionSELECT * FROM (SELECTt.*,ROW_NUMBER() OVER(PARTITION BY day ORDER BY total_bill DESC) AS rnFROM tips t)WHERE rn < 3ORDER BY day, rn;
In [35]: (tips.assign(rn=tips.sort_values(['total_bill'], ascending=False)....: .groupby(['day'])....: .cumcount() + 1)....: .query('rn < 3')....: .sort_values(['day', 'rn']))....:Out[35]:total_bill tip sex smoker day time size rn95 40.17 4.73 Male Yes Fri Dinner 4 190 28.97 3.00 Male Yes Fri Dinner 2 2170 50.81 10.00 Male Yes Sat Dinner 3 1212 48.33 9.00 Male No Sat Dinner 4 2156 48.17 5.00 Male No Sun Dinner 6 1182 45.35 3.50 Male Yes Sun Dinner 3 2197 43.11 5.00 Female Yes Thur Lunch 4 1142 41.19 5.00 Male No Thur Lunch 5 2
the same using rank(method=’first’) function
In [36]: (tips.assign(rnk=tips.groupby(['day'])['total_bill']....: .rank(method='first', ascending=False))....: .query('rnk < 3')....: .sort_values(['day', 'rnk']))....:Out[36]:total_bill tip sex smoker day time size rnk95 40.17 4.73 Male Yes Fri Dinner 4 1.090 28.97 3.00 Male Yes Fri Dinner 2 2.0170 50.81 10.00 Male Yes Sat Dinner 3 1.0212 48.33 9.00 Male No Sat Dinner 4 2.0156 48.17 5.00 Male No Sun Dinner 6 1.0182 45.35 3.50 Male Yes Sun Dinner 3 2.0197 43.11 5.00 Female Yes Thur Lunch 4 1.0142 41.19 5.00 Male No Thur Lunch 5 2.0
-- Oracle's RANK() analytic functionSELECT * FROM (SELECTt.*,RANK() OVER(PARTITION BY sex ORDER BY tip) AS rnkFROM tips tWHERE tip < 2)WHERE rnk < 3ORDER BY sex, rnk;
Let’s find tips with (rank < 3) per gender group for (tips < 2). Notice that when using rank(method='min') function rnk_min remains the same for the same tip (as Oracle’s RANK() function)
In [37]: (tips[tips['tip'] < 2]....: .assign(rnk_min=tips.groupby(['sex'])['tip']....: .rank(method='min'))....: .query('rnk_min < 3')....: .sort_values(['sex', 'rnk_min']))....:Out[37]:total_bill tip sex smoker day time size rnk_min67 3.07 1.00 Female Yes Sat Dinner 1 1.092 5.75 1.00 Female Yes Fri Dinner 2 1.0111 7.25 1.00 Female No Sat Dinner 1 1.0236 12.60 1.00 Male Yes Sat Dinner 2 1.0237 32.83 1.17 Male Yes Sat Dinner 2 2.0
UPDATE
UPDATE tipsSET tip = tip*2WHERE tip < 2;
In [38]: tips.loc[tips['tip'] < 2, 'tip'] *= 2
DELETE
DELETE FROM tipsWHERE tip > 9;
In pandas we select the rows that should remain, instead of deleting them
In [39]: tips = tips.loc[tips['tip'] <= 9]
Comparison with SAS
For potential users coming from SAS) this page is meant to demonstrate how different SAS operations would be performed in pandas.
If you’re new to pandas, you might want to first read through 10 Minutes to pandas to familiarize yourself with the library.
As is customary, we import pandas and NumPy as follows:
In [1]: import pandas as pdIn [2]: import numpy as np
::: tip Note
Throughout this tutorial, the pandas DataFrame will be displayed by calling df.head(), which displays the first N (default 5) rows of the DataFrame. This is often used in interactive work (e.g. Jupyter notebook or terminal) - the equivalent in SAS would be:
proc print data=df(obs=5);run;
:::
Data Structures
General Terminology Translation
| pandas | SAS |
|---|---|
| DataFrame | data set |
| column | variable |
| row | observation |
| groupby | BY-group |
| NaN | . |
DataFrame / Series
A DataFrame in pandas is analogous to a SAS data set - a two-dimensional data source with labeled columns that can be of different types. As will be shown in this document, almost any operation that can be applied to a data set using SAS’s DATA step, can also be accomplished in pandas.
A Series is the data structure that represents one column of a DataFrame. SAS doesn’t have a separate data structure for a single column, but in general, working with a Series is analogous to referencing a column in the DATA step.
Index
Every DataFrame and Series has an Index - which are labels on the rows of the data. SAS does not have an exactly analogous concept. A data set’s rows are essentially unlabeled, other than an implicit integer index that can be accessed during the DATA step (_N_).
In pandas, if no index is specified, an integer index is also used by default (first row = 0, second row = 1, and so on). While using a labeled Index or MultiIndex can enable sophisticated analyses and is ultimately an important part of pandas to understand, for this comparison we will essentially ignore the Index and just treat the DataFrame as a collection of columns. Please see the indexing documentation for much more on how to use an Index effectively.
Data Input / Output
Constructing a DataFrame from Values
A SAS data set can be built from specified values by placing the data after a datalines statement and specifying the column names.
data df;input x y;datalines;1 23 45 6;run;
A pandas DataFrame can be constructed in many different ways, but for a small number of values, it is often convenient to specify it as a Python dictionary, where the keys are the column names and the values are the data.
In [3]: df = pd.DataFrame({'x': [1, 3, 5], 'y': [2, 4, 6]})In [4]: dfOut[4]:x y0 1 21 3 42 5 6
Reading External Data
Like SAS, pandas provides utilities for reading in data from many formats. The tips dataset, found within the pandas tests (csv) will be used in many of the following examples.
SAS provides PROC IMPORT to read csv data into a data set.
proc import datafile='tips.csv' dbms=csv out=tips replace;getnames=yes;run;
The pandas method is read_csv(), which works similarly.
In [5]: url = ('https://raw.github.com/pandas-dev/'...: 'pandas/master/pandas/tests/data/tips.csv')...:In [6]: tips = pd.read_csv(url)In [7]: tips.head()Out[7]:total_bill tip sex smoker day time size0 16.99 1.01 Female No Sun Dinner 21 10.34 1.66 Male No Sun Dinner 32 21.01 3.50 Male No Sun Dinner 33 23.68 3.31 Male No Sun Dinner 24 24.59 3.61 Female No Sun Dinner 4
Like PROC IMPORT, read_csv can take a number of parameters to specify how the data should be parsed. For example, if the data was instead tab delimited, and did not have column names, the pandas command would be:
tips = pd.read_csv('tips.csv', sep='\t', header=None)# alternatively, read_table is an alias to read_csv with tab delimitertips = pd.read_table('tips.csv', header=None)
In addition to text/csv, pandas supports a variety of other data formats such as Excel, HDF5, and SQL databases. These are all read via a pd.read_* function. See the IO documentation for more details.
Exporting Data
The inverse of PROC IMPORT in SAS is PROC EXPORT
proc export data=tips outfile='tips2.csv' dbms=csv;run;
Similarly in pandas, the opposite of read_csv is to_csv(), and other data formats follow a similar api.
tips.to_csv('tips2.csv')
Data Operations
Operations on Columns
In the DATA step, arbitrary math expressions can be used on new or existing columns.
data tips;set tips;total_bill = total_bill - 2;new_bill = total_bill / 2;run;
pandas provides similar vectorized operations by specifying the individual Series in the DataFrame. New columns can be assigned in the same way.
In [8]: tips['total_bill'] = tips['total_bill'] - 2In [9]: tips['new_bill'] = tips['total_bill'] / 2.0In [10]: tips.head()Out[10]:total_bill tip sex smoker day time size new_bill0 14.99 1.01 Female No Sun Dinner 2 7.4951 8.34 1.66 Male No Sun Dinner 3 4.1702 19.01 3.50 Male No Sun Dinner 3 9.5053 21.68 3.31 Male No Sun Dinner 2 10.8404 22.59 3.61 Female No Sun Dinner 4 11.295
Filtering
Filtering in SAS is done with an if or where statement, on one or more columns.
data tips;set tips;if total_bill > 10;run;data tips;set tips;where total_bill > 10;/* equivalent in this case - where happens before theDATA step begins and can also be used in PROC statements */run;
DataFrames can be filtered in multiple ways; the most intuitive of which is using boolean indexing
In [11]: tips[tips['total_bill'] > 10].head()Out[11]:total_bill tip sex smoker day time size0 14.99 1.01 Female No Sun Dinner 22 19.01 3.50 Male No Sun Dinner 33 21.68 3.31 Male No Sun Dinner 24 22.59 3.61 Female No Sun Dinner 45 23.29 4.71 Male No Sun Dinner 4
If/Then Logic
In SAS, if/then logic can be used to create new columns.
data tips;set tips;format bucket $4.;if total_bill < 10 then bucket = 'low';else bucket = 'high';run;
The same operation in pandas can be accomplished using the where method from numpy.
In [12]: tips['bucket'] = np.where(tips['total_bill'] < 10, 'low', 'high')In [13]: tips.head()Out[13]:total_bill tip sex smoker day time size bucket0 14.99 1.01 Female No Sun Dinner 2 high1 8.34 1.66 Male No Sun Dinner 3 low2 19.01 3.50 Male No Sun Dinner 3 high3 21.68 3.31 Male No Sun Dinner 2 high4 22.59 3.61 Female No Sun Dinner 4 high
Date Functionality
SAS provides a variety of functions to do operations on date/datetime columns.
data tips;set tips;format date1 date2 date1_plusmonth mmddyy10.;date1 = mdy(1, 15, 2013);date2 = mdy(2, 15, 2015);date1_year = year(date1);date2_month = month(date2);* shift date to beginning of next interval;date1_next = intnx('MONTH', date1, 1);* count intervals between dates;months_between = intck('MONTH', date1, date2);run;
The equivalent pandas operations are shown below. In addition to these functions pandas supports other Time Series features not available in Base SAS (such as resampling and custom offsets) - see the timeseries documentation for more details.
In [14]: tips['date1'] = pd.Timestamp('2013-01-15')In [15]: tips['date2'] = pd.Timestamp('2015-02-15')In [16]: tips['date1_year'] = tips['date1'].dt.yearIn [17]: tips['date2_month'] = tips['date2'].dt.monthIn [18]: tips['date1_next'] = tips['date1'] + pd.offsets.MonthBegin()In [19]: tips['months_between'] = (....: tips['date2'].dt.to_period('M') - tips['date1'].dt.to_period('M'))....:In [20]: tips[['date1', 'date2', 'date1_year', 'date2_month',....: 'date1_next', 'months_between']].head()....:Out[20]:date1 date2 date1_year date2_month date1_next months_between0 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>1 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>2 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>3 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>4 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>
Selection of Columns
SAS provides keywords in the DATA step to select, drop, and rename columns.
data tips;set tips;keep sex total_bill tip;run;data tips;set tips;drop sex;run;data tips;set tips;rename total_bill=total_bill_2;run;
The same operations are expressed in pandas below.
# keepIn [21]: tips[['sex', 'total_bill', 'tip']].head()Out[21]:sex total_bill tip0 Female 14.99 1.011 Male 8.34 1.662 Male 19.01 3.503 Male 21.68 3.314 Female 22.59 3.61# dropIn [22]: tips.drop('sex', axis=1).head()Out[22]:total_bill tip smoker day time size0 14.99 1.01 No Sun Dinner 21 8.34 1.66 No Sun Dinner 32 19.01 3.50 No Sun Dinner 33 21.68 3.31 No Sun Dinner 24 22.59 3.61 No Sun Dinner 4# renameIn [23]: tips.rename(columns={'total_bill': 'total_bill_2'}).head()Out[23]:total_bill_2 tip sex smoker day time size0 14.99 1.01 Female No Sun Dinner 21 8.34 1.66 Male No Sun Dinner 32 19.01 3.50 Male No Sun Dinner 33 21.68 3.31 Male No Sun Dinner 24 22.59 3.61 Female No Sun Dinner 4
Sorting by Values
Sorting in SAS is accomplished via PROC SORT
proc sort data=tips;by sex total_bill;run;
pandas objects have a sort_values() method, which takes a list of columns to sort by.
In [24]: tips = tips.sort_values(['sex', 'total_bill'])In [25]: tips.head()Out[25]:total_bill tip sex smoker day time size67 1.07 1.00 Female Yes Sat Dinner 192 3.75 1.00 Female Yes Fri Dinner 2111 5.25 1.00 Female No Sat Dinner 1145 6.35 1.50 Female No Thur Lunch 2135 6.51 1.25 Female No Thur Lunch 2
String Processing
Length
SAS determines the length of a character string with the LENGTHN and LENGTHC functions. LENGTHN excludes trailing blanks and LENGTHC includes trailing blanks.
data _null_;set tips;put(LENGTHN(time));put(LENGTHC(time));run;
Python determines the length of a character string with the len function. len includes trailing blanks. Use len and rstrip to exclude trailing blanks.
In [26]: tips['time'].str.len().head()Out[26]:67 692 6111 6145 5135 5Name: time, dtype: int64In [27]: tips['time'].str.rstrip().str.len().head()Out[27]:67 692 6111 6145 5135 5Name: time, dtype: int64
Find
SAS determines the position of a character in a string with the FINDW function. FINDW takes the string defined by the first argument and searches for the first position of the substring you supply as the second argument.
data _null_;set tips;put(FINDW(sex,'ale'));run;
Python determines the position of a character in a string with the find function. find searches for the first position of the substring. If the substring is found, the function returns its position. Keep in mind that Python indexes are zero-based and the function will return -1 if it fails to find the substring.
In [28]: tips['sex'].str.find("ale").head()Out[28]:67 392 3111 3145 3135 3Name: sex, dtype: int64
Substring
SAS extracts a substring from a string based on its position with the SUBSTR function.
data _null_;set tips;put(substr(sex,1,1));run;
With pandas you can use [] notation to extract a substring from a string by position locations. Keep in mind that Python indexes are zero-based.
In [29]: tips['sex'].str[0:1].head()Out[29]:67 F92 F111 F145 F135 FName: sex, dtype: object
Scan
The SAS SCAN function returns the nth word from a string. The first argument is the string you want to parse and the second argument specifies which word you want to extract.
data firstlast;input String $60.;First_Name = scan(string, 1);Last_Name = scan(string, -1);datalines2;John Smith;Jane Cook;;;;run;
Python extracts a substring from a string based on its text by using regular expressions. There are much more powerful approaches, but this just shows a simple approach.
In [30]: firstlast = pd.DataFrame({'String': ['John Smith', 'Jane Cook']})In [31]: firstlast['First_Name'] = firstlast['String'].str.split(" ", expand=True)[0]In [32]: firstlast['Last_Name'] = firstlast['String'].str.rsplit(" ", expand=True)[0]In [33]: firstlastOut[33]:String First_Name Last_Name0 John Smith John John1 Jane Cook Jane Jane
Upcase, Lowcase, and Propcase
The SAS UPCASE LOWCASE and PROPCASE functions change the case of the argument.
data firstlast;input String $60.;string_up = UPCASE(string);string_low = LOWCASE(string);string_prop = PROPCASE(string);datalines2;John Smith;Jane Cook;;;;run;
The equivalent Python functions are upper, lower, and title.
In [34]: firstlast = pd.DataFrame({'String': ['John Smith', 'Jane Cook']})In [35]: firstlast['string_up'] = firstlast['String'].str.upper()In [36]: firstlast['string_low'] = firstlast['String'].str.lower()In [37]: firstlast['string_prop'] = firstlast['String'].str.title()In [38]: firstlastOut[38]:String string_up string_low string_prop0 John Smith JOHN SMITH john smith John Smith1 Jane Cook JANE COOK jane cook Jane Cook
Merging
The following tables will be used in the merge examples
In [39]: df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],....: 'value': np.random.randn(4)})....:In [40]: df1Out[40]:key value0 A 0.4691121 B -0.2828632 C -1.5090593 D -1.135632In [41]: df2 = pd.DataFrame({'key': ['B', 'D', 'D', 'E'],....: 'value': np.random.randn(4)})....:In [42]: df2Out[42]:key value0 B 1.2121121 D -0.1732152 D 0.1192093 E -1.044236
In SAS, data must be explicitly sorted before merging. Different types of joins are accomplished using the in= dummy variables to track whether a match was found in one or both input frames.
proc sort data=df1;by key;run;proc sort data=df2;by key;run;data left_join inner_join right_join outer_join;merge df1(in=a) df2(in=b);if a and b then output inner_join;if a then output left_join;if b then output right_join;if a or b then output outer_join;run;
pandas DataFrames have a merge() method, which provides similar functionality. Note that the data does not have to be sorted ahead of time, and different join types are accomplished via the how keyword.
In [43]: inner_join = df1.merge(df2, on=['key'], how='inner')In [44]: inner_joinOut[44]:key value_x value_y0 B -0.282863 1.2121121 D -1.135632 -0.1732152 D -1.135632 0.119209In [45]: left_join = df1.merge(df2, on=['key'], how='left')In [46]: left_joinOut[46]:key value_x value_y0 A 0.469112 NaN1 B -0.282863 1.2121122 C -1.509059 NaN3 D -1.135632 -0.1732154 D -1.135632 0.119209In [47]: right_join = df1.merge(df2, on=['key'], how='right')In [48]: right_joinOut[48]:key value_x value_y0 B -0.282863 1.2121121 D -1.135632 -0.1732152 D -1.135632 0.1192093 E NaN -1.044236In [49]: outer_join = df1.merge(df2, on=['key'], how='outer')In [50]: outer_joinOut[50]:key value_x value_y0 A 0.469112 NaN1 B -0.282863 1.2121122 C -1.509059 NaN3 D -1.135632 -0.1732154 D -1.135632 0.1192095 E NaN -1.044236
Missing Data
Like SAS, pandas has a representation for missing data - which is the special float value NaN (not a number). Many of the semantics are the same, for example missing data propagates through numeric operations, and is ignored by default for aggregations.
In [51]: outer_joinOut[51]:key value_x value_y0 A 0.469112 NaN1 B -0.282863 1.2121122 C -1.509059 NaN3 D -1.135632 -0.1732154 D -1.135632 0.1192095 E NaN -1.044236In [52]: outer_join['value_x'] + outer_join['value_y']Out[52]:0 NaN1 0.9292492 NaN3 -1.3088474 -1.0164245 NaNdtype: float64In [53]: outer_join['value_x'].sum()Out[53]: -3.5940742896293765
One difference is that missing data cannot be compared to its sentinel value. For example, in SAS you could do this to filter missing values.
data outer_join_nulls;set outer_join;if value_x = .;run;data outer_join_no_nulls;set outer_join;if value_x ^= .;run;
Which doesn’t work in pandas. Instead, the pd.isna or pd.notna functions should be used for comparisons.
In [54]: outer_join[pd.isna(outer_join['value_x'])]Out[54]:key value_x value_y5 E NaN -1.044236In [55]: outer_join[pd.notna(outer_join['value_x'])]Out[55]:key value_x value_y0 A 0.469112 NaN1 B -0.282863 1.2121122 C -1.509059 NaN3 D -1.135632 -0.1732154 D -1.135632 0.119209
pandas also provides a variety of methods to work with missing data - some of which would be challenging to express in SAS. For example, there are methods to drop all rows with any missing values, replacing mi(ssing values with a specified value, like the mean, or forward filling from previous rows. See the missing data documentation for more.
In [56]: outer_join.dropna()Out[56]:key value_x value_y1 B -0.282863 1.2121123 D -1.135632 -0.1732154 D -1.135632 0.119209In [57]: outer_join.fillna(method='ffill')Out[57]:key value_x value_y0 A 0.469112 NaN1 B -0.282863 1.2121122 C -1.509059 1.2121123 D -1.135632 -0.1732154 D -1.135632 0.1192095 E -1.135632 -1.044236In [58]: outer_join['value_x'].fillna(outer_join['value_x'].mean())Out[58]:0 0.4691121 -0.2828632 -1.5090593 -1.1356324 -1.1356325 -0.718815Name: value_x, dtype: float64
GroupBy
Aggregation
SAS’s PROC SUMMARY can be used to group by one or more key variables and compute aggregations on numeric columns.
proc summary data=tips nway;class sex smoker;var total_bill tip;output out=tips_summed sum=;run;
pandas provides a flexible groupby mechanism that allows similar aggregations. See the groupby documentation for more details and examples.
In [59]: tips_summed = tips.groupby(['sex', 'smoker'])['total_bill', 'tip'].sum()In [60]: tips_summed.head()Out[60]:total_bill tipsex smokerFemale No 869.68 149.77Yes 527.27 96.74Male No 1725.75 302.00Yes 1217.07 183.07
Transformation
In SAS, if the group aggregations need to be used with the original frame, it must be merged back together. For example, to subtract the mean for each observation by smoker group.
proc summary data=tips missing nway;class smoker;var total_bill;output out=smoker_means mean(total_bill)=group_bill;run;proc sort data=tips;by smoker;run;data tips;merge tips(in=a) smoker_means(in=b);by smoker;adj_total_bill = total_bill - group_bill;if a and b;run;
pandas groubpy provides a transform mechanism that allows these type of operations to be succinctly expressed in one operation.
In [61]: gb = tips.groupby('smoker')['total_bill']In [62]: tips['adj_total_bill'] = tips['total_bill'] - gb.transform('mean')In [63]: tips.head()Out[63]:total_bill tip sex smoker day time size adj_total_bill67 1.07 1.00 Female Yes Sat Dinner 1 -17.68634492 3.75 1.00 Female Yes Fri Dinner 2 -15.006344111 5.25 1.00 Female No Sat Dinner 1 -11.938278145 6.35 1.50 Female No Thur Lunch 2 -10.838278135 6.51 1.25 Female No Thur Lunch 2 -10.678278
By Group Processing
In addition to aggregation, pandas groupby can be used to replicate most other by group processing from SAS. For example, this DATA step reads the data by sex/smoker group and filters to the first entry for each.
proc sort data=tips;by sex smoker;run;data tips_first;set tips;by sex smoker;if FIRST.sex or FIRST.smoker then output;run;
In pandas this would be written as:
In [64]: tips.groupby(['sex', 'smoker']).first()Out[64]:total_bill tip day time size adj_total_billsex smokerFemale No 5.25 1.00 Sat Dinner 1 -11.938278Yes 1.07 1.00 Sat Dinner 1 -17.686344Male No 5.51 2.00 Thur Lunch 2 -11.678278Yes 5.25 5.15 Sun Dinner 2 -13.506344
Other Considerations
Disk vs Memory
pandas operates exclusively in memory, where a SAS data set exists on disk. This means that the size of data able to be loaded in pandas is limited by your machine’s memory, but also that the operations on that data may be faster.
If out of core processing is needed, one possibility is the dask.dataframe library (currently in development) which provides a subset of pandas functionality for an on-disk DataFrame
Data Interop
pandas provides a read_sas() method that can read SAS data saved in the XPORT or SAS7BDAT binary format.
libname xportout xport 'transport-file.xpt';data xportout.tips;set tips(rename=(total_bill=tbill));* xport variable names limited to 6 characters;run;
df = pd.read_sas('transport-file.xpt')df = pd.read_sas('binary-file.sas7bdat')
You can also specify the file format directly. By default, pandas will try to infer the file format based on its extension.
df = pd.read_sas('transport-file.xpt', format='xport')df = pd.read_sas('binary-file.sas7bdat', format='sas7bdat')
XPORT is a relatively limited format and the parsing of it is not as optimized as some of the other pandas readers. An alternative way to interop data between SAS and pandas is to serialize to csv.
# version 0.17, 10M rowsIn [8]: %time df = pd.read_sas('big.xpt')Wall time: 14.6 sIn [9]: %time df = pd.read_csv('big.csv')Wall time: 4.86 s
Comparison with Stata
For potential users coming from Stata this page is meant to demonstrate how different Stata operations would be performed in pandas.
If you’re new to pandas, you might want to first read through 10 Minutes to pandas to familiarize yourself with the library.
As is customary, we import pandas and NumPy as follows. This means that we can refer to the libraries as pd and np, respectively, for the rest of the document.
In [1]: import pandas as pdIn [2]: import numpy as np
::: tip Note
Throughout this tutorial, the pandas DataFrame will be displayed by calling df.head(), which displays the first N (default 5) rows of the DataFrame. This is often used in interactive work (e.g. Jupyter notebook or terminal) – the equivalent in Stata would be:
list in 1/5
:::
Data Structures
General Terminology Translation
| pandas | Stata |
|---|---|
| DataFrame | data set |
| column | variable |
| row | observation |
| groupby | bysort |
| NaN | . |
DataFrame / Series
A DataFrame in pandas is analogous to a Stata data set – a two-dimensional data source with labeled columns that can be of different types. As will be shown in this document, almost any operation that can be applied to a data set in Stata can also be accomplished in pandas.
A Series is the data structure that represents one column of a DataFrame. Stata doesn’t have a separate data structure for a single column, but in general, working with a Series is analogous to referencing a column of a data set in Stata.
Index
Every DataFrame and Series has an Index – labels on the rows of the data. Stata does not have an exactly analogous concept. In Stata, a data set’s rows are essentially unlabeled, other than an implicit integer index that can be accessed with _n.
In pandas, if no index is specified, an integer index is also used by default (first row = 0, second row = 1, and so on). While using a labeled Index or MultiIndex can enable sophisticated analyses and is ultimately an important part of pandas to understand, for this comparison we will essentially ignore the Index and just treat the DataFrame as a collection of columns. Please see the indexing documentation for much more on how to use an Index effectively.
Data Input / Output
Constructing a DataFrame from Values
A Stata data set can be built from specified values by placing the data after an input statement and specifying the column names.
input x y1 23 45 6end
A pandas DataFrame can be constructed in many different ways, but for a small number of values, it is often convenient to specify it as a Python dictionary, where the keys are the column names and the values are the data.
In [3]: df = pd.DataFrame({'x': [1, 3, 5], 'y': [2, 4, 6]})In [4]: dfOut[4]:x y0 1 21 3 42 5 6
Reading External Data
Like Stata, pandas provides utilities for reading in data from many formats. The tips data set, found within the pandas tests (csv) will be used in many of the following examples.
Stata provides import delimited to read csv data into a data set in memory. If the tips.csv file is in the current working directory, we can import it as follows.
import delimited tips.csv
The pandas method is read_csv(), which works similarly. Additionally, it will automatically download the data set if presented with a url.
In [5]: url = ('https://raw.github.com/pandas-dev'...: '/pandas/master/pandas/tests/data/tips.csv')...:In [6]: tips = pd.read_csv(url)In [7]: tips.head()Out[7]:total_bill tip sex smoker day time size0 16.99 1.01 Female No Sun Dinner 21 10.34 1.66 Male No Sun Dinner 32 21.01 3.50 Male No Sun Dinner 33 23.68 3.31 Male No Sun Dinner 24 24.59 3.61 Female No Sun Dinner 4
Like import delimited, read_csv() can take a number of parameters to specify how the data should be parsed. For example, if the data were instead tab delimited, did not have column names, and existed in the current working directory, the pandas command would be:
tips = pd.read_csv('tips.csv', sep='\t', header=None)# alternatively, read_table is an alias to read_csv with tab delimitertips = pd.read_table('tips.csv', header=None)
Pandas can also read Stata data sets in .dta format with the read_stata() function.
df = pd.read_stata('data.dta')
In addition to text/csv and Stata files, pandas supports a variety of other data formats such as Excel, SAS, HDF5, Parquet, and SQL databases. These are all read via a pd.read_* function. See the IO documentation for more details.
Exporting Data
The inverse of import delimited in Stata is export delimited
export delimited tips2.csv
Similarly in pandas, the opposite of read_csv is DataFrame.to_csv().
tips.to_csv('tips2.csv')
Pandas can also export to Stata file format with the DataFrame.to_stata() method.
tips.to_stata('tips2.dta')
Data Operations
Operations on Columns
In Stata, arbitrary math expressions can be used with the generate and replace commands on new or existing columns. The drop command drops the column from the data set.
replace total_bill = total_bill - 2generate new_bill = total_bill / 2drop new_bill
pandas provides similar vectorized operations by specifying the individual Series in the DataFrame. New columns can be assigned in the same way. The DataFrame.drop() method drops a column from the DataFrame.
In [8]: tips['total_bill'] = tips['total_bill'] - 2In [9]: tips['new_bill'] = tips['total_bill'] / 2In [10]: tips.head()Out[10]:total_bill tip sex smoker day time size new_bill0 14.99 1.01 Female No Sun Dinner 2 7.4951 8.34 1.66 Male No Sun Dinner 3 4.1702 19.01 3.50 Male No Sun Dinner 3 9.5053 21.68 3.31 Male No Sun Dinner 2 10.8404 22.59 3.61 Female No Sun Dinner 4 11.295In [11]: tips = tips.drop('new_bill', axis=1)
Filtering
Filtering in Stata is done with an if clause on one or more columns.
list if total_bill > 10
DataFrames can be filtered in multiple ways; the most intuitive of which is using boolean indexing.
In [12]: tips[tips['total_bill'] > 10].head()Out[12]:total_bill tip sex smoker day time size0 14.99 1.01 Female No Sun Dinner 22 19.01 3.50 Male No Sun Dinner 33 21.68 3.31 Male No Sun Dinner 24 22.59 3.61 Female No Sun Dinner 45 23.29 4.71 Male No Sun Dinner 4
If/Then Logic
In Stata, an if clause can also be used to create new columns.
generate bucket = "low" if total_bill < 10replace bucket = "high" if total_bill >= 10
The same operation in pandas can be accomplished using the where method from numpy.
In [13]: tips['bucket'] = np.where(tips['total_bill'] < 10, 'low', 'high')In [14]: tips.head()Out[14]:total_bill tip sex smoker day time size bucket0 14.99 1.01 Female No Sun Dinner 2 high1 8.34 1.66 Male No Sun Dinner 3 low2 19.01 3.50 Male No Sun Dinner 3 high3 21.68 3.31 Male No Sun Dinner 2 high4 22.59 3.61 Female No Sun Dinner 4 high
Date Functionality
Stata provides a variety of functions to do operations on date/datetime columns.
generate date1 = mdy(1, 15, 2013)generate date2 = date("Feb152015", "MDY")generate date1_year = year(date1)generate date2_month = month(date2)* shift date to beginning of next monthgenerate date1_next = mdy(month(date1) + 1, 1, year(date1)) if month(date1) != 12replace date1_next = mdy(1, 1, year(date1) + 1) if month(date1) == 12generate months_between = mofd(date2) - mofd(date1)list date1 date2 date1_year date2_month date1_next months_between
The equivalent pandas operations are shown below. In addition to these functions, pandas supports other Time Series features not available in Stata (such as time zone handling and custom offsets) – see the timeseries documentation for more details.
In [15]: tips['date1'] = pd.Timestamp('2013-01-15')In [16]: tips['date2'] = pd.Timestamp('2015-02-15')In [17]: tips['date1_year'] = tips['date1'].dt.yearIn [18]: tips['date2_month'] = tips['date2'].dt.monthIn [19]: tips['date1_next'] = tips['date1'] + pd.offsets.MonthBegin()In [20]: tips['months_between'] = (tips['date2'].dt.to_period('M')....: - tips['date1'].dt.to_period('M'))....:In [21]: tips[['date1', 'date2', 'date1_year', 'date2_month', 'date1_next',....: 'months_between']].head()....:Out[21]:date1 date2 date1_year date2_month date1_next months_between0 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>1 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>2 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>3 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>4 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>
Selection of Columns
Stata provides keywords to select, drop, and rename columns.
keep sex total_bill tipdrop sexrename total_bill total_bill_2
The same operations are expressed in pandas below. Note that in contrast to Stata, these operations do not happen in place. To make these changes persist, assign the operation back to a variable.
# keepIn [22]: tips[['sex', 'total_bill', 'tip']].head()Out[22]:sex total_bill tip0 Female 14.99 1.011 Male 8.34 1.662 Male 19.01 3.503 Male 21.68 3.314 Female 22.59 3.61# dropIn [23]: tips.drop('sex', axis=1).head()Out[23]:total_bill tip smoker day time size0 14.99 1.01 No Sun Dinner 21 8.34 1.66 No Sun Dinner 32 19.01 3.50 No Sun Dinner 33 21.68 3.31 No Sun Dinner 24 22.59 3.61 No Sun Dinner 4# renameIn [24]: tips.rename(columns={'total_bill': 'total_bill_2'}).head()Out[24]:total_bill_2 tip sex smoker day time size0 14.99 1.01 Female No Sun Dinner 21 8.34 1.66 Male No Sun Dinner 32 19.01 3.50 Male No Sun Dinner 33 21.68 3.31 Male No Sun Dinner 24 22.59 3.61 Female No Sun Dinner 4
Sorting by Values
Sorting in Stata is accomplished via sort
sort sex total_bill
pandas objects have a DataFrame.sort_values() method, which takes a list of columns to sort by.
In [25]: tips = tips.sort_values(['sex', 'total_bill'])In [26]: tips.head()Out[26]:total_bill tip sex smoker day time size67 1.07 1.00 Female Yes Sat Dinner 192 3.75 1.00 Female Yes Fri Dinner 2111 5.25 1.00 Female No Sat Dinner 1145 6.35 1.50 Female No Thur Lunch 2135 6.51 1.25 Female No Thur Lunch 2
String Processing
Finding Length of String
Stata determines the length of a character string with the strlen() and ustrlen() functions for ASCII and Unicode strings, respectively.
generate strlen_time = strlen(time)generate ustrlen_time = ustrlen(time)
Python determines the length of a character string with the len function. In Python 3, all strings are Unicode strings. len includes trailing blanks. Use len and rstrip to exclude trailing blanks.
In [27]: tips['time'].str.len().head()Out[27]:67 692 6111 6145 5135 5Name: time, dtype: int64In [28]: tips['time'].str.rstrip().str.len().head()Out[28]:67 692 6111 6145 5135 5Name: time, dtype: int64
Finding Position of Substring
Stata determines the position of a character in a string with the strpos() function. This takes the string defined by the first argument and searches for the first position of the substring you supply as the second argument.
generate str_position = strpos(sex, "ale")
Python determines the position of a character in a string with the find() function. find searches for the first position of the substring. If the substring is found, the function returns its position. Keep in mind that Python indexes are zero-based and the function will return -1 if it fails to find the substring.
In [29]: tips['sex'].str.find("ale").head()Out[29]:67 392 3111 3145 3135 3Name: sex, dtype: int64
Extracting Substring by Position
Stata extracts a substring from a string based on its position with the substr() function.
generate short_sex = substr(sex, 1, 1)
With pandas you can use [] notation to extract a substring from a string by position locations. Keep in mind that Python indexes are zero-based.
In [30]: tips['sex'].str[0:1].head()Out[30]:67 F92 F111 F145 F135 FName: sex, dtype: object
Extracting nth Word
The Stata word() function returns the nth word from a string. The first argument is the string you want to parse and the second argument specifies which word you want to extract.
clearinput str20 string"John Smith""Jane Cook"endgenerate first_name = word(name, 1)generate last_name = word(name, -1)
Python extracts a substring from a string based on its text by using regular expressions. There are much more powerful approaches, but this just shows a simple approach.
In [31]: firstlast = pd.DataFrame({'string': ['John Smith', 'Jane Cook']})In [32]: firstlast['First_Name'] = firstlast['string'].str.split(" ", expand=True)[0]In [33]: firstlast['Last_Name'] = firstlast['string'].str.rsplit(" ", expand=True)[0]In [34]: firstlastOut[34]:string First_Name Last_Name0 John Smith John John1 Jane Cook Jane Jane
Changing Case
The Stata strupper(), strlower(), strproper(), ustrupper(), ustrlower(), and ustrtitle() functions change the case of ASCII and Unicode strings, respectively.
clearinput str20 string"John Smith""Jane Cook"endgenerate upper = strupper(string)generate lower = strlower(string)generate title = strproper(string)list
The equivalent Python functions are upper, lower, and title.
In [35]: firstlast = pd.DataFrame({'string': ['John Smith', 'Jane Cook']})In [36]: firstlast['upper'] = firstlast['string'].str.upper()In [37]: firstlast['lower'] = firstlast['string'].str.lower()In [38]: firstlast['title'] = firstlast['string'].str.title()In [39]: firstlastOut[39]:string upper lower title0 John Smith JOHN SMITH john smith John Smith1 Jane Cook JANE COOK jane cook Jane Cook
Merging
The following tables will be used in the merge examples
In [40]: df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],....: 'value': np.random.randn(4)})....:In [41]: df1Out[41]:key value0 A 0.4691121 B -0.2828632 C -1.5090593 D -1.135632In [42]: df2 = pd.DataFrame({'key': ['B', 'D', 'D', 'E'],....: 'value': np.random.randn(4)})....:In [43]: df2Out[43]:key value0 B 1.2121121 D -0.1732152 D 0.1192093 E -1.044236
In Stata, to perform a merge, one data set must be in memory and the other must be referenced as a file name on disk. In contrast, Python must have both DataFrames already in memory.
By default, Stata performs an outer join, where all observations from both data sets are left in memory after the merge. One can keep only observations from the initial data set, the merged data set, or the intersection of the two by using the values created in the _merge variable.
* First create df2 and save to diskclearinput str1 keyBDDEendgenerate value = rnormal()save df2.dta* Now create df1 in memoryclearinput str1 keyABCDendgenerate value = rnormal()preserve* Left joinmerge 1:n key using df2.dtakeep if _merge == 1* Right joinrestore, preservemerge 1:n key using df2.dtakeep if _merge == 2* Inner joinrestore, preservemerge 1:n key using df2.dtakeep if _merge == 3* Outer joinrestoremerge 1:n key using df2.dta
pandas DataFrames have a DataFrame.merge() method, which provides similar functionality. Note that different join types are accomplished via the how keyword.
In [44]: inner_join = df1.merge(df2, on=['key'], how='inner')In [45]: inner_joinOut[45]:key value_x value_y0 B -0.282863 1.2121121 D -1.135632 -0.1732152 D -1.135632 0.119209In [46]: left_join = df1.merge(df2, on=['key'], how='left')In [47]: left_joinOut[47]:key value_x value_y0 A 0.469112 NaN1 B -0.282863 1.2121122 C -1.509059 NaN3 D -1.135632 -0.1732154 D -1.135632 0.119209In [48]: right_join = df1.merge(df2, on=['key'], how='right')In [49]: right_joinOut[49]:key value_x value_y0 B -0.282863 1.2121121 D -1.135632 -0.1732152 D -1.135632 0.1192093 E NaN -1.044236In [50]: outer_join = df1.merge(df2, on=['key'], how='outer')In [51]: outer_joinOut[51]:key value_x value_y0 A 0.469112 NaN1 B -0.282863 1.2121122 C -1.509059 NaN3 D -1.135632 -0.1732154 D -1.135632 0.1192095 E NaN -1.044236
Missing Data
Like Stata, pandas has a representation for missing data – the special float value NaN (not a number). Many of the semantics are the same; for example missing data propagates through numeric operations, and is ignored by default for aggregations.
In [52]: outer_joinOut[52]:key value_x value_y0 A 0.469112 NaN1 B -0.282863 1.2121122 C -1.509059 NaN3 D -1.135632 -0.1732154 D -1.135632 0.1192095 E NaN -1.044236In [53]: outer_join['value_x'] + outer_join['value_y']Out[53]:0 NaN1 0.9292492 NaN3 -1.3088474 -1.0164245 NaNdtype: float64In [54]: outer_join['value_x'].sum()Out[54]: -3.5940742896293765
One difference is that missing data cannot be compared to its sentinel value. For example, in Stata you could do this to filter missing values.
* Keep missing valueslist if value_x == .* Keep non-missing valueslist if value_x != .
This doesn’t work in pandas. Instead, the pd.isna() or pd.notna() functions should be used for comparisons.
In [55]: outer_join[pd.isna(outer_join['value_x'])]Out[55]:key value_x value_y5 E NaN -1.044236In [56]: outer_join[pd.notna(outer_join['value_x'])]Out[56]:key value_x value_y0 A 0.469112 NaN1 B -0.282863 1.2121122 C -1.509059 NaN3 D -1.135632 -0.1732154 D -1.135632 0.119209
Pandas also provides a variety of methods to work with missing data – some of which would be challenging to express in Stata. For example, there are methods to drop all rows with any missing values, replacing missing values with a specified value, like the mean, or forward filling from previous rows. See the missing data documentation for more.
# Drop rows with any missing valueIn [57]: outer_join.dropna()Out[57]:key value_x value_y1 B -0.282863 1.2121123 D -1.135632 -0.1732154 D -1.135632 0.119209# Fill forwardsIn [58]: outer_join.fillna(method='ffill')Out[58]:key value_x value_y0 A 0.469112 NaN1 B -0.282863 1.2121122 C -1.509059 1.2121123 D -1.135632 -0.1732154 D -1.135632 0.1192095 E -1.135632 -1.044236# Impute missing values with the meanIn [59]: outer_join['value_x'].fillna(outer_join['value_x'].mean())Out[59]:0 0.4691121 -0.2828632 -1.5090593 -1.1356324 -1.1356325 -0.718815Name: value_x, dtype: float64
GroupBy
Aggregation
Stata’s collapse can be used to group by one or more key variables and compute aggregations on numeric columns.
collapse (sum) total_bill tip, by(sex smoker)
pandas provides a flexible groupby mechanism that allows similar aggregations. See the groupby documentation for more details and examples.
In [60]: tips_summed = tips.groupby(['sex', 'smoker'])['total_bill', 'tip'].sum()In [61]: tips_summed.head()Out[61]:total_bill tipsex smokerFemale No 869.68 149.77Yes 527.27 96.74Male No 1725.75 302.00Yes 1217.07 183.07
Transformation
In Stata, if the group aggregations need to be used with the original data set, one would usually use bysort with egen(). For example, to subtract the mean for each observation by smoker group.
bysort sex smoker: egen group_bill = mean(total_bill)generate adj_total_bill = total_bill - group_bill
pandas groubpy provides a transform mechanism that allows these type of operations to be succinctly expressed in one operation.
In [62]: gb = tips.groupby('smoker')['total_bill']In [63]: tips['adj_total_bill'] = tips['total_bill'] - gb.transform('mean')In [64]: tips.head()Out[64]:total_bill tip sex smoker day time size adj_total_bill67 1.07 1.00 Female Yes Sat Dinner 1 -17.68634492 3.75 1.00 Female Yes Fri Dinner 2 -15.006344111 5.25 1.00 Female No Sat Dinner 1 -11.938278145 6.35 1.50 Female No Thur Lunch 2 -10.838278135 6.51 1.25 Female No Thur Lunch 2 -10.678278
By Group Processing
In addition to aggregation, pandas groupby can be used to replicate most other bysort processing from Stata. For example, the following example lists the first observation in the current sort order by sex/smoker group.
bysort sex smoker: list if _n == 1
In pandas this would be written as:
In [65]: tips.groupby(['sex', 'smoker']).first()Out[65]:total_bill tip day time size adj_total_billsex smokerFemale No 5.25 1.00 Sat Dinner 1 -11.938278Yes 1.07 1.00 Sat Dinner 1 -17.686344Male No 5.51 2.00 Thur Lunch 2 -11.678278Yes 5.25 5.15 Sun Dinner 2 -13.506344
Other Considerations
Disk vs Memory
Pandas and Stata both operate exclusively in memory. This means that the size of data able to be loaded in pandas is limited by your machine’s memory. If out of core processing is needed, one possibility is the dask.dataframe library, which provides a subset of pandas functionality for an on-disk DataFrame.
