Overview

Source of data: “[World Bank - Bangladesh Informal Firms Surveys 2010] (http://microdata.worldbank.org/index.php/catalog/2244/related_materials)”

The World Bank conducted a survey of informal firms in Bangladesh in 2010. An informal firm is a business not registered with the authorities, as opposed to a formal business which is registered with the government, and therefore is subject to a larger degree of legal scrutiny and regulation. Informal businesses often predominate the economy in developing countries, and often are informal due to the high cost of obtaining any sort of formal recognition from the government in terms of time and money, particularly relative to the low or nonexistent benefit of being registered; business registration might end up being even more costly than remaining informal due to the requirement of having to then pay taxes.

The survey asked informal business owners a series of questions covering topics ranging from business characteristics to market conditions. Surveys were conducted in a select number of districts around Bangladesh. Each record of the data indicates one business for which a survey is conducted, and includes district-level geography information (the second level administrative boundary after divisions). Geographic boundary data for Bangladesh sourced via humanitarianresponse.info from World Food Programme.

My analysis intends to explore various characteristics of firms in Bangladesh; specifically, I analyze how annual revenue relates to other variable, such as number of employees, industry sector, financing sources, and geography.

The data set contains a long list of variables from which to compare, and part of the challenge of this project is to select which variables are the most useful to bring out the story of the data. In the visualizations below, the primary dependent variable that I use is Annual Revenue for the year 2009, denominated in Bangladesh Takas, which is the local currrecy; in 2009, the “[exchange rate] (http://www.exchange-rates.org/Rate/USD/BDT/12-23-2009)” was about 68.4126 takas to the US dollar. The variable is referenced in the exploratory visualizations as “q4_7”, which is the variable name. Another often used variable is the number of employees of the firm, which in the exploratory visualizations is referred to as “q1_12a5a”.

I should also note that I augmented the data significantly, by linking the survey data (which includes a variable for ISIC code–ISIC being an industry classification system) with ISIC section category descriptions to allow for grouping of responses by industry sector.

The following R code loads the libraries and required data sets for analysis. These data sets include the survey data from Bangladesh, and geographic Shapefile used to display the boundaries of districts in Bangladesh.

Analysis

A Geographic Overview of Bangladesh

## OGR data source with driver: ESRI Shapefile 
## Source: "shp", layer: "bgd_polbnda_adm2_wfp"
## with 64 features
## It has 4 fields

The figure above shows Average Revenue, denominated in local currency (Takas), in the year 2009 for businesses in the surveyed districts, A few things are made apparent on this plot. First of all, the survey was conducted only in a selection of districts, not the whole country, and only the surveyed districts are filled with some kind of color.

Are there any clear geographic patterns to annual revenue? Such a conclusion is difficult to draw based on the information on this map. The district with the lowest average revenue per firm (shaded in darkrest blue) appears to be right in the center of the country. There are a number of other districts that have comparable average revenue. The district with the highest average revenue is shaded in light blue inn the southwest of the country; no other district seems to have a comparably high average revenue per firm.

Now, let’s take a closer look at the data for annual revenue.

Analysis of Annual Revenue variable

Let’s explore the Annual Revenue variable through a series of histograms.

Above histograms display distribution of values of annual revenue for 2009. I chose to use thre histograms to display the same variable, with different binwidths, which allows us to view the data from an increasingly generalized perspective. We can see that the distribution is not normal–it is a negative exponential distribution (skewed to the left). The distribution makes sense, if one expects that most firms in Bangladesh are small and would not have a high revenue.

What does the differencces in binwidth tell us? The histogram with binwidth of 100000 demonstrates that the data periodically spikes, and does not appear as smooth as the next 2 histogramms. These spikes likely result because survey respondents would tend to give estimated answers, corresponding to roundded increments.

The second histogram reduces the number of spikes substantially, and the third almost eliminates them entirely, creating a smooth staircase shape. I believe that the second histogram sufficiently manages the spikes without totally eliminating the variety of responses, so I believe that one is the best to use.

Applying logarithmic adjustment, we can create a new histogram:

The logarithmic adjustment of the annual revenue variable creates a new histogram that displays the data as a normal distribution.

Let’s take another view of business size, by comparing annual revenue to number of employees.

The plot above displays number of employees in the firm on the x axis, and annual revenue on the y axis. It is difficult to discern any clear pattern; We appear to have a problem with overplotting, as many of the data points appear to overlap. This probably results from the fact that number of emmployes is an integer, and is limited on the graph to a value of 20. Let’s add a jitter to the plot to make things clearer.

The jitter does help somewhat to make clear the varied relationship between number of employees and annual revenue. It makes sense that there is an overall positive relationship, but based on this plot, the relationship does not appear to be that strong. There seems to be many instances of a low number of employees with high revenue, and high number employees with a low revenue.

Let’s take another view of this relationship, by applying a logarithmic scale to the annual revenue, and maintaining the jitter.

The application of this change seems to make clearer a positive relationship between annual revenue and number of employees.

Let’s numerically verify this relationship by calculating the correlation between number of employees and annual revenue.

## 
##  Pearson's product-moment correlation
## 
## data:  surveydata$q4_7 and surveydata$q1_12a5a
## t = 14.8642, df = 1722, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2947005 0.3784047
## sample estimates:
##       cor 
## 0.3372189

The correlation value of .337 indicates that there is a relationship between the number of employees and annual revenue, albeit it cannot be chacterised as a strong relationship.

Analysis of Business Classifications

What type of businesses were surveyed, and is it possible to explore some of their characteristics? Perhaps a more insightful question would be, is it possible to determine if annual revenue varies by business sector?

The survey data includes information about the industry classification of the business; the business information is given via the ISIC code, an industry classification system. ISIC is a 4-tiered classification system, with the lower 3 tiers represented by a numeric code, and a corresponding top level represented by a letter. The lowest (and most granular) level is represented by a 4-digit code.

The following 2 lists and pie chart are intended to explore through the different classification levels within the data. I wish to find the right level of granularity to the classification data that can further facilitate analysis. There is a balance to be achieved; granular classifications could yield granular insights, but may require an unwieldy number of posible classifications; too few classifications could inhibit any meaningful analysis, because we need a critical mass of observations to make good ,generalizable observations for classifications of data.

The following code imports the ISIC code descriptions, and adds the descriptions to the survey data. The ISIC classification descriptions were sourced from the “[UN Department of Economic and Social Affairs] (http://unstats.un.org/unsd/cr/registry/regcst.asp?Cl=27)”; some manipulation of this data was required to shorten the category descriptions, for better display of names on the plots, and to create better mapping between different codes.

ISIC Level 2 Number of Businesses

## Source: local data frame [4 x 2]
## 
##          ISIC2_desc    n
## 1                   1699
## 2     Communication   18
## 3    Land Transport    6
## 4 Transport Support    1

ISIC Level 4 Number of Businesses

## Source: local data frame [76 x 2]
## 
##                                                                       ISIC4_desc   n
## 1                  Retail sale of textiles, clothing, footwear and leather goods 217
## 2                                                 Restaurants, bars and canteens 179
## 3                                    Other retail sale in non-specialized stores  99
## 4                                       Manufacture of structural metal products  92
## 5                                        Other retail sale in specialized stores  86
## 6                    Retail sale of household appliances, articles and equipment  83
## 7                             Manufacture of wearing apparel, except fur apparel  82
## 8  Retail sale of pharmaceutical and medical goods, cosmetic and toilet articles  68
## 9                                                       Manufacture of furniture  64
## 10              Retail sale of food, beverages and tobacco in specialized stores  58
## ..                                                                           ... ...

ISIC Number of Businesses by Section

The most granular level of classification, Level 4, has 76 distinct values in the data set, while Level 2 has 30 distinct values. For purposes of initial exploration (especially visually), both present a very large number of distinct values. However, the ISIC Section level has 7 distinct values, which seems to be an ideal number of values with which to explore. The other variables may be useful to conduct deeper analysis at some point. The risk of using the ISIC Section level will be that there is too much concentration in a few variables; indeed, looking at the pie chart under “ISIC Number of Businesses by Section” shows that over 50% of busineses are classified under “Retail and Repair”.

Let’s begin by comparing profit levels for the different ISIC sections.

We will first facet annual revenue by ISIC section.

On the surface, the faceted histograms reveal a variety of distributions. Sections Manufacturing, Retail & Repair, and “Other community” have a very clearly left-skewed exponential distribution. Hotels and restaurants, on the other hand, seem to have a very flat distribution.

Let’s use box plots to examine the relationship between ISIC sections and revenue.

The box plot does reveal variation of revenue levels across Sections; Hotels & Restaurants show the highest median level of profit, and Health & Social Work shows the lowest level of median profit (note though the relatively small number of observations in the data set for this category, as evidenced by the pie chart “ISIC Number of Businesses by Section”). Manufacturing, Logistics & Communications, and Retail & Repair have comparable median levels. Logistics & Communications has the widest difference between the 1st and 3rd quartiles.

Let’s have a closer look at certain ISIC sections, first by generating summary statistics, then by reconciling these statistics against histograms for those sections.

## surveydata$ISIC_sectiondesc: Health and social work
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   80000   96000  112000  112000  128000  144000 
## --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
## surveydata$ISIC_sectiondesc: Hotels and restaurants
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    90000  1400000  2948000  4073000  5010000 28000000 
## --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
## surveydata$ISIC_sectiondesc: Logistics & Communications
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   200000  1300000  2020000  5711000  7000000 44400000 
## --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
## surveydata$ISIC_sectiondesc: Manufacturing
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
##        999     700000    1680000   19300000    6500000 1860000000 
## --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
## surveydata$ISIC_sectiondesc: Other community
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   16000  190000  330000  670300  600000 9600000 
## --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
## surveydata$ISIC_sectiondesc: Real estate
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    60000   240000   635000  1283000  1350000 11600000 
## --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
## surveydata$ISIC_sectiondesc: Retail & Repair
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##       999    850000   2184000   9384000   5400000 600000000

This perspective makes clearer than the faceted histograms the distribution is similar to the ones already mentioned–with an exponential distribution. The distribution was not as apparent in the faceted plot, because the y axis limit was more appropriate for those sections with more observations in the data, and therefore higher counts. This points to a limitation in making comparisions using faceted views: if the number of observations for each faceted varies widely, it is difficult to make accurate observations across all facets.

The observations taken from analysis of the boxplot are confirmed by the summary of the statistics above. Hotels and restaurants have a median of 2948000 takas, mean of 4073000, and range between 1st and 3rd quartiles of 1400000 and 5010000 takas.

The distribution of the histogram seems to be negative exponential; however, compared to the same plots of the other ISIC sections, it is a lot bumpier, peaking about 3 times. So this distribution is not as consistent as the others.

For Retail and Repair, we observe a median of 2814000 (the next highest median after Hotels and Restaurants), mean of 9384000 (higher than Hotels and Restaurants), and range between 1st and 3rd qurtiles of 850000 and 5400000. The distribution is negative exponential, with a fairly smooth decrease along the x axiis.

Manufacturing has a median of 1680000 (lower than the previouslyly discussed ISIC sections), mean of 19300000 (much higher htan the other sections), and range between 1st and 3rd qurtiles of 700000 and 6500000 - a difference of 5800000 takas. Note the large maximum value of 1,860,000,000 takas–which indicates the absolute range of the data is quite large, which is likely driving the large difference between mean and median. It seems that there is a large variety of firms in the Manufacturing section, when compared by revenue.

The distribution of the data is negative exponential, with the largest number of observations concentrated in the first bin. The data includes a large number of firms making little revenue, and a small number that has very high revenue.

Logistics and Communcations has a median of 2020000 (comparable to Manufacturing), mean of 5711000 (much lower than Manufacturing, but closer to Retail & Repair), and range between 1st and 3rd quartiles of 1300000 and 7000000–a difference of 5700000 (slightly less than that of Manufacturing)

While it does seem to have a negative expential distribution like the others, it also seems to have a very uneven distribution, peaking and zeroing at multiple points along the x axis. This may be caused by a relatively small number of observations in this category (which is reflected in the pie chart above.)

Exploring Business Size- Revenue vs Number of Employees

As another measure of business size, how does the number of employees vary across ISIC sections?

Based on this boxplot, section Hotels & Restaurants has the highest median number of employees, followed by Manufacturing. The medians for those sections are significantly higher than the other sections, and the difference betweeen 1st and 3rd quartiles wider. ###Point Plot comparison - Annual Revenue & Number of Employees by ISIC section Now, let’s explore the two size characteristics together with the ISIC section, all on the same point plot.

And above plot, but applying logarithmic scale to annual revenue:

What is apparent in the first plot is made more explicit in the second plot (the one with the logarithmic adjustment applied to annual revenue): how intensively different ISIC sections utilise labor. ISIC section Retail & Repair has revenue that grows very quickly for a small growth of number of employees; however, in sectors Hotels & Restaurants and Manufacturing, annual revenue increases more slowly as number of employees grow–implying that those sectors are more dependent on labor for growth.

This line plot shows the relationship between revenue and number of employees for selected sectors. This plot shows much more explicitly what was inferred in the previous plot: that revenue increases quite sharply as the number of employees grows for the Retail & Repair Section, and much more gradually for the other sectors.

Let’s explore the correlation between revenue and number of employees by section.

Correlation – Annual Revenue to Employees, section Manufacturing

## 
##  Pearson's product-moment correlation
## 
## data:  q4_7 and q1_12a5a
## t = 8.1324, df = 479, p-value = 0.000000000000003553
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2672266 0.4244997
## sample estimates:
##      cor 
## 0.348312

Correlation – Annual Revenue to Employees, section Retail

## 
##  Pearson's product-moment correlation
## 
## data:  q4_7 and q1_12a5a
## t = 12.2992, df = 907, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3209411 0.4324666
## sample estimates:
##       cor 
## 0.3780747

Correlation – Annual Revenue to Employees, section Hotels and Restaurants

## 
##  Pearson's product-moment correlation
## 
## data:  q4_7 and q1_12a5a
## t = 19.7177, df = 178, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7759546 0.8691912
## sample estimates:
##       cor 
## 0.8282201

Correlation – Annual Revenue to Employees, section Other Community

## 
##  Pearson's product-moment correlation
## 
## data:  q4_7 and q1_12a5a
## t = 1.7646, df = 63, p-value = 0.08247
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.02837827  0.43775115
## sample estimates:
##       cor 
## 0.2170229

One can see that the correlations vary quite widely by industry. In the Manufacturing and Retail & Repair sections, the correlations of .348 and .378 are substantial, but not particularly strong. The correlation in section Other Commmunity is the weakest of the 4 selected sections, .217. On the other hand, in section “Hotels and Restaurants”, the correlation of .828 is much higher than the other sections.

Do these correlations contradict what we inferred in the previous plots? Perhaps we can say that the correlations temper those findings. So while there is something of a tendency in the Retail & Repair section for revenue to increase as the number of employees increases, the correlation indicates that the relationship exists but is not particularly strong; further we have not done any analysis to quantify the intensity of the relationshp where it exists.

In any case, this avenue of analysis has been an interesting exploration of the labour intensity by different industries. We can potentially dig deeper by using the more granular ISIC classifications.

Exploring the sources of financing for businesses

Now, let’s explore some of the other characteristics of the business. The dataset contains information about sources of loans. The respondent is asked to give a series of yes / no answers about whether the business takes credit from various types of sources, such as banks or moneylenders. In order to create a plot that can allow us to compare the responses, the data requires some manipulation to achieve the right input format. I used the “melt” function in the “reshape2” function to do so.

These faceted boxplot reveal that the sources of credit (x axis, 1= Yes, 2 = No) varied widely by annual revenue (y axis). Government and Private Banks showed by far a higher median revenue than the other options, and the difference between 1st and 3rd quartiles is mmuch wider than the other sources, indicating that the Banks lent a wider variety of larger amounts. Firms with smaller revenue turned to Family and Friends, Moneylenders, and Microfinance organisations for credit; that they would do so is not surprising, considering that banks have traditionally lent to larger and more established firms. Microfinance organisations, moneylenders, and family & friends are often sources of credit for busineses who have trouble obtaining loans from banks.

## surveyfinance_yes$financesource: Family and Friends
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    10.0   354.0   698.0   795.7  1263.0  1684.0 
## --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
## surveyfinance_yes$financesource: Government bank
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    11.0   687.8   967.0   931.7  1211.0  1725.0 
## --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
## surveyfinance_yes$financesource: Microfinance Organization
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     2.0   476.0   954.0   927.2  1428.0  1717.0 
## --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
## surveyfinance_yes$financesource: Moneylender
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    50.0   572.8  1006.0   915.2  1269.0  1725.0 
## --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
## surveyfinance_yes$financesource: Other
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      42      83    1352     855    1396    1509 
## --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
## surveyfinance_yes$financesource: Private bank
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    10.0   415.5   855.0   854.2  1272.0  1722.0

A look at the actual summary statistics for the loan data confirms above assessment: median loan size for Government Banks (6350000) and Private Banks (4800000) are far higher than the other sources of credit; the closest is Moneylenders at 23000000 takas.

Does the lending vary by sector?

Let’s look at how the different categories of lenders lend in different sectors.

Lending by sector – Government Banks

For firms borrowing from government banks, What is immediately clear is the median annual revenue and 1st & 3rd quartiles for firms in the manufacturing sector is markedly higher than the other sectors (in this example, literally off the charts, because I chose the scaling to be consistent with the other lending categories and not distort the results in the other sectors). Hotels and restaurants that borrow from government banks have the nest highest median value for annual revenue, with Retail & Repair after that. None of the other sectors seem to receive loans from Government Banks in any significant way.

Lending by sector – Moneylender

For firms that borrow from Moneylenders, again the revenue for Manufacturing seems to be much higher than in the other sectors–though for manufacturing,the median value and box quartiles are much lower than manufacturing firms that borrow from Government Banks. In comparison, the revenue in other sectors that borrow from moneylenders seems to be negligiible; however, the scaling in place to facilitate comparison between lender types may also distort our perspective. Let’s have a look at the same chart, but with an adjusted scale on the Y axis.

Lending by sector – Moneylender – Y axis rescaled

It seems that the scale adjustment only seems to emphasise the high annual revenue of manufaccturers. The sector with the next highest median revenue is Retail & Repair, but that is substantially lower than that of Manufactiring.

Lending by sector – Microfinance

Microfinance lenders, in contrast to the above two lender types, appears to have a much lower level of lending to firms in the Manufacturing section. In fact, the ISIC section which appears to have higher levels of lending than all of the other sections is Logistics & Communications.

Let’s look at the data rescaled on the y axis to better understand lending by Microfinance.

The rescaled view confirms what we saw before: that Microfinance loans are made to firms in the Logistics & Communications section, that have median revenue much higher than in other sections. After that section, the section with the next highest median revenue is Hotels and Restaurants; Manufacturing and Retail & Repair follow, and appear to have about the same median revenue. It’s worth noting that other than Logistics & communications , in the other 3 sections that receive siginficant microfinance lending, the level of revenue for those firms that receive microfinance loans tend to be lower than those firms that do not; for example, in the Retail & Repair section, median revenue for those firms that receive microfinance loans is lower than firms that do not. Such a situation is no surprise: Microfinance lenders offer smaller loans designed for smaller firms. Bearing that in mind, however, the revenues in the Logistics & communications section for firms that borrow from Microfinance lenders seems anomalously high.

Due to the hierarchial nature of the ISIC classification system, we can take a look at the most granular ISIC level, to try to obtain more granular insights.

Annual revenue for Transport firms that borrow from Microfinance lenders

In fact, the boxplots indicate that within the Logistics & Communcations section, the only firms that receive Microfinance loans have ISIC level 4 of “Telecommunications”.

Final Plots and Summary

Plot One

Description One

My first final plot is the map of Bangladesh that we used near the beginning of this project, adding some features to provide some context. First of all, I plotted on the map the 3 largest cities in Bangladesh, with orange circles indicating population size of each. As wealth often concentrates in urban areas (typicallly being centers of economic activity), I wanted to test the idea that firms with the largest level of revenue coincided with major urban centers in Bangladesh. In fact, in this case, we note that the highest average revenue coincides with the three major urban centers; Khulna district has the highest average revenue of 29 million takas, and also contains the city of Khulna, the 3rd largest in Bangladesh. Chittagong district, with the second highest average revenue (15.5 million takas), also contains the second-largest city in Bangladesh. And the district with the 3rd highest average revenue (15.1 million takas) is Dhaka district, where the capital and largest city of Bangladesh is located.

So, it seems that urbanization in Bangladesh coincides with high annual revenue for firms in those urban locations.

Plot Two

Description Two

Plot two builds on the point plot comparison above, and refines to add trendlines to the plot. This plot has the following features:

  • A log10 scale is applied to the Y axis (Annual Revenue 2009);
  • Jitter is applied to avoid overclustering of points;
  • The points are colored by ISIC section;
  • Trendlines are added for select ISIC sections, to highlight the differences between them.

The trendlines emphasize the findings from the original plot: within the Retail & Repair section, annual revenue increases at a distinctly faster rate as employees increase, than in the Manufacturing andd Hotels and restaurants section. Using the combination of plotting methods, we have been able to identify a differentiating characteristic between certain sectors: at higher levels of revenue, Retail & Repair firms feature fewer employees than the Manufacturing and Hotels & Restaurants sections.

Plot Three

Description Three

Plot three is a refinement of the box plots that were explored above. Those box plots informed the creation of plot three in an number of ways.

Firrst, to make an effective and attractive visualization that captures the salient observations of our findings, it would be an idea to limit the number of items within the categories on the plot. For the ISIC section variable, I selected items that have enough observations to be able to draw meaningful insights; we can tell from the box plots and the point plots that items within the ISIC section category, such as “Health & Social Work”" and “Other Community” do not have many observations–so it is difficult to glean any insights about those sections. I selected 3 sections: “Hotels and restaurants”, “Manufacturing”, and “Retail and Repair”–the items that I also highlighted in plot two with trend lines. Also, considering the other categorical variable, Source of Financing, I had noted that the 3 items from ISIC Section had displayed some distinctive characteristics as the source of financing change; additionally, I also included “Logistics and Communications”, as for the Source of Financing “Microfinance Organization”, the revenue patterns were distinctive when compared to the other ISIC sections.

Second, I noted that when faceting the Source of Financing “Government Banks”, it was difficult to create a scale that could adequately show the range of responses for annual revenue; scaling on the y axis that appropriately showed the “Manufacturing” seciton would distort the plots of the other sections. To resolve this issue, I applied the “coord_flip” option to make a horizontal boxplot.

Third, I saw that being specific about the order of the data would help to bring out the story of the data. So, I adjusted the input data frame to order both categorical variables, to generally place the ISIC sections and sources of financing with higher annual revenue towards the top of the plot, while maintaining the order of sources of financing within each sub-plot to facilitate comparison.

This plot captures the relevant information that we gathered from the box plots. We can observe that government banks tend to provide credit to firms with higher levels of revenue, while moneylenders and microfinance organizations tend to lend to firms with relatively lower levels of revenue. We can also see the differentiated levels of revenue between the different ISIC sections, with Manufacturing firms having a markedly higher revenue than the other sectors.

Most interestingly, we can observe a potential segmentation of the finance market, by discerning patters of how different lenders tend to lend to different industries. Microfinance organizations are designed to lend to firms of a very small size–and so one would expect that the firms who receive credit from such sources would have a lower annual revenue. This seems to hold true in most of the sections; however, we can see that firms in Logistics & communications that receive microfinance loans have a median revenue that is much higher than microfinance loan recipients in other sections. Government banks, on the other hand, may tend to target credit towards larger firms–hence the phenomenon of firms in the Manufacturing sector that take government bank loans, that show much higher levels of revenue than in other sections; hence, manufactuers seem to be a good market segment for government banks. Indeed, the growth of manufacturing in Bangladesh has been a well-known story of success, so much so that it is probably considered to be a strategic industry in the country–something that governments would tend to encourage (see http://www.mckinsey.de/sites/mck_files/files/2011_McKinsey_Bangladesh.pdf for more information).


Reflection

This process has been moderately successful in elucidating some insights about trends in Bangladeshi firms’ annual revenue across a number of dimensions, such as geography, number of employees, industry sector, and financing source, with the intention of finding how these various characteristics affect and impact revenue, and more broadly, understand the dynamics of the market in which firms operate in Bangladesh.

My analysis came up with a number of interesting conclusions.
First, we observed that annual revenue varied by geography, with the 3 districts having the highest average revenue also containing the 3 largest cities in Bangladesh. This fact supports the hypothesis that there is a positive correlation between urbanization and economic activity. For the purpose of this analysis, the ability to plot this information on a map proved to be essential.

We also have a better understanding of how the two measures of business size in this project (annual revenue and number of employees) were positively correlated, though not strongly so. Our analysis of this relationship was enhanced when we broke down the analysis using the ISIC section variable as an indicator of industry sector, to find that revenue in certain sectors (such as Retail and Repair) had a higher rate of growth as employees increased than in other sectors, such as Manufacturing. Box plots were particularly useful for exploring the ISIC section data; further, viewing the business size data on a point plot, colored by ISIC section, allowed us to see the differentiated trends of business size across industries.

Additionally, we have a better understanding of financing trends. Certain sources of finance, such as government banks, tend to lend to firms with higher revenue, while others, such as microfinance organisations, tend to focus on firms with low revenue. These trends in revenue may mean that sources of financing tend to target certain sectors; government banks lend to a large number of manufacturing firms, as those firms have a high level of revenue. Boxplots were used extensively to arrive at these conclusions.

One major limitation of this analysis was that such insights were limited to only a selection of ISIC sections; this limitation was probably due to the limited availabilty of observations for ISIC sections beyond those that we examined in depth.

One initial challenge that I had to work through was how to choose a limited set of variables to explore, when so many were available. This also means that there is a lot more data to explore within this data set; the data includes information such as family characteristics and education levels of business owners. Within this data set, believe there are many other opportunities to segment businesses based on differing characteristics other than the ones we explored here.