Entradas Web Scraping, case of study plebiscite in Colombia,2016
Entrada
Cancel

Web Scraping, case of study plebiscite in Colombia,2016

Description and Motivation

The plebiscite about the peace agreements of Colombia in 2016 was the mecanism of endorsement to aprove the agreements between the Colombia’s goberment and the “Fuerzas Armadas Revolucionarias de Colombia” (FARC). The voting was scheduled for Sunday, October 2, 2016.

This was the question made in the plebiscite:

Given the preceding introduction, the current project aims to use Python scraping tools to retrieve information at the municipal and departmental levels, and then create a dashboard in Google Data Studio using the findings.


Methodologies

  • The Scraping process of the Registraduría Nacional has the following functions:

  • Making the scraping of the principal page about the scrutiny of the votation results the previous action is to extract the access link to each Colombia’s department in the votation. Later, extracting the information by department, we extract each one of the municipal links related to that department.
  • We make the scraping of each municipal page to extract the votation results such as the participation percentage and plebiscite approval.

In this project, we extract the information the 1.186 municipals by 34 departments (including consulates and Bogota D.C.).

  • After we extract the information, we generate a Dashboard in Google Data Studio. The following visualization is the dashboard which is the result of the scrapping.

Scraping process

We begin importing the libraries needed for the manipulation of the databases and the extraction in the website. For that purpose, we use pandas, BeautifulSoup y requests, as we show in the following code:

1
2
3
4
5
6
7
8
import re
import numpy as np
import pandas as pd
import requests 
from bs4 import BeautifulSoup 
import warnings
from requests.packages.urllib3.exceptions import InsecureRequestWarning
warnings.simplefilter('ignore',InsecureRequestWarning)

Once we imported the libraries, with the library calls request, we call the code of the website intending to find out the link with the result information of the plebiscite by the department.

1
2
3
4
5
6
html = requests.get("https://elecciones.registraduria.gov.co/pre_plebis_2016/99PL/DPLZZZZZZZZZZZZZZZZZ_L1.htm", verify=False).text
# estructurar datos a partir de archivos HTML
soup = BeautifulSoup(html, "lxml")
depa=re.findall('<a href="../99PL/DP["\w\s.>‘]+', str(soup))[2:-1]
departamentos=pd.DataFrame()
departamentos['link1']=depa

Later, with the links list by department, we perform a data cleaning to generate the inputs. The inputs will be the resource to the second scraping by municipal level.

1
2
3
departamentos['link']=departamentos['link1'].map(lambda x:x.replace('<a href="../', '').replace("</a></li>", "").replace('">', ",")).map(lambda x:x[:32])
departamentos['departamento']=departamentos['link1'].map(lambda x:x.replace('<a href="../', '').replace("</a></li>", "").replace('">', ",")).map(lambda x:x[33:])
departamentos1=departamentos[['link', 'departamento']]

The role to perform the loop by the department to extract the municipal relation relevant to each department is described next.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def func_mun(link, departamento):
    html = requests.get("https://elecciones.registraduria.gov.co/pre_plebis_2016/"+link, verify=False).text
# We structure the data from the HTML files

    soup = BeautifulSoup(html, "lxml")
    muni=re.findall('<option value="../99PL/DPL["\w\s.>‘]+', str(soup))
    municipios=pd.DataFrame()
    municipios['link1']=muni
    municipios['numeros']=municipios['link1'].map(lambda x:re.findall('[\d]+', str(x)))
    municipios['indicativos']=municipios['numeros'].map(lambda x:len(x[1]))
    municipios1=municipios[municipios['indicativos']>2]
    municipios1['link']=municipios1['link1'].map(lambda x:x.replace('<option value="../', '')).map(lambda x:x[:32])
    municipios1['municipio']=municipios1['link1'].map(lambda x:x.replace('<option value="../', '')).map(lambda x:x[34:])
    municipios1['departamento']=departamento
    return municipios1 

This is the loop used to extract the municipal link related to each department.

1
2
3
4
5
6
total=pd.DataFrame()
for i in range(len(departamentos1)):
    print("Iteracion",i)
    total=total.append(func_mun(departamentos1['link'][i],departamentos1['departamento'][i] ))
total=total.reset_index(drop=True)
total1=total[['link', 'municipio', 'departamento']]

This function extracts the information of each municipality related to the number of voters in the decision of choice “yes” or “no” to the plebiscite and percentages.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
def busqueda(link):
    html = requests.get("https://elecciones.registraduria.gov.co/pre_plebis_2016/"+link, verify=False).text
    # We structure the data from the HTML files
    soup = BeautifulSoup(html, "lxml")
    a=re.findall('<div class="cajaDatosHeader"><b>[\w\s+%.-]+|<div class="cajaDatosBody"><span class="porcentajesCajas">[\w,+%.-]+|</span><br/><span class="descripcionCaja">[\w,\s%.-]+', str(soup))
    ba=[]
    ba.append(a[1])
    bb=[]
    bb.append(a[2])
    m=re.search("([\w]+) de ([\w]+)", str(bb))
    bc=[]
    bc.append(a[4])
    bd=[]
    bd.append(a[5])
    v1=re.findall("[\d.]+", str(bd))[0]
    v2=re.findall("[\d.]+", str(bd))[1]
    b=re.findall('<div class="skill-bar-percent pVotos">[\w,%()\s.]+|<div class="skill-bar-percent">[\w,()%\s.]+|<div class="skillbar-title"><span>[\w(),%\s.]+', str(soup))
    be=[]
    be.append(b[1])
    app=re.findall("[\d,%.]+", str(be))


    bf=[]
    bf.append(b[2])
    ap=re.search("[\d.]+", str(bf))

    bg=[]
    bg.append(b[4])
    noapp=re.findall("[\d,%.]+", str(bg))

    bh=[]
    bh.append(b[5])
    noap=re.search("[\d.]+", str(bh))
    c=re.findall('<div class="contenido"><b>[\w\s.¡]+|</b><br/>[\d.]+', str(soup))
    ca=[]
    ca.append(c[1])
    nu1=re.findall("[\d.]+", str(ca).replace("</b><br/>", ""))
    cb=[]
    cb.append(c[3])
    nu2=re.findall("[\d.]+", str(cb))
    cc=[]
    cc.append(c[5])
    nu3=re.findall("[\d.]+", str(cc))

    base=pd.DataFrame()

    base['link_real']=''
    base['nombre_muni']=''
    base['nombre_dep']=''
    base['mesas_informadas%']=ba
    base['mesas_informadas%']=base['mesas_informadas%'].map(lambda x:x.replace('<div class="cajaDatosBody"><span class="porcentajesCajas">',""))
    base["mesas_informadas%"]=base["mesas_informadas%"].map(lambda x:str(x).replace("%","").replace(",",".")).astype(float)
    base['mesas_usadas']=m.group(1)
    base['mesas_usadas']=base['mesas_usadas'].map(lambda x:str(x).replace("%","").replace(".","")).astype(int)
    base['mesas_disponibles']=m.group(2)
    base['mesas_disponibles']=base['mesas_disponibles'].map(lambda x:str(x).replace("%","").replace(".","")).astype(int)

    base['votantes%']=bc
    base['votantes%']=base['votantes%'].map(lambda x:x.replace('<div class="cajaDatosBody"><span class="porcentajesCajas">',""))
    base["votantes%"]=base["votantes%"].map(lambda x:str(x).replace("%","").replace(",",".")).astype(float)
    try:
        base['votantes_cumplieron']=v1
        base['votantes_cumplieron']=base['votantes_cumplieron'].map(lambda x:str(x).replace("%","").replace(".","")).astype(int)
        base['votantes_habilidatos']=v2
        base['votantes_habilidatos']=base['votantes_habilidatos'].map(lambda x:str(x).replace("%","").replace(".","")).astype(int)
    except:
        base['votantes_cumplieron']=None
        base['votantes_habilidatos']=None
    base['apoya%']=app
    base["apoya%"]=base["apoya%"].map(lambda x:str(x).replace("%","").replace(",",".")).astype(float)
    base['apoya#']=ap.group(0)
    base['apoya#']=base['apoya#'].map(lambda x:str(x).replace(".","")).astype(int)
    base['no_apoya%']=noapp
    base["no_apoya%"]=base["no_apoya%"].map(lambda x:str(x).replace("%","").replace(",",".")).astype(float)
    base['no_apoya#']=noap.group(0)
    base['no_apoya#']=base['no_apoya#'].map(lambda x:str(x).replace(".","")).astype(int)
    base['votos_validos']=nu1
    base['votos_validos']=base['votos_validos'].map(lambda x:str(x).replace("%","").replace(".","")).astype(int)
    base['votos_no_marcados']=nu2
    base['votos_no_marcados']=base['votos_no_marcados'].map(lambda x:str(x).replace("%","").replace(".","")).astype(int)
    base['votos_anulados']=nu3
    base["votos_anulados"]=base["votos_anulados"].map(lambda x:str(x).replace(".","")).astype(int)
    return base

Finally, we generate the loop that makes the scraping by municipality and department to save the information in a DataFrame. That Dataframe is exported in CSV format to be manipulated in Google Data Studio.

1
2
3
4
5
6
7
8
9
10
base_total=pd.DataFrame()
for i in range(len(total1['link'])):
    print("Iteracion:",i)
    base=busqueda(total1['link'][i])
    base_total=base_total.append(base)
base_total=base_total.reset_index(drop=True)
base_total['link_real']=total1['link']
base_total['nombre_muni']=total1['municipio']
base_total['nombre_dep']=total1['departamento']
base_total.to_csv("base_total1.csv", encoding='utf-8', index=False)

Once we have the database generated and refined, we go to Google Data Studio to build the graphs about the founded results.

This process can be replicated to whatever website has no renderization. In addition, the project has not to scope to make a deep analysis of the information because we have to review more factors such as demographics, politics, and economics.

Esta publicación está autorizada por el autor bajo CC BY 4.0