Programación y Tecnología
We need to create a list with the 100 top cities for each of the following countries: USA, Mexico, Colombia, Argentina, Chile, Peru and Uruguay.
And a second list with the regions/states in each country.
We will require as well the software used to extract the locations
The cities list shall have the following fields:
City Name, State Name, Country Name, Aliases
The states list shall have the following fields:
State Name, Country Name, Aliases
For example, let's suppose the same exercise with european countries where Spain is included: the city of Barcelona should be included in the cities list and it should look like this:
City Name | State Name | Country Name | Aliases
Barcelona Catalonia Spain Bcn, Barna
As there are lots of Twitter locations thats says:
"Barcelona, Spain", but there are some others that just say "barna" or "bcn".
Besides, Catalonia should be included in the regions/states list and it should look like this:
State Name | Country Name | Aliases
Catalonia Spain Catalunya, Cataluña,
We will provide a dataset with hundreds of thousands of anonymous locations from all over the world extracted from Twitter Public API.
We will provide you with a text file, each line in that file contains a location as it's written in Twitter, i.e. Some of them will be useless like "in my house", some others will be quite simple like "New York" and some others will be harder to get like "León, Gjto" which means "León, Guanajuato", so you are suppose to extract that gjto is an alias for Guanajuato (a mexican state).
As a starting point we recommend the software to run like this, it is just an idea so you can use this approach or use whatever you think will work better:
Create a list with the country names and, probably, the official region/state names and city names
Iterate over the provided locations and extract matches with the previous lists.
Analyze those matches to include new aliases.