# 18.4.4 推文地理位置预测 **Repository Path**: modaiwang/uiuc-twitter-geolocation ## Basic Information - **Project Name**: 18.4.4 推文地理位置预测 - **Description**: 一个外国友人的text-based geolocate的程序 - **Primary Language**: Python - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2018-04-04 - **Last Updated**: 2020-12-19 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README Twitter Geolocation Predictor ============================= This is a deep-learning tool to predict the location of a Twitter user based solely on the text content of his/her tweets without any other form of metadata. Overview -------- The Twitter Geolocation Predictor is a Recurrent Neural Network classifier. Every training sample is a collection of tweets labeled with a location (e.g. country, state, city, etc.). The model will tokenize all tweets into a sequence of words, and feed them into an `Embedding Layer `__. The embeddings will learn the meaning of words and use them as input for two stacked `Long-Short Term Memory `__ layers. A `Softmax `__ fully-connected layer at the end yields the classification result. .. image:: https://dl.dropbox.com/s/tvar2ccihtq0ijg/GeoModelGraph.png :width: 500px :align: center Getting Started --------------- Dependencies ~~~~~~~~~~~~ 1. Python 3.5 2. tensorflow 3. keras 4. nltk 5. pandas 6. numpy 7. sqlalchemy 8. sklearn 9. psycopg2 Installation ~~~~~~~~~~~~ Clone the repository and install all the dependencies using pip. .. code:: console $ git clone git@github.com:jmatias/uiuc-twitter-geolocation.git $ cd uiuc-twitter-geolocation $ sudo pip3 install -r requirements.txt This will install the latest CPU version of Tensorflow. If you would like to run on a GPU, follow the Tensorflow-GPU `installation instructions `__. Using A Pre-Processed Dataset ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The tool comes with a built-in dataset of ~430K users located in the U.S. (~410K for training, ~10K for development and ~10K for testing). To train a model using this dataset, run the train.py sample script. Note: The dataset has a size of approximately 2.5GB. .. code-block:: console $ python3 train.py --epochs 5 --batch_size 32 --vocab_size 20000 --hidden_size 100 --max_words 100 --classifier state Using TensorFlow backend. Downloading data from https://dl.dropbox.com/s/ze4ov5j30u9rf5m/twus_test.pickle 55181312/55180071 [==============================] - 11s 0us/step Downloading data from https://dl.dropbox.com/s/kg09i1z32n12o98/twus_dev.pickle 57229312/57227360 [==============================] - 12s 0us/step Downloading data from https://dl.dropbox.com/s/0d4l6jmgguzonou/twus_train.pickle 2427592704/2427591168 [==============================] - 486s 0us/step Building model... Hidden layer size: 100 Analyzing up to 100 words for each sample. Building tweet Tokenizer using a 20,000 word vocabulary. This may take a while... Tokenizing tweets from 59,546 users. This may take a while... Training model... Train on 50000 samples, validate on 9546 samples Epoch 1/1 1664/50000 [..............................] - ETA: 3:59 - loss: 3.8578 - acc: 0.0950 - top_5_acc: 0.2536 You can also try using this data from your own source code. .. code-block:: ipython In [1]: from twgeo.data import twus_dataset Using TensorFlow backend. In [2]: x_train, y_train, x_dev, y_dev, x_test, y_test = twus_dataset.load_state_data() In [3]: x_train.shape Out[3]: (410336,) In [4]: y_train.shape Out[4]: (410336,) In [5]: x_train, y_train, x_dev, y_dev, x_test, y_test = twus_dataset.load_state_data(size='small') In [6]: x_train.shape Out[6]: (50000,) In [7]: y_train.shape Out[7]: (50000,) Pre-Processing your own data ---------------------------- +------------------------------------------------------------------+------------+ | Tweet Text | Location | +==================================================================+============+ | Hello world! This is a tweet. This is another tweet. | Florida | +------------------------------------------------------------------+------------+ | Going to see Star Wars tonite! | Puerto Rico| +------------------------------------------------------------------+------------+ | Pizza was delicious! I'm another tweeeeeet | California | +------------------------------------------------------------------+------------+ Given a raw dataset stored in a CSV file like the one shown above, we can preprocess said data using :code:`twgeo.data.input.read_csv_data()`. This function will: 1. Tokenize the tweet text. 2. Limit repeated characters to a maximum of 2. For example: 'Greeeeeetings' becomes 'Greetings'. 3. Perform `Porter stemming `_ on each token. 4. Convert each token to lower case. The location data may be any string or integer value. .. code:: python import twgeo.data.input as input tweets, locations = input.read_csv_data('mydata.csv', tweet_txt_column_idx=0, location_column_idx=1) Training the Model ------------------ .. code:: python from twgeo.models.geomodel import Model from twgeo.data import twus # x_train is an array of text. Each element contains all the tweets for a given user. # y_train is an array of integer values, corresponding to each particular location we want to train against. x_train, y_train, x_dev, y_dev, x_test, y_test = twus.load_state_data(size='small') # num_outputs is the total number of possible classes (locations). In this example, 50 US states plus 3 territories. # time_steps is the total number of individual words to consider for each user. # Some users have more tweets then others. In this example, we are capping it at a total of 500 words per user. geoModel = Model(batch_size=64) geoModel.build_model(num_outputs=53, time_steps=500,vocab_size=20000) geoModel.train(x_train, y_train, x_dev, y_dev, epochs=5) geoModel.save_model('mymodel') Making Predictions ------------------ .. code:: ipython In [1]: from twgeo.models.geomodel import Model Using TensorFlow backend. In [2]: from twgeo.data import twus_dataset as twus In [3]: x_train, y_train, x_dev, y_dev, x_test, y_test = twus.load_state_data(size='small') In [4]: geoModel = Model() In [5]: geoModel.load_saved_model('mymodel') Loading saved model... In [6]: geoModel.predict(x_test) Out[6]: array(['CA', 'FL', 'NY', ..., 'TX', 'MA', 'KY'], dtype=object) Results ------------------ The built-in TWUS dataset was used to train US State and US Census Region classifiers. Using a hidden layer size of 300 neurons, timestep window of 500 words and a vocabulary size of 50,000 words, the model achieves the following results. +-------------------------+-------------------+-----------------------+ | Classification Task | Test Set Accuracy | Test Set Accuracy @ 5 | +=========================+===================+=======================+ | US Census Region | 73.95% | N/A | +-------------------------+-------------------+-----------------------+ | US State | 51.44% | 75.39% | +-------------------------+-------------------+-----------------------+