How to Prevent Bot From Twitting Same Tweet Again and Again

May 14, 2022 Post a Comment

How to Build a Web-Scraping Twitter Bot

@marvelbeings is a simple Twitter bot I wrote in early 2019 during a vacation trip. The bot scrapes Marvel's official characters site and tweets the Marvel graphic symbol's url link, the photo of said character found on the site, along with a boilerplate string and a few hashtags. A user tapping or clicking on the link is taken to the grapheme'due south private page where they can read more than nigh the character.

This mail will serve as the documentation for @marvelbeings equally I have made the decision to deprecate information technology (at least for the time being). This postal service tin can as well be used equally a guide for anyone interested in building their own Twitter bot. My code is available on my Github.

Dependencies: requests, tweepy, beautifulsoup4, time

I wanted to keep this project as short and modular as possible. My code consists of 3 files, and this documentation is split up into four sections. The first section briefly discusses how to go about registering a bot'due south Twitter account and then obtaining tweepy API credentials. I won't provide step by footstep details regarding setting upwardly a Programmer account but I will provide links to documentation well-nigh necessary actions. I should mention that I employed the official ways to create this bot which required me to enter a valid electronic mail address, register the business relationship as a bot, and answer a few bones questions well-nigh the bot's use instance. I encourage anyone developing their ain Twitter bot to use the official means equally well. The second section recounts how simple it was to scrape the Marvel website and retrieve the url path to each character's page. The 3rd section cover'due south the bot'due south bones functionality with a good corporeality of room for customization. The 4th section volition conclude this documentation and highlight a few ideas for further development.

Trending Chatbot Manufactures:

1. Building a Discord Bot for ChatOps, Pentesting or Server Automation

2. 8 Best It Workflow Automation Practices to follow for Chatbots

three. How to prevent chatbot attacks?

iv. Adding a conversational interface to your app with Dialogflow

In this documentation, I'll exist treating the 3 files found on my Github equally ane script.

Department i: Bot Business relationship Setup and Connecting to tweepy

Twitter allows developers to create and manage automated accounts through its Twitter Dev program.

Before connecting to tweepy, you'll need to create your bot's Twitter account (if you haven't notwithstanding washed so) and progress through the application process. Likewise, exist certain to review Restricted Apply Cases to ensure that how you intend to apply your bot is actually permitted. Follow these steps on the Twitter Dev website to generate admission tokens.

Now that we accept our hallmark credentials, nosotros tin starting writing some lawmaking.

Importing tweepy:

          import tweepy          # creds: supervene upon with your Twitter Programmer credentials            
con_key = "your consumer cardinal"            
con_sec = "your consumer secret"
acc_tok = "your admission token"
acc_sec = "your access hugger-mugger"auth = tweepy.OAuthHandler(con_key, con_sec) # authenticate              
auth.set_access_token(acc_tok, acc_sec) # grant admission            
                    api = tweepy.API(auth) # connect to API

If you lot're able to run the above code without fault, you've successfully established your connectedness to tweepy and can at present begin writing your bot's functionality.

Test tweet:

Try calling update_status(). The update_status() method is how your bot will update its condition, AKA send out a tweet.

          api.update_status("Bot's 1st tweet. Thanks tweepy!🐥")

Later on running update_status(), open your bot's Twitter page in a browser window and make certain the tweet shows up. To gainsay spam, Twitter does non allow an automated account to post a tweet that the account has already tweeted before. So, running the to a higher place line of code more once will yield the duplicate status TweepError in the screenshot beneath.

Section 2: Scraping

Cute Soup four (bs4) is a standard library for exploring and collecting web page information. We'll also use the requests library to read the url we desire our lawmaking to scrape.

          from bs4 import BeautifulSoup
import requests

The Marvel characters url is "https://www.curiosity.com/characters".

Make some Soup:

          url = "https://www.marvel.com/characters"
web_stuff = requests.get(url) # allows code to read the url          html = web_stuff.text # returns the page'south content in unicode
soup = BeautifulSoup(html, "html.parser") # gets the page's html

Now, in club to detect the relevant information, we'll take to exercise some digging through the site'south html. I've laid these out as numbered steps.

Using Safari or Chrome, navigate to the url alleged in the to a higher place lawmaking.
Correct click on one of the character panes and click "Audit Chemical element".
This allows you to view the pages html construction and access its tags. I've used my favorite Avenger every bit an example in the screenshot below.

It can be a bit catchy determining which html tag(s) to grab. Luckily, we're only looking for the character's proper name and the path to the character's individual page. The <div> tag (div) I've highlighted in the in a higher place screenshot gives us both the name and the path. We can grab the entire div and and so parse out the information we need.

four. We'll grab all divs of class "mvl-card mvl-card — explore" since nosotros want to be able to get every character's info, and not merely Spiderman's. To do this we'll call the find_all() method and pass it the tag type, which in this case is "div", and the course name, "mvl-menu mvl-card — explore".

          char_content = soup.find_all("div", class_="mvl-card mvl-carte--explore")          print(blazon(char_content), "\northward",char_content)
# blazon() should render "<grade 'bs4.element.ResultSet'>"

Printing char_content returns the jungle of tags, text, and links y'all see in the below screenshot. This is considering char_content contains all of the page's divs of class "mvl-bill of fare mvl-card — explore" and there are 48 such divs per folio at the time of writing. I've highlighted part of Spiderman's block as an case of one of the divs.

We'll be looping through this collection of divs and sending off one tweet per div.

v. To excerpt the name and url path from each div, we'll simply split() the content of the divs to grab the text value of "information-further-details" for the character'south name and separate() after "data-click-url" for the path.

Graphic symbol name:

          tweeted_chars = [] # Proceed track of tweeted characters          def get_name(content, i): # i is the alphabetize to begin loop at
            html_block = str(content[i])
            temp_name = html_block.split('data-click-text="', 1)
            graphic symbol = temp_name[1].dissever('" data-click-type')[0]
            tweeted_chars.append(grapheme) # Track tweeted characters
            return character

Character path:

Nosotros'll use the path contained in the div to create the full path to the character'due south page by appending that path to the top level https://www.marvel.com. The full path for Spiderman, for case, would be https://world wide web.marvel.com + "/characters/spider-man-peter-parker" which when clicked/tapped will open up Spiderman's private page.

          def get_url(content, i): # i is the index to brainstorm loop at
            # First five are not characters.
            html_block = str(content[i])  # List of divs -> str
            temp_path = html_block.divide('url="', ane)  # Single out the url
            path = temp_path[i].split('" data-further-details', 1)[0]
            char_url = "https://www.curiosity.com"+path # Total path
            render char_url

That's it for our scraper.

Section 3: Offset tweeting

This is the easier part. We loop through char_content and pass each div to our two functions which extract the grapheme'due south proper noun and create the character'southward full path. Nosotros can and then pass both values to api.update_status() which, if you think, is how our bot ship out tweets. We'll laissez passer a average f-string containing our values with some hashtags to aid our bot gain some attention.

          i = 6  # index start to be passed to get_name() and get_url()

At the time of writing, the Marvel web folio is structured in such a style that the first six divs of class "mvl-card mvl-bill of fare — explore" do non incorporate any Marvel character data. For this reason, we brainstorm our loop at alphabetize 6.

          import fourth dimension                    for content in char_content:
            api.update_status(
f"Today's Marvel character is {get_name(char_content,i).upper()}" +
f" 🤖 #Marvel #MarvelComics #Heroes #Villains {get_url(char_content,i)}"
)
            i += one
            impress("Posting character " + str(i) + ".")
            time.sleep(86400)  # Sleep for 86400 # sec -> ane day

Change the time.slumber() integer to 1 and let information technology run for five seconds to make sure the bot is grabbing and tweeting the characters in the order they appear on the site. Don't run it for likewise long, because you'll need to manually delete those tweets if yous re-run the loop to avoid the indistinguishable condition TweepError.

Section 4: Determination & Further Development

This concludes the documentation for the @marvelbeings Twitter bot. Below is the final script forth with two ideas for further evolution.

Be sure to proceed your keys and tokens secure. Misuse of bots tin get your account reported, deactivated, or worse.

          import tweepy
from bs4 import BeautifulSoup
import requests
import time# creds: supercede with your Twitter Programmer credentials
con_key = "your consumer primal"              
con_sec = "your consumer hush-hush"
acc_tok = "your access token"
acc_sec = "your access secret"
            auth = tweepy.OAuthHandler(con_key, con_sec)
auth.set_access_token(acc_tok, acc_sec)
api = tweepy.API(auth) # connect to API
            url = "https://www.marvel.com/characters"
web_stuff = requests.get(url) # allows lawmaking to read the url
html = web_stuff.text # returns the page's content in unicode
soup = BeautifulSoup(html, "html.parser") # gets the page's html
            char_content = soup.find_all("div", class_="mvl-card mvl-card--explore")
print(type(char_content), "\n",char_content)
            tweeted_chars = []
def get_name(content, i): # i is the index to begin loop at
              html_block = str(content[i])
              temp_name = html_block.split('data-click-text="', 1)
              graphic symbol = temp_name[1].split('" information-click-type')[0]
              tweeted_chars.append(graphic symbol) # Record tweeted characters
              return character # proper name
            def get_url(content, i): # i is the index to begin loop at
              # First five are non characters.
              html_block = str(content[i])  # List of divs -> str
              temp_path = html_block.divide('url="', i)  # Single out the url
              path = temp_path[ane].split('" data-farther-details', one)[0]
              char_url = "https://www.marvel.com"+path # Full path
              render char_url
            i = 6  # index start to be passed to get_name() and get_url()
            for content in char_content:
              api.update_status(
f"Today's Marvel character is {get_name(char_content,i).upper()}" +
f" 🤖 #Marvel #MarvelComics #Heroes #Villains {get_url(char_content,i)}"
)
              i += i
              print("Posting character " + str(i) + ".")
              time.sleep(86400)  # Slumber for 86400 # sec -> ane day

Scrape Multiple Pages: In that location are 72 pages worth of Marvel characters on the Marvel website.

@marvelbeings currently only scrapes page 1. The start difficulty here is that the url path will non modify every bit you jump from one page to some other; it remains https://www.curiosity.com/characters. Using Selenium Webdriver to navigate through the pages may be a good approach. Also, the first twelve Marvel characters on each page are the key characters in movies like Avengers Endgame and Captain Curiosity, and do not change regardless of the page you navigate to. For all pages afterward folio 1, yous will need to begin your loop at index xviii instead of index 6 in order to avoid posting the aforementioned character more than once and to avoid receiving the duplicate status TweepError (which will too break the loop).

Randomize Characters: Instead of following the gild in which the characters announced on the site, you can make the bot grab and tweet the characters in a random order by using Lib/random.py and setting your index (i) to a random integer within range(6, len(char_content)). Yous'd need to laissez passer get_name() and get_url() the random integer, tape that random integer, and ensure information technology isn't passed over again or you'll end up getting the duplicate status TweepError (which will too suspension the loop) for attempting to post an already-tweeted tweet.

Thanks a bunch for allowing me to share, more than to come up. 😃

Don't forget to give usa your 👏 !

neasesuse1976.blogspot.com

Source: https://chatbotslife.com/marvel-web-scraper-twitter-bot-in-under-50-lines-of-code-453456c917c?source=post_internal_links---------4----------------------------

Nease Suse1976