2024
(1331) This Open Source Scraper CHANGES the Game!!! - YouTube https://www.youtube.com/watch?v=45hMI2QH1c8
Transcript: (00:00) this application can scrape any website on the internet using only the URL and the fields that you want to extract for example if you want to scrape data out of Hacker News all we need to do is to get the URL place it in here then Define the fields that we want to extract in this case it's going to be the title the number of points the Creator date of posting and the number of comments and go back here and Define them so the title number of points Creator date of posting and number of comments then we are going to click on scrape (00:31) and it's going to start scraping the data and then it will say please wait data is being scraped and it will basically show me a table in here as you can see it just scraped all the data give it to me in a table format in a nice table format then I can open it either in Json Excel or I can even open the markdowns if I want and here if I open in Excel I will have a file and this file will have exactly the data that I wanted and another thing is that it will show me exactly how much it has spent in order to do this extraction (01:04) in this case for example the input tokens were 3,868 these are the tokens that we have in the markdowns then the output tokens were 1,500 this is what we have inside of the Json this table representation as a Json and the total cost is 0.0015 it is absolutely cheap because we have used GPT for om mini and here we can introduce as many models as we want in this case I have gbt 40 mini and a GPT 40 therefore if I want more precision and more power I can always default the GPT 40 but in this case GPT 40 mini does the job for an absolute (01:39) cheap price and this works on any websites let's say for example we are going to take another website this is a website that has listing of cars and we want to scrape this table that we have here all we need to do as always URL here I am going to define the new Fields so image vehicle name vehicle info condition sale info and bids I am going to click on scrape it will open the website it will start scrape in the data and as you can see here it has basically scraped the data these are the urls that will take us to the cars if I (02:32) copy one of them and base it in here it will take me to the URL of the car the first car 8,900 4,800 38,000 let's see 8,000 4,800 3800 and that is basically it and that will work on any website I've tried so many websites and this application actually works on all of them of course this is going to be a bit more expensive because we have so much more data here we have 21,000 tokens and we are still not even at one cent we are half of a cent so it makes per perfect sense to use these application to scrape (03:02) data instead of making more efforts to create one script per website so before jumping into the code and seeing how this application works and don't worry we are going to see all of this in detail I want to actually address the comments that I got on my last video that was by the way about the same topic I have already created a first version of this application and there were a lot of comments about let's say three main categories so the first one is about how I couldn't get consistent names every (03:29) time and how can use something like identic data validation library in order to make sure that I am going to get the same names now this is something that I had in mind last time but I didn't want to do it not to make the video very long but since then open aai have actually introduced structured output which basically made my life so much easier because now I can define object schemas using btic meaning that I can basically Define the names and open AI with 100% accuracy will give me the same names every time so that is very important and (03:59) it was actually a great remark from you guys the second thing is about why the use of fir craw now there are some people that were genuinely asking about why am I using firr and there are other people that just thought I was just I don't know sponsored or a tick sale or something like that first of all I am not sponsored anything it does not make sense for a tiny Channel like mine therefore if you can subscribe it would be amazing but the idea is why do we even need to use fire craw we can just read the whole HTML and extracts (04:24) markdowns from that HTML without going through any kind of library and you're right about that and I actually did that this sign we're not going to use fir craw we're not going to use any library but the use of fir craw or Gina AI or scrape graph AI they're actually very good because they simplify the process of us getting the markdowns we only need three or four lines of code and we have the markdowns ready if you don't have that we will have to go through some ways in order to make sure that the (04:50) websites that we are scraping are not blocking us from scraping by introducing captas we have to make sure that the the website has to be open in our machine and a lot of other complexities that are not present with all of these other libraries but still going through the process of opening the websites ourselves is going to give us so much more possibilities that we didn't have before so not using fire C can actually be beneficial and this is what we are going to do today the third point is about the fact that this will never (05:17) replace scraping as we know it today and honestly I don't want to argue about this point cuz I don't know the answer but what I'm sure of is that the established industry of scraping does not have the same Innovation Pace as the AI industry today and we can confidently say that because every two weeks we have a new State ofth art model that outperforms all of the other models and beats all of the benchmarks and introduces a new layer of possibilities therefore dismissing this way of scraping data is not very wise because (05:46) someone who is willing to get outside of the comfort zone and basically try this new method will at least have a way of scraping Data before going into the old way of getting your xat to scrape every elements from the websites and there are actually use cases where this would at least give you a starting point so let's close this bracket and let's continue with the video now let's jump to our code and see how I have created this so the first thing that we start with is some voiler plate Imports so nothing (06:12) important to see here after that we are going to use pandas beautiful soup and fentic which is going to be very important it's what going to allow us to create the schemas then we have hml2 text this is going to help us create the markdowns and we have tick token this is what we are going to use in order to calculate the number of tokens and their cost and finally we will have the selenium Imports and open AI ones so the first thing that we are going to start with and we should absolutely pay attention to is the setup selenium (06:39) because if you're just trying to export data using let's say for example requests. get URL without the setup you will absolutely get the verify you're not a human and solve capture so you have to mimic some human behavior in order for the website not to block you and this is why I have downloaded the Chrome driver which you can basically find in here I will keep the link in the description below okay so the first thing that we're going to start with is create an instance of the options class (07:05) in order to add arguments to it first argument is disabled GPU this is important because it's going to help you disable the GPU if you're running this on a VM and it will basically make it faster because it will not try to initialize any kind of integrated GPU that you have inside of your CPU after that we will use this argument this is to make sure that our Chrome instance that we are going to open is independent and separate because it will have to access a folder called temp it's a detail you don't really need to know (07:32) about it but if you are running this code in a Docker container this will prove to be important after that we will Define the site this will help the website think that we are not scrapers we are actually a human user opening the window it's really not that importance but it could help here we arrive at the first really important argument which is the user agent argument and here we have a long text this is quite important because this is what proves to the website that we are not scrapers because basically are using artifacts that are (08:02) being used normally if we were to open that website ourselves so everything in here means something for example this is Windows 10 this is Chrome and its version and all of the other things mean something we don't need to go into details but these are just artifacts that usually are present when we are opening a websites ourselves after that we will open the service that I already have inside of my project in here and then we are going to initialize the web driver and we are going to return it we didn't get to fetch HTML is selenium and (08:27) this is where we are going to open the URL and then we are going to add some sleeps and mimic a scroll action which always going to help us not to get blocks and also this is quite good if we have an infinite scroll case this is how we can do it we can just add three or four in here just so that all of the data is being loaded and then we can get the page source and then return the HML and then we have driver. (08:53) quick inside of a finalist just to make sure that if something crashes in here we still are going to close our Chrome instance and not not leave it open this is quite important we then get to the part where we are going to create markdowns here inside the clean HTML we want to keep only the main content so what we are going to do is that we are going to remove the