1) My name is Glen Parker , I'm from the University of South Florida, in Tampa. This is a talk about data mining, about finding useful information within your blackboard application data. The slides for this presentation are available on my website, at "presentation.glenparker.net". They're available now if you have wireless access, or you can get them whenever it's convenient. 2) This is an outline what I'll be talking about today. I'll start by setting the stage with some examples of data that we've gathered at USF. Then I'll move on to show how we can find the data in our Blackboard systems, how to make sense of the numbers we'll find, and then how to present those numbers to different audiences. I'll finish up with some suggestions that I've found useful in my own data gathering, and then open the floor to questions. 3) Okay, The figures on the next 3 slides are specific to USF, but the techniques used to gather the data are useful to any institution. The rest of this talk will show you where Blackboard stores the data and how you can get access to it. You can use these figures as inspiration when thinking about what sorts of data you'd like to see presented for your own school. We started creating course sites automatically using Snapshot in the Spring of 2002. Since making this move, the adoption of Blackboard by faculty and students has been great. As part of this automation, we create a Blackboard course site for each course section in our Banner Student Information system. This results in around 8000 Blackboard course sites each semester. In Spring 2002 we had 844 live courses which represents about 12% of all sections created that semester. 3 years later we've find more than half of created course sites are actively using Blackboard with more than 95% of our students using some portion of the system. We've see similar growth among instructor adoption, tripling in 3 years from 500 to more than 1700, and student use doubling from 17000 to 35000 active students each semester. 4) These graphs give another look at the way that Blackboard is used. The top graphs show us that during the Spring Semester, 20000 different people logged into our blackboard system each day. The lower graphs shows that during the same time period, 1.6 million web pages were being served each day. That's an average of 80 pages per unique user per day. 5) Looking inside our group of active courses for the Spring, we see that 84% used some Content, 43% posted announcements, 13% used Blackboards assessment engine, 21% used discussion boards. We also see that 26% of live courses had instructors who were interested enough in the student usage of their course material that they made use of Item Tracking. 6) These sorts of statistics and numbers, and graphs are impressive, But even more important is that we have them at all. Having access to this sort of information has shown itself to be very important in our efforts to promote Blackboard. Having poked around the Blackboard database a while now, I appreciate how difficult it is to get clean, consistent numbers. The rest of this talk will try to answer these questions. I'll show samples of how Blackboard works internally, and how I came about the numbers I used in the previous slides. By the end of this talk, you'll have enough background to start gathering your own statistics about your Blackboard installation. If you are creating course sections automatically, then this information will help determine how many of your course sections are being used. If you are creating courses on demand, then this information will still help YOU to determine how those courses are using the Blackboard system. 7) I've broken the process of data mining into three distinct stages, the Collection of data, the Interpretation of that data, and the Presentation of the data. Collection means getting the important nuggets of data out of the massive heap that comprises Blackboard. There's a lot of data available in Blackboard, and you need to what to get and how to get it. The Interpretation stage tries to make some sense out of your data. Ideally you would find it useful to relate the data to your particular institution, and use it to solve other needs like Capacity Planning or to prove a certain level of adoption by faculty or students. In the presentation stage I'll talk about making the data useful to people other than yourself, and includes some sample reports and graphs. 8) The collection phase is where we try to make sense of the set of data structures that Blackboard uses. These data structures primarily include the database schema, but also include the App server file system and App server logs. We can round out our set of data with operating system metrics like CPU load average, and with external data sources such as the Student Information System, Human Resources system, and a metadirectory if your school uses such sources. All of these put together can be used to create a comprehensive picture of how your system has grown in use, how it's being used today, and how that growth might affect your system in the future. 9) Blackboard has done a good job of applying relational database standards to the creation of their database schema. The database as of App Pack 2 is a real treat to work in. Columns are clearly named. Tables are consistently normalized. It wasn't too difficult to figure out how the system was reacting to changes, making it easy to find where to look for data about different Blackboard functions. What's on the screen now is the description of the msg_main table. This table stores discussion board messages. I'm going to use this table to demonstrate some common columns you'll see in many blackboard tables. It has a pk1. Every table has a pk1 column, which is the primary key for that table. The column is a one-up number generated from a sequence You'll use these pk1's to join tables. It makes foreign key references to 3 other tables. forummain_pk1 tells which discussion forum the message was posted into. This can be a forum in the main discussion board, or a forum in a group discussion board, or even a forum on the portal discussion board. We would need a more complex query to learn more about the forum this message belongs to. users_pk1 joins to the Users table, and tells us which users made the post. archmain_pk1 tells us if this message is part of an archived discussion forum. If it's archived, then it won't display on the active discussion forum, instead you'll need to look in an archive to find this particular message. We can query to determine how many instructors are using the archive discussion board feature, or even to determine how many messages on average are archived. msgmain_pk1 is an example of a parent-child relation. These are also called nested rows or self-referential tables. In some cases, Blackboard wants to create a hierarchy of data within the same table. An example are Subfolders in the content area, or threaded discussion postings. Each folder is a content item and has a row in the course_contents table. Other Contents items make reference to the folder in which they reside. In such cases, you need to know what parent a child has, that is, which folder does this content item belong in. The next slide has an example of this. In this case, if msgmain_pk1 is null, then the message is a top level or root message. It has no parents. If msgmain_pk1 is not null, then the value of msgmain_pk1 refers to the parent message of this message. It has a dtcreated and a dtmodified, which you can use to see when a posting was first added, and if it was ever updated. This column has often been used to show instructors that system problems were in fact user errors. status codes: Blackboard uses status code all around the database. These often show up as char type columns, and often have a column name ending in Ô_ind', like the available_ind column. I believe the ind is short for indicator, as in "avaialble indicator" Determining what each code represents and what the range of values for each code requires a lot of trial and error experimentation. cartrg_flag tells us if this is owned by a course cartridge, which might affect the ability of a student to view the message, or might affect an instructors ability to perform a course copy. post_as_annon_ind tells us if the owner wished to remain hidden. preformat_ind tells us if the message was created as HTML or smart_html, or as text. Finally it has a row_status value. Row_status is primarily used by the snapshot, and I'm not sure what circumstances would cause snapshot to disable a message post. row_status seems to come up most often in the course_users table, where we operate in SNAPSHOT mode. if a user was previously enrolled in a course by snapshot, and they are not now in the snapshot file, the snapshot disables their course_user record by setting row_status to 2. From the administrator list/modify users view, this shows up as a red Ôx'. available_ind and row_status tells us if the row is to be made available to the users or not. availability depends on many factors, including your course role, your portal role, and your system role. Depending on the scope, the item might simply be invisible, or you might be denied access to the object or location. Or it might appear, as in the case of the instructor, but be flagged in italics or with a red slash, a visual reminder that it is unavailable. 10) this list shows some of the major tables you'll be interested in when digging around the blackboard system. There are other tables related to community/portal activity, but thats another show. The course_main table describes everything about a course. The title, the batch_uid, it's availability, color scheme, button style, course cartridge use, and quota. Most of your other queries looking into course stats will include this table in a join as they all reference this tables pk1 column. Similarly, the users table describes users in the system. It includes their name, email address, birthdate, student ID, department, cdrom settings, instant message name and type, privacy settings, primary institution role, and system role. A lot of good demographic information is stored in this table about users, but not enough in my opinion. If you have the option, you'll want to join users information with outside sources to gather additional demographics. I'm personally using a homegrown metadirectory to determine the users college and department, as well as home campus. The course_users table is a relation between course_main and course_users, and essentially describes course enrollments. Use this table when looking for users in a course. This table includes the role of the user in the course, student, instructor, TA, etc., and their course homepage if one is being used. course_contents is the table that stores all the content in a course. It is stored in a parent/child structure to describe nested folders and content in those folders. In this table you'll find the cnthndlr_handle and content_type which describes the type of content, it's location on the application server storage, the availability parameters for the content item, the parent folder of the item. This table can be used to build the hierarchy of the content structure in a course, and used to determine the size of content items. The activity_accumulator is the user tacking table. It keeps track of almost any click a user makes in their travels around your system. If the student clicks on a content item in a course, the activity accumulator knows about it. If they check messages, activity accumulator probably has that knowledge as well. If a student makes multiple calls to the course roster to harvest email addresses in an effort to spam the class, the activity accumulator knows. The three tables conference, forum, and msg main comprise the discussion board. Useful for counting how many messages a course receives, how many forums those messages are spread across, and to know if group discussion boards are being used. Conference main is a table for general discussion board settings. A course has at least one conference for it's main discussion board, and an additional conference for each course group that makes use of group discussion boards. As an example, if you had a course with 3 groups, your course would have 4 rows in conference_main, one for the course, and one each for the groups. A forum is container for individual messages, and belongs one conference. A conference has many forums, while a forum belongs to one conference. Each forum is described by one row in the forum_main table. gradebook, attempt, and qti_asi hold all data about quiz questions and answers, and gradebook information. I look here to know who has added columns to their course gradebook. There's a wealth of data in these qti_* tables for those looking for more indepth analysis of assessments. Blackboard's recently announced Project Caliper will be addressing many of these problems, but if you want to get a head start with your own analysis, these are the tables you want to look at. 11) These next two slides have a number of SQl queries. I've picked a handful of queries that I hope will demonstrate some of the different blackboard structures and how you would query them with SQL. If you see a question mark in a query, this represents a bind variable, and you should substitute an appropriate value if you actually want to execute these queries. First Query, To determine if a message has any children, you would do a query looking for messages whose msgmain_pk1 was equal to this messages pk1. Said another way "Return a count of all messages whose parent is the message we are interested in." If we wanted to get the name of a user who posted that message, we simply join the msg_main table with the users table and select the rows. "Return the first and last name of the person who posted the message we are interested". A slightly more complex query, if we wanted to get all courses which have at least one posting in at least one of their forums, we can use the query with joins on the msg_main table, forum_main, conference_main, and course_main tables. This resulting query is read "Select all courses that have messages in msg_main" An example of a less deterministic query is getting the number times email was sent by the instructor of a course. I approximate this by assuming that a visit the the control panel send email page constitutes sending email. It's approximate because while I can tell how many times the page was visited, the Activity Accumulator does not keep track of how many emails were actually sent. Getting actual emails requires App Server logs, either apache logs or a custom email log, and even this information in not 100% accurate. To approximate the email being sent using only database information, I assume that if the control panel send email page was visited more than 5 times, then at least one of these visited resulted in email being sent out. So the query should read "return the number of times that any of the control panel send email pages were visited for a particular course" 12) Some more queries If you have a portal system, this query will tell you how many users have used the Blackboard Bookmarks module to store URL's. Recently, we wanted to know who was using the Blackboard Bookmarks module so we could let them know about our own Bookmarks module. Blackboard stores the bookmark data itself in the portal_extra_info table, which is useful for many things related to persistent portal data. If you are writing your own building block, it would be worth convincing Blackboard to fix this table so that data is persisted when building blocks are reinstalled. Doing so would help make your building block 100% cross-platform. To keep this query clean, I looked up the module pk1 for the Bookmarks module and determined it to be 15. "return the user_id of users who have data in the portal extra info table and that data belongs to the module with pk1 15" A second portal query is determining how many people have customized their tabs. If you have only one customizable tab in your portal, then this query will tell you how many still have the default layout. If you customize your layout, then you have a row in the layout table. This information is useful for finding out if modules you are adding are being used by the population. The query then reads "return a count of rows when we take the set of all users from the users table and subtract the set of users who have a row in the layout table" This query is useful for estimating the storage that your course contents are taking up. file_size is stored in bytes, so dividing this number by 1024 twice returns the value in Megabytes. "Return each course and the sum of bytes used by that course for all courses whose sum of bytes is greater than 1Mb" Finally, this is implementation of my definition of concurrent users. My definition reads "return a count of all distinct users who have accessed the system in the last 5 minutes" A note about this strange number. Oracle returns date calculations as fractions of a day. This number is 5 minutes shows as a fraction of a day. 13) With the introduction of App Pack 2, Blackboard introduced two new functions, messaging and glossary. Unfortunately, the data for these apps is stored in the course content directory on your app server, not in your database. If you want to know if these tools are being used, you need to make a check of the file system. You can find Glossary use by looking for a non-zero byte file called glossary.html Messaging requires you to look Inside the messaging subdirectory. Look inside for files ending with .body. These are the messages. Each .body file corresponds to a messages that was received by this user. 14) The App Servers various logs contain a wealth of tracking information The apache logs can be mined to show page requests over time as well as bytes served. There are logs of apache log analyzer tools freely available that will give you great results with minimal effort. You can also write your own parser to get more Blackboard specific information. Mod_perl logs could be used to show the frequency of access of different perl based applications, like the courses.pl tab or the discussion board. As Blackboard moves more code from perl to java, this will be much less useful. And of course, the tomcat logs are useless. However, they could be made more useful if developers, including Blackboard, wrote useful information to system.out and didn't simply dump exception threads to the log file. 15) Alrighty, You've collected the data, now you need to make some sense of it. My own work started by trying to find out the answer to the question Òhow many courses are being used?Ó This got me looking to find which blackboard components each course was using. We ended up calling courses live if they used any Blackboard component. Another way to use your data is to track user activity. You can watch where the users go, what they click, how they move from place to place within a course and within the portal. A very practical way to use the information is to demonstrate growth. You can use this to show the adoption rate, growth rate, and other trends needed to get more money for additional resources. This data is very useful when Capacity Planning. 16) A course is live if it includes a blackboard component that I've been able to identify. This slide shows all the components that I'm currently tracking within Blackboard App pack 2. If you know of more things to look at, please talk to me afterwards. So far, I've found 18 blackboard components and 13 different content types. Components include Announcements, gradebook columns, discussion board postings, Groups, Tasks, Calendar entries, Staff Information pages. One area not on the list is the Virtual Classroom. I've not yet found a way to accurately know when a classroom is used. If you have any ideas how to identify when a virtual classroom is being used, lets talk afterwards. 17) User Activity is primarily stored in the activity accumulator. It's a great way to follow students around Blackboard. With a simple query to the Activity Accumulator table, we can see most every click a student makes in the system. This slide shows a sample output from a web based tool we recently developed to perform just such queries. We enter the course ID and user ID and a range of dates we are interested in, and a report comes back in very short order. We've even gone so far as to group the activity by individual sessions. In this sample output, we see that our fictional student has entered the system on May 1. From the welcome tab, they used their My Courses box to enter their course EXP4204 and saw the Announcement page, then clicked on Course Tools to check their grade, Clicked another page which took them back to the announcements. From here they clicked the courses tab to bring up their course list, and they used to visit their other course. They looked at their announcements and clicked the tools area, but left before they could do anything else. They returned in a new session the next day to continue checking for end of semester grades. If you've got someone with a better eye toward pattern analysis, this sort of data could prove valuable. I'd really like to talk with someone who knew more about pattern analysys and see what they could do with this information. Or, you could do as we recently did, and used this tool to identify a student who was impersinating the instructor by sending emails with a forged header. We identified them by observing a person making repeated roster lookups, and also noticing that the emails were sent within minutes of the roster lookups, and that the email were delivered in the same order that the roster returned the addresses, alphabetically by last name. If you choose to perform such an investigation, beware that this sort of forensics evidence is circumstantial at best. There's no guarantee that the person logged in was the owner of the account used. We ended up dropping the issue because the evidence was so weak. 18) Alrighty, You've gathered your data and made sense of it, now you need to give that data to others in a format that makes sense to them. It's now your job to make the data useful to others. There are a number of typical ways to present the data depending on it's use and audience. These are some of these methods i've used myself. Reports include tables and one-line factoids that explain counts of stuff. Use reports to present how many live courses you have. Among your live courses, how each Blackboard component is being used. You can show counts of each course role, or you can present how live courses are distributed among different departments or campuses. These reports provide good overviews when talking to upper management about growth trends and the need for more money to continue supporting that growth. Graphs are another form of information, except that they show relations over time or relations between groups of elements. Think pie charts and line graphs. If you have access to college, department, or campus data, breaking down reports into finer grains is useful. Better still, storing the data in a less cooked format is good for data analysts you have around campus. 19) Reports are comprehensive snapshots of your system at a point in time, and this slide shows a sample of one I've created for USF. I use these reports to aggregate a number of metrics within a single snapshot. The higest level of aggregation of a Blackboard report would include all the courses on your system. This is useful as a system overview, However, if you can identify them, grouping courses by academic year or by academic semester is much more useful. Unfortunately, Blackboard doesn't provide an clean way to label course sites by year or semester, so you're on your own. At USF we've found it useful to encode this semester information in the courseID. This report shows that we had 3200 live courses in the spring, and these live courses consumes 50Gb of storage. Of these live course, 43% used Announcements, 84% used content, 64% used the gradebook, and so on. This report is a useful overview of your system. 20) I've also found it useful to break down live courses by college and department. You'll likely have to look outside of Blackboard to get this information, but it's worth the extra effort. This report shows that the College of Arts and Sciences used 1735 live course sites. Also, within the college of arts and science, we see the department of Anthropology used 48 course sites, Astronomy used 3 course sites, Communication used 80 course sites. And so on. When it comes time to ask for money from the college to support the schools Blackboard installation, having numbers showing how the different departments are using the system is a good way to convince them of the need to contribute. 21) We can get even more detail about how each course is using Blackboard. Looking within the Department of Communication, we find that the first course used a number of Blackboard areas including Course Content, Discussion Boards, added columns to the gradebook, sent email, used the digital dropbox, created groups, and posted a staff information page. This course made great use of the available Blackboard tools. I also find it useful to group by Blackboard functional areas. For each Blackboard function, I can identify the course sites using that function. For example, if I want to know all courses using the gradebook, I have that information. 22) Graphs look at a single aspect of your system viewed over a period of time. If you can count something, you can probably graph it. Just count it at the same time every day and store the data points somewhere safe. I'm currently graphing things like users logins per day, web pages served per day, bytes served per day, user activity per minute. It would be easy to extend this to include things like content clicks per day, discussion postings per minute, or many other Blackboard metrics. On this slide, the top graphs shows the concurrent users. Using the SQL query from earlier, I record the number of concurrent users by taking a measurement every 5 minutes, and I plot these values over 7 days. This graphs is showing that we had upwards of 250 concurrent users at some point last Monday. It also shows that we were consistently serving more than 200 concurrent users throughout the week. The lower slide shows the total data served by the apache servers each day. At the end of the day, I read the apache log server for that day, and extract the bytes served. I add these bytes up, and plot this value over an entire year. We can see from this graph that during the previous Spring semester we were serving 60Gb a day and now during the Summer we're serving around 20Gb a day. 23) The numbers you used to compute your graphs and reports can be stored in their raw format as well. With the raw numbers, it's easier to create new and customized reports or to identify trends that aren't apparent on your precomputed reports. Letting others have access to the raw numbers to compute graphs or reports that suit their own needs frees time for you to do other things. You're best bet is to store these numbers in an Relational database and allow interested people to make their own queries. This requires that anybody using the data have a knowledge of SQL, and that your database system is set up to behave like a data warehouse. This means that you need to store this data in a database other than your production Blackboard database, which is inherently OLTP in nature. Your next option is to provide the data in downloadable csv files. These tend to be really large files. Just be aware of their size and their impact if you try to load one of these directly into Excel. 24) Some random thoughts to round out this talk Try to keep your data collectors separate from your data presenters. That is, collect your data via whatever mechanism you need, run any necessary calculations, then store that computed data somewhere. Later, have your data presenter read the stored data and create the reports and graphs. Separating these two functions insulates the presenter from changes in the way the collector operates. Try to minimize the number of collectors you run. Each collectors places a load on your system. Enough collectors can begin to noticably degrade performance. If possible, try to have collectors perform double duty. Example: Big Brother is a monitoring tool that computes a set of numbers to determine the health of the Blackboard system. Rather than write a separate collectors to compute and store the same numbers, modify the Blackboard monitor to store the numbers it is already collecting. A corralary to this is to use the tools you already use. Don't write a brand new data collection scheme if you already have a scheduler in place that make regular collections. Our installation of Big Brother makes it fairly easy to write plugins that called on a regular basis. If you are already using a monitoring tool already, see if you can add your plugins to the existing framework. You can use your innate knowledge of the system to produce useful administratoive tools. lI wrote a web CGI for our helpdesk which lets them search the Blackboard by courses and by users. If they search for a user, it shows the status of all courses that user is a member of, including the availability, row_status, and available date range. It flags any potential trouble settings Giving the helpdesk student information without allowing them more access to the admin panel than they really need. Automate everything. If you don't automate the collectors, you won't remember to run them. Better yet, automate the presenters as well, so that your reports are always up-to-date. In addition to automating the presenters, make regular archives of the reports and graphs so that you can look back and see growth trends. It's nice to be able to look back at the Spring semester and see a report from the first day of classes, the midterm week of classes, and the last day of classes, and see how the system use changed. How many courses used the gradebook in the beginning vs. the end of semester, how many added discussion boards in the middle of the semester, and so on. If you do decide to write your own collectors, be prepared to reevaluate how you collect after each Blackboard upgrade. Everything I've done is subject to change and reinterpretation after an upgrade. Remember to Be Flexible. 25) That's my presentation. Again, a copy of these slides along with notes is available on my web site presentation.glenparker.net Also on my site you find a copy of the code used to produce the course overview statistics. The code is written in Perl and used the perl binary that ships with Blackboard. For Unix installations using Oracle, this code should work right out of the box. Instructions for use are included in a README file in the download package. The floor is now open for questions. . .